From Fedora Project Wiki

Revision as of 15:46, 25 June 2024 by Catanzaro (talk | contribs) (Change title)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Opt-In Metrics for Fedora Workstation

This is a proposed Change for Fedora Linux.
This document represents a proposed Change. As part of the Changes process, proposals are publicly announced in order to receive community feedback. This proposal will only be implemented if approved by the Fedora Engineering Steering Committee.

Summary

The goal of this change proposal is to provide the Fedora community with accurate, representative data about the real world use of Fedora Workstation. By doing this, we believe that we can accelerate the development of Fedora Workstation, and ensure that it improves in line with our users’ needs and requirements.

Protecting user privacy is of utmost importance for this initiative. To this end, the service will only collect generic, standardized data, and will never collect anything that is personally identifying. It will also, of course, be fully open source. On the server side, the data will be stored in a way that prevents user identification.

Another important aspect of the initiative is that it will be run in a transparent manner, and will be governed as part of the Fedora project. A new SIG will be responsible for the service, and will be open to community participation. It will publish analyses of the data which has been collected, provide documentation about how the service operates, will share samples of the database data, and will respond to requests from the community.

Finally, we intend to ensure that metrics reporting is fully under the control of end users. Metrics collection will default to off, and will only be enabled through a clear on/off prompt in initial setup. Users will be able to view the data that has been collected locally, and will be able to remove the client software from their systems, should they choose to do so.

To address concerns that the community might have, the change owners have created a privacy and transparency checklist, which will be updated as the initiative progresses.

Owners

Current status

The proposal is to deploy a pre-existing data collection system - called Azafea - for Fedora Workstation. Azafea has both client and server components. Significant work is required to make a wide scale deployment of Azafea possible (see scope section below).

This updated proposal obsoletes the original proposal.

  • Targeted release: Fedora Linux 42
  • Last updated: 2024-06-25
  • FESCo issue: <will be assigned by the Wrangler>
  • Tracker bug: <will be assigned by the Wrangler>
  • Release notes tracker: <will be assigned by the Wrangler>

Detailed Description

This section includes a detailed description of each aspect of the metrics proposal.

Data that will be collected

All collected data will be anonymous:

  • We will not collect identifying information, such as email addresses, online account details, and IP addresses.
  • We will only collect generic, standardized information. For example, we want to collect data on which apps are used, but we will never collect data on which websites are viewed or which files are opened.
  • Server side, each metric will be stored separately and will not be linked to other metrics from the same system. This will prevent user fingerprinting through the cross-referencing of anonymous information.

All of the code in the data collection system will be open source and available for public inspection.

The data we plan on collecting will fall into the following categories:

Category Examples
Hardware details CPU, graphics, cameras, which peripherals are present.
System settings The display language, which input methods are used, which accessibility features are enabled.
Desktop usage patterns Which apps are used, how many open workspaces there are, how often each system settings panel is opened.
Performance reports Disk and memory usage.
Evidence of problems Counts of system crashes, OOM events, app crashes.

For more detailed information, see the preliminary list of metrics that we want to collect. This list indicates the purpose of each metric that we hope to collect.

Steps to ensure anonymity

The metrics that we hope to collect are all generic in nature, and do not contain personal or identifying information.

To prevent accidental collection of identifying information, the data we collect will be filtered on the client side, so that only known, standardized variables are included. For example, when recording which apps are used, we will only record known package names, in order to prevent custom apps with identifying metadata from being recorded.

Wherever possible, the system will aggregate data locally prior to upload. For example, it can report the number of times that a feature was used in a week, instead of the exact time whenever it is used. This method further increases anonymity by reducing the precision of the data that is reported.

We will only deploy the service once it has undergone a thorough period of testing, during which we will verify that the database is only being populated with anonymous data. (Data from the testing phase of the system will be permanently deleted.)

How metrics data will be used

We anticipate that the data we collect will drive myriad improvements within Fedora as well as the wider ecosystem. These improvements include:

Resource prioritization - knowing which hardware, features and apps are used most will allow developers and partners to focus their efforts where they will have the most impact.

Software improvements - data about usage and performance patterns can drive optimisations in existing software, in terms of both technical and UX design.

Configuration enhancements - decisions about default settings and the default composition of Fedora Workstation can be based on observed usage patterns.

Better development practices - we aim to promote and encourage user and data driven development practices through this work.

To achieve these impacts, analysis of the collected data will be published and circulated to the relevant developers and projects.

Who will have access to metrics data

In the interests of transparency, we will put the following mechanisms in place for viewing the data that is collected:

  1. Raw data from the database will be published during the testing phase, prior to wide scale deployment
  2. Members of the community will be able to join the metrics SIG, in order to get full ongoing access to the data
  3. After deployment, a randomly selected sample of the database will be published (once it has been manually checked)
  4. Members of the community will be able to request the SIG for copies of the database, which will be shared privately

This proposal is an attempt to balance the need to protect privacy with the need to provide transparency. We have a high degree of confidence that the database will only contain anonymous data (see “how will we ensure that the system only collects anonymous, generic data?”). However, there is always some risk that something could go wrong with data collection. Out of an abundance of caution, we therefore only want to share data once it has been manually checked.

Approval for changes to the metrics system

Any changes to the metrics system and its governance arrangements will require approval by FESCo. This will include any changes to the:

  • metrics data that is collected
  • the metrics SIG (its rules, role, composition, membership terms)
  • the technology used
  • changes to the UI for user opt in/opt out
  • hosting of the infrastructure or involvement of 3rd parties

User control

The proposed system aims to ensure that users are always in control of metrics collection on their systems. This will be achieved through the following:

  • The setting for metrics collection will enabled/disable both local metrics collection and data upload
  • Metrics collection will be off by default
  • Metrics collection will only be enabled through an explicit opt in from the user, which will be presented as part of initial setup
  • It will always be possible for users to disable metrics collection from the system settings
  • It will be possible for users to view the metrics that have been collected locally on their systems
  • It will be possible for users to remove the metrics collection components from their systems, using dnf

Metrics system components

The metrics system would be composed of server and client Azafea components.

An Azafea server deployment consists of five components: 1. an nginx proxy server, 2. azafea-metrics-proxy, 3. redis, 4. azafea itself, 5. a Postgres database

nginx proxies HTTP requests to azafea-metrics-proxy, which is itself a simple HTTP server that adds batches of metrics into the redis database, where they will be fetched by Azafea and stored into Postgres.

The client side consists of the following components:

  • eos-metrics - a D-Bus interface that applications and services may use to record events, plus a GObject library that provides a simple API around the D-Bus interface
  • eos-event-recorder-daemon - the service that actually implements the D-Bus interface: it collects metrics recorded via D-Bus, batches them together, and sends them to the metrics server at predefined intervals
  • eos-metrics-instrumentation - the component that calls D-Bus methods on eos-event-recorder

Feedback

The initial version of this proposal generated a huge amount of feedback and debate. We have put a lot of time and effort into engaging with this feedback, and the proposal has been substantially changed in response to it. We are grateful to the Fedora community for enabling us to improve the proposal in this way.

We know that there were issues with the original proposal, and that these led to serious concerns amongst the community. We hope that the updated proposal addresses these concerns, and look forward to receiving further feedback.

The following is a summary of the key points from the discussion so far, along with details of the steps that have been taken in response to them. Additional information is also included in the FAQ If we have missed something from that discussion, please let us know.

Opt in or opt out?

The original proposal specified that metrics upload would be disabled by default, and that the UI setup would include an on by default switch to allow users to opt out. This aspect of the proposal attracted by far the most negative feedback.

As a result of this feedback, we have changed the proposal: we now propose that initial setup will show an explicit yes/no prompt which has no default value.

We recognise that feedback about the opt-out UI reflected wider concerns about the privacy and transparency of the metrics system, which we have addressed through other changes.

Proposal omissions

We received feedback that the original proposal omitted key details from the proposal, including:

  • The benefit to Fedora
  • Which metrics will be collected
  • That each metric will be stored separately and will not be correlated
  • How members of the community will be able to access the database
  • Whether users will be able to view the local data that has been collected on their systems
  • That the metrics packages can be removed using DNF
  • The policy through which the collection of specific metrics will be approved

This information has now been added to the proposal.

Ability to view the entire data set

This was a frequent request in the feedback we received. We understand the motivation to have transparency and to verify what data is being collected.

“Who will have access to the data?” contains an updated proposal which we hope will satisfy this desire while also preventing potential privacy issues.

Risks to anonymity if the metrics server is hacked

This was another major subject of discussion, with various concerns being raised.

We are confident that it will not be possible for the administrators of the metrics system to identify or fingerprint users under normal operation of the metrics server. We also want to emphasize the generic nature of the metrics we want to collect.

We have also committed to:

  • Take steps to minimize risks, such as having short retention of server logs
  • Manage the server through the metrics SIG, so that members of the community can contribute their expertise
  • Document the infrastructure setup for the metrics server once it has been setup, in order to solicit further feedback

These points have been added to our privacy and transparency checklist.

The metrics server will not store IP addresses or entire batches of metrics data. However, we acknowledge that, if Fedora infrastructure is compromised, an attacker could begin recording this information. We acknowledge this as a risk of the system.

Local data collection

The original proposal specified that local data collection would default to on, while upload of that data would default to off. Some pointed out that this would be a privacy risk.

In the new version of the proposal, local data collection will only be enabled after the user has consented to metrics collection.

Other suggestions

We received various other suggestions during the debate about the original change proposal. These included:

Provide fine-grained user control over which data is uploaded

This would add complexity to the system and to data analysis. We are also unsure how much these fine-grained controls would be used in practice. This is not something that we are rejecting outright, but it is unlikely something that we ourselves would be able to add to the initial version of the system.

Only collect some metrics for a fixed time period

We agree that this makes sense for some metrics and we have added this to our privacy and transparency checklist, as a future work item.

Restrict metrics collection to a small sample of users

The main issues with this approach would be ensuring that the sample is representative, and our ability to detect issues experienced by subsets of the user base.

Collaborate with a trusted third party

The idea behind this suggestion was for us to get additional oversight and input from an organization that has expertise in data privacy issues. We’d be very happy to do this, but are unsure who that third party would be. We are open to suggestions!

Adopt differential privacy techniques

Differential privacy would potentially allow Fedora systems to submit inaccurate data to the metrics server, while ensuring the overall data set is still representative and useful. We would welcome collaboration from Fedora community members interested in improving the metrics collection system to adopt such techniques.

Benefit to Fedora

See “What will the data be used for?”

Scope

  • Proposal owners: this change requires substantial technical and nontechnical work from the change owners. This will include:
    • Properly packaging eos-metrics, eos-event-recorder-daemon, and eos-metrics-instrumentation for Fedora
    • Modifying eos-metrics-instrumentation so that it does not send events that are not approved for use in Fedora
    • Creation of the metrics SIG and its various policies and procedures
    • Documentation for end users and members of the community
  • Other developers: Community Platform Engineering (CPE) will need to host the metrics server infrastructure.
  • Release engineering: #11514
  • Policies and guidelines: see "How will data collection be approved?"
  • Trademark approval: N/A (not needed for this change)
  • Alignment with objectives: there are currently no Fedora Initiatives. However, the generated data will be broadly applicable to Fedora community activities.

Upgrade/Compatibility Impact

There are no special technical challenges in this regard.

Metrics collection will only be enabled in response to an explicit opt-in by the user, through a UI in either gnome-initial-setup or gnome-control-center. gnome-initial-setup is only shown for new installs, meaning that the only way to enable metrics on an upgraded system would be through gnome-control-center.

How to Test

Testing is not currently possible. Instructions will be provided when this changes.

User Experience

The user experience for the system will consist of:

  1. In initial setup, a UI to choose between metrics collection being on or off. There will be no default in the UI and users will have to explicitly choose one of the two options.
  2. In the privacy Settings, a switch to turn metrics collection on or off
  3. User documentation about the service
  4. A method to view locally collected metrics data

Dependencies

Packages wanting to collect metrics data will need to depend on eos-metrics. For example, to collect statistics about Settings usage, the gnome-control-center package would need to depend on eos-metrics in order to send a metric to eos-event-recorder-daemon.

Contingency Plan

  • Contingency mechanism: remove the eos-metrics, eos-event-recorder-daemon, and eos-metrics-instrumentation packages from the workstation-product comps group, and rebuild any packages that gained a dependency on eos-metrics.
  • Contingency deadline: beta freeze
  • Blocks release? If the change is incomplete, it will need to be reverted before release.

Documentation

This feature depends on several different upstream projects, each of which have their own documentation.

Client side components:

  • eos-metrics has online docs at D-Bus interface XML. API documentation is also built and installed in a docs subpackage.
  • eos-event-recorder-daemon and eos-metrics-instrumentation components do not have online documentation at this time.

Server-side documentation:

Release Notes

These will be provided if the proposal is approved and successfully implemented.