Infrastructure/Metrics

= Metrics =

RFC
The purpose of this text is to compile thoughts, opinions and options to gather metrics for the Fedora Project. Everyone seems to be able to find faults in the different types of metrics available but few have been able to offer improvements and alternatives. No metric is perfect, at present the more accurate the method, the more invasive it becomes. In an Open Source world privacy is important so a balance must be found.

This is actively being discussed on the Fedora-users list and in the Fedora Advisory Board

https://www.redhat.com/archives/fedora-advisory-board/2006-November/msg00239.html https://www.redhat.com/archives/fedora-list/2006-November/msg05080.html

Metrics are actually important
Many people might say that metrics are numbers just for show and that gathering them just isn't worth it. The fact is that metrics are important for anyone trying to do something with limited resources. It allows us to put what little resources we do have to better use. If the developers spend 20% of their time debugging x86_64 and our metrics show that they are 1% of our install base, the argument could be made that less time needs to be spent on x86_64. (Totally made up numbers)

Metrics are also important for determining infrastructure needs like how many users per mirror we have. Finding these numbers allow us to determine capacity and plan for future growth and spikes in traffic, like those found during a release launch.

Methods
There are a few methods available to us to determine how many installs / users are out there. Ideally we'd attempt to pick two different methods and compare their numbers. The closer the numbers, the more accurate we can assume our numbers are.

Lies, Damned Lies, and Statistics
There's a number of things that we can't get automatically and there's no point in trying to get them. Here's the list, anyone that thinks they can get these numbers without a survey, state your case to the list.


 * Fedora Users: We will never know how many people actually use Fedora.
 * Failed Installs: We will never know how many people, experienced or noob, try to install Fedora and fail.
 * Industry: We will never know how people use Fedora for business, education, government, military or personal, though with GeoIP and reverse DNS we might be able to ballpark this.
 * Server vs Desktop: Just the nature of what a server or desktop is, is blurred. We can never hope to even ballpark this number.

What we can get through automated methods:
 * Geographic location: GeoIP can even grab some major cities
 * Number of installs during the duration of a release. We currently have no way to measure this, though it is technically feasible.
 * Number of installs in a period of time.   We currently have no way to measure this, though it is technically feasible.
 * Popular packages: Would require coordination with mirrors
 * General growth
 * Downloads
 * Use of EOL'd releases

Phone Home
Phoning home is one of the most obvious methods for determining what installations are out there. The idea is simple, place an image in the home page for Firefox, have yum contact a single site for its mirror list (it already does), have Anaconda dial home during an install, have a nightly cron job contact a central server, etc.

Embedded Image
The embedded image idea has been used other projects. It could be any remote file, js, dtd, image, you name it. When a user opens a web browser, the home page is or contains a link to one of our servers, we then mine relevant information from the logs.

Pros: Cons:
 * Easy, safe
 * Little effort to implement
 * Could be combined with a unique (per machine) identifier for better results
 * Could get regular updates unless someone changes the home page
 * Not very reliable (What if someone doesn't install Firefox? Or even X for that matter?)
 * Mildly invasive
 * NAT issue (Many machines look like one)
 * Dynamic issue (One machine looks like many)
 * Requires further alterations for things like arch and release
 * Does not catch offline machines

Yum
Functionally similar to the embedded image tracking, yum would phone home to a central server to retrieve a mirror list. This method is currently being used. Initially it was thought that we could also track the number of total hits to the mirror site in a day. The idea being that by default yum checks in once an hour. So 500,000 hits (per day) / 24 hours in a day = 20,800 installs. This assumption has proved to be pretty off as many people alter their settings not to check in.

Since the hits/day / 24 hours = installs method proved off, we've changed to tracking unique IP's. Any time a new IP gets added, we add a number to our total install base. This will not, however, let us know how many installs are out there at any point in time since if someone removes Fedora from their system, they will never get removed from our IP list.

The current system tracks unique IP's based off of users looking for rawhide or updates-released-fc6 and properly tracks arch's.

Pros: Cons:
 * Easy, safe
 * Will always have this information as long as we have a mirror list server
 * Could be combined with a unique (per machine) identifier for better results
 * Could get regular updates
 * Minimally invasive, it is actually a side affect of a service we provide
 * Automatically grab arch and release
 * NAT issue (Many machines look like one)
 * Dynamic issue (One machine looks like many)
 * Difficult to track current install base, mostly tracks total installs.
 * Does not catch offline machines

Anaconda
Perhaps the most promising technology is adding a phone home into anaconda. Regular check-ins could be considered invasive by some as they allow some sort of tracking, a one time phone home during anaconda would call back to a server once. Users would need opt out (via check box or ks line) to not send the info back home. This could also be considered something to be done one time on first boot.

Pros: Cons:
 * Less invasive then regular phone homes
 * Could be combined with a unique (per machine) identifier for better results
 * Could be set up to send info about arch and release
 * Could be combined with a hardware scanner and software profile
 * Adds complexity to Anaconda
 * NAT issue (Many machines look like one)
 * Dynamic issue (One machine looks like many)
 * Cannot track current number of installs, only total successful installs
 * Will not catch offline machines

Registration
Registration is another method that could be used to determine our user base. On its own its fairly useless, basically allowing us to count registered users. But, when combined with a phone home method, we can get much more accurate information.

User Registration / Survey
User registration would be one method for us to determine who is using our systems and for what purpose. This could be voluntary or mandatory, the latter of which is most accurate. Simply asking our users how many installs they have of what and for what purpose they use it for could be very interesting.

Pros: Cons:
 * Requires no changes to the OS
 * Can get non-fedora information (Do you use Ubuntu for example)
 * Mandatory registration may not be taken well among the community
 * Users lie or don't always know
 * To my knowledge no other major FOSS is requires registration

Machine registration
Evil? Probably, but it is the most reliable way to figure out what machines are installed. Lets not fool ourselves, as clever as we get with this someone will be far more clever and find a way around it. Basically a machine installs and as part of anaconda or at first boot a dialog between the machine and our servers will occur. This could be mandatory or voluntary.

Pros: Cons:
 * Extremely accurate
 * Can account for Fedora installs that no longer exist and remove them from our records
 * Can get arbitrary information, installed packages, hardware profile, etc
 * Evil
 * Does not catch offline machines
 * Requires a changes in our current install method

Unique Identifiers
The most reliable way to get rid of the NAT and Dynamic issue is to have each machine identify itself to the server using a unique identifier. This must be combined with another method but can be incredibly affective.

Pros: Cons:
 * Accurate
 * When used with Phone Home, automated tracking becomes very easy
 * Can get trends and remove a machine that is no longer around
 * Requires minimal changes to whatever other mechanism it is partnered with
 * No more NAT / Dynamic issue
 * Users may be uncomfortable with us tracking individual machines (even if there is no direct link back to them)
 * More invasive than just IP tracking

Other Metrics
Most of this document is focused on install base but, as mentioned earlier there are other metrics we could be getting.

Geographic Location
There are a couple of methods for getting this besides using user registration. The first is by enabling reverse DNS lookup on the awstats information we're getting. We did this or a while (http://fedoraproject.org/awstats/fedoraproject.org/10-06/). The other method is to use GeoIP to mine the information for us. Using simple reverse DNS lookups has proved to not scale well in our environment. Nightly scans of our logs took hours instead of minutes. Using GeoIP will require a few new Perl modules but scales much better than reverse DNS lookup.

Pros: Cons:
 * Geographical information is important, especially when discussing language support
 * Pretty non-invasive, like our yum method this is a side affect of a service we provide.
 * Not extremely accurate but it doesn't have to be

Popular Packages
Popular packages would be very easy to grab if we controlled all the mirrors or at least had a way to grab logs from all of our primary mirrors. Even if we couldn't get all the mirrors we could get some of them but aggregating different log types can be a pain.

Pros: Cons:
 * Knowing which packages are popular may help us allocate time better (adding more modules and plug-ins for popular packages)
 * Requires no changes to the OS
 * Difficult to grab
 * Won't grab people using local mirrors.
 * Distorted by which packages are the defaults

One option is to combine package profiles with registration. With voluntary registration this may be a good way to figure out how people are using our OS.

Public Proxy Servers
One idea for totally anonymous phone homes is to use public proxy servers. Basically have each machine phone home with a unique identifier but through a public proxy server. There's no way for us to trace a profile back to an actual machine or user. This would allow users to totally anonymously send us hardware and software profiles which may make more people participate.