Infrastructure/Metrics

From FedoraProject

< Infrastructure
Revision as of 21:15, 25 May 2008 by Ricky (Talk | contribs)
(diff) ← Older revision | Current revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Contents

Metrics

RFC

The purpose of this text is to compile thoughts, opinions and options to gather metrics for the Fedora Project. Everyone seems to be able to find faults in the different types of metrics available but few have been able to offer improvements and alternatives. No metric is perfect, at present the more accurate the method, the more invasive it becomes. In an Open Source world privacy is important so a balance must be found.

This is actively being discussed on the Fedora-users list and in the Fedora Advisory Board

https://www.redhat.com/archives/fedora-advisory-board/2006-November/msg00239.html https://www.redhat.com/archives/fedora-list/2006-November/msg05080.html

Metrics are actually important

Many people might say that metrics are numbers just for show and that gathering them just isn't worth it. The fact is that metrics are important for anyone trying to do something with limited resources. It allows us to put what little resources we do have to better use. If the developers spend 20% of their time debugging x86_64 and our metrics show that they are 1% of our install base, the argument could be made that less time needs to be spent on x86_64. (Totally made up numbers)

Metrics are also important for determining infrastructure needs like how many users per mirror we have. Finding these numbers allow us to determine capacity and plan for future growth and spikes in traffic, like those found during a release launch.


Methods

There are a few methods available to us to determine how many installs / users are out there. Ideally we'd attempt to pick two different methods and compare their numbers. The closer the numbers, the more accurate we can assume our numbers are.

Lies, Damned Lies, and Statistics

There's a number of things that we can't get automatically and there's no point in trying to get them. Here's the list, anyone that thinks they can get these numbers without a survey, state your case to the list.

What we can get through automated methods:

Phone Home

Phoning home is one of the most obvious methods for determining what installations are out there. The idea is simple, place an image in the home page for Firefox, have yum contact a single site for its mirror list (it already does), have Anaconda dial home during an install, have a nightly cron job contact a central server, etc.

Embedded Image

The embedded image idea has been used other projects. It could be any remote file, js, dtd, image, you name it. When a user opens a web browser, the home page is or contains a link to one of our servers, we then mine relevant information from the logs.

Pros:

Cons:

Yum

Functionally similar to the embedded image tracking, yum would phone home to a central server to retrieve a mirror list. This method is currently being used. Initially it was thought that we could also track the number of total hits to the mirror site in a day. The idea being that by default yum checks in once an hour. So 500,000 hits (per day) / 24 hours in a day = 20,800 installs. This assumption has proved to be pretty off as many people alter their settings not to check in.

Since the hits/day / 24 hours = installs method proved off, we've changed to tracking unique IP's. Any time a new IP gets added, we add a number to our total install base. This will not, however, let us know how many installs are out there at any point in time since if someone removes Fedora from their system, they will never get removed from our IP list.

The current system tracks unique IP's based off of users looking for rawhide or updates-released-fc6 and properly tracks arch's.

Pros:

Cons:

Anaconda

Perhaps the most promising technology is adding a phone home into anaconda. Regular check-ins could be considered invasive by some as they allow some sort of tracking, a one time phone home during anaconda would call back to a server once. Users would need opt out (via check box or ks line) to not send the info back home. This could also be considered something to be done one time on first boot.

Pros:

Cons:

Registration

Registration is another method that could be used to determine our user base. On its own its fairly useless, basically allowing us to count registered users. But, when combined with a phone home method, we can get much more accurate information.

User Registration / Survey

User registration would be one method for us to determine who is using our systems and for what purpose. This could be voluntary or mandatory, the latter of which is most accurate. Simply asking our users how many installs they have of what and for what purpose they use it for could be very interesting.

Pros:

Cons:

Machine registration

Evil? Probably, but it is the most reliable way to figure out what machines are installed. Lets not fool ourselves, as clever as we get with this someone will be far more clever and find a way around it. Basically a machine installs and as part of anaconda or at first boot a dialog between the machine and our servers will occur. This could be mandatory or voluntary.

Pros:

Cons:

Unique Identifiers

The most reliable way to get rid of the NAT and Dynamic issue is to have each machine identify itself to the server using a unique identifier. This must be combined with another method but can be incredibly affective.

Pros:

Cons:

Other Metrics

Most of this document is focused on install base but, as mentioned earlier there are other metrics we could be getting.

Geographic Location

There are a couple of methods for getting this besides using user registration. The first is by enabling reverse DNS lookup on the awstats information we're getting. We did this in October for a while (http://fedoraproject.org/awstats/fedoraproject.org/10-06/). The other method is to use GeoIP to mine the information for us. Using simple reverse DNS lookups has proved to not scale well in our environment. Nightly scans of our logs took hours instead of minutes. Using GeoIP will require a few new Perl modules but scales much better than reverse DNS lookup.

Pros:

Cons:

Popular Packages

Popular packages would be very easy to grab if we controlled all the mirrors or at least had a way to grab logs from all of our primary mirrors. Even if we couldn't get all the mirrors we could get some of them but aggregating different log types can be a pain.

Pros:

Cons:

One option is to combine package profiles with registration. With voluntary registration this may be a good way to figure out how people are using our OS.

Public Proxy Servers

One idea for totally anonymous phone homes is to use public proxy servers. Basically have each machine phone home with a unique identifier but through a public proxy server. There's no way for us to trace a profile back to an actual machine or user. This would allow users to totally anonymously send us hardware and software profiles which may make more people participate.

References