From Fedora Project Wiki

statistics++: Making Fedora Project data accessible
Ian Weller, Fedora Engineering, Red Hat, Inc.

Project overview

Fedora Infrastructure has had a limited foray into the field of statistics. The Statistics page on the Fedora Project Wiki contains some limited information about the number of HTTP requests made to various infrastructure applications and the number of wiki edits made per month.

The statistics app in the first version of Fedora Community attempted to improve on the Statistics page, but ultimately failed because of the complexity of adding new and relevant automated queries to the platform and the limited amount of information Fedora's application servers could access.

With the planned messaging infrastructure for infrastructure applications, a statistics application can be programmed to listen on the message bus, record activity, and store activity in a database for later retrieval. This program will be called statistics++.

statistics++ consists of three components:

  1. datanommer, a server daemon that listens on the infrastructure message bus and records activity to a database
  2. datagrepper, an HTTP application that provides a RESTful web API for downloading data stored in the database based on a simple query syntax
  3. dataviewer, an HTTP application that produces automated data displays such as tables or charts

Target audience

datanommer is targeted toward infrastructure application developers who wish to make their data available for use in datagrepper and dataviewer.

datagrepper is targeted toward software developers who wish to generate their own queries for personal use or for inclusion in dataviewer.

dataviewer is targeted toward any user interested in statistics about the Fedora Project, such as Fedora users and developers, Red Hat executives, and journalists.

Goals

This project aims to solve the following problems:

  • Data on the Statistics wiki page can only be generated and validated by those who have access to Fedora log servers.
  • Data on the Statistics wiki page requires a human to generate the data each week.
  • Data on the Statistics wiki page does not encompass all infrastructure applications.
  • Data on the Statistics wiki page can be modified by anybody who can edit the wiki.
  • To generate data for other infrastructure applications (such as FAS, Koji, Bodhi, and other applications), separate code has to be written for each application in order to download data.

To solve these problems, statistics++ will have the following functionality:

  • Open, read-only access to any anonymized data collected by infrastructure applications
  • A standard RESTful API for downloading data
  • Flexible schemas for storing and retrieving data from infrastructure applications
  • Live updates of statistical data from infrastructure applications
  • An interface for creating automated queries and representing data in tables or charts

Non-goals

  • Live pushing of data to other applications (the purpose of the messaging bus)

Details / design overview

Modularity

I decided to break statistics++ into three components to make them more modular. There are some benefits to this:

  • Each component can be versioned and updated separately, assuming there is no API breakage (there shouldn't be).
  • Other projects can decide to use the project as a whole or separate components (for example, using datanommer alone to prevent using the TG2 stack).
  • I get to reuse the name datanommer (the name for the statistics project started about two years ago that did effectively the same thing but was put on hold due to limited resources).

datanommer

datagrepper

dataviewer

Requirements for release

  1. The following applications must send activity messages over the message bus:
    • httpd
    • MediaWiki
    • FAS
    • Koji
    • Bodhi
  2. The datanommer service must run, connect to a message bus, listen for activity, parse activity messages and store data into a database for all of the above services.
  3. Data from before datanommer began running must be gathered from log files or application databases and placed in the database.
  4. The datagrepper service must run and respond to basic queries. The data schema for each infrastructure application and the query syntax must be documented, and examples in that documentation must function. The service must be capable of providing responses in JSON and compressing a response when requested.

Use cases

  • Adam wants information on wiki edits made in 2011. He doesn't have experience with any programming languages, but if he could import data into a spreadsheet program he can use that.

Relationship to other services

statistics++ is directly related to the messaging bus project (fedbus and busmon).

statistics++ is indirectly related to every other infrastructure application, as we wish to include every infrastructure application in statistics++ eventually.

Reviewers

(Subject to change, names are basically placeholders)

Schedule summary

Milestones aren't likely to change, but dates are subject to wild change depending on the status of messaging support in infrastructure.

Date Milestone
2012-04-13 datanommer done:
  • Application configuration (messaging configuration and data schema) format complete
  • Successfully listens to messages and stores them in the database
2012-05-21 datagrepper done
  • The long amount of time involved here takes into account my inexperience with TurboGears 2, the preferred web framework for Fedora Infrastructure.
2012-05-28 datagrepper Python client library done
2012-06-29 dataviewer done

Dependencies

For statistics++ to run on Fedora Infrastructure, a messaging bus must be in place.

For the inclusion of each infrastructure application in statistics++, that application must send messages over the messaging bus, and data generated from that application prior to inclusion must be imported into the database.

Open issues

Resources for information

Responsible parties