From Fedora Project Wiki
 
(33 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
== Executive summary ==
 
== Executive summary ==
  
This document is a specification for ''statistics++'', a set of software to aggregate, present, and display data and statistics about the Fedora community. Its primary goals are to make data about the Fedora Project easily accessible to the general public and automate current statistical analysis currently done by hand.
+
This document is a specification for ''statistics++'', a set of software to aggregate and display data and statistics about the Fedora community. Its primary goals are to make data about the Fedora Project easily accessible to the public and automate current statistical analysis done by hand.
  
statistics++ is a smaller project in [[Fedora Engineering/FY13 Plan|Fedora Engineering's FY13 plan]]. It depends on the development of a messaging bus inside Fedora Infrastructure. If all milestones are completed on time, the project's first release will be mid-July, 2012.
+
statistics++ is a smaller project in [[Fedora Engineering/FY13 Plan|Fedora Engineering's FY13 plan]]. It depends on a messaging bus existing within Fedora Infrastructure. If we complete all milestones on time, the project's first release will be mid-July 2012.
  
 
== Revision history ==
 
== Revision history ==
Line 16: Line 16:
 
== Project overview ==
 
== Project overview ==
  
Fedora Infrastructure has had a limited foray into the field of statistics. The [[Statistics]] page on the Fedora Project Wiki contains some limited information about the number of HTTP requests made to various infrastructure applications and the number of wiki edits made per month.
+
Fedora Infrastructure has had a limited foray into the field of statistics. The [[Statistics]] page on the Fedora Project Wiki has some limited information about the number of HTTP requests made to various infrastructure applications and the number of wiki edits made per month.
  
 
The [https://admin.fedoraproject.org/community/#statistics statistics app in the first version of Fedora Community] attempted to improve on the [[Statistics]] page, but ultimately failed because of the complexity of adding new and relevant automated queries to the platform and the limited amount of information Fedora's application servers could access.
 
The [https://admin.fedoraproject.org/community/#statistics statistics app in the first version of Fedora Community] attempted to improve on the [[Statistics]] page, but ultimately failed because of the complexity of adding new and relevant automated queries to the platform and the limited amount of information Fedora's application servers could access.
  
With the [[Fedora Engineering/FY13 Plan#AMQP Enablement|planned messaging infrastructure]] for infrastructure applications, a statistics application can be programmed to listen on the message bus, record activity, and store activity in a database for later retrieval. This program will be called ''statistics++''.
+
With the [[Fedora Engineering/FY13 Plan#AMQP Enablement|planned messaging infrastructure]] for infrastructure applications, we can create a statistics application to listen on the message bus, record activity, and store activity in a database for later retrieval. We call this program ''statistics++''.
  
 
statistics++ consists of three components:
 
statistics++ consists of three components:
# ''datanommer'', a server daemon that listens on the infrastructure message bus and records activity to a database
+
# <code>datanommer</code>, a server daemon that listens on the infrastructure message bus and records activity to a database
# ''datagrepper'', an HTTP application that provides a [http://en.wikipedia.org/wiki/Representational_state_transfer#RESTful_web_services RESTful web API] for downloading data stored in the database based on a simple query syntax
+
# <code>datagrepper</code>, an HTTP application that provides a [http://en.wikipedia.org/wiki/Representational_state_transfer#RESTful_web_services RESTful web API] for downloading data stored in the database based on a simple query syntax
# ''dataviewer'', an HTTP application that produces automated data displays such as tables or charts
+
# <code>dataviewer</code>, an HTTP application that produces automated data displays such as tables or charts
  
 
== Target audience ==
 
== Target audience ==
  
datanommer is targeted toward infrastructure application developers who wish to make their data available for use in datagrepper and dataviewer.
+
{|
 
+
|+ Table 1: statistics++ components and target audiences
datagrepper is targeted toward software developers who wish to generate their own queries for personal use or for inclusion in dataviewer.
+
|-
 
+
! Component !! Target audience
dataviewer is targeted toward any user interested in statistics about the Fedora Project, such as Fedora users and developers, Red Hat executives, and journalists.
+
|-
 +
| <code>datanommer</code> || Fedora Infrastructure application developers that want to make application data available for use in <code>datagrepper</code> and <code>dataviewer</code>
 +
|-
 +
| <code>datagrepper</code> || Programmers that want to generate queries on <code>datanommer</code>-provided data for personal use or for inclusion in <code>dataviewer</code>
 +
|-
 +
| <code>dataviewer</code> || Any user interested in statistics about the Fedora Project, including Fedora users and developers, Red Hat executives, and journalists
 +
|}
  
 
== Goals ==
 
== Goals ==
Line 39: Line 45:
 
This project aims to solve the following problems:
 
This project aims to solve the following problems:
  
* Data on the [[Statistics]] wiki page can only be generated and validated by those who have access to Fedora log servers.
+
* Data on the [[Statistics]] wiki page can only be generated and validated by those who have access to Fedora log servers
* Data on the [[Statistics]] wiki page requires a human to generate the data each week.
+
* Data on the [[Statistics]] wiki page requires a human to generate the data each week
* Data on the [[Statistics]] wiki page does not encompass all infrastructure applications.
+
* Data on the [[Statistics]] wiki page does not encompass all infrastructure applications
* Data on the [[Statistics]] wiki page can be modified by anybody who can edit the wiki.
+
* Anybody who can edit the wiki can change data on the [[Statistics]] wiki page
* To generate data for other infrastructure applications (such as FAS, Koji, Bodhi, and other applications), separate code has to be written for each application in order to download data.
+
* Programmers must write different code to generate data for each infrastructure application
  
To solve these problems, statistics++ will have the following functionality:
+
To solve these problems, statistics++ has the following functionality:
  
* Open, read-only access to any anonymized data collected by infrastructure applications
+
* Open, read-only access to any anonymous data collected by infrastructure applications
 
* A standard RESTful API for downloading data
 
* A standard RESTful API for downloading data
 
* Flexible schemas for storing and retrieving data from infrastructure applications
 
* Flexible schemas for storing and retrieving data from infrastructure applications
Line 63: Line 69:
 
=== Modularity ===
 
=== Modularity ===
  
I decided to break statistics++ into three components to make them more modular. There are some benefits to this:
+
I broke statistics++ into three components. There are some benefits to this:
* Each component can be versioned and updated separately, assuming there is no API breakage (there shouldn't be).
+
* We can version and update each component separately
* Other projects can decide to use the project as a whole or separate components (for example, using datanommer alone to prevent using the TG2 stack).
+
* Other projects can decide to use the project as a whole or as separate components (such as using <code>datanommer</code> alone to prevent using the TurboGears 2 stack)
* I get to reuse the name datanommer (the name for the statistics project started about two years ago that did effectively the same thing but was put on hold due to limited resources).
+
* I get to reuse the name for <code>datanommer</code> (the name for the statistics project started about two years ago and put on indefinite hold)
  
=== datanommer ===
+
=== <code>datanommer</code> ===
  
datanommer will be a system service written in Python. At a basic level, its purpose is to connect to a message bus, find messages that it is interested in, and store data from those messages into a database.
+
<code>datanommer</code> is a system daemon written in Python. At a basic level, its purpose is to connect to a message bus, listen for interesting messages, and store data from those messages into a database.
  
An init script or systemd service file (depending on the release) will be written for datanommer.
+
<code>datanommer</code> includes a SysV-style init script or systemd service file.
  
A configuration file defines data stored for each application. These data definitions are called ''schemas''. A schema represents a single application, but applications can have multiple schemas. Each schema consists of this configuration:
+
A configuration file defines data stored for each application called ''schemas''. A schema represents a single application, but applications can have multiple schemas. Each schema consists of this information:
* The namespace to check messages against (with named groups)
+
* The namespace to check messages against
* The fields that are stored in the database and their types (SQLAlchemy field types, most likely)
+
* The fields stored in the database and their types (SQLAlchemy field types, most likely)
* (optional) A regular expression for reading data in from log files using the ''datanommer-logread'' utility
+
* (optional) A regular expression for reading data in from log files using the <code>datanommer-logread</code> utility
  
When enabled, datanommer will check each message on the bus against its list of namespaces. If it matches any that datanommer knows, it will extract the data and store it in the database.
+
When enabled, <code>datanommer</code> checks each message on the bus against its list of namespaces. If it matches any that <code>datanommer</code> knows, it will extract the data and store it in the database.
  
=== datagrepper ===
+
=== <code>datagrepper</code> ===
  
datagrepper is a web frontend written in the TurboGears 2 framework, to be run through Apache httpd via WSGI. Its purpose is to accept queries to the statistics database and return the requested information.
+
<code>datagrepper</code> is a web frontend written in the TurboGears 2 framework, run with Apache httpd and WSGI. It accepts queries to the statistics database and return the requested information.
  
Depending on implementation, datagrepper may or may not need access to datanommer's configuration file. If the database is SQL-backed (i.e. PostgreSQL), datagrepper can determine the schema for each database based on table layouts. If a NoSQL database is used, datanommer could put information about the schema in the database. Alternatively to all of these choices, datagrepper can simply have access to datanommer's configuration file.
+
Depending on implementation, <code>datagrepper</code> may or may not need access to <code>datanommer</code>'s configuration file. If the database is SQL-backed (i.e. PostgreSQL), <code>datagrepper</code> can determine the schema for each database based on table layouts. If a NoSQL database is used, <code>datanommer</code> could put information about the schema in the database. Alternatively, <code>datagrepper</code> can simply have access to <code>datanommer</code>'s configuration file.
  
The index page of datagrepper shows available schemas that data can be downloaded from and what fields can be fetched or searched. By default, it presents output in HTML, but can be displayed in JSON.
+
The index page of <code>datagrepper</code> shows available schemas and what fields can be fetched or searched. It outputs this list in HTML or JSON.
  
The <code>/query</code> URI accepts a query string as either a GET or POST request. Query string variable names match those of the database fields. [https://docs.djangoproject.com/en/dev/topics/db/queries/#field-lookups Django-like field lookup arguments] will be accepted (for example, sending the query string <code>date__lte=2011-12-31</code> will return rows in the table where the "date" field is less than or equal to December 31, 2011). <code>/query</code> will accept a <code>__format</code> argument, which can either be <code>json</code> to return data in JSON or <code>csv</code> to return data in CSV.
+
The <code>/query</code> URI accepts a query string as either a GET or POST request. Query string variable names match those of the database fields. <code>/query</code> accepts [https://docs.djangoproject.com/en/dev/topics/db/queries/#field-lookups Django-like field lookup arguments] (for example, sending the query string <code>date__lte=2011-12-31</code> returns rows in the table where the "date" field is less than or equal to December 31, 2011). <code>/query</code> accepts a <code>__format</code> argument to output data in JSON or CSV.
  
==== datagrepper client API ====
+
(Thought: provide additional data outputs to work with other visualization programs, such as https://lwn.net/Articles/504741/)
  
A Python client API will be available for datagrepper which will automate some of the intricacies of downloading data via HTTP, using gzip compression, continuing queries and converting the JSON output to a Python object.
+
==== <code>datagrepper</code> client Python library ====
  
=== dataviewer ===
+
A Python library for accessing <code>datagrepper</code> will automate the intricacies of downloading data via HTTP, using gzip compression, continuing queries and converting the JSON output to a Python object.
  
dataviewer is a web frontend written in the TurboGears 2 framework, to be run through Apache httpd via WSGI. Its purpose is to make queries to datagrepper using the client API and display data in various formats (such as tables or charts).
+
=== <code>dataviewer</code> ===
  
The specific plan for defining what displays are available and how they get data is currently being discussed in the <code>#fedora-apps</code> IRC channel on freenode. At a basic level, the definition of how the displays work should be a per-client configuration as opposed to something distributed with dataviewer.
+
<code>dataviewer</code> is a web frontend written in the TurboGears 2 framework, run with Apache httpd and WSGI. It makes queries to <code>datagrepper</code> using the Python client library and displays data in various formats (such as tables or charts).
 +
 
 +
The specific plan for defining what displays are available and how they get data is being discussed in the <code>#fedora-apps</code> IRC channel on freenode.
  
 
== Requirements for release ==
 
== Requirements for release ==
  
 
# The following applications must send activity or log messages over the message bus:
 
# The following applications must send activity or log messages over the message bus:
#* httpd
+
#* Apache httpd
 
#* MediaWiki
 
#* MediaWiki
 
#* FAS
 
#* FAS
Line 112: Line 120:
 
#* AutoQA
 
#* AutoQA
 
#* Git (pkgs.fedoraproject.org and git.fedorahosted.org)
 
#* Git (pkgs.fedoraproject.org and git.fedorahosted.org)
# The datanommer service must run, connect to a message bus, listen for activity, parse activity messages and store data into a database for all of the above services.
+
# The <code>datanommer</code> service must run, connect to a message bus, listen for activity, parse activity messages and store data into a database for all the above services.
# Data from before datanommer began running must be gathered from log files or application databases and placed in the database.
+
# Data from before <code>datanommer</code> began running must be gathered from log files or application databases and placed in the database.
# The datagrepper service must run and respond to basic queries. The data schema for each infrastructure application and the query syntax must be documented, and examples in that documentation must function. The service must be capable of providing responses in JSON and compressing a response when requested.
+
# The <code>datagrepper</code> service must run and respond to basic queries. The data schema for each infrastructure application and the query syntax must have documentation, and examples in that documentation must function. The service must provide responses in JSON and compress a response when requested.
# Queries on [[Statistics]] using the above application data must be automated and displayed in dataviewer.
+
# Queries on the [[Statistics]] wiki page using the above application data must exist in <code>dataviewer</code>.
# Documentation must be written for:
+
# Documentation must exist for:
#* Adding schemas to datanommer
+
#* Adding schemas to <code>datanommer</code>
#* Using the datanommer API
+
#* Using the <code>datanommer</code> API
#* Using the datagrepper Python client library
+
#* Using the <code>datagrepper</code> Python client library
#* Adding displays to dataviewer
+
#* Adding displays to <code>dataviewer</code>
  
 
== Use cases ==
 
== Use cases ==
  
Within six months, statistics++ should be able to handle the following use cases:
+
Within six months, statistics++ should handle the following use cases:
  
 
* Adam wants information on wiki edits made in 2011. He doesn't have experience with any programming languages, but if he could import data into a spreadsheet program he can use the data that way.
 
* Adam wants information on wiki edits made in 2011. He doesn't have experience with any programming languages, but if he could import data into a spreadsheet program he can use the data that way.
* Brenda needs information on how often different architectures were requested from MirrorManager in order to provide information to FESCo on the debate of demoting an architecture to secondary.
+
* Brenda needs information on how often Fedora systems requested repodata for different architectures from MirrorManager to provide information to FESCo on the debate of demoting an architecture to secondary.
 
* Cathy is a journalist and wants to determine the year-by-year growth rate of the Fedora user base and compare that to the year-by-year growth rate of the Fedora contributor base.
 
* Cathy is a journalist and wants to determine the year-by-year growth rate of the Fedora user base and compare that to the year-by-year growth rate of the Fedora contributor base.
* David is interested in seeing how many packages were available at each release's end-of-life and whether the rate of change is increasing or decreasing.
+
* David wants to see how many packages were available at each release's end-of-life and whether the rate of change is increasing or decreasing.
* Ethan of the [[Websites]] team wants to see if a certain page was regularly accessed enough to see if it should continue to be maintained.
+
* Ethan of the [[Websites]] team wants to see if a certain page was regularly accessed enough to decide whether to remove it.
* Fred wants to determine how many packages required to remain in testing for a certain period of time actually receive positive or negative karma in Bodhi.
+
* Fred wants to determine how many packages required to stay in testing for a certain time actually receive positive or negative karma in Bodhi.
 +
* Giles wants to see information on mailing list user counts over time.
  
 
== Relationship to other services ==
 
== Relationship to other services ==
Line 141: Line 150:
 
== Reviewers ==
 
== Reviewers ==
  
Subject to change; names are basically placeholders.
+
(Subject to change)
  
 
* Infrastructure reviewer: [[User:kevin|Kevin Fenzi]]
 
* Infrastructure reviewer: [[User:kevin|Kevin Fenzi]]
Line 162: Line 171:
 
|-
 
|-
 
| 2012-04-20
 
| 2012-04-20
| datanommer written:
+
| <code>datanommer</code> written:
* Application configuration (messaging configuration and data schema) format complete
+
* Application configuration (messaging configuration and <code>data</code> schema) format complete
 
* Successfully listens to messages and stores them in the database
 
* Successfully listens to messages and stores them in the database
 
|-
 
|-
 
| 2012-04-27
 
| 2012-04-27
| datanommer packaged for EPEL and in production infrastructure (or staging if during a change freeze)
+
| <code>datanommer</code> packaged for EPEL and in production infrastructure (or staging if during a change freeze)
 
|-
 
|-
 
| 2012-05-21
 
| 2012-05-21
| datagrepper written:
+
| <code>datagrepper</code> written:
 
* Automatically determines schemas and lists them on the index page
 
* Automatically determines schemas and lists them on the index page
 
* <code>/query</code> functions as advertised
 
* <code>/query</code> functions as advertised
Line 178: Line 187:
 
| 2012-05-28
 
| 2012-05-28
 
|
 
|
* datagrepper packaged for EPEL and in production infrastructure (or staging if during a change freeze)
+
* <code>datagrepper</code> packaged for EPEL and in production infrastructure (or staging if during a change freeze)
* datagrepper Python client library done and functions as advertised
+
* <code>datagrepper</code> Python client library done and functions as advertised
 
|-
 
|-
 
| 2012-06-29
 
| 2012-06-29
| dataviewer written:
+
| <code>dataviewer</code> written:
 
* Capable of displaying most if not all data on the [[Statistics]] wiki page
 
* Capable of displaying most if not all data on the [[Statistics]] wiki page
 
|-
 
|-
 
| 2012-07-13
 
| 2012-07-13
| dataviewer packaged for EPEL and in production infrastructure (or staging if during a change freeze)
+
| <code>dataviewer</code> packaged for EPEL and in production infrastructure (or staging if during a change freeze)
 
|}
 
|}
 
For statistics++ to run on Fedora Infrastructure, a messaging bus must be in place, and all components of statistics++ must be packaged for [[EPEL]].
 
 
For the inclusion of each infrastructure application in statistics++, that application must send messages over the messaging bus, and data generated from that application prior to inclusion must be imported into the database.
 
  
 
== Open issues ==
 
== Open issues ==
  
* How does the datanommer configuration file define data types? (Currently thinking SQLAlchemy types will work best)
+
* How does the <code>datanommer</code> configuration file define data types? (Currently thinking SQLAlchemy types will work best)
* How should messages sent while datanommer is not listening be handled?
+
* How should messages sent while <code>datanommer</code> is not listening be handled?
* Should datanommer check for duplicate messages (for example, reading in log files during a time period when messages were received)? If so, should this be configured per-schema?
+
* Should <code>datanommer</code> check for duplicate messages (such as reading in log files during a time period when <code>datanommer</code> was running)? If so, should this be configured per-schema?
* How should datagrepper handle excessively large queries? Some large queries may take longer than a normal HTTP timeout to generate. Some ideas:
+
* How should <code>datagrepper</code> handle excessively large queries? Some large queries may take longer than a normal HTTP timeout to generate. Some ideas:
** Have a response that means "your query is generating, here's a code you can check to see if you can download it." Advantages: server can process query when it has idle time; downloads have less HTTP request overhead. Disadvantages: user has to wait for data; server has to retain data for some time period so it can be downloaded.
+
** Have a response that means "your query is generating, here's a code you can check to see if you can download it." Advantages: server can process query when it has idle time; downloads have less HTTP request overhead. Disadvantages: user has to wait for data; server has to retain data for some time period.
** MediaWiki style "query-continue" messages that give query string variables to be changed to access the next set of results
+
** MediaWiki style <code>query-continue</code> messages that give changes to query string variables to access the next set of results
 
* Should we use RRD as a secondary database for faster queries and rendering?
 
* Should we use RRD as a secondary database for faster queries and rendering?
* How should dataviewer be configured?
+
* How should <code>dataviewer</code> be configured?
* Should the dataviewer component be a separate web application or should it be part of the Fedora Community web framework?
+
* Should the <code>dataviewer</code> component be a separate web application or should it be part of the Fedora Community web framework?
  
 
== Resources for information ==
 
== Resources for information ==

Latest revision as of 13:10, 5 July 2012

statistics++: Making Fedora Project statistics accessible and automated
Ian Weller, Fedora Engineering, Red Hat, Inc.
Version 1.0 (Tue Mar 27 2012)

Executive summary

This document is a specification for statistics++, a set of software to aggregate and display data and statistics about the Fedora community. Its primary goals are to make data about the Fedora Project easily accessible to the public and automate current statistical analysis done by hand.

statistics++ is a smaller project in Fedora Engineering's FY13 plan. It depends on a messaging bus existing within Fedora Infrastructure. If we complete all milestones on time, the project's first release will be mid-July 2012.

Revision history

Version 1.0 — Tue Mar 27 2012
Initial specification release

Project overview

Fedora Infrastructure has had a limited foray into the field of statistics. The Statistics page on the Fedora Project Wiki has some limited information about the number of HTTP requests made to various infrastructure applications and the number of wiki edits made per month.

The statistics app in the first version of Fedora Community attempted to improve on the Statistics page, but ultimately failed because of the complexity of adding new and relevant automated queries to the platform and the limited amount of information Fedora's application servers could access.

With the planned messaging infrastructure for infrastructure applications, we can create a statistics application to listen on the message bus, record activity, and store activity in a database for later retrieval. We call this program statistics++.

statistics++ consists of three components:

  1. datanommer, a server daemon that listens on the infrastructure message bus and records activity to a database
  2. datagrepper, an HTTP application that provides a RESTful web API for downloading data stored in the database based on a simple query syntax
  3. dataviewer, an HTTP application that produces automated data displays such as tables or charts

Target audience

Table 1: statistics++ components and target audiences
Component Target audience
datanommer Fedora Infrastructure application developers that want to make application data available for use in datagrepper and dataviewer
datagrepper Programmers that want to generate queries on datanommer-provided data for personal use or for inclusion in dataviewer
dataviewer Any user interested in statistics about the Fedora Project, including Fedora users and developers, Red Hat executives, and journalists

Goals

This project aims to solve the following problems:

  • Data on the Statistics wiki page can only be generated and validated by those who have access to Fedora log servers
  • Data on the Statistics wiki page requires a human to generate the data each week
  • Data on the Statistics wiki page does not encompass all infrastructure applications
  • Anybody who can edit the wiki can change data on the Statistics wiki page
  • Programmers must write different code to generate data for each infrastructure application

To solve these problems, statistics++ has the following functionality:

  • Open, read-only access to any anonymous data collected by infrastructure applications
  • A standard RESTful API for downloading data
  • Flexible schemas for storing and retrieving data from infrastructure applications
  • Live updates of statistical data from infrastructure applications
  • An interface for creating automated queries and representing data in tables or charts

Non-goals

This project should not attempt to solve the following problems:

  • Live pushing of data to other applications (the purpose of the messaging bus)

Details / design overview

Modularity

I broke statistics++ into three components. There are some benefits to this:

  • We can version and update each component separately
  • Other projects can decide to use the project as a whole or as separate components (such as using datanommer alone to prevent using the TurboGears 2 stack)
  • I get to reuse the name for datanommer (the name for the statistics project started about two years ago and put on indefinite hold)

datanommer

datanommer is a system daemon written in Python. At a basic level, its purpose is to connect to a message bus, listen for interesting messages, and store data from those messages into a database.

datanommer includes a SysV-style init script or systemd service file.

A configuration file defines data stored for each application called schemas. A schema represents a single application, but applications can have multiple schemas. Each schema consists of this information:

  • The namespace to check messages against
  • The fields stored in the database and their types (SQLAlchemy field types, most likely)
  • (optional) A regular expression for reading data in from log files using the datanommer-logread utility

When enabled, datanommer checks each message on the bus against its list of namespaces. If it matches any that datanommer knows, it will extract the data and store it in the database.

datagrepper

datagrepper is a web frontend written in the TurboGears 2 framework, run with Apache httpd and WSGI. It accepts queries to the statistics database and return the requested information.

Depending on implementation, datagrepper may or may not need access to datanommer's configuration file. If the database is SQL-backed (i.e. PostgreSQL), datagrepper can determine the schema for each database based on table layouts. If a NoSQL database is used, datanommer could put information about the schema in the database. Alternatively, datagrepper can simply have access to datanommer's configuration file.

The index page of datagrepper shows available schemas and what fields can be fetched or searched. It outputs this list in HTML or JSON.

The /query URI accepts a query string as either a GET or POST request. Query string variable names match those of the database fields. /query accepts Django-like field lookup arguments (for example, sending the query string date__lte=2011-12-31 returns rows in the table where the "date" field is less than or equal to December 31, 2011). /query accepts a __format argument to output data in JSON or CSV.

(Thought: provide additional data outputs to work with other visualization programs, such as https://lwn.net/Articles/504741/)

datagrepper client Python library

A Python library for accessing datagrepper will automate the intricacies of downloading data via HTTP, using gzip compression, continuing queries and converting the JSON output to a Python object.

dataviewer

dataviewer is a web frontend written in the TurboGears 2 framework, run with Apache httpd and WSGI. It makes queries to datagrepper using the Python client library and displays data in various formats (such as tables or charts).

The specific plan for defining what displays are available and how they get data is being discussed in the #fedora-apps IRC channel on freenode.

Requirements for release

  1. The following applications must send activity or log messages over the message bus:
    • Apache httpd
    • MediaWiki
    • FAS
    • MirrorManager
    • Bodhi
    • Koji
    • AutoQA
    • Git (pkgs.fedoraproject.org and git.fedorahosted.org)
  2. The datanommer service must run, connect to a message bus, listen for activity, parse activity messages and store data into a database for all the above services.
  3. Data from before datanommer began running must be gathered from log files or application databases and placed in the database.
  4. The datagrepper service must run and respond to basic queries. The data schema for each infrastructure application and the query syntax must have documentation, and examples in that documentation must function. The service must provide responses in JSON and compress a response when requested.
  5. Queries on the Statistics wiki page using the above application data must exist in dataviewer.
  6. Documentation must exist for:
    • Adding schemas to datanommer
    • Using the datanommer API
    • Using the datagrepper Python client library
    • Adding displays to dataviewer

Use cases

Within six months, statistics++ should handle the following use cases:

  • Adam wants information on wiki edits made in 2011. He doesn't have experience with any programming languages, but if he could import data into a spreadsheet program he can use the data that way.
  • Brenda needs information on how often Fedora systems requested repodata for different architectures from MirrorManager to provide information to FESCo on the debate of demoting an architecture to secondary.
  • Cathy is a journalist and wants to determine the year-by-year growth rate of the Fedora user base and compare that to the year-by-year growth rate of the Fedora contributor base.
  • David wants to see how many packages were available at each release's end-of-life and whether the rate of change is increasing or decreasing.
  • Ethan of the Websites team wants to see if a certain page was regularly accessed enough to decide whether to remove it.
  • Fred wants to determine how many packages required to stay in testing for a certain time actually receive positive or negative karma in Bodhi.
  • Giles wants to see information on mailing list user counts over time.

Relationship to other services

statistics++ is directly related to the messaging bus project (fedbus and busmon).

statistics++ is indirectly related to every other infrastructure application, as we wish to include every infrastructure application in statistics++ eventually.

Reviewers

(Subject to change)

Schedule and milestones

Milestones aren't likely to change, but dates are subject to wild change depending on the status of messaging support in infrastructure.

Date Milestone
2012-04-13
2012-04-20 datanommer written:
  • Application configuration (messaging configuration and data schema) format complete
  • Successfully listens to messages and stores them in the database
2012-04-27 datanommer packaged for EPEL and in production infrastructure (or staging if during a change freeze)
2012-05-21 datagrepper written:
  • Automatically determines schemas and lists them on the index page
  • /query functions as advertised

(The long amount of time involved here takes into account my inexperience with TurboGears 2, the preferred web framework for Fedora Infrastructure.)

2012-05-28
  • datagrepper packaged for EPEL and in production infrastructure (or staging if during a change freeze)
  • datagrepper Python client library done and functions as advertised
2012-06-29 dataviewer written:
  • Capable of displaying most if not all data on the Statistics wiki page
2012-07-13 dataviewer packaged for EPEL and in production infrastructure (or staging if during a change freeze)

Open issues

  • How does the datanommer configuration file define data types? (Currently thinking SQLAlchemy types will work best)
  • How should messages sent while datanommer is not listening be handled?
  • Should datanommer check for duplicate messages (such as reading in log files during a time period when datanommer was running)? If so, should this be configured per-schema?
  • How should datagrepper handle excessively large queries? Some large queries may take longer than a normal HTTP timeout to generate. Some ideas:
    • Have a response that means "your query is generating, here's a code you can check to see if you can download it." Advantages: server can process query when it has idle time; downloads have less HTTP request overhead. Disadvantages: user has to wait for data; server has to retain data for some time period.
    • MediaWiki style query-continue messages that give changes to query string variables to access the next set of results
  • Should we use RRD as a secondary database for faster queries and rendering?
  • How should dataviewer be configured?
  • Should the dataviewer component be a separate web application or should it be part of the Fedora Community web framework?

Resources for information

Responsible parties