From Fedora Project Wiki

(Redirected from Talk:Data mining use cases)

Please use the "+" button at the top of the page to add your thoughts. This will split each one into a section so that discussion can follow.

Sign your comments with --~~~~!

Jsmidt's suggestion

Sorry to list what seems like "everything" but I believe these statistics are all important to various groups of people. As for use cases, each of these could be important for marketing: "Look we fixes a high percentage of bugs", "We have lots of new packages coming in", "Documentation is really improving", etc... Furthermore, each thing I list helps establish the health of the project in each area. "Are we doing enough to encourage translators to contribute", "Are we fixing bugs well", etc...

  • Number of people joining the project as a function of time
  • Number of Ambassadors as function of time
  • Number of translators as function of time
  • Number of BugZappers as function of time
  • Number of packagers as function of time
  • Number of people on docs team as function of time
  • Number of designers as function of time
  • Numbers if wiki edits as function of time
  • Number of packages in Fedora as function of time
  • Number of people using Fedora as a function of time.
  • Number of bugs opened each week as a function of time
  • Number of bugs closed each week as a function of time
  • Function of time as a function of time. (Better be linear).

--Jsmidt 01:16, 18 June 2009 (UTC)

Thanks for the wonderful suggestions, I'll get these integrated into the main use cases page shortly.
And just so you know — it never feels like time is linear ;) --Ian Weller 01:22, 18 June 2009 (UTC)

Statistic about the distribution itself

I don't know if this really is in alignment with this effort but there are some metrics that could be used to monitor the "health" and development of the distribution or various parts of it:

  • # of packages in comps and the comps groups
    • I have a bar graph in mind with the bar representing all packages and the groups are coloured parts showing the relative size.
If you don't mind me asking, what purpose would this use case serve the Fedora community? --Ian Weller 14:31, 25 June 2009 (UTC)
It would give a an overview of how much packages different application domains contain and how they develop over time. It would also show how far the comps groups grow together with the distribution (or not). --Ffesti 14:21, 30 June 2009 (UTC)
  • amount of content in the packages by file type
    • I have a script for the stats on Features/NoarchSubpackages. It could probably be extended to more fine grained categorization of the package contents.
    • don't know whether this really makes sense
    • automatically generating the noarch stats would of course be a big help for the Feature
  • Connectivity of the dependency graph. We often have the problem that by accident packages requiring packages that they really shouldn't (like large part of gnome by fedora-release). The question is whether we can calculate a number for the distribution or each package that makes such changes detectable. Such metric could also be used further thin out the dependencies throughout the distribution. May be mixing this with the comps information can make it easier to find the interesting packages.
I think that might be outside the scope of this project, but I'll definitely give it some thought. --Ian Weller 14:35, 25 June 2009 (UTC)
Figuring out the details and implementing them might be a bit too much for this project. Anyway, if you like the idea and know how to hook in a "statistic module" drop me a note. I might do that as a nice side project. --Ffesti 17:03, 30 June 2009 (UTC)
  • # number of packages with common post fix. There are a few very common postfixes in the package names: -devel, -doc, -data, -common. Knowing the numbers could give a better impression how the distribution looks like.

--Ffesti 10:04, 18 June 2009 (UTC)

talk/action ratios

  • mailing lists
    • posts
    • unique posters
    • posters from $group or $company
  • IRC channels (perhaps via meeting minutes/logs)
    • participants
    • lines per participant
  • for both of the above, it'd be interesting to see someone's "talk to action" ratio, where you can define (at search time) what is "talk" and what is "action" (wikipage edits could be both, etc. and definitely sending out emails is an important part of getting work done around here.)

[[User:Mchua|Mel Chua]] 18:41, 18 June 2009 (UTC)

Package information

  • Number of packagers / package
  • Number of commits / day
  • Packages altered / month

—Preceding unsigned comment added by Mmcgrath (talkcontribs)

Marketing Use Cases

To go along with Project FooBar.

Where are our visitors coming from?

What types of information are they consuming most?

Do they prefer one content type over another? i.e. audio over video

More fine grained metrics about news-related posts, i.e. does this have broad reach

A way to judge international uptake of all types of posts, and which languages should we focus translations on?

Does posting certain types of content lead to attrition? (attrition=drop in people coming to the site)

Maybe something about our rate of new people viewing material or visiting sites?

Activity Cycles, i.e. are there periods of time that are more conducive to posting content?

  • Pre-Release
  • Post-Release
  • Around Events
  • When its quiet

This list goes on...

Jack Aboutboul on 2009.06.30

Mapping events to the master timeline

For all of the stats that track against time, it would be useful to map real world events on a timeline for comparison and analysis.

Examples include regular, irregular, and rare events:

  • Alpha, beta, RC
  • Release
  • Mid-point between releases
  • FUDCon
  • Fedora Activity Days
  • When a bunch of engineers disappear to work on RHEL or other products
  • Security breach
  • Other long system downtime
  • Changes in Infrastructure scale (that might permit greater wiki edits, for example)

--Quaid 09:01, 6 July 2009 (UTC)

Automagic next-gen interpretation and analysis

Similar to the 'talk to action' measurement mentioned elsewhere in this Talk page.

What could we learn if we used tools such as natural language analysis? Or even really clever regular expression matching?

What if we could cross-compare activities of users in lists/IRC/blog posts with the skills and experiences of that user as captured in an opt-in database?

There is a level where we can seriously map based on our experiences with community building. For example, when the level of people not participating on a project list originally sparked by Red Hat reaches a certain level (~40%?), we can see that the project is more controlled and influenced by the wider community than just being a pet project of Red Hat.

For example, if we had these items ...

  • Database of skills/experience in FAS
  • Identity matching for IRC nicks helping in #fedora with FAS accounts
  • Natural language and expression matching data that makes assumptions based on rules created by humans

... we could determine something otherwise arbitrary, such as, "FAS user 'juansmith' is a Linux sysadmin expert, has tons of experience with GFS/LVM, and is a member of the bug triage and IRC helper teams. By analyzing IRC logs with natural language and pattern matching tools, we can determine that 'juansmith' asks and answers a large amount of questions about LVM and far less about GFS in #fedora."

In the past, people were interested in or scared of such data because it looks too much like performance analysis, as in, "Let's give 'juansmith' a t-shirt for answering 100 questions in #fedora!!!11!!!1!1!"

I can see other uses for this data. We can get an idea of where we have expertise in the community for helping each other, and where we do not. What kinds of problems plague our users the most, beyond the stores we tell and what is captured in the wiki. Who and what type of users are asking for what kind of help, how much they are getting helped, do they seem to be coming back for more such help, is the help on topic for the channel, etc.

--Quaid 23:55, 6 July 2009 (UTC)

Package review quotient

It's occasionally useful to know the number of reviews done per review submitted (or the reciprocal, since most folks review far less than they submit).

It's just two bugzilla queries: component->"Package Review", reporter->address gives you submitted reviews and assignee->address gives you reviewed tickets. There are some corner cases but by and large that's close enough.

Tibbs 20:06, 11 July 2009 (UTC)

exposing some statistics on the distribution lists

I can see from the "would like" list published on the Statistics_2.0 page that there is interested in making making some Mailing lists statistics available;

'''Mailing lists'''
List activity
Popular threads
Most active posters
Number of subscriptions/unsubs over time

I can think of some useful visualizations of this data if it was available in some sanitised format.