Features/ABRTBacktraceDeduplication

= ABRT Backtrace Deduplication Service =

Summary
Backtrace deduplication service solves the problem of many duplicate crash reports being submitted  by ABRT to  Red Hat Bugzilla. It helps ABRT users to find duplicate reports before filing a new bug, and it helps package maintainers to triage/reassign/merge already reported bugs.

Owner

 * Name: Karel Klic
 * Email: kklic at redhat.com


 * Name: Michal Toman
 * Email: mtoman at redhat.com


 * Name: Miroslav Lichvar
 * Email: mlichvar at redhat.com


 * Name: Jan Smejda

Current status

 * Targeted release: Fedora 17
 * Last updated: 2012-03-26
 * Percentage of completion: 100%

Detailed Description
Backtrace deduplication server is a collection of newly-developed tools that will be deployed on the ABRT Retrace Server hardware, which is a part of Fedora infractructure. ABRT will contain a client tool and integration with the server.

Benefit to Fedora

 * 1) Red Hat Bugzilla receives a lot of duplicate crash reports from ABRT clients, even for a single component. This makes ABRT reports less useful and causes developers to give ABRT reports lower priority. Red Hat Bugzilla receives a lot of low-quality reports, which should be closed without intervention from maintainers. For example, the simple-scan component is very affected by low quality of ABRT: many of its bug reports are duplicates, and some reports are incorrectly showing __libc_message and similar functions as crash functions.
 * 2) Red Hat Bugzilla contains multiple crash reports filed on end-user applications, that are caused by a single bug in a library. The crash reports are then analyzed multiple times by various developers, and that wastes their time.

Scope
 Implementation of backtrace metrics and indexes in Btparser.  Damerau-Levenshtein distance Jaro-Winkler distance   Implementation of backtrace optimization in Btparser. Backtrace deduplication service for C/C++ backtraces, which takes a backtrace and component, and checks backtraces from all related components (of libraries used by the crashed binary) in   Bugzilla  name: faf-btserver-find-duplicates  </li> HTTP interface to the backtrace deduplication service, implemented as a CGI script  name: faf-btserver-cgi</li> must contain a machine interface (plain text)</li> must contain a human interface (HTML)</li> distinguishes between the two by reading HTTP_ACCEPT environment variable</li> Apache configuration file to activate the CGI script</li> require a backtrace, component name, operating system version from the user</li> respond with a list of bug ids, bug components, operating system version, and similarity: 625354 glib2 14 94% 688952 glib2 15 94% 654789 emacs 14 92% </li> </ul> </li> Crash report cleanup service, which merges crashes that are already reported in Bugzilla. It also finds low quality reports and duplicates and close/reassign them. The implementation consists of four scripts:  faf-btserver-cluster  The merging is done on a component level, where similar bugs from the same component are merged, and also on a	   cross-component level, where bugs from applications are matched to those of their library dependencies, and bugs in libraries are detected by searching duplicates between components with shared dependencies.</li> Achieve the right balance between application bug and library bug blaming. For example, many applications are crashing on a  call, but we can reasonably assume there is no bug in .</li> Compute distances and similarity indices between a bug (backtrace of bug) and all relevant bugs</li> <li>Compute backtrace quality</li> <li>Store the computed data in a bug report</li> <li>The number of crash combinations to check is huge. Optimizations might be needed to limit checks to	   backtraces having the same library calls on stack.</li> </ul> </li> <li>faf-btserver-prepare-actions <ul> <li>find similar bugs in the bug reports</li> <li>check bug statuses and generate a list of desired actions to be performed on Bugzilla</li> </ul> </li> <li>faf-btserver-push-actions-bugzilla <ul> <li>Performs desired actions on Bugzilla</li> <li>If a bug that is filed on an application but belongs to	   a library is detected, it will be either reassinged or a	    comment will be added: </li> </ul> </li> <li>faf-btserver-actions-log - generate a log of desired actions on Bugzilla in a text file; this is good for development, tweaking, debugging</li> </ul> </li> <li>Synchronization script to update server metadata &mdash; bugs, backtraces, builds, RPMs</li> <li>ABRT client using Backtrace deduplication server</li> </ol>

How To Test

 * 1) via ABRT
 * 2) via web interface

User Experience

 * 1) Maintainers: ABRT will open lower amount of bug duplicates
 * 2) Maintainers: Bugs across components will be marked as duplicates
 * 3) by adding comment to each bug with links to other bugs
 * 4) by closing all bugs except one as duplucates, with the remaining opened bug being reassingned to a common library

Dependencies
None

Contingency Plan
ABRT uses duplicate hashes to detect duplicates as usual. Without the backtrace deduplication server, ABRT bugs are still filed on the software component that owns the crashed binary. Duplicates within single component can be closed by extending an existing script, without having a server deployed.

Documentation
No documentation is currently available.

If you want to see more details about implementation, you can check the source code:
 * See faf-btserver-* source code in Faf repository
 * See btparser source code (esp. lib/metrics.[ch]) in Btparser repository

Release Notes
Fedora's bug reporting tool (ABRT) now uses new sophisticated server-side algorithms to discover bug duplicates and direct new reports to right operating system component.

Comments and Discussion

 * See Talk:Features/ABRTBacktraceDeduplication