Summer Coding 2010 ideas - Universal Build-ID

More information
The main page for Summer Coding 2010 ideas is Category:Summer Coding 2010 ideas.

Summary
Build-IDs are currently being put into binaries, shared libraries, core files and related debuginfo files to uniquely identify the build a user or developer is working with. There are a couple of conventions in place to use this information to identify "currently running" or "distro installed" builds. This helps with identifying what was being run and match it to the corresponding package, sources and debuginfo for tools that want to help the user show what is going on (at the moment mostly when things break). We would like to extend this to a more universial approach, that helps people identify historical, local, non- or cross-distro or organisational builds. So that Build-IDs become useful outside the current "static" setup and retain information over time and across upgrades.

Build-ID background
Build-IDs are unique identifiers of "builds". A build is an executable, a shared library, the kernel, a module, etc. You can also find the build-id in a running process, a core file or a separate debuginfo file.

The main idea behind Build-IDs is to make elf files "self-identifying". This means that when you have a Build-ID it should uniquely identify a final executable or shared library. The default Build-ID calculation (done through, see the ld manual) calculates a sha1 hash (160 bits/20 bytes) based on all the ELF header bits and section contents in the file. Which means that it is unique among the set of meaningful contents for ELF files and identical when the output file would otherwise have been identical. GCC now passes  to the linker by default.

When an executable or shared library is loaded into memory the Build-ID will also be loaded into memory, a core dump of a process will also have the Build-IDs of the executable and the shared libraries embedded. And when separating debuginfo from the main executable or shared library into  files the original Build-ID will also be copied over. This means it is easy to match a core file or a running process to the original executable and shared library builds. And that matching those against the debuginfo files that provide more information for introspection and debugging should be trivial.

Fedora has had full support for build-ids since Fedora Core 8: https://fedoraproject.org/wiki/Releases/FeatureBuildId

Getting Build-IDs
A simple way to get the build-id(s) is through eu-unstrip (part of elfutils).

$ eu-unstrip -n -e 
 * build-id from an executable, shared library or separate debuginfo file:

$ eu-unstrip -n --core
 * build-ids of an executable and all shared libraries from a core file:

$ eu-unstrip -n --pid
 * build-ids of an executable and all shared libraries of a running process:

$ eu-unstrip -n -k
 * build-id of the running kernel and all loaded modules:

Current conventions and usage
Build-IDs are as useful as the methods we build around them to look things up based on them.

The convention that is currently being used by Fedora (and which has been adopted by the upstream GNU toolchain in for example GDB to find files) is to include a link in the debuginfo package that points to the elf file and the debuginfo file under /usr/lib/debug/.build-id/XX/YYYY (where XX are the first two hex-digits of the build id and YYYY are all the others).

So for example the bash-debuginfo package has the following files/links:

/usr/lib/debug/.build-id/c7/a002ba1eb1dbc7c609d2e5fb9a57f10861dbdd -> ../../../../../bin/bash /usr/lib/debug/.build-id/c7/a002ba1eb1dbc7c609d2e5fb9a57f10861dbdd.debug -> ../../bin/bash.debug

These files/links are added by the debugedit and find-debuginfo.sh  programs which make sure every executable and shared library (and the separate   debuginfo packages) have Build-IDs embedded and that the links above are added under /usr/lib/debug/.debug-id.

This makes it extremely easy to find the executable or shared library and the corresponding debuginfo just given the build-id. If they are installed on your system.

Since these are files included in the rpm package, it also makes it easy to find the package that provided the executable/library, that corresponds to the build id (gdb and systemtap will suggest the right debuginfo package to install based on the build-id they found for the program you wanted to introspect). You can ask yum to install it, or use repoquery to figure out the details of the package and binary involved.

For example you find some core file and examine it with, or a long running process is spending a lot of time in some section of code and when running  , you find out that the Build-ID corresponding to that section of code is. Now you can use yum (or repoquery) to figure out what that thing really is: $ yum whatprovides \*/84/153a6428b291df6d62ce906b65ee9270ec6837 glibc-debuginfo-2.11.1-6.i686 : Debug information for package glibc Repo       : updates-debuginfo Matched from: Filename   : /usr/lib/debug/.build-id/84/153a6428b291df6d62ce906b65ee9270ec6837

You install that package and then you'll find: $ ls -l /usr/lib/debug/.build-id/84/153a6428b291df6d62ce906b65ee9270ec6837 /usr/lib/debug/.build-id/84/153a6428b291df6d62ce906b65ee9270ec6837 -> ../../../../../lib/libutil-2.11.1.so

The debuginfo package will also contain the source code of libutil.so and so you can start debugging.

But this is only for the latest current/up-to-date installed repository. There is no support for historical information, local builds, cross-distro, etc. Extending the usefulness of having build-ids is what this idea is about.

How do we scale this up/down? The actual Universial Build-IDs idea
What we would like is that when you get a Build-ID for something you can easily map it to the original developer, "creator", package, distributor, executable, sources, debuginfo files, etc.

This Build-ID can come from anything really, an old executable, a core file once made but never fully investigated, some currently running process that needs to be introspected, etc. And for various reasons parts or all of the original package, the executable itself, the libraries it relied on, the debuginfo packages, etc. could all be missing on the machine.

With an old core file, it might be all you have. A system could have been upgraded since a process started running, so the executable or any of the libraries it is using might only be in memory at the moment. The debuginfo package might never have been installed.

One use case to keep in mind when reading the various examples of situations where we want Build-ID mappings to work is that of the "canonical backtrace". This is a the backtrace of a process (or from a core file) as pure Build-ID + canonpc list. A canonpc is the pc adjusted for module & prelink bias so it's relative to the original module. Such a "canonical backtrace" is useful for identifying similar crashes. It is also the minimal information you need to provide to someone with access to the full Build-ID artefacts (binaries plus debuginfo) to extract some useful information from the crash.

The "canonical backtrace" example is interesting in two ways. First to generate a backtrace one needs access to the  section of executable or shared library (the   contains the data that shows how to unwind from a particular address in a module). So given an address and having the corresponding Build-ID one wants to lookup the executable/shared library associated with it on the local machine. Secondly it shows why one might have a BuildID in "isolation". It was send to you for examination, as the shortest way to transfer the information of which module we are talking about that was involved in some crash/backtrace. Since one might want to store this extracted information over a longer period of time to see if there are patterns in the crash reporting users do, you also need access to an historical Build-ID database for matching it.

This matching now works for your current Fedora repository, through the "hack" of adding "Build-ID symlinks" to the debuginfo packages. But we would like to scale this up and down for various other situations. Here are a couple of situations for which we would like a convenient way to store matches of Build-IDs to executable modules and a way to query such a storage. For each situation given a Build-ID we would like to answer some generic questions like "what is that called?", "where did that come from?", and "where do i get it now?":


 * Up in fedora, what about getting "historical" mappings? Given a Build-ID one would want to know which repository it is in. If koji knows about it. For which package, which version, which architecture, etc. So whenever Fedora creates a package that might be distributed to users, they (koji) would like to register all Build-IDs somewhere so it is easy to get at the original packages that contain the build, source and debuginfo bits.


 * Up towards other distributions. Since all the low-level bits are upstream in the GNU toolchain
 * Maybe through some PackageKit hook, so a user can always easily get at the bits given a BuildID?


 * Up towards a general build-id mapping universe (build-id.org is available).
 * Generic registration, querying and mapping of build-ids.
 * This would be some central Webby service for anyone wanting to publish information indexed by build ID. There would be an anonymous public service for doing lookups. It would be some webby query protocol (XMLRPC?) or just a URL convention for downloading a query-result payload in XML.
 * For free distros this could be an agreed-upon schema that points to distro home, major release, url (or standard way to find) distro package containing binary, package containing debuginfo, file name within that package.
 * Organisations distributing their own binaries/packages (e.g. Mozilla) or ISVs could register the BuildIDs of their applications with enough information they are willing to share to map back to the just the organisation, the specific binaries plus versions, or even the pointers (URLs) to the actual bits.
 * Maybe nice to have some Common Platform Enumeration http://cpe.mitre.org/ for it?


 * Down towards to local database for lone developer.
 * This could take the form of some build wrapper that adds the Build-ID and files to some local Build-ID database.
 * A script that the developer would invoke to store build bits (plus sources) for historical reasons.
 * Or maybe just a "archive" directory in which binaries are dropped and some inotify script registering the Build-IDs for them.


 * Or an local shop that builds upon an existing distro, but also has (internal) apps in their organization.
 * Likely there is some explicit mechanism to register a package locally for their internal users (some local koji).


 * To totally disorganized "installs" where people move around executables all the time.
 * This would mean populating the Build-ID database with information gathered through something like updatedb or inotify mechanism.

It might very well be that no one solution works correctly/efficiently/pragmatically for all the above situations. So you might have to just pick one of the above use cases and design a registration, storage and query mechanism for it. But keep in mind that it should scale up/down to the other situations so users/tools have an easy way of "proxying" this information between the different layers. So tools can have one registration/query mechanism that works for any Build-ID that they happen to come across.

As bonus, think of the interactions of the above system/database/query/xml/mechanism with:


 * Tie-in to koji, yum packagekit, abrt, debuginfo-fs?

As this is not a fully worked out and scoped project yet, please feel free to add your ideas to the talk page, contact the mentors with your ideas or discuss on some appropriate mailinglist.