From Fedora Project Wiki
Line 134: Line 134:
== Contingency Plan ==
== Contingency Plan ==
<!-- If you cannot complete your feature by the final development freeze, what is the backup plan?  This might be as simple as "None necessary, revert to previous release behaviour."  Or it might not.  If you feature is not completed in time we want to assure others that other parts of Fedora will not be in jeopardy.  -->
<!-- If you cannot complete your feature by the final development freeze, what is the backup plan?  This might be as simple as "None necessary, revert to previous release behaviour."  Or it might not.  If you feature is not completed in time we want to assure others that other parts of Fedora will not be in jeopardy.  -->
The contingency plan would be to remove the additional .py files, deactivating the feature.


== Documentation ==
== Documentation ==

Revision as of 18:37, 14 January 2010


Easier Python Debugging

Summary

Owner

  • Email: <dmalcolm@redhat.com>

Current status

  • Targeted release: Fedora 40
  • Last updated: 2010-01-13
  • Percentage of completion: 0%

Currently I'm stuck on this issue: http://sourceware.org/ml/archer/2009-q4/msg00129.html


Detailed Description

We ship Python wrappers for numerous libraries implemented in C and C++. Bugs (either in the libraries themselves, or in the usage of those libraries) can lead to complicated backtraces from gdb, and it can be hard to figure out what's going on at the python level.

For example, see this complex backtrace (relating to bug 536786).

Walking through the stack frames, going up from the bottom (textually), or down from the top (numerically):

  • frames 26 and below show a pygtk application starting up.
  • An event comes in frame 24/25, and is dispatched into pulsecore (frames 23->18; pstream_packet_callback, pa_context_simple_ack_callback) which:
  • calls a Python callback (down to frame 15),
  • ...which invokes python code down to frame 3.
  • ...where it calls back into native code; whereupon the segfault happens, calling Py_DecRef on some object pointer.

Note that as it stands, all we see from the backtrace is that python code was run: we have no way as-is of telling what that python code was.

In the above example, it happens that there is a bug in the application's Python code, which is sufficiently serious to cause a SIGSEGV error. This example uses the ctypes module, which is designed to expose machine-level details. It's fairly easily to write a one-liner of python code using this module which causes the python process to immediately fail with either a SIGSEGV or SIGABRT.

When using "native" C/C++ libraries, it's sadly common for bugs in the library to leads to SIGSEGV errors that immediately cause the whole python process to terminate. Beyond that, poorly-designed error-handling in such libraries uses assert() or abort() at the C level, which immediately terminates the entire process. It's useful to be able to determine what was "really" going on when this happens.

A trickier problem is when a threading assertion fails: many libraries make assumptions about threads and locks, and allow the programmer to register callbacks, but imposes conditions upon the kind of code run in those callbacks. When the threads and callback-registration hooks are wrapped at the python level, these conditions continue to be required at the Python level, but mistakes here often lead to low-level error-handling that's difficult to debug.

For example, the GTK widget library requires that all communication with the X server happen within a GDK lock, to avoid garbling the single "conversation" between the process and the X server. The common way to implement this in a multi-threaded application is to restrict all calls to GTK to a single "primary" thread. See attachment 379251 to rhbug:543278 bug 543278 for an example of where a secondary thread in an application violates this, which leads to a low-level gdk_x_error() failure in the main thread: frames 16 to 28 of this backtrace are running Python code, but it's not at all clear from the backtrace _what_ said code is actually doing.

Current state-of-the-art for debugging CPython backtraces

Python already has a gdbinit file with plenty of domain-specific hooks for debugging CPython, and we ship it in our python-devel subpackage. If you copy this to ~/.gdbinit you can then use "pyframe" and other commands to debug things, and figure out where we are in Python code from gdb. I used it when deciphering the example backtraces referred to above.

Unfortunately:

  • this script isn't very robust; if the data in the "inferior" process is corrupt, attempting to print it can lead to a SIGSEGV within that process
  • you have to go into gdb manually and run these commands by hand, and it's hard to do this correctly; any mistakes when doing this will typically cause a SIGSEGV in the inferior process; see e.g. bug 532552
  • the script is written in the gdb language and is thus hard to work with and extend

Proposal

gdb should provide rich information on what's going on at the Python level automatically. I plan to hook this in using gdb-archer, and make it automatic:

  • Biggest win: automatically display python frame information in PyEval_EvalFrameEx in gdb backtraces, including in ABRT:
    • python source file, line number, and function names
    • values of locals, if available
  • name of function for wrapped C functions


See Alex's work: http://blogs.gnome.org/alexl/2008/11/18/gdb-is-dead-long-live-gdb/ and more recently: http://blogs.gnome.org/alexl/2009/09/21/archer-gdb-macros-for-glib/

I'd want to have the python backtrace work integrated with the glib backtrace work: pygtk regularly shows me backtraces with a mixture of both

Alex's work is in in glib git: http://git.gnome.org/browse/glib/commit/?id=efe9169234e226f594b4254618f35a139338c35f which does a:

 gdb.backtrace.push_frame_filter (GFrameFilter)

See http://tromey.com/blog/?p=522 for info on this.

This needs a more recent version of gdb than in F-12; I'll need to build a local copy of "archer-tromey-python" branch of gdb to work on this.

Archer upstream: http://sourceware.org/gdb/wiki/ProjectArcher

Benefit to Fedora

Backtraces from gdb (such as those from ABRT) that involve python code will show what's going on at the Python level, as well as at the C level. This will make it much easier for developers to read backtraces when a library wrapped by python encounters a bug (e.g. PyGTK)

For python developers, it should be possible to attach to a running python process using gdb, then run thread apply all backtrace to get an overview of all C and Python code running in all threads within that process - I believe this ability would be unique to Fedora, and be valuable for Python developers seeking additional visibility into their CPython processes.

Scope

This will require extensions to the python srpm, and analogous changes to the python3 srpm.

It may well require co-ordination with the gdb srpm (such as API changes), and with the glib2 changes written by Alex referred to above.

How To Test

Ideas for test cases/coverage:

  • try attaching to a running (multithreaded) python process and ensure that thread apply all backtrace generates meaningful results
  • ensure it plays well with Alex's GLib/GTK work; debug a multithreaded pygtk app
  • ensure it fails gracefully if python-debuginfo isn't installed
  • ensure that it fails gracefully if the inferior process has corrupted data (e.g. overwrites on the heap)
  • ensure that it fails gracefully if the inferior process has a corrupted stack
  • ensure that it works well under ABRT. It's easy to write one-liner python scripts that abuse the ctypes module in such a way as to cause /usr/bin/python to segfault/abort:
[david@brick ~]$ python -c "import ctypes; ctypes.string_at(0xffffffff)"
Segmentation fault (core dumped)
[david@brick ~]$ python -c "import ctypes; ctypes.string_at(0x0)"
python: Objects/stringobject.c:115: PyString_FromString: Assertion `str != ((void *)0)' failed.
Aborted (core dumped)
  • repeat all of the above for python3 and python3-debuginfo

In each case, gdb should give you meaningful information at the Python level, as well as at the C level.

User Experience

Dependencies

This feature will require coordination with, and possible changes in, the gdb, and glib2 packages.

Contingency Plan

The contingency plan would be to remove the additional .py files, deactivating the feature.

Documentation

Release Notes

Comments and Discussion

See also this bug: https://bugzilla.redhat.com/show_bug.cgi?id=552654