From Fedora Project Wiki
No edit summary
Line 1: Line 1:
Some speed optimization ideas for yum
= Some speed optimization ideas for yum =
* use Cython to compile one or more of the .py files to .c code and compile them into DSOs
* use Cython to compile one or more of the .py files to .c code and compile them into DSOs
* use PyPy; would require building out a full PyPy stack: an alternative implementation of Python.  Last time I looked a the generated .c code, I wasn't comfortable debugging the result (I didn't feel that debugging a crash in the result would be feasible at 3am)
* use PyPy; would require building out a full PyPy stack: an alternative implementation of Python.  Last time I looked a the generated .c code, I wasn't comfortable debugging the result (I didn't feel that debugging a crash in the result would be feasible at 3am)
Line 9: Line 9:
I've looked at the generated code and it seems debuggable; I'd be able to debug issues arising.
I've looked at the generated code and it seems debuggable; I'd be able to debug issues arising.


= Example of generated code =
= Notes on Cython =
From upstream, on one simple example: "Simply compiling this in Cython merely gives a 35% speedup. This is better than nothing, but adding some static types can make a much larger difference."
 
The approach I'm currently thinking of is to simply compile the .py files with Cython, without using any of the non-Python extensions that Cython adds.
 
This would ensure that yum can continue to be usable without Cython: no non-standard language features.
 
It might perhaps be appropriate to add targeted optimizations to Cython, so that it does a better job when handling yum as input.
 
Going further, we could perhaps introduce pragma-style comments to yum sources e.g
  # cython: isinstance(foo, dict)
which Cython could take as an optimization hint to generate specialized code for handling the case when foo is a PyDictObject (might be more productive to support exact type matches here, though).
 
== Example of generated code ==
See [http://dmalcolm.fedorapeople.org/python-packaging/depsolve.html depsolve.html].  You can see the generated .c code by clicking on the yellow-colored .py code.  This was generated using the "-a" option to Cython.  Note that this was generated using a development copy of Cython, which has the support for some lambdas.
See [http://dmalcolm.fedorapeople.org/python-packaging/depsolve.html depsolve.html].  You can see the generated .c code by clicking on the yellow-colored .py code.  This was generated using the "-a" option to Cython.  Note that this was generated using a development copy of Cython, which has the support for some lambdas.


Note in particular that the generated C code has comments throughout indicating which line of .py code (in context) it corresponds to: this is important for my own comfort level, in feeling that this is supportable.
Note in particular that the generated C code has comments throughout indicating which line of .py code (in context) it corresponds to: this is important for my own comfort level, in feeling that this is supportable.


= Notes on Cython =
== Optimization potential ==
In theory this avoids both bytecode dispatch and stack manipulation, and should give us better CPU branch prediction; the result should also be more directly amenable to further optimization work: C-level profiling tools such as oprofile would indicate specifically where we're spending in the .py code.
Why would it be faster to simply compile with Cython?
* it avoids the bytecode dispatch loop: operations by the module are "unrolled" directly into libpython API calls, and this can be individually optimized by the C compiler: error handling can be marked as unlikely, and the unrolled .c code should give us better CPU branch prediction; the result should also be more directly amenable to further optimization work: C-level profiling tools such as oprofile would indicate specifically where we're spending in the .py code.
* it avoids some stack manipulation: temporaries are stored directly as locals within the .c code, and twiddling of ob_refcnt fields can be reduced


From upstream, on one simple example: "Simply compiling this in Cython merely gives a 35% speedup. This is better than nothing, but adding some static types can make a much larger difference."
Using Cython "bakes in" some values for builtins: calls to the builtin "len" are turned directly into calls to PyObject_Length, rather than doublechecking each time what the value of __builtins__.len is, and calling it.  So this is a semantic difference from regular Python, and some monkey-patching is ruled out, but I think it's a reasonable optimization.


Using Cython "bakes in" some values for builtins: calls to the builtin "len" are turned directly into calls to PyObject_Length, rather than doublechecking each time what the value of __builtins__.len is, and calling itSo this is a semantic difference from regular Python, and some monkey-patching is ruled out, but I think it's a reasonable optimization.
== Pessimization risks ==
* looking at _sort_reqs(), the line
<pre>mapper = {'EQ' : 1, 'LT' : 2, 'LE' : 3, 'GT' : 4, 'GE' : 5,</pre> uses a dictionary which is never modified.  The Cython-generated C code is rebuilding this dictionary every time the function is calledAs it turns out, Python 2.6's peephole optimizer doesn't optimize this either:
<pre>
>>> dis(depsolve.Depsolve._sort_reqs)
811          0 BUILD_MAP                6
              3 LOAD_CONST              1 (1)
              6 LOAD_CONST              2 ('EQ')
              9 STORE_MAP         
            10 LOAD_CONST              3 (2)
            13 LOAD_CONST              4 ('LT')
            16 STORE_MAP         
            17 LOAD_CONST              5 (3)
            20 LOAD_CONST              6 ('LE')
            23 STORE_MAP         
            24 LOAD_CONST              7 (4)
            27 LOAD_CONST              8 ('GT')
            30 STORE_MAP         
            31 LOAD_CONST              9 (5)
            34 LOAD_CONST              10 ('GE')
            37 STORE_MAP         
</pre>


TODO:
TODO:
* measure the impact of using a Cython .c build of depsolve.py
* measure the impact of using a Cython .c build of depsolve.py
** try building with Cython
** try building with Cython

Revision as of 16:54, 24 August 2010

Some speed optimization ideas for yum

  • use Cython to compile one or more of the .py files to .c code and compile them into DSOs
  • use PyPy; would require building out a full PyPy stack: an alternative implementation of Python. Last time I looked a the generated .c code, I wasn't comfortable debugging the result (I didn't feel that debugging a crash in the result would be feasible at 3am)
  • use Unladen Swallow for Python 2: would require porting the US 2.6 stack to 2.7, and a separate Python stack
  • use Unladen Swallow for Python 3: wait until it gets merged (in Python 3.3); port yum to python 3

Using Cython seems to be the least invasive approach.

I've looked at the generated code and it seems debuggable; I'd be able to debug issues arising.

Notes on Cython

From upstream, on one simple example: "Simply compiling this in Cython merely gives a 35% speedup. This is better than nothing, but adding some static types can make a much larger difference."

The approach I'm currently thinking of is to simply compile the .py files with Cython, without using any of the non-Python extensions that Cython adds.

This would ensure that yum can continue to be usable without Cython: no non-standard language features.

It might perhaps be appropriate to add targeted optimizations to Cython, so that it does a better job when handling yum as input.

Going further, we could perhaps introduce pragma-style comments to yum sources e.g

  # cython: isinstance(foo, dict)

which Cython could take as an optimization hint to generate specialized code for handling the case when foo is a PyDictObject (might be more productive to support exact type matches here, though).

Example of generated code

See depsolve.html. You can see the generated .c code by clicking on the yellow-colored .py code. This was generated using the "-a" option to Cython. Note that this was generated using a development copy of Cython, which has the support for some lambdas.

Note in particular that the generated C code has comments throughout indicating which line of .py code (in context) it corresponds to: this is important for my own comfort level, in feeling that this is supportable.

Optimization potential

Why would it be faster to simply compile with Cython?

  • it avoids the bytecode dispatch loop: operations by the module are "unrolled" directly into libpython API calls, and this can be individually optimized by the C compiler: error handling can be marked as unlikely, and the unrolled .c code should give us better CPU branch prediction; the result should also be more directly amenable to further optimization work: C-level profiling tools such as oprofile would indicate specifically where we're spending in the .py code.
  • it avoids some stack manipulation: temporaries are stored directly as locals within the .c code, and twiddling of ob_refcnt fields can be reduced

Using Cython "bakes in" some values for builtins: calls to the builtin "len" are turned directly into calls to PyObject_Length, rather than doublechecking each time what the value of __builtins__.len is, and calling it. So this is a semantic difference from regular Python, and some monkey-patching is ruled out, but I think it's a reasonable optimization.

Pessimization risks

  • looking at _sort_reqs(), the line
mapper = {'EQ' : 1, 'LT' : 2, 'LE' : 3, 'GT' : 4, 'GE' : 5,

uses a dictionary which is never modified. The Cython-generated C code is rebuilding this dictionary every time the function is called. As it turns out, Python 2.6's peephole optimizer doesn't optimize this either:

>>> dis(depsolve.Depsolve._sort_reqs)
811           0 BUILD_MAP                6
              3 LOAD_CONST               1 (1)
              6 LOAD_CONST               2 ('EQ')
              9 STORE_MAP           
             10 LOAD_CONST               3 (2)
             13 LOAD_CONST               4 ('LT')
             16 STORE_MAP           
             17 LOAD_CONST               5 (3)
             20 LOAD_CONST               6 ('LE')
             23 STORE_MAP           
             24 LOAD_CONST               7 (4)
             27 LOAD_CONST               8 ('GT')
             30 STORE_MAP           
             31 LOAD_CONST               9 (5)
             34 LOAD_CONST              10 ('GE')
             37 STORE_MAP           

TODO:

  • measure the impact of using a Cython .c build of depsolve.py
    • try building with Cython