From Fedora Project Wiki

< DaveMalcolm

Revision as of 19:39, 2 September 2010 by Dmalcolm (talk | contribs) (→‎Some speed optimization ideas for yum)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Some speed optimization ideas for yum

  • use Cython to compile one or more of the .py files to .c code and compile them into DSOs
  • use PyPy; would require building out a full PyPy stack: an alternative implementation of Python. Last time I looked a the generated .c code, I wasn't comfortable debugging the result (I didn't feel that debugging a crash in the result would be feasible at 3am)
  • use Unladen Swallow for Python 2: would require porting the US 2.6 stack to 2.7, and a separate Python stack
  • use Unladen Swallow for Python 3: wait until it gets merged (in Python 3.3); port yum to python 3
  • use shedskin to compile it down to C++. Shedskin seems to impose some major restrictions: lists/tuples need to have their elements all be of the same type, and for dictionaries, all keys must be of the same type, and all values of the same type (assuming that I'm reading builtin.hpp correctly).

Using Cython seems to be the least invasive approach.

I've looked at the generated code and it seems debuggable; I'd be able to debug issues arising.

Notes on Cython

From upstream, on one simple example: "Simply compiling this in Cython merely gives a 35% speedup. This is better than nothing, but adding some static types can make a much larger difference."

The approach I'm currently thinking of is to simply compile the .py files with Cython, without using any of the non-Python extensions that Cython adds.

Cython can't yet deal with all of yum's code, but the latest (unreleased) devel version of Cython can compile about half of yum's source files, including depsolve.py Given that depsolve.py contains profiler hooks, I'm guessing that that's a known area for performance tuning, so perhaps even just optimizing that file might be a win.

This would ensure that yum can continue to be usable without Cython: no non-standard language features.

This would mean that yum upstream could continue to target Python, but in, say Fedora 15 onwards we'd build our yum packages with this optimization.

It might perhaps be appropriate to add targeted optimizations to Cython, so that it does a better job when handling yum as input.

Going further, we could perhaps introduce pragma-style comments to yum sources e.g

  # cython: isinstance(foo, dict)

which Cython could take as an optimization hint to generate specialized code for handling the case when foo is a PyDictObject (might be more productive to support exact type matches here, though).

Example of generated code

See depsolve.html. You can see the generated .c code by clicking on the yellow-colored .py code. This was generated using the "-a" option to Cython. Note that this was generated using a development copy of Cython, which has the support for some lambdas.

Note in particular that the generated C code has comments throughout indicating which line of .py code (in context) it corresponds to: this is important for my own comfort level, in feeling that this is supportable.

Correctness bugs

Method:

$ cp /usr/lib/python2.7/site-packages/yum/depsolve.py .
$ python cython-devel/cython.py -a depsolve.py 
$ gcc -I/usr/include/python2.7 -lpython2.7 -g -O2 -fPIC -shared -o depsolve.so depsolve.c
# cp ~dmalcolm/depsolve.so /usr/lib/python2.7/site-packages/yum
# ls -al /usr/lib/python2.7/site-packages/yum/depsolve.*
-rw-r--r--. 1 root root   57757 Jul 22 16:51 /usr/lib/python2.7/site-packages/yum/depsolve.py
-rw-r--r--. 1 root root   35697 Jul 22 16:51 /usr/lib/python2.7/site-packages/yum/depsolve.pyc
-rw-r--r--. 1 root root   35641 Jul 22 16:51 /usr/lib/python2.7/site-packages/yum/depsolve.pyo
-rwxr-xr-x. 1 root root 1498275 Aug 24 15:21 /usr/lib/python2.7/site-packages/yum/depsolve.so
# python
Python 2.7 (r27:82500, Jul 28 2010, 18:07:57) 
[GCC 4.5.0 20100716 (Red Hat 4.5.0-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import yum
>>> import yum.depsolve
>>> yum.depsolve.__file__
'/usr/lib/python2.7/site-packages/yum/depsolve.so'
>>> dir (yum.depsolve)
['BOOLEAN_STATES', 'Conflicts', 'DBVERSION', 'DepCheck', 'Depsolve', 'Errors', 'LETTERFLAGS', 'ListPackageSack', 'PATTERNS_INDEXED_MAX', 'PATTERNS_MAX', 'PLUG_OPT_BOOL', 'PLUG_OPT_FLOAT', 'PLUG_OPT_INT', 'PLUG_OPT_STRING', 'PLUG_OPT_WHERE_ALL', 'PLUG_OPT_WHERE_MAIN', 'PLUG_OPT_WHERE_REPO', 'PO_CONFIG', 'PO_DIR', 'PO_DOC', 'PO_FILE', 'PO_GHOST', 'PO_INSTALLEDPKG', 'PO_LOCALPKG', 'PO_REMOTEPKG', 'REPO_PROBLEM_COMPS', 'REPO_PROBLEM_METADATA', 'REPO_PROBLEM_OTHER', 'REPO_PROBLEM_PACKAGE', 'REPO_PROBLEM_REPOMD', 'RPM_CHECKSUM_TYPES', 'RPM_TO_SQLITE', 'Requires', 'SYMBOLFLAGS', 'TR_DEPENDS', 'TR_DEPENDSON', 'TR_OBSOLETEDBY', 'TR_OBSOLETES', 'TR_UPDATEDBY', 'TR_UPDATES', 'TS_AVAILABLE', 'TS_ERASE', 'TS_FAILED', 'TS_INSTALL', 'TS_INSTALL_STATES', 'TS_OBSOLETED', 'TS_OBSOLETING', 'TS_REMOVE_STATES', 'TS_TRUEINSTALL', 'TS_UPDATE', 'TS_UPDATED', 'TX_BLACK', 'TX_GREY', 'TX_WHITE', 'TransactionMember', 'YUM_PID_FILE', '_', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '__pyx_scope_struct__compare_providers', '__pyx_scope_struct__requiringFromInstalled', '__test__', 'archDifference', 'canCoinstall', 'flags', 'logging', 'logginglevels', 'max', 'min', 'misc', 'os', 'packages', 'rpm', 'rpmUtils', 'types', 'unique', 'version_tuple_to_string', 'warnings']

Up to this point, everything's looking good.

Unfortunately, we're running into a bug with the generated code:

# yum update
Loaded plugins: auto-update-debuginfo, presto, refresh-packagekit
Found 9 installed debuginfo package(s)
Setting up Update Process
Resolving Dependencies
--> Running transaction check
---> Package GConf2.x86_64 0:2.31.7-1.fc14 set to be updated
Traceback (most recent call last):
  File "/usr/bin/yum", line 29, in <module>
    yummain.user_main(sys.argv[1:], exit_code=True)
  File "/usr/share/yum-cli/yummain.py", line 258, in user_main
    errcode = main(args)
  File "/usr/share/yum-cli/yummain.py", line 154, in main
    (result, resultmsgs) = base.buildTransaction() 
  File "/usr/lib/python2.7/site-packages/yum/__init__.py", line 910, in buildTransaction
    (rescode, restring) = self.resolveDeps()
  File "depsolve.py", line 714, in depsolve.Depsolve.resolveDeps (depsolve.c:13321)
  File "depsolve.py", line 814, in depsolve.Depsolve._resolveRequires (depsolve.c:15125)
  File "depsolve.py", line 882, in depsolve.Depsolve._checkInstall (depsolve.c:16046)
TypeError: unbound method _sort_req_key() must be called with Depsolve instance as first argument (got tuple instance instead)

The issue here is the call to:

        for req in sorted(txmbr_reqs, key=self._sort_req_key):

where sorted() is being passed _sort_req_key, which is defined as a @staticmethod. Cython is, however, generating an instance method, which expects to have "self" added to the front of the arguments. See http://docs.python.org/library/functions.html#staticmethod

This is a known limitation of cython-0.12; see http://docs.cython.org/src/userguide/limitations.html#behaviour-of-class-scopes

#0  instancemethod_call (func=<built-in function _sort_req_key>, arg=(('libldap-2.4.so.2()(64bit)', None, (None, None, None)),), kw=0x0) at /usr/src/debug/Python-2.7/Objects/classobject.c:2543
#1  0x0000003852248fc3 in PyObject_Call (func=<instancemethod at remote 0xcc30a0>, arg=<value optimized out>, kw=<value optimized out>) at /usr/src/debug/Python-2.7/Objects/abstract.c:2522
#2  0x00000038522497c8 in PyObject_CallFunctionObjArgs (callable=<instancemethod at remote 0xcc30a0>) at /usr/src/debug/Python-2.7/Objects/abstract.c:2753
#3  0x0000003852276ad3 in listsort (self=0x37aaef0, args=<value optimized out>, kwds=<value optimized out>) at /usr/src/debug/Python-2.7/Objects/listobject.c:2096
#4  0x0000003852248fc3 in PyObject_Call (func=<built-in method sort of list object at remote 0x37aaef0>, arg=<value optimized out>, kw=<value optimized out>) at /usr/src/debug/Python-2.7/Objects/abstract.c:2522
#5  0x00000038522de4e9 in builtin_sorted (self=<value optimized out>, args=<value optimized out>, kwds={'key': <instancemethod at remote 0xcc30a0>}) at /usr/src/debug/Python-2.7/Python/bltinmodule.c:2234
#6  0x0000003852248fc3 in PyObject_Call (func=<built-in function sorted>, arg=<value optimized out>, kw=<value optimized out>) at /usr/src/debug/Python-2.7/Objects/abstract.c:2522
#7  0x00000038522e3a87 in PyEval_CallObjectWithKeywords (func=<built-in function sorted>, arg=([('libldap-2.4.so.2()(64bit)', None, (None, None, None)), ('libxml2.so.2(LIBXML2_2.4.30)(64bit)', None, (None, None, None)), ('libORBit-2.so.0()(64bit)', None, (None, None, None)), ('dbus', None, (None, None, None)), ('libc.so.6()(64bit)', None, (None, None, None)), ('libgobject-2.0.so.0()(64bit)', None, (None, None, None)), ('libpolkit-gobject-1.so.0()(64bit)', None, (None, None, None)), ('rtld(GNU_HASH)', None, (None, None, None)), ('libc.so.6(GLIBC_2.3.4)(64bit)', None, (None, None, None)), ('/sbin/ldconfig', None, (None, None, None)), ('libdbus-1.so.3()(64bit)', None, (None, None, None)), ('libgmodule-2.0.so.0()(64bit)', None, (None, None, None)), ('libc.so.6(GLIBC_2.2.5)(64bit)', None, (None, None, None)), ('libgio-2.0.so.0()(64bit)', None, (None, None, None)), ('libc.so.6(GLIBC_2.4)(64bit)', None, (None, None, None)), ('sgml-common', None, (None, None, None)), ('libpthread.so.0(GLIBC_2.2.5)(64bit)', None, (None, None, None)), ('/usr/bin/killall', None, (None, None, None)), ('libpthread.so.0()(64bit)', ...(truncated), kw=<value optimized out>) at /usr/src/debug/Python-2.7/Python/ceval.c:3940
#8  0x00007fffed91c6f1 in __pyx_pf_8depsolve_8Depsolve__checkInstall (__pyx_self=<value optimized out>, __pyx_args=<value optimized out>, __pyx_kwds=<value optimized out>) at depsolve.c:16046

I'm able to work around this by moving static methods outside of the class scope.

Optimization potential

Why would it be faster to simply compile with Cython?

  • it avoids the bytecode dispatch loop: operations by the module are "unrolled" directly into libpython API calls, and this can be individually optimized by the C compiler: error handling can be marked as unlikely, and the unrolled .c code should give us better CPU branch prediction; the result should also be more directly amenable to further optimization work: C-level profiling tools such as oprofile would indicate specifically where we're spending time in the .py code, rather than it all showing up in the bytecode dispatch loop.
  • it avoids some stack manipulation: temporaries are stored directly as locals within the .c code, and twiddling of ob_refcnt fields can be reduced

Using Cython "bakes in" some values for builtins: calls to the builtin "len" are turned directly into calls to PyObject_Length, rather than doublechecking each time what the value of __builtins__.len is, and calling it. So this is a semantic difference from regular Python, and some monkey-patching is ruled out, but I think it's a reasonable optimization.

Correctness risks

Adding a compile-to-C and compile-to-DSO stages to the process introduces various points of failure, and possible semantic changes (see e.g the notes on builtins above).

Need a set of reproducible test cases to ensure that yum continues to behave as expected, to minimize risk here, and to quickly locate failures.

What about all of the plugins? In theory, nothing should change: a Cython-generated module has the same dictionary and its names are importable in the same way. But "the proof of the pudding is in the eating". Would need some unit tests, and some time to soak in rawhide.

Pessimization risks

  • older versions of Cython didn't do as good a job of constant folding/propagation as Python's peephole optimizer. Largely seems to be fixed in the devel branch, but worth keeping an eye out for. For example, yum's transactioninfo.py has:
 return txmember.ts_state in ('u', 'i')
 and not isinstance(txmember.po, (YumInstalledPackage, YumAvailablePackageSqlite))

and the 0.12.1-generated C was constructing the tuples: ('u', 'i') and (YumInstalledPackage, YumAvailablePackageSqlite) every time; the former can be constant; the latest devel version of Cython does, reducing it to a pair of comparisons for equality against const "u" and const "i"; see Cython/Compiler/Optimize.py (and we could profile yum and find further optimizations)

  • looking at _sort_reqs(), the line
mapper = {'EQ' : 1, 'LT' : 2, 'LE' : 3, 'GT' : 4, 'GE' : 5,

uses a dictionary which is never modified. The Cython-generated C code is rebuilding this dictionary every time the function is called. As it turns out, Python 2.6's peephole optimizer doesn't optimize this either:

>>> dis(depsolve.Depsolve._sort_reqs)
811           0 BUILD_MAP                6
              3 LOAD_CONST               1 (1)
              6 LOAD_CONST               2 ('EQ')
              9 STORE_MAP           
             10 LOAD_CONST               3 (2)
             13 LOAD_CONST               4 ('LT')
             16 STORE_MAP           
             17 LOAD_CONST               5 (3)
             20 LOAD_CONST               6 ('LE')
             23 STORE_MAP           
             24 LOAD_CONST               7 (4)
             27 LOAD_CONST               8 ('GT')
             30 STORE_MAP           
             31 LOAD_CONST               9 (5)
             34 LOAD_CONST              10 ('GE')
             37 STORE_MAP           

Profiling notes

Method: x86_64 F14 box with yum-3.2.27-19.fc14.noarch

Running "time yum update", with a warm local cache of metadata, with /usr/share/yum-cli/output.py:userconfirm modified to always return False, so that "Is this ok [y/N]: " immediately returns and does nothing, so that we can time the taken to download metadata and depsolve.

"Transaction Summary" gives:

Install       8 Package(s)
Upgrade     120 Package(s)
Remove        2 Package(s)

Total download size: 202 M
Exiting on user Command
Complete!

real    0m3.318s
user    0m2.977s
sys     0m0.337s
# time yum install "*"
real    1m45.649s
user    1m42.596s
sys     0m2.391s

Sends a lot of debug messages to stdout, and this is across ssh, though the bulk of time is spent in "user" above, so perhaps not a significant part of the cost.

Piping to /dev/null confirms this:

# time (yum install "*" > /dev/null)
real    1m44.228s
user    1m41.885s
sys     0m2.239s

Rerunning with a cython build of depsolve yields no effective saving:

real    1m44.613s
user    1m42.036s
sys     0m2.341s
# python -mcProfile -oyum-install.prof -- /usr/bin/yum install "*"

(turning on profiling seems to halve the speed, fwiw)

yum-install.prof can be viewed using "runsnake yum-install.prof" (yum install RunSnakeRun)

  • 110 seconds spent inside matchingPrcos in packages.py, of which:
    • 85 seconds spent inside str_eq in i18n.py
  • 49 seconds spent inside loadFiles in sqlsitesack.py
    • 29 seconds in share_data in misc.py (n.b.: this looks very optimizable; cython misses some optimization possibilities here for comparison against values from "types" which could be turned into c-level pointer equality)
  • 20 seconds spent inside _addToDictAsList in packageSack.py; but uses generators (not yet supported in cython)
  • packages.py: lots of issues with lambda
  • misc.py: class definition for 2.4 compat; del: upon removal; it builds:
    • user time for "yum update" reduces from 7.2seconds to 6.7seconds
    • user time for "yum install" reduces from 1m42.497s to 1m35.471s
  • i18n.py: generators not supported; renaming the two functions that use them to lose the __ prefix, and moving them to another file, and a "from my18n import *", and cython copes;
    • user time for "yum update" reduces from 7.2seconds to 6.7seconds
    • user time for "yum install" reduces from 1m35.471s to 1m15.342s (was 1m42 with no cython)

packages.py: (reworked all property lambdas to be functions; replaced yield on the dict values with return itervalues()

 $ gcc -I/usr/include/python2.7 -lpython2.7 -g -O2 -fPIC -shared -o packages.so packages.c

With /usr/lib/python2.7/site-packages/yum/depsolve.so /usr/lib/python2.7/site-packages/yum/i18n.so /usr/lib/python2.7/site-packages/yum/misc.so:

 1m12.636s

Adding packages.so: 1m8.839s (4 second saving)

packageSack.py: _return_all_provides(po): has lots of yield statements; doesn't look trivial; hand-implemented an iteration class

 1m7.790s (1 second saving)

transactioninfo.py: compiles

 1m7.041s (.7 second saving)

Ruled out for now:

  • sqlitesack.py (move tmpObject to outermost scope); has a "yield" in a difficult place
  • __init__.py: various "yield" statements
  • rpmsack.py: various "yield" statements


Some other optimizations:

  • Upping sys's checkinterval from 100 to 10000 reduces it to 1m6.640s
  • Upping gc thresholds from (700, 10, 10) to (70000, 10, 10) (which may be dubious: RAM!), reduces it further to 0m55.900s.


Current result: usertime down from 1m42s to 0m56s = from 102 seconds to 56 seconds: 54% of the original time, so almost 2* faster. Leaving aside the sys and gc optimizations it's 1m7s = 67s: 66% of the original time.

Various optimization notes

Profile! Profile! Profile!

  • lots of ideas here; approach: profile yum with a Cython build and find the slow spots, and add optimizations to Cython to generate better .c code
  • foo.has_key() is generating a function call each time (with tuple construction for the args); might perhaps test for dictionaries and have a specialized implementation that goes directly to PyDict_ API (though arguably this is a semantic change)
  • const string literal "" % (args): it's doing a PyNumber_Remainder; it could use typechecking and convert it in-place to a PyString_Format call, avoiding various indirections (much harder: compile this down to specialized C-code, based on the exact format strings) profile!!! This is a special-case of specialization of binops for where we know the types: would this be a new optimization for Cython? BinopNode.
  • min() builtin is being invoked on a pair of ints, when it could just inline directly: (in _common_prefix_len(): num = min(len(x), len(y)) ); profile, profile, profile!!!
  • quite a lot of complexity in parsing the arguments as well. Perhaps we could directly inline functions, with a fast path that detects when a function call is calling the known function, and avoid the thunking at both ends??? profile profile profile!

TODO

  • try adding Cython to the build of yum
  • measure the impact of using a Cython .c build of depsolve.py