Revision as of 23:23, 6 January 2010

Feature Name

Summary

Make Fedora's implementation of Python use a locale-aware default string encoding (generally "UTF-8"), rather than hardcoding "ascii".

Owner

Name: Dave Malcolm

Email: <dmalcolm@redhat.com>

Current status

Targeted release: Fedora 41
Last updated: (DATE)
Percentage of completion: XX%

Detailed Description

Python's site.py includes this fragment of code:

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !

It is proposed to change the first conditional to if 1: so that Fedora's Python by default reads the locale from the environment and uses that encoding. This will generally mean UTF-8 is used, rather than ascii.

Background

CPython's "default encoding"

The C implementation of Python 2 has two ways it can represent text strings:

the classic legacy str object in which each character is represented as a single byte in an undefined character set. This is represented internally as a struct PyStringObject
unicode objects where each character is represented as either 16-bit or 32-bit word in the Unicode character set (UCS). This is represented internally as a struct PyUnicodeObject. We use UCS4 (32-bit) in Fedora's builds of Python.

Python 2 will encode and decode between unicode objects and str objects based on what Python believes the character set and character encoding are for the str object.

CPython 2's implementation has an internal read-only variable called unicode_default_encoding which is returned by sys.getdefaultencoding() (for brevity sake I'm going to refer to this variable as default_encoding). Whenever Python passes a string to an external API or receives a string from an external API, e.g. any string ultimately passed to a C function and the C binding has not explicitly specified its encode/decode requirements then Python consults the unicode_default_encoding variable to decide how to encode/decode that string. That means any time you print a string, open a file, call a function in a CPython binding it is subject to the default encoding.

(In Python 3, the str object became a struct PyUnicodeObject, and struct PyStringObject became a bytes object)

The unicode_default_encoding is set in site.py to ascii for historical reasons. Then site.py makes the default_encoding read-only by removing it from the sys module name space. This means you cannot call sys.setdefaultencoding() without generating an exception. This also means Python's default encoding is locked to ascii.

The reason for this appears to be an optimization within CPython: at the C level a struct PyUnicodeObject actually caries two copies of the string:

its UCS-{2,4} representation (this is the Py_UNICODE *str field), and
its encoded representation after encoding it according to the value in the global unicode_default_encoding variable; this is the PyObject *defenc field.

Think of this as a cached value of the string in the default encoding. The first time a unicode object is subject to encode/decode it caches the encoded value of the string to avoid having to encode/decode every time the unicode object needs to accessed in its encoded form. This cached value is invalidated when the unicode string content changes but there is no mechanism to invalidate it when the default encoding changes (hence, I believe, the restrictions on changing the default encoding, and the possibility that any struct PyUnicodeObject instances created prior to the modification of the default encoding may exhibit incorrect behavior with respect to encoding).

The system locale's encoding

In Fedora our default encoding is UTF-8. This is normally set via login scripts in /etc/profile.d. The user if they wish may choose to override the system default. In both instances the default language and encoding is exported via an environment variable.

It's possible to query this locale information from Python using the locale module:

>>> import locale
>>> print locale.getdefaultlocale()
('en_US', 'UTF8')

The encoding of stdout/stderr/stdin varies with TTY-connectivity

To add to the confusion, Py_InitializeEx sets up the encoding of each of stdout, stderr, stdin to the default locale encoding (typically UTF-8), _provided_ they are connected to a tty:

#0  PyFile_SetEncodingAndErrors (f=0xb7fc5020, enc=0x80edc28 "UTF-8",
errors=0x0) at Objects/fileobject.c:458
#1  0x04fbdd49 in Py_InitializeEx (install_sigs=<value optimized out>) at
Python/pythonrun.c:322
#2  0x04fbe29e in Py_Initialize () at Python/pythonrun.c:359
#3  0x04fc9886 in Py_Main (argc=<value optimized out>, argv=<value optimized
out>) at Modules/main.c:512
#4  0x080485c7 in main (argc=<value optimized out>, argv=<value optimized out>)
at Modules/python.c:23

so that the python interpreter run interactively from a terminal uses UTF-8 for the standard streams:

>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdin.encoding
'UTF-8'
>>> sys.stdout.encoding
'UTF-8'
>>> sys.stderr.encoding
'UTF-8'

This means that a simple case (printing lower case greek alpha, beta, gamma) works when run directly:

[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'
αβγ

...but fails if you pipe it to a file or redirected into "less", despite the fact that the system locale is UTF-8, and thus "less" expects UTF-8 data:

[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' > foo.txt
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' | less
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)

PyGTK and Pango

A significant "gotcha" here is that the pango Python module forces the global default encoding variable to be 'utf-8'. It can do this because it's implemented in CPython where there are no restrictions; it directly calls PyUnicode_SetDefaultEncoding

    /* set the default python encoding to utf-8 */
    PyUnicode_SetDefaultEncoding("utf-8");

Let's take a little test drive and see things in action for ourselves:

$ python
Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51)
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.setdefaultencoding('utf-8')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> import pango
>>> sys.getdefaultencoding()
'utf-8'

This hidden global side-effect can be particularly confusing, since the module is typically imported implicitly by other modules (e.g. by the gtk module)

This was first introduced in pygtk in a 2000-10-25 commit, and was moved from the pygtk module to the pango module in a 2006-04-01 commit in response to https://bugzilla.gnome.org/show_bug.cgi?id=328031

site.py

Looking over the source history in upstream's Subversion:

the site.py hook to set the default encoding from the locale was added on June 7th 2000 in rev 15634:

'Added support to set the default encoding of strings at startup time to the values defined by the C locale...'

the code was disabled by default 5 weeks later on July 15th 2000 in rev 16374 by effbot (Fredrik Lundh):

-- changed default encoding to "ascii".  you can still change
   the default via site.py...:

and the code was optimized two months later on Sept 18th 2000 in rev 17513, to only set it if it's changed:

Looking over upstream mailing list archives for this period:

(unfortunately side-tracked into a debate of "deprecated" vs "depreciated"); I may have missed some of the discussion though.

sys.setdefaultencoding

The function sys.setdefaultencoding is defined in Python/sysmodule.c, it calls PyUnicode_SetDefaultEncoding(encoding) on the string "encoding"

PyUnicode_SetDefaultEncoding is defined in Objects/unicodeobject.c; it has this code:

    /* Make sure the encoding is valid. As side effect, this also
       loads the encoding into the codec registry cache. */
    v = _PyCodec_Lookup(encoding);

then copies the encoding into the buffer: "unicode_default_encoding"; this buffer supplies the return value for PyUnicode_GetDefaultEncoding(), which is used in many places inside the unicode implementation, plus in bytearrayobject.c: bytearray_decode() and in stringobject.c: PyString_AsDecodedObject() and PyString_AsEncodedObject() so it would seem that there's at least some risk in changing this setting.

Material I'm assembling

(Probably should discuss the usage of the PyArg_ API, and the effect of the encoding on converting between PyObject* and char* for calling into libraries)

(Quoting jdennis from https://bugzilla.redhat.com/show_bug.cgi?id=243541)

Python when it outputs unicode strings will automatically translate them into
the default system encoding. The default encoding is set in site.py and cannot
be overriden by the user, once set in site.py it is locked. In Fedora and RHEL
our default encoding is UTF-8. This is normally set via login scripts in
/etc/profile.d. Thu user if they wish may choose to override the system default.
In both instances the default language and encoding is exported via an
environment variable.

In site.py there is code to allow the default encoding to be set from the locale
information discussed above, however this functionality is turned off and
instead is hardcoded to be ascii. This is clearly wrong IMHO. A typical
consequence of this is a i18n python application using unicode strings will
fault with encoding exceptions when it tries to output any of its unicode
strings. The reason string output will throw exceptions is because the default
encoding is ascii, internally CPython will convert the unicode string using the
default codec (ascii) which of course will fail if the unicode string contains
characters outside the asckii character set, which is highly likely in non-latin
languages.

If the default encoding was UTF-8, as it should be by default to match the rest
of our environment the the encoding translations from Pythons internal UCS-4
Unicode to UTF-8 would succeed. I have personally tested and verified this works . 

Also, one should take into account that ascii is identical to UTF-8 by design
when the set of characters is composed only from the ascii character set.
Therefore which placed ascii strings into Python's unicode strings will not see
a regression. Applications which used i18n unicode strings previously could only
have worked correctly if they were manually encoding to UTF-8 on every output
call, they should also see no regression. Applications which load unicode
strings from translation catalogs would never have worked correctly and will now
work.

Note, the only way existing applications could have worked correctly is:

1) They load unicode strings and manuall convert to UTF-8 on output (correct
default encoding removes the need for manual conversion on every output call).

2) The load their i18n strings from message catalog in UTF-8 format. This is
typically specified as the codeset parameter in
gettext.bind_textdomain_codeset() or gettext.install(). In this case the strings
loaded from the catelog ARE NOT UNIICODE (python has an explicit string type
called unicode which in our builds is UCS-4) normal python strings are
represented as 'str' objects. When gettext is told to return strings via _()
using the UTF-8 codeset python represents them as 'str' not 'unicode', in other
words they are sequences of octets. When output the default encoding is not not
applied because they are not unicode strings, rather they are vanilla strings.
Thus output works in our environment because their entire lifetime in python is
as UTF-8.

However, there are many good reasons to work with i18n strings as unicode, not
byte sequences which happen to be represented as UTF-8 (e.g. can't count the
number of characters, can't concatenate, etc.). Thus applications should be able
to represent their i18n strings as unicode (internally as UCS-4) and output
correctly with correct translation to UTF-8 automatically applied by python, not
manually.

This is from site.py. Note the hardcoding of 'ascii'. If the first 'if 0:' test
allowed locale.getdefaultlocale() to be called it would allow the default
encoding to be correctly set from the environment. Site.py should be patched to
allow this.

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !

(end quote)

Currently Fedora's python implementation uses ascii

Benefit to Fedora

Scope

(should survey other distributions, and discuss things with them)

How To Test

User Experience

Dependencies

Contingency Plan

In theory this is a one-line change in the site.py file shipped in our python rpm, and so it can be backed out by reverting that one line change.

(It may be that Python applications develop a dependency on our Python having made this change and so would be broken by reverting)

Documentation

Release Notes

Comments and Discussion

See Talk:Features/YourFeatureName

@@ Line 54: / Line 54: @@
 The C implementation of Python 2 has two ways it can represent text strings:
 * the classic legacy <code>str</code> object in which each character is represented as a single byte in an undefined character set.  This is represented internally as a <code> struct PyStringObject</code>
-* <code>unicode</code> objects where each character is represented as either 16-bit or 32-bit word in the Unicode character set (UCS).  This is represented internally as a <code>struct PyUnicodeObject</code>
+* <code>unicode</code> objects where each character is represented as either 16-bit or 32-bit word in the Unicode character set (UCS).  This is represented internally as a <code>struct PyUnicodeObject</code>.  We use UCS4 (32-bit) in Fedora's builds of Python.
 Python 2 will encode and decode between unicode objects and str objects based on what Python believes the character set and character encoding are for the str object.

Search

Features/PythonEncodingUsesSystemLocale: Difference between revisions