From Fedora Project Wiki
Line 26: Line 26:
== Detailed Description ==
== Detailed Description ==
<!-- Expand on the summary, if appropriate.  A couple sentences suffices to explain the goal, but the more details you can provide the better. -->
<!-- Expand on the summary, if appropriate.  A couple sentences suffices to explain the goal, but the more details you can provide the better. -->
https://bugzilla.redhat.com/show_bug.cgi?id=243541
 
(Quoting jdennis from https://bugzilla.redhat.com/show_bug.cgi?id=243541)
<pre>
Python when it outputs unicode strings will automatically translate them into
the default system encoding. The default encoding is set in site.py and cannot
be overriden by the user, once set in site.py it is locked. In Fedora and RHEL
our default encoding is UTF-8. This is normally set via login scripts in
/etc/profile.d. Thu user if they wish may choose to override the system default.
In both instances the default language and encoding is exported via an
environment variable.
 
In site.py there is code to allow the default encoding to be set from the locale
information discussed above, however this functionality is turned off and
instead is hardcoded to be ascii. This is clearly wrong IMHO. A typical
consequence of this is a i18n python application using unicode strings will
fault with encoding exceptions when it tries to output any of its unicode
strings. The reason string output will throw exceptions is because the default
encoding is ascii, internally CPython will convert the unicode string using the
default codec (ascii) which of course will fail if the unicode string contains
characters outside the asckii character set, which is highly likely in non-latin
languages.
 
If the default encoding was UTF-8, as it should be by default to match the rest
of our environment the the encoding translations from Pythons internal UCS-4
Unicode to UTF-8 would succeed. I have personally tested and verified this works .
 
Also, one should take into account that ascii is identical to UTF-8 by design
when the set of characters is composed only from the ascii character set.
Therefore which placed ascii strings into Python's unicode strings will not see
a regression. Applications which used i18n unicode strings previously could only
have worked correctly if they were manually encoding to UTF-8 on every output
call, they should also see no regression. Applications which load unicode
strings from translation catalogs would never have worked correctly and will now
work.
 
Note, the only way existing applications could have worked correctly is:
 
1) They load unicode strings and manuall convert to UTF-8 on output (correct
default encoding removes the need for manual conversion on every output call).
 
2) The load their i18n strings from message catalog in UTF-8 format. This is
typically specified as the codeset parameter in
gettext.bind_textdomain_codeset() or gettext.install(). In this case the strings
loaded from the catelog ARE NOT UNIICODE (python has an explicit string type
called unicode which in our builds is UCS-4) normal python strings are
represented as 'str' objects. When gettext is told to return strings via _()
using the UTF-8 codeset python represents them as 'str' not 'unicode', in other
words they are sequences of octets. When output the default encoding is not not
applied because they are not unicode strings, rather they are vanilla strings.
Thus output works in our environment because their entire lifetime in python is
as UTF-8.
 
However, there are many good reasons to work with i18n strings as unicode, not
byte sequences which happen to be represented as UTF-8 (e.g. can't count the
number of characters, can't concatenate, etc.). Thus applications should be able
to represent their i18n strings as unicode (internally as UCS-4) and output
correctly with correct translation to UTF-8 automatically applied by python, not
manually.
 
This is from site.py. Note the hardcoding of 'ascii'. If the first 'if 0:' test
allowed locale.getdefaultlocale() to be called it would allow the default
encoding to be correctly set from the environment. Site.py should be patched to
allow this.
 
def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build ! 
</pre>
(end quote)
 


Currently Fedora's python implementation uses <code>ascii</code>  
Currently Fedora's python implementation uses <code>ascii</code>  
Python's <code>site.py</code> includes this fragment of code:
Python's <code>site.py</code> includes this fragment of code:
<pre>
<pre>
Line 50: Line 135:
         sys.setdefaultencoding(encoding) # Needs Python Unicode build !   
         sys.setdefaultencoding(encoding) # Needs Python Unicode build !   
</pre>
</pre>


It is proposed to change the first conditional to <code>if 1:</code> so that Fedora's Python by default reads the locale from the environment and uses that encoding.  This will generally mean <code>UTF-8</code> is used, rather than <code>ascii</code>.
It is proposed to change the first conditional to <code>if 1:</code> so that Fedora's Python by default reads the locale from the environment and uses that encoding.  This will generally mean <code>UTF-8</code> is used, rather than <code>ascii</code>.

Revision as of 19:24, 6 January 2010


Feature Name

Summary

Make Fedora's implementation of Python use a locale-aware default string encoding (generally "UTF-8"), rather than hardcoding "ascii".

Owner

  • Email: <dmalcolm@redhat.com>

Current status

  • Targeted release: Fedora 41
  • Last updated: (DATE)
  • Percentage of completion: XX%


Detailed Description

(Quoting jdennis from https://bugzilla.redhat.com/show_bug.cgi?id=243541)

Python when it outputs unicode strings will automatically translate them into
the default system encoding. The default encoding is set in site.py and cannot
be overriden by the user, once set in site.py it is locked. In Fedora and RHEL
our default encoding is UTF-8. This is normally set via login scripts in
/etc/profile.d. Thu user if they wish may choose to override the system default.
In both instances the default language and encoding is exported via an
environment variable.

In site.py there is code to allow the default encoding to be set from the locale
information discussed above, however this functionality is turned off and
instead is hardcoded to be ascii. This is clearly wrong IMHO. A typical
consequence of this is a i18n python application using unicode strings will
fault with encoding exceptions when it tries to output any of its unicode
strings. The reason string output will throw exceptions is because the default
encoding is ascii, internally CPython will convert the unicode string using the
default codec (ascii) which of course will fail if the unicode string contains
characters outside the asckii character set, which is highly likely in non-latin
languages.

If the default encoding was UTF-8, as it should be by default to match the rest
of our environment the the encoding translations from Pythons internal UCS-4
Unicode to UTF-8 would succeed. I have personally tested and verified this works . 

Also, one should take into account that ascii is identical to UTF-8 by design
when the set of characters is composed only from the ascii character set.
Therefore which placed ascii strings into Python's unicode strings will not see
a regression. Applications which used i18n unicode strings previously could only
have worked correctly if they were manually encoding to UTF-8 on every output
call, they should also see no regression. Applications which load unicode
strings from translation catalogs would never have worked correctly and will now
work.

Note, the only way existing applications could have worked correctly is:

1) They load unicode strings and manuall convert to UTF-8 on output (correct
default encoding removes the need for manual conversion on every output call).

2) The load their i18n strings from message catalog in UTF-8 format. This is
typically specified as the codeset parameter in
gettext.bind_textdomain_codeset() or gettext.install(). In this case the strings
loaded from the catelog ARE NOT UNIICODE (python has an explicit string type
called unicode which in our builds is UCS-4) normal python strings are
represented as 'str' objects. When gettext is told to return strings via _()
using the UTF-8 codeset python represents them as 'str' not 'unicode', in other
words they are sequences of octets. When output the default encoding is not not
applied because they are not unicode strings, rather they are vanilla strings.
Thus output works in our environment because their entire lifetime in python is
as UTF-8.

However, there are many good reasons to work with i18n strings as unicode, not
byte sequences which happen to be represented as UTF-8 (e.g. can't count the
number of characters, can't concatenate, etc.). Thus applications should be able
to represent their i18n strings as unicode (internally as UCS-4) and output
correctly with correct translation to UTF-8 automatically applied by python, not
manually.

This is from site.py. Note the hardcoding of 'ascii'. If the first 'if 0:' test
allowed locale.getdefaultlocale() to be called it would allow the default
encoding to be correctly set from the environment. Site.py should be patched to
allow this.

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !  

(end quote)


Currently Fedora's python implementation uses ascii

Python's site.py includes this fragment of code:

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !  


It is proposed to change the first conditional to if 1: so that Fedora's Python by default reads the locale from the environment and uses that encoding. This will generally mean UTF-8 is used, rather than ascii.

Benefit to Fedora

Scope

How To Test

User Experience

Dependencies

Contingency Plan

Documentation

Release Notes

Comments and Discussion