User:Toshio/Python Unicode

From FedoraProject

< User:Toshio(Difference between revisions)
Jump to: navigation, search
(Erase this section)
(minor clarifications of language)
Line 6: Line 6:
 
2. <code>unicode</code> for strings of unicode codepoints.
 
2. <code>unicode</code> for strings of unicode codepoints.
  
{{admon/note|Just what the dickens is "Unicode"?|One mistake that people encountering this issue for the first time make is confusing the <code>unicode</code> type and the encodings of unicode: UTF8, UTF16, UTF32, UCS4, etc.  <code>unicode</code> in python is an abstract sequence of code points.  Each code point represents a "[http://en.wikipedia.org/wiki/Grapheme grapheme]", characters (or pieces of characters) that you might write on a page to make words, sentences, or other pieces of text.  The unicode encodings are byte strings that assign a certain sequence of bytes to each <code>unicode</code> code point.  What does that mean to you as a programmer?  When you're dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with <code>unicode</code> types as they abstract characters in a manner that's appropriate for thinking of them as a sequence of letters that you will see on a page.  When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over the wire, etc, you should be dealing with byte <code>str</code> as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.}}
+
{{admon/note|Just what the dickens is "Unicode"?|One mistake that people encountering this issue for the first time make is confusing the <code>unicode</code> type and the encodings of unicode stored in the <code>str</code> type: UTF8, UTF16, UTF32, UCS4, etc.  In python, the <code>unicode</code> type stores an abstract sequence of code points.  Each code point represents a "[http://en.wikipedia.org/wiki/Grapheme grapheme]": characters (or pieces of characters) that you might write on a page to make words, sentences, or other pieces of text.  by contrast, the unicode encodings are byte strings that assign a certain sequence of bytes to each <code>unicode</code> code point.  What does that mean to you as a programmer?  When you're dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with <code>unicode</code> types as they abstract characters in a manner that's appropriate for thinking of them as a sequence of letters that you will see on a page.  When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over the wire, etc, you should be dealing with byte <code>str</code> as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.}}
  
In the python-2.x world, these are used pretty interchangably but there are several important APIs where only one or the other will do the right thing.  When you give the wrong type of string to an API that wants the other one, you may end up with an exception being raised (<code>UnicodeDecodeError</code> or <code>UnicodeEncodeError</code>).  However, these exceptions aren't always raised because python implicitly converts between types... ''sometimes''.
+
In the python-2.x world, these are used interchangably in many APIs but there are several important APIs where only one or the other will do the right thing.  When you give the wrong type of string to an API that wants the other type, you may end up with an exception being raised (<code>UnicodeDecodeError</code> or <code>UnicodeEncodeError</code>).  However, these exceptions aren't always raised because python implicitly converts between types... ''sometimes''.
  
 
=== Frustration #1: Inconsistent Errors ===
 
=== Frustration #1: Inconsistent Errors ===

Revision as of 06:54, 19 August 2010

Contents

Why python and unicode leads to frustration

In python-2.x, there's two types that deal with text.

1. str is for strings of bytes. These are very similar in nature to how strings are handled in C. 2. unicode for strings of unicode codepoints.

Note.png
Just what the dickens is "Unicode"?
One mistake that people encountering this issue for the first time make is confusing the unicode type and the encodings of unicode stored in the str type: UTF8, UTF16, UTF32, UCS4, etc. In python, the unicode type stores an abstract sequence of code points. Each code point represents a "grapheme": characters (or pieces of characters) that you might write on a page to make words, sentences, or other pieces of text. by contrast, the unicode encodings are byte strings that assign a certain sequence of bytes to each unicode code point. What does that mean to you as a programmer? When you're dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with unicode types as they abstract characters in a manner that's appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over the wire, etc, you should be dealing with byte str as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.

In the python-2.x world, these are used interchangably in many APIs but there are several important APIs where only one or the other will do the right thing. When you give the wrong type of string to an API that wants the other type, you may end up with an exception being raised (UnicodeDecodeError or UnicodeEncodeError). However, these exceptions aren't always raised because python implicitly converts between types... sometimes.

Frustration #1: Inconsistent Errors

Although doing the right thing when possible seems like the right thing to do, it's actually the first source of frustration. A programmer can test out their program with a string like: "The quick brown fox jumped over the lazy dog" and not encounter any issues. But when they release their software into the wild, someone enters the string: "I sat down for coffee at the café" and suddenly an exception is thrown. The reason? The mechanism that converts between the two types is only able to deal with ASCII characters. Once you throw non-ASCII characters into your strings, you have to start dealing with the conversion manually.

So, if I manually convert everything to either bytes or strings, will I be okay? The answer is.... sometimes.

Frustration #2: Inconsistent APIs

The problem you run into when converting everything to byte str or unicode strings is that you'll be using someone else's API quite often (this includes the APIs in the python standard library) and find that the API will only accept byte str or only accept unicode strings. Or worse, that the code will accept either when you're dealing with strings that consist solely of ASCII but throw an error when you give it a str that's got non-ASCII characters or a unicode that has non-ASCII characters. When you encounter these APIs you first need to identify which type will work better and then you have to convert your values to the correct version for that code. This means two pieces of work for the programmer that wants to proactively fix all unicode errors in their code:

  1. You must keep track of what type your sequences of text are. Does my_sentence contain unicode or str? If you don't know that, then you're going to be in for a world of hurt.
  2. Anytime you call a function you need to evaluate whether that function will do the right thing with str or unicode values. Sending the wrong value here will lead to UnicodeErrors being thrown down the line.
Note.png
Mitigating factor
The python community has been standardizing on using unicode in all its APIs. Although there are some APIs that you need to send bytes to in order to be safe, (including print in the next frustration), it's usually safe to default to sending unicode to APIs.

Frustration #3: Inconsistent treatment of output

Alright, since the python community is moving to using unicode type everywhere, we might as well convert everything to unicode and use that by default, right? Sounds good most of the time but there's at least one huge caveat to be aware of. Anytime you output text to the terminal or to a file, the text has to be converted into bytes. Python will try to implicitly convert from unicode to bytes... but it will throw an exception if the bytes are non-ASCII::

>>> string = unicode(raw_input(), 'utf8')
café
>>> log = open('/var/tmp/debug.log', 'w')
>>> log.write(string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)

Okay, this is simple enough to solve: Just convert to bytes and we're all set:

>>> string = unicode(raw_input(), 'utf8')
café
>>> string_for_output = string.encode('utf8', 'replace')
>>> log = open('/var/tmp/debug.log', 'w')
>>> log.write(string_for_output)
>>>

So that was simple, right? Well... there's one gotcha that makes things a bit harder. When you attempt to write non-ASCII unicode to a file-like object you get a traceback everytime. But what happens when you use print? The terminal is a file-like object so it should raise an exception right? The answer to that is.... sometimes.

$ python
>>> print u'café'
café

No exception. Okay, we're fine then?

We are until someone does one of the following:

  • Runs the script in a different locale
$ LC_ALL=C python
>>> # Note: if you're using a good terminal program when running in the C locale
>>> # The terminal program will prevent you from entering non-ASCII characters
>>> # python will still recognize them if you use the codepoint instead:
>>> print u'caf\xe9'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
  • Redirects output to a file:
$ cat test.py
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
print u'café'
$ ./test.py  >t
Traceback (most recent call last):
  File "./test.py", line 4, in <module>
    print u'café'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)

Why does this happen? It's because print in python-2.x is treated specially. Whereas the other file-like objects in python convert to ASCII unless you set them up differently, using print to output to the terminal will use the user's locale to convert before printing to the terminal. When print is not outputting to the terminal (being redirected to a file, for instance), the output is not converted to the user's locale but to ASCII instead.

So what does this mean for you, as a programmer? Unless you have the luxury of controlling how your users use your code, you should always, always, always convert to bytes before outputting strings to the terminal (via print) or to a file. Python even provides you with a facility to do just this. If you know that every piece of unicode you send to a particular file-like object (for instance, stdout) should be converted to a particular byte encoding you can use StreamWriter to convert from unicode into a byte string. In particular, the getwriter function will return a StreamWriter class that will help you to wrap a file-like object for output. Using our print example:

$ cat test.py
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys

UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print u'café'
$ ./test.py  >t
$ cat t
café

Frustrations #4 and #5 -- The other shoes

In English, there's a saying "waiting for the other shoe to drop". It means that when one event (usually bad) happens, you come to expect another event (usually worse) to come after. In this case we have two other shoes.

Frustration #4: Now it doesn't take byte strings?!

If you wrap sys.stdout using codec.getwriter() and think you are now safe to print any variable without checking its type, I must inform you that you're not paying enough attention to Murphy's Law. The StreamWriters that codec.getwriter() provides will take unicode strings and transform them into byte strings before they get to stdout. But if you give it something that's already a byte string it tries to turn that into unicode before transforming it back into a byte string.... and since it uses the ASCII codec to go from bytes to Unicode, chances are that it'll blow up there.

>>> import codecs
>>> import sys
>>> UTF8Writer = codecs.getwriter('utf8')
>>> sys.stdout = UTF8Writer(sys.stdout)
>>> print 'café'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.6/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

To work around this the kitchen library provides an alternate encoder that can deal with both bytes and unicode strings. Use it like this:

>>> import sys
>>> from kitchen.text.converters import getwriter
>>> UTF8Writer = codecs.getwriter('utf8')
>>> sys.stdout = UTF8Writer(sys.stdout)
>>> print u'café'
café
>>> print 'café'
café

Frustration #5: Exceptions

Okay, so we've gotten ourselves this far. We convert everything to unicode type. We're aware that we need to convert back into bytes before we write to the terminal. we've worked around the inability of the standard codec to deal with both byte strings and unicode strings. Are we all set? Well, there's at least one more gotcha: raising exceptions with a unicode message. Take a look:

>>> class MyException(Exception):
>>>     pass
>>>
>>> raise MyException(u'Cannot do this')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException: Cannot do this
>>>
>>> raise MyException(u'Cannot do this while at a café')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException
>>> 

No, I didn't truncate that last line; raising exceptions really cannot handle unicode strings and will output an exception without the message if the message contains non ASCII characters in a unicode string. What happens if we use the codecs trick to work around this?

>>> import sys
>>> from kitchen.text.converters import getwriter
>>> sys.stderr = getwriter('utf8')(sys.stderr)
>>> raise MyException(u'Cannot do this')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException: Cannot do this
>>> raise MyException(u'Cannot do this while at a café')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException>>>

Not only did this also fail, it even swallowed the trailing newline that's normally there.... So how to make this work? Transform from unicode strings to bytes manually before outputting:

>>> raise MyException('Cannot do this while at a café')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException: Cannot do this while at a café
Note.png
Note that we used kitchen.text.converters.getwriter in this example. If you use codecs.getwriter() instead, you'll find that raising an exception with a byte string is broken by the codec as well.

A few solutions

Now that we've identified the issues, what's a comprehensive solution to the problem?

Convert text at the border

If you get some piece of text from a library, read from a file, etc, turn it into a unicode type immediately. Since python is moving in the direction of unicode type everywhere (python3 turns the str type into the equivalent of python2's unicode and python3's bytes type replaces str.) it's going to be easier to work with unicode type within your code.

When the data needs to be treated as bytes use a naming convention

Sometimes you read textual data in but have to keep bytes around rather than converting to unicode. This is often the case where you need to use the value verbatim somewhere else. For instance, filenames or key values in a database. When you do this, use a naming convention for the data you're working with so you and others don't get confused being stored in the value.

If you need both a string to present to the user and a byte value for an exact match, consider keeping both versions around. You can either use two variables for this or a dict whose key is the byte value.

When outputting data, convert back into bytes

When you go to send your data back outside of your program (to the filesystem, over the network, displaying to the user, etc) turn the data back into bytes. How you do this will depend on the expected output format of the data. For displaying to the user, you can use the user's default encoding (remembering that they may have their encoding set to something that can't display every single unicode character). For entering into a file, you're best bet is to pick a single encoding and stick with it.

You can use getwriter to do this automatically for sys.stdout. When printing to sys.stderr (for example, exceptions), be sure to convert to bytes manually.

Example: Putting this all together with kitchen

In Fedora, we found we needed to do these sorts of things all the time so we put together a library which contained a bunch of utility functions for doing these things. Here's a short example of using this at work::

#!/usr/bin/python -tt
import locale
import os
import sys
import unicodedata

from kitchen.text.converters import getwriter, to_bytes, to_unicode
from kitchen.i18n import get_translation_object

if __name__ == '__main__':
    # Setup translations via gettext
    translations = get_translation_object('example')
    # We use _() for marking strings that we operate on as unicode
    # This is pretty much everything
    _ = translations.ugettext
    # And _b() for marking strings that we operate on as bytes.
    # This is limited to exceptions
    _b = translations.lgettext

    # Setup stdout
    encoding = locale.getpreferredencoding()
    Writer = getwriter(encoding)
    sys.stdout = Writer(sys.stdout)

    # Load data.  Format is filename\0description
    # description should be utf8 but filename can be any legal filename on the filesystem
    # Sample datafile.txt:
    #   /etc/shells\x00Shells available on caf\xc3\xa9.lan
    #   /var/tmp/file\xff\x00File with non-utf8 data in the filename
    #
    # And to create /var/tmp/file\xff (under bash or zsh) do:
    #   echo 'Some data' > /var/tmp/file$'\377'
    datafile = open('datafile.txt', 'r')
    data = {}
    for line in datafile:
        # We're going to keep filename as bytes because we will need the exact
        # bytes to access files on a POSIX operating system.  description,
        # we'll immediately transform into unicode type.
        filename_b, description = line.split('\0', 1)
        # to_unicode defaults to decoding output from utf8 and replacing any
        # problematic bytes with the unicode replacement character
        # We accept mangling of the description here knowing that our file
        # format is supposed to use utf8 in that field.
        description = to_unicode(description, 'utf8').strip()
        data[filename_b] = description
    datafile.close()

    # We're going to add a pair of extra fields onto our data to show the
    # length of the description and the filesize.  We put those between the
    # filename and description because we haven't checked that the description
    # is free of NULLs.
    datafile = open('newdatafile.txt', 'w')

    # Name filename with a _b suffix to denote byte string of unknown encoding
    for filename_b in data:
        # Since we have the byte representation of filename, we can read any
        # filename
        if os.access(filename_b, os.F_OK):
            size = os.path.getsize(filename_b)
        else:
            size = 0
        # Because the description is unicode type,  we know the number of
        # characters corresponds to the length of the normalized unicode
        # string.
        length = len(unicodedata.normalize('NFC', description))

        # Print a summary to the screen
        print _b('filename: %s') % filename_b
        print _(u'file size: %s') % size
        print _(u'desc length: %s') % length
        print _(u'description: %s') % data[filename_b]

        # First combine the unicode portion
        line = u'%s\0%s\0%s' % (size, length, data[filename_b])
        # Since the filenames are bytes, turn everything else to bytes before combining
        # Turning into unicode first would be wrong as the bytes in filename_b
        # might not convert
        line_b = '%s\0%s\n' % (filename_b, to_bytes(line))

        # Just to demonstrate that getwriter will pass this through fine
        print _b('Wrote: %s') % line_b
        datafile.write(line_b)
    datafile.close()

    # And just to show how to properly deal with an exception.
    # Note three things about this function:
    # 1) We use the _b() function to translate the string.  This returns a
    #    byte string instead of a unicode string
    # 2) We send a byte string in.  Unfortunately, python's lgettext()
    #    functions will return a unicode string if a unicode string was given
    #    and no translation for that string is found in the message catalog.
    #    We should always enter byte strings for this reason.
    message = u'Demonstrate the proper way to raise exceptions.  Sincerely,  \u3068\u3057\u304a'
    raise Exception(_b(to_bytes(message)))


Designing APIs

If you're writing APIs to deal with text, your job gets even trickier because you need to think of not just what your code does when you give it data to process but also what happens when someone else gives it data to process. Here's a few techniques to use. However, you must be wary in what you do as each method has some drawbacks.

Polymorphism

Gotchas

Two functions

Reasons to avoid

One function and make the user convert

What to watch out for