User:Toshio/Python Unicode: Difference between revisions

Revision as of 15:56, 16 August 2010

Why python and unicode leads to frustration

In python-2.x, there's two types that deal with text.

1. str is for strings of bytes. These are very similar in nature to how strings are handled in C. 2. unicode for strings of unicode codepoints.

Just what the dickens is "Unicode"?
One mistake that people encountering this issue for the first time make is confusing the unicode type and the encodings of unicode: UTF8, UTF16, UTF32, UCS4, etc. unicode in python is an abstract sequence of code points. Each code point represents a "grapheme", characters (or pieces of characters) that you might write on a page to make words, sentences, or other pieces of text. The unicode encodings are byte strings that assign a certain sequence of bytes to each unicode code point. What does that mean to you as a programmer? When you're dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with unicode types as they abstract characters in a manner that's appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over the wire, etc, you should be dealing with byte str as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.

In the python-2.x world, these are used pretty interchangably but there are several important APIs where only one or the other will do the right thing. When you give the wrong type of string to an API that wants the other one, you may end up with an exception being raised (UnicodeDecodeError or UnicodeEncodeError). However, these exceptions aren't always raised because python implicitly converts between types... sometimes.

Frustration #1: Inconsistent Errors

Although doing the right thing when possible seems like the right thing to do, it's actually the first source of frustration. A programmer can test out their program with a string like: "The quick brown fox jumped over the lazy dog" and not encounter any issues. But when they release their software into the wild, someone enters the string: "I sat down for coffee at the café" and suddenly an exception is thrown. The reason? The mechanism that converts between the two types is only able to deal with ASCII characters. Once you throw non-ASCII characters into your strings, you have to start dealing with the conversion manually.

So, if I manually convert everything to either bytes or strings, will I be okay? The answer is.... sometimes.

Frustration #2: Inconsistent APIs

The problem you run into when converting everything to byte str or unicode strings is that you'll be using someone else's API quite often (this includes the APIs in the python standard library) and find that the API will only accept byte str or only accept unicode strings. Or worse, that the code will accept either when you're dealing with strings that consist solely of ASCII but throw an error when you give it a str that's got non-ASCII characters or a unicode that has non-ASCII characters. When you encounter these APIs you first need to identify which type will work better and then you have to convert your values to the correct version for that code. This means two pieces of work for the programmer that wants to proactively fix all unicode errors in their code:

You must keep track of what type your sequences of text are. Does my_sentence contain unicode or str? If you don't know that, then you're going to be in for a world of hurt.
Anytime you call a function you need to evaluate whether that function will do the right thing with str or unicode values. Sending the wrong value here will lead to UnicodeErrors being thrown down the line.

Mitigating factor
The python community has been standardizing on using unicode in all its APIs. Although there are some APIs that you need to send bytes to in order to be safe, (including print in the next frustration), it's usually safe to default to sending unicode to APIs.

Frustration #3: Inconsistent treatment of output

Alright, since the python community is moving to using unicode type everywhere, we might as well convert everything to unicode and use that by default, right? Sounds good most of the time but there's at least one huge caveat to be aware of. Anytime you output text to the terminal or to a file, the text has to be converted into bytes. Python will try to implicitly convert from unicode to bytes... but it will throw an exception if the bytes are non-ASCII::

>>> string = unicode(raw_input(), 'utf8')
café
>>> log = open('/var/tmp/debug.log', 'w')
>>> log.write(string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)

Okay, this is simple enough to solve: Just convert to bytes and we're all set:

>>> string = unicode(raw_input(), 'utf8')
café
>>> string_for_output = string.encode('utf8', 'replace')
>>> log = open('/var/tmp/debug.log', 'w')
>>> log.write(string_for_output)
>>>

So that was simple, right? Well... there's one gotcha that makes things a bit harder. When you attempt to write non-ASCII unicode to a file-like object you get a traceback everytime. But what happens when you use print? The terminal is a file-like object so it should raise an exception right? The answer to that is.... sometimes.

$ python
>>> print u'café'
café

No exception. Okay, we're fine then?

We are until someone does one of the following:

Runs the script in a different locale

$ LC_ALL=C python
>>> # Note: if you're using a good terminal program when running in the C locale
>>> # The terminal program will prevent you from entering non-ASCII characters
>>> # python will still recognize them if you use the codepoint instead:
>>> print u'caf\xe9'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)

Redirects output to a file:

$ cat test.py
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
print u'café'
$ ./test.py  >t
Traceback (most recent call last):
  File "./test.py", line 4, in <module>
    print u'café'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)

Why does this happen? It's because print in python-2.x is treated specially. Whereas the other file-like objects in python convert to ASCII unless you set them up differently, using print to output to the terminal will use the user's locale to convert before printing to the terminal. When print is not outputting to the terminal (being redirected to a file, for instance), the output is not converted to the user's locale but to ASCII instead.

So what does this mean for you, as a programmer? Unless you have the luxury of controlling how your users use your code, you should always, always, always convert to bytes before outputting strings to the terminal (via print) or to a file. Python even provides you with a facility to do just this. If you know that every piece of unicode you send to a particular file-like object (for instance, stdout) should be converted to a particular byte encoding you can use StreamWriter to convert from unicode into a byte string. In particular, the getwriter function will return a StreamWriter class that will help you to wrap a file-like object for output. Using our print example:

$ cat test.py
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys

UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print u'café'
$ ./test.py  >t
$ cat t
café

Frustrations #4 and #5 -- The other shoes

In English, there's a saying "waiting for the other shoe to drop". It means that when one event (usually bad) happens, you come to expect another event (usually worse) to come after. In this case we have two other shoes.

Frustration #4: Now it doesn't take byte strings?!

If you wrap sys.stdout using codec.getwriter() and think you are now safe to print any variable without checking its type, I must inform you that you're not paying enough attention to Murphy's Law. The StreamWriters that codec.getwriter() provides

Frustration #5: Exceptions =

Okay, so we've gotten ourselves this far. We convert everything to unicode type. We're aware that we need to convert back into bytes before we write to the terminal. Are we all set? Well, there's at least one more gotcha: raising exceptions with a unicode message. Take a look:

>>> class MyException(Exception):
>>>     pass
>>>
>>> raise MyException(u'Cannot do this')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException: Cannot do this
>>>
>>> raise MyException(u'Cannot do this while at a café')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException
>>>

No, I didn't truncate that last line; raising exceptions really cannot handle unicode strings and will output a

A few solutions

On the

Use kitchen to convert at the border

In python-2.x I've started work on a library that can help with unicode issues. The part that's especially relevant is the converter functions. They allow you to transform your text from bytes to unicode and unicode to bytes pretty painlessly.

Designing APIs

If you're writing APIs to deal with text, there's a few techniques to use. However, you must be wary in what you do as each method has some drawbacks.

Search

User:Toshio/Python Unicode: Difference between revisions

Revision as of 15:56, 16 August 2010

Contents

Why python and unicode leads to frustration

Frustration #1: Inconsistent Errors

Frustration #2: Inconsistent APIs

Frustration #3: Inconsistent treatment of output

Frustrations #4 and #5 -- The other shoes

Frustration #4: Now it doesn't take byte strings?!

Frustration #5: Exceptions =

A few solutions

Use kitchen to convert at the border

Designing APIs

Polymorphism

Gotchas

Two functions

Reasons to avoid

One function and make the user convert

What to watch out for