Why python and unicode leads to frustration
In python-2.x, there's two types that deal with text.
1. str
is for strings of bytes. These are very similar in nature to how strings are handled in C.
2. unicode
for strings of unicode codepoints.
In the python-2.x world, these are used pretty interchangably but there are several important APIs where only one or the other will do the right thing. When you give the wrong type of string to an API that wants the other one, you may end up with an exception being raised (UnicodeDecodeError
or UnicodeEncodeError
). However, these exceptions aren't always raised because python implicitly converts between types... sometimes.
Frustration #1: Inconsistent Errors
Although doing the right thing when possible seems like the right thing to do, it's actually the first source of frustration. A programmer can test out their program with a string like: "The quick brown fox jumped over the lazy dog"
and not encounter any issues. But when they release their software into the wild, someone enters the string: "I sat down for coffee at the café"
and suddenly an exception is thrown. The reason? The mechanism that converts between the two types is only able to deal with ASCII characters. Once you throw non-ASCII characters into your strings, you have to start dealing with the conversion manually.
So, if I manually convert everything to either bytes or strings, will I be okay? The answer is.... sometimes.
Frustration #2: Inconsistent APIs
The problem you run into when converting everything to byte str
or unicode
strings is that you'll be using someone else's API quite often (this includes the APIs in the python standard library) and find that the API will only accept byte str
or only accept unicode
strings. Or worse, that the code will accept either when you're dealing with strings that consist solely of ASCII but throw an error when you give it a str
that's got non-ASCII characters or a unicode
that has non-ASCII characters. When you encounter these APIs you first need to identify which type will work better and then you have to convert your values to the correct version for that code. This means two pieces of work for the programmer that wants to proactively fix all unicode errors in their code:
- You must keep track of what type your sequences of text are. Does
my_sentence
containunicode
orstr
? If you don't know that, then you're going to be in for a world of hurt. - Anytime you call a function you need to evaluate whether that function will do the right thing with
str
orunicode
values. Sending the wrong value here will lead toUnicodeError
s being thrown down the line.
Frustration #3: Inconsistent treatment of output
Alright, since the python community is moving to using unicode type everywhere, we might as well convert everything to unicode
and use that by default, right? Sounds good most of the time but there's at least one huge problem with this. Anytime you output text to the terminal or to a file, the text has to be converted into bytes. Python will try to implicitly convert from unicode
to bytes... but it will throw an exception if the bytes are non-ASCII::
>>> string = unicode(raw_input(), 'utf8') café >>> log = open('/var/tmp/debug.log', 'w') >>> log.write(string) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
Frustration #4: Exceptions =
Okay, so we've gotten ourselves this far. We convert everything to unicode type. We're aware that we need to convert back into bytes before we write to the terminal. Are we all set? Well, there's at least one more gotcha: raising exceptions with a unicode message. Take a look:
>>> class MyException(Exception): >>> pass >>> >>> raise MyException(u'Cannot do this') Traceback (most recent call last): File "<stdin>", line 1, in <module> __main__.MyException: Cannot do this >>> >>> raise MyException(u'Cannot do this while at a café') Traceback (most recent call last): File "<stdin>", line 1, in <module> __main__.MyException >>>
No, I didn't truncate that last line; raising exceptions really cannot handle unicode
strings and will output a
A few solutions
On the
Use kitchen to convert at the border
In python-2.x I've started work on a library that can help with unicode issues. The part that's especially relevant is the converter functions. They allow you to transform your text from bytes to unicode and unicode to bytes pretty painlessly.
Designing APIs
If you're writing APIs to deal with text, there's a few techniques to use. However, you must be wary in what you do as each method has some drawbacks.