User:Toshio/Python Unicode: Difference between revisions

Revision as of 12:38, 29 July 2010

Why python and unicode leads to frustration

In python-2.x, there's two types that deal with text.

1. str is for strings of bytes. These are very similar in nature to how strings are handled in C. 2. unicode for strings of unicode codepoints.

Just what the dickens is "Unicode"?
One mistake that people encountering this issue for the first time make is confusing the unicode type and the encodings of unicode: UTF8, UTF16, UTF32, UCS4, etc. unicode in python is an abstract sequence of code points. Each code point represents a "grapheme", characters (or pieces of characters) that you might write on a page to make words, sentences, or other pieces of text. The unicode encodings are byte strings that assign a certain sequence of bytes to each unicode code point. What does that mean to you as a programmer? When you're dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with unicode types as they abstract characters in a manner that's appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over the wire, etc, you should be dealing with byte str as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.

In the python-2.x world, these are used pretty interchangably but there are several important APIs where only one or the other will do the right thing. When you give the wrong type of string to an API that wants the other one, you may end up with an exception being raised (UnicodeDecodeError or UnicodeEncodeError). However, these exceptions aren't always raised because python implicitly converts between types... sometimes.

Although doing the right thing when possible seems like the right thing to do, it's actually the first source of frustration. A programmer can test out their program with a string like: "The quick brown fox jumped over the lazy dog" and not encounter any issues. But when they release their software into the wild, someone enters the string: "I sat down for coffee at the café" and suddenly an exception is thrown. The reason? The mechanism that converts between the two types is only able to deal with ASCII characters. Once you throw non-ASCII characters into your strings, you have to start dealing with the conversion manually.

So, if I manually convert everything to either bytes or strings, will I be okay?

A few solutions

On the

Use kitchen to convert at the border

In python-2.x I've started work on a library that can help with unicode issues.  The part that's especially relevant is the converter functions.  They allow you to transform your text from bytes to unicode and unicode to bytes pretty painlessly.

Designing APIs

If you're writing APIs to deal with text, there's a few techniques to use.  However, you must be wary in what you do as each method has some drawbacks.

PolymorphismGotchasTwo functionsReasons to avoidOne function and make the user convertWhat to watch out for

@@ Line 6: / Line 6: @@
 . <code>unicode</code> for strings of unicode codepoints.
-{{admon/note|Just what the dickens is "Unicode"?|One mistake that people encountering this issue for the first time make is confusing the <code>unicode</code> type and the encodings of unicode: UTF8, UTF16, UTF32, UCS4, etc.  The unicode encodings are byte strings that assign a certain sequence of bytes to each <code>unicode</code> code point.}}
+{{admon/note|Just what the dickens is "Unicode"?|One mistake that people encountering this issue for the first time make is confusing the <code>unicode</code> type and the encodings of unicode: UTF8, UTF16, UTF32, UCS4, etc.  <code>unicode</code> in python is an abstract sequence of code points.  Each code point represents a "[http://en.wikipedia.org/wiki/Grapheme grapheme]", characters (or pieces of characters) that you might write on a page to make words, sentences, or other pieces of text.  The unicode encodings are byte strings that assign a certain sequence of bytes to each <code>unicode</code> code point.  What does that mean to you as a programmer?  When you're dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with <code>unicode</code> types as they abstract characters in a manner that's appropriate for thinking of them as a sequence of letters that you will see on a page.  When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over the wire, etc, you should be dealing with byte <code>str</code> as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters.}}
 In the python-2.x world, these are used pretty interchangably but there are several important APIs where only one or the other will do the right thing.  When you give the wrong type of string to an API that wants the other one, you may end up with an exception being raised (<code>UnicodeDecodeError</code> or <code>UnicodeEncodeError</code>).  However, these exceptions aren't always raised because python implicitly converts between types... ''sometimes''.

Search

User:Toshio/Python Unicode: Difference between revisions

Revision as of 12:38, 29 July 2010

Contents

Why python and unicode leads to frustration

A few solutions

Use kitchen to convert at the border

Designing APIs

Polymorphism

Gotchas

Two functions

Reasons to avoid

One function and make the user convert

What to watch out for