UTF-8 explained

2009-05-18

This is a short explanation of UTF-8—what it is, how it works, and why it’s popular.

Description

UTF-8 is a character encoding. First of all, you need to understand what a character is. The problem is, it’s hard to explain, so instead here are some examples: a5!Ω→김лÜ. A character encoding, then, is a method to represent characters so computers can understand them. If you consider the fact that computers only understand binary values, a character encoding basically specifies how to turn characters into 0s and 1s.

Traditionally in the Western computing world, a character is represented by a single byte, which in most systems is an octet (eight bits). For example, ASCII encodes the character A to the number 65, or 01000001 in 8-bit binary. It should be obvious why 8 bits (256 combinations) is not enough to encode all characters from all languages in the world.

UTF-8 is one of the most recent character encodings developed that supports all characters from the huge list known as Unicode. Unicode assigns a number called code point for each character that it recognises. The idea is similar to the A=65 mapping in ASCII. UTF-8 provides a way to represent these code points as bits, for the purpose of file storage or network transmission.

Technical details

If a character’s code point consists of 7 bits or less (i.e. code points 0-127), it is represented as one octet with the format 0xxxxxxx, where the x’s are the character code point in binary, padded with 0′s at the front if necessary to fill up the 7 bits.

For 8-11 bits, the representation is 2 octets of the form: 110xxxxx 10xxxxxx.

For 12-16 bits, 3 octets: 1110xxxx 10xxxxxx 10xxxxxx.

… And so on.

Note how the number of 1′s in the leading octet determines the number of octets the character occupies. Octets of the form 0xxxxxxx are reserved for the first 128 Unicode characters (7-bit code points), while octets of the form 10xxxxxx are continuations of preceding octet(s).

Popularity and support

Part of UTF-8′s popularity is due to its backwards compatibility with ASCII-based encodings, because the first 128 Unicode characters correspond with those of ASCII. For example, software that only supports ISO-8859-1 (a commonly used superset of ASCII) can still read a UTF-8 file containing only English characters. Even if the file contains a few non-English characters, the worst case is these characters will be replaced with multi-character gibberish, but the English characters will remain intact.

In terms of library support, each programming library generally chooses one encoding as its “native” encoding (e.g. UTF-8 in GLib and UTF-16 in modern Win32), but can usually convert between UTF-8 and UTF-16/UCS-2 at the minimum (GLib, Win32).

Further reading

Another great reading for anyone who wish to understand encodings is Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).


Unicode with Python 2 and PyGTK

2009-04-24

Playing with Unicode in Python 2 is not fun, and combining this with third-party libraries brings even more headaches. This post explains how Unicode in PyGTK is handled.

Note: This information is only valid for Python 2.x. It will likely change when PyGTK releases support for Python 3.

Calling GTK+ functions: PyGTK accepts str and unicode objects as input. str objects are assumed to be in UTF-8. If you pass a non-UTF-8 str to a GTK+ function, it will work until you try to show it, where you’ll get a “PangoWarning: Invalid UTF-8 string passed to pango_layout_set_text()”.

Handling GTK+ return values: PyGTK functions always return strings as str objects. In most (all?) cases, the strings are encoded in UTF-8. Ideally, Python programs should use unicode strings internally, so it’s wise to convert the output of PyGTK function calls to unicode.

Example:

label1.set_text("Some UTF-8 string")
label1.set_text(u"Some Unicode string")
x = label1.get_text()  # x is an str object containing UTF-8 string.
y = unicode(x, 'utf-8')  # y is the unicode version of x.
y = x.decode('utf-8')  # Same as above.


fix-columns-i18n merged

2007-11-26

There’s a long-standing internationalisation bug in Exaile that has been nagging me for some time. In localised environments, column names get saved to the settings in translated form. This has annoying consequences.

No more; I’ve merged the fix-columns-i18n branch that I have been working on for a while. There’s just a catch: using existing settings you’ll see no columns at all, which means no cells, which means completely empty playlists. You can simply re-enable them from View → Columns.

Edit (2008-01-29): The remaining bug is fixed now.


Follow

Get every new post delivered to your Inbox.