Tag Archives: unicode

Transliterating arbitrary text into Latin script

This post explores one of the capabilities of the PyICU library, namely its text transformation module. Specifically, we’ll look at the simplest use case: transliterating text into Latin script.

Say you are given a list of phrases, names, titles, whatever, in a script that you can’t read. You want to be able to differentiate them, but how, when they all look like random lines and curves? Well, let’s turn them into Latin characters!

>>> import icu
>>> tr = icu.Transliterator.createInstance('Any-Latin; Title').transliterate
>>> tr('Ἀριστοτέλης, Πλάτων, Σωκράτης')
'Aristotélēs, Plátōn, Sōkrátēs'

There we go. Even if you still can’t pronounce these names correctly, at least they’re hopefully easier to recognise because they are now in a script that you are more used to reading (unless you’re Greek, of course).

'Any-Latin; Title' means we want to transliterate from any script to Latin, then convert it to title case. If that’s too simple, the ICU documentation has the gory details of all the supported transforms.

Easy, no?

Caveats

Do not rely on the output as pronunciation guide unless you know what you’re doing. For example, the Korean character 꽃 is transliterated by ICU as kkoch to keep it reversible, even though the word certainly does not sound like the gunmaker’s nor Kochie’s last names, and definitely not like the synonym for rooster (the modern romanisation, which matches closer to the correct pronunciation, is kkot).

The transliteration of Han characters (shared between Chinese, Japanese, and Korean) uses Chinese Pinyin, and thus may not resemble the Japanese and Korean romanisations at all. This makes the transliteration of many Japanese texts particularly awful.

>>> tr('日本国')  # "Nippon-koku" = Japan
'Rì Běn Guó'

Oops, that could start an Internet war. Use a different library if you are primarily dealing with Japanese text.

Another unfortunate thing with ICU is that there are still scripts that it doesn’t support at all. For example, it can’t transliterate to/from Javanese.

>>> tr('ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ')
'ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ'

Maybe one day.

Advertisements

UTF-8 explained

This is a short explanation of UTF-8—what it is, how it works, and why it’s popular.

Description

UTF-8 is a character encoding.

First of all, you need to understand what a character is. The problem is, it’s hard to explain, so instead here are some examples of characters: N 2 @ ũ æ Ω ओ 김 Ѝ ≥ → ☢ ★.

A character encoding, then, is a method to represent characters so computers can understand them. If you consider the fact that computers only understand binary values, a character encoding basically specifies how to turn characters into 0s and 1s.

Traditionally in the Western computing world, a character is represented by a single byte, which in most systems is an octet (eight bits). For example, the ASCII character encoding encodes the character A to the number 65, or 01000001 in 8-bit binary. It should be obvious why 8 bits (256 combinations) are not enough to encode all characters from all languages in the world.

UTF-8 is one of the most recent character encodings developed that supports all characters from the huge list known as Unicode. Unicode assigns a number called code point for each character that it recognises. The idea is similar to the A=65 mapping in ASCII. UTF-8 provides a way to represent these code points as bits, for the purpose of file storage or network transmission.

Technical details

In UTF-8, if a character’s code point can be represented with 7 bits or less (i.e. code points 0-127), it is instead represented as one octet with the format 0xxxxxxx, where the x’s are the character code point in binary, padded with 0’s at the front if necessary to fill up the 7 bits.

For 8-11 bits, the representation is 2 octets of the form: 110xxxxx 10xxxxxx.

For 12-16 bits, 3 octets: 1110xxxx 10xxxxxx 10xxxxxx.

… And so on.

Note how the number of 1’s in the leading octet determines the total number of octets the character occupies. Octets of the form 0xxxxxxx are reserved for the first 128 Unicode characters (7-bit code points), while octets of the form 10xxxxxx are continuations of preceding octet(s).

Popularity and support

Part of UTF-8’s popularity is due to its backwards compatibility with ASCII-based encodings, because the first 128 Unicode characters correspond with those of ASCII. For example, software that only supports ISO-8859-1 (a commonly used superset of ASCII) can still read a UTF-8 file containing only English characters. Even if the file contains a few non-English characters, the worst case is these characters will be replaced with multi-character gibberish, but the English characters will remain intact.

In terms of library support, each programming library generally chooses one encoding as its “native” encoding (e.g. UTF-8 in GLib and UTF-16 in modern Win32), but can usually convert between UTF-8 and UTF-16/UCS-2 at the minimum (GLib, Win32).

Further reading

The descriptions of various Unicode-related terms in this article are incomplete and, in order to simplify things, slightly inaccurate. For example, it ignores the fact that a character can be represented with multiple code points (ä can be written as “a with umlaut” or “a” + “umlaut”).

If you’re interested in understanding more about Unicode, the Wikipedia article can be a good starting point. There is also a whole forest of topics related to international text representation and rendering, some of which can be found on Wikipedia.

Unicode in Python 2

… is a pain.

One of the things I like about Python is that it normally makes it harder to shoot yourself in the foot (monkey patching, anyone?). The only exception that is very frustrating for me is Python 2’s Unicode support, which is ugly and difficult to get right.

Really, at this point I don’t care much about other (planned) changes in Python 3. If Unicode support can be made as transparent as in Java or .NET, I would be really happy.