Tag Archives: i18n

Software internationalisation

Transliterating arbitrary text into Latin script

This post explores one of the capabilities of the PyICU library, namely its text transformation module. Specifically, we’ll look at the simplest use case: transliterating text into Latin script.

Say you are given a list of phrases, names, titles, whatever, in a script that you can’t read. You want to be able to differentiate them, but how, when they all look like random lines and curves? Well, let’s turn them into Latin characters!

>>> import icu
>>> tr = icu.Transliterator.createInstance('Any-Latin; Title').transliterate
>>> tr('Ἀριστοτέλης, Πλάτων, Σωκράτης')
'Aristotélēs, Plátōn, Sōkrátēs'

There we go. Even if you still can’t pronounce these names correctly, at least they’re hopefully easier to recognise because they are now in a script that you are more used to reading (unless you’re Greek, of course).

'Any-Latin; Title' means we want to transliterate from any script to Latin, then convert it to title case. If that’s too simple, the ICU documentation has the gory details of all the supported transforms.

Easy, no?

Caveats

Do not rely on the output as pronunciation guide unless you know what you’re doing. For example, the Korean character 꽃 is transliterated by ICU as kkoch to keep it reversible, even though the word certainly does not sound like the gunmaker’s nor Kochie’s last names, and definitely not like the synonym for rooster (the modern romanisation, which matches closer to the correct pronunciation, is kkot).

The transliteration of Han characters (shared between Chinese, Japanese, and Korean) uses Chinese Pinyin, and thus may not resemble the Japanese and Korean romanisations at all. This makes the transliteration of many Japanese texts particularly awful.

>>> tr('日本国')  # "Nippon-koku" = Japan
'Rì Běn Guó'

Oops, that could start an Internet war. Use a different library if you are primarily dealing with Japanese text.

Another unfortunate thing with ICU is that there are still scripts that it doesn’t support at all. For example, it can’t transliterate to/from Javanese.

>>> tr('ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ')
'ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ'

Maybe one day.

Unicode with Python 2 and PyGTK

Playing with Unicode in Python 2 is not fun, and combining this with third-party libraries brings even more headaches. This post explains how Unicode in PyGTK is handled.

Note: This information is only valid for Python 2.x. It will likely change when PyGTK releases support for Python 3.

Calling GTK+ functions: PyGTK accepts str and unicode objects as input. str objects are assumed to be in UTF-8. If you pass a non-UTF-8 str to a GTK+ function, it will work until you try to show it, where you’ll get a “PangoWarning: Invalid UTF-8 string passed to pango_layout_set_text()”.

Handling GTK+ return values: PyGTK functions always return strings as str objects. In most (all?) cases, the strings are encoded in UTF-8. Ideally, Python programs should use unicode strings internally, so it’s wise to convert the output of PyGTK function calls to unicode.

Example:

label1.set_text("Some UTF-8 string")
label1.set_text(u"Some Unicode string")
x = label1.get_text()  # x is an str object containing UTF-8 string.
y = unicode(x, 'utf-8')  # y is the unicode version of x.
y = x.decode('utf-8')  # Same as above.

fix-columns-i18n merged

There’s a long-standing internationalisation bug in Exaile that has been nagging me for some time. In localised environments, column names get saved to the settings in translated form. This has annoying consequences.

No more; I’ve merged the fix-columns-i18n branch that I have been working on for a while. There’s just a catch: using existing settings you’ll see no columns at all, which means no cells, which means completely empty playlists. You can simply re-enable them from View → Columns.

Edit (2008-01-29): The remaining bug is fixed now.