Tag Archives: python

Transliterating arbitrary text into Latin script

This post explores one of the capabilities of the PyICU library, namely its text transformation module. Specifically, we’ll look at the simplest use case: transliterating text into Latin script.

Say you are given a list of phrases, names, titles, whatever, in a script that you can’t read. You want to be able to differentiate them, but how, when they all look like random lines and curves? Well, let’s turn them into Latin characters!

>>> import icu
>>> tr = icu.Transliterator.createInstance('Any-Latin; Title').transliterate
>>> tr('Ἀριστοτέλης, Πλάτων, Σωκράτης')
'Aristotélēs, Plátōn, Sōkrátēs'

There we go. Even if you still can’t pronounce these names correctly, at least they’re hopefully easier to recognise because they are now in a script that you are more used to reading (unless you’re Greek, of course).

'Any-Latin; Title' means we want to transliterate from any script to Latin, then convert it to title case. If that’s too simple, the ICU documentation has the gory details of all the supported transforms.

Easy, no?

Caveats

Do not rely on the output as pronunciation guide unless you know what you’re doing. For example, the Korean character 꽃 is transliterated by ICU as kkoch to keep it reversible, even though the word certainly does not sound like the gunmaker’s nor Kochie’s last names, and definitely not like the synonym for rooster (the modern romanisation, which matches closer to the correct pronunciation, is kkot).

The transliteration of Han characters (shared between Chinese, Japanese, and Korean) uses Chinese Pinyin, and thus may not resemble the Japanese and Korean romanisations at all. This makes the transliteration of many Japanese texts particularly awful.

>>> tr('日本国')  # "Nippon-koku" = Japan
'Rì Běn Guó'

Oops, that could start an Internet war. Use a different library if you are primarily dealing with Japanese text.

Another unfortunate thing with ICU is that there are still scripts that it doesn’t support at all. For example, it can’t transliterate to/from Javanese.

>>> tr('ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ')
'ꦫꦩꦏꦮꦸꦭꦲꦶꦁꦱ꧀ꦮꦂꦒ'

Maybe one day.

Advertisements

Christoph Gohlke won a PSF Community Service Award in 2014 (and it went unnoticed)

In October 2014, Christoph Gohlke won a Python Software Foundation’s Community Service Award. Gohlke is the (sole?) maintainer of the most impressive collection of Windows binaries of Python extensions, and his website is often the only place to find Win64 builds of some libraries.

Despite his significant contribution, the award seems to have gone unnoticed. No official announcement, no news articles, no written acknowledgement of what he did that warranted the award. It’s not even listed in the PSF Community Service Awards page, which seems to have stopped being updated one year ago (update: this is fixed now).

I’m no management expert, but this looks bad on the PSF. The point of giving an award is not just to make the recipient feel good and continue what they’re doing, but also to encourage others to contribute to the award-giver’s interests. In this case, the first seems to have been cheapened, and the second is not accomplished at all.

Win32 Python: Getting all window titles

This post shows how you can retrieve all window titles in Microsoft Windows using Python’s ctypes module. Moreover, it also acts as a ctypes tutorial, showing how to create and use callback functions.

The following is the full code. Keep reading if you want to understand how it works. (Note: If you are reading this as a ctypes tutorial and are having trouble following the explanation, you may want to go through my previous tutorial first.)

import ctypes

EnumWindows = ctypes.windll.user32.EnumWindows
EnumWindowsProc = ctypes.WINFUNCTYPE(ctypes.c_bool, ctypes.POINTER(ctypes.c_int), ctypes.POINTER(ctypes.c_int))
GetWindowText = ctypes.windll.user32.GetWindowTextW
GetWindowTextLength = ctypes.windll.user32.GetWindowTextLengthW
IsWindowVisible = ctypes.windll.user32.IsWindowVisible

titles = []
def foreach_window(hwnd, lParam):
	if IsWindowVisible(hwnd):
		length = GetWindowTextLength(hwnd)
		buff = ctypes.create_unicode_buffer(length + 1)
		GetWindowText(hwnd, buff, length + 1)
		titles.append(buff.value)
	return True
EnumWindows(EnumWindowsProc(foreach_window), 0)

print(titles)

Continue reading

Pango: Determine if a font is monospaced

If you have a GtkFontButton, finding out whether the chosen font is monospaced is quite a complicated process. Here is a complete walk-through.

(By the way, I will be using PyGTK’s Pango documentation because the C version is a mess.)

FontButton.get_font_name returns the font family (a.k.a. “font name”), style, and size; for example, “Liberation Serif Italic 14”. The first thing we need to do is pick just the family name. We do this by going through a PangoFontDescription.

desc_str = font_button.get_font_name()  # Liberation Serif Italic 14
desc = pango.FontDescription(desc_str)
family_name = desc.get_family()  # Liberation Serif

Next, check whether the font family describes a monospaced font. Here is where it gets dodgy. We need an arbitrary PangoContext, which can be obtained from a GtkWidget using Widget.get_pango_context. We then list all available font families and find the one with the appropriate name. Call FontFamily.is_monospace to finish the job.

(By the way, this is also a good place to show off Python’s for-else construct.)

context = widget.get_pango_context()  # widget can be any GtkWidget.
for family in context.list_families():
	if family.get_name() == family_name:
		break
else:  # Should not happen.
	assert False
family.is_monospace()  # False -- Liberation Serif is proportional.

Win32 Python: getting user’s display name using ctypes

This post explains how you can obtain the user’s display name (a.k.a. “real name” or “full name”) in Windows, using Python’s ctypes module. However, it also serves as a mini tutorial/demonstration of ctypes.

First, a bit of background. I researched this while working on a patch for Jokosher. When you create a new project in Jokosher, it will prompt you with a dialog asking for the name of the project and so on. One of the fields in this dialog is the Author field, which by default should be filled with the logged-in user’s real name. While there are several ways to get the user’s login ID (a.k.a. “username”), there is no easy way to get their real name (display name) in Windows.

This is where ctypes and GetUserNameEx come in. ctypes is a Python library that lets you call C functions. GetUserNameEx is the C function in Win32 API that we want to call.

For the impatient, here is the full code. Continue reading if you want to know how it works and maybe learn a bit about ctypes. Otherwise, copy away. Note, however, that it does not have any error checking whatsoever.

import ctypes

def get_display_name():
	GetUserNameEx = ctypes.windll.secur32.GetUserNameExW
	NameDisplay = 3

	size = ctypes.pointer(ctypes.c_ulong(0))
	GetUserNameEx(NameDisplay, None, size)

	nameBuffer = ctypes.create_unicode_buffer(size.contents.value)
	GetUserNameEx(NameDisplay, nameBuffer, size)
	return nameBuffer.value

Continue reading

Using matplotlib in a Web application

matplotlib‘s FAQ has a section dealing with the exact topic of this post: using matplotlib in a web application server. The problem is that I couldn’t find it easily from my Web searches. What I’m doing here is adding my own twist to the answer, and hopefully making it slightly more search-friendly.

While testing a Web app that I was working on, I noticed that it would often hang. At first I dismissed it as a server problem, but it kept occuring on one particular page. A few hours and many head scratches later, I narrowed the problem down to matplotlib.

# Negative example; do not use.

import matplotlib.pyplot as plt

def callback():
	# ... (process data)
	fig = plt.figure()
	# ... (draw stuff)
	fig.savefile(path)

I used matplotlib to draw a plot and save it to a file. The code was quite long, but it involved steps similar to the above listing. The first time it ran, everything went OK. The second time, it always hung at the pyplot.figure call. This smelled like a threading / deadlock problem, so I tried to put a lock on the pyplot calls (which I should have done anyway, considering pyplot operates on a single plot at a time). Still, it didn’t work.

After some Web searches and more head scratching, I accidentally arrived at the FAQ entry mentioned earlier.

Here’s the gist of the available solutions.

First option: configure matplotlib to use the Anti-Grain Geometry backend. Continue using pyplot, carefully grouping its commands together and surrounding them with a lock.

from threading import Lock
lock = Lock()

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def callback():
	# ... (process data)
	with lock:
		fig = pyplot.figure()
		# ... (draw stuff)
		fig.savefile(path)

Second, better option: dump pyplot and use matplotlib’s object-oriented API instead. For this one, you don’t need to care about threads or locking or whatever.

from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
from matplotlib.figure import Figure

def callback():
	# ... (process data)
	fig = Figure()
	canvas = FigureCanvas(fig)
	# ... (draw stuff)
	canvas.print_figure(path)

EBML/Matroska parser for Python

This post explains a Python EBML parser that I wrote. (EBML is Matroska‘s binary “markup language”.) It is implemented as a single-file library and is available under a free software licence.

Background

I’ve been working to implement Matroska (mka, mkv, webm) tag-reading support in Exaile. Mutagen—the tag library that we use—currently doesn’t have this feature, so I looked elsewhere.

Choices

Previously, I had a working solution using hachoir-metadata, but it doesn’t really make sense to depend on another large tagging library when we’re already using Mutagen. To make matters worse, I accidentally deleted the branch during our recent Bazaar upgrade problem.

I started shopping around for other possible solutions and found videoparser, which seemed quite nice and compact. It’s still a different library, though, and it doesn’t seem to be packaged in Debian.

I was considering just using it anyway for yet another temporary hack when I chanced on MatroskaParser.pm, a Perl library written by “Omion (on HA)”. It’s only 816 lines of Perl; discounting the README and the Matroska elements table, we’re looking at less than 450.

Solution

I decided to translate MatroskaParser.pm into Python. Despite the horror stories out there about Perl, this particular code is written in a style that is extremely readable if you’re somewhat familiar with the language.

Well, I’ve finished the porting: 250 lines of EBML parser written in Python. Parts of MatroskaParser.pm that are not relevant—mainly the validity checker and the Block parser—have been removed, and the output data structure has been simplified. The next job is to actually extract tags out of the structure.

Matroska tags

Matroska tags are quite different from MP3 and Vorbis tags, in that they’re not just a flat list of key-value pairs. Consider the following snippet.

[{'SimpleTag': [{'TagName': ['TITLE'],
                 'TagString': ['Light + Shade']},
                {'TagName': ['ARTIST'],
                 'TagString': ['Mike Oldfield']}],
  'Targets': [{'TargetTypevalue': [50]}]},
 {'SimpleTag': [{'TagName': ['TITLE'],
                 'TagString': ['Surfing']}],
  'Targets': [{'TargetTypevalue': [30]}]}]

There are two types of tags in this example. The first (target type: 50) explains the album (title: Light + Shade, artist: Mike Oldfield), while the second (target type: 30) explains the track (title: Surfing). Translating this structure into tags that Exaile can understand is not hard, just needs a bit of planning.

By the way, notice that Matroska makes implementing album artists / compilation albums very intuitive: you can have an artist tag at album level, and another at track level. There are even other levels specified. As a further example, because Light + Shade consists of two CDs labelled Light and Shade respectively, you could use them as the titles at level 40 (between album and track); however, this is not common practice.

Another tricky part is getting the track length out of the structure. Under /Segment/Info, you’ll find something like

[{'Duration': [14821615.0],
  'TimecodeScale': [22674]}]

At first I randomly assumed that the duration was specified in seconds, and got around 171 days as output, which is obviously wrong. Apparently you need to apply this formula to get the length in seconds:

Length = Duration * TimecodeScale / 10^9

Note that TimecodeScale may be omitted; it is one of the few important elements with default values (1,000,000 in its case).

Code

The code is now available in Exaile’s repository. It’s licensed under GPL 2+ with the standard Exaile exception, although I will consider relicensing it if there is interest.

Notice that the last 100-or-so lines make up the Matroska tagging part. Depending on your needs, you may need to expand the list of elements based on the Matroska specification. There are also 40 lines of code that subclasses the parser to use GIO to read the files; you may want to remove this chunk of code if it’s not relevant to you.

Future

Matroska read-only tag support will be in Exaile 0.3.2. Maybe one day I’ll add write support and integrate the whole thing into Mutagen, but don’t count on it. If anyone wants to do it, I’m more than happy to help.

What about WebM?

Funny how I made this post shortly before WebM was announced. Coincidence? Yes, unfortunately; I’m not as cool as the Mozilla and Opera people, who were let in on Google’s secret.

At this point, the WebM container is mostly just a subset of Matroska (the only incompatibility I’ve noticed is the change in doctype, from matroska to webm). As far as I know, they use the exact same EBML structure for tags, so there’s no reason Exaile or this code shouldn’t be able to read tags from a WebM file.