This post explains an EBML parser that I wrote in Python. (EBML is Matroska‘s binary “markup language”.) It is implemented as a single-file library and is available under a free software licence.
Background
I’ve been working to implement Matroska (mka, mkv, webm) tag-reading support in Exaile. Mutagen—the tag library that we use—currently doesn’t have this feature, so I looked elsewhere.
[Update (2010-06-16): News about Exaile 0.3.2, location of the Matroska parser in Exaile's source tree, and discussion on WebM support.]
Choices
Previously, I had a working solution using hachoir-metadata, but it doesn’t really make sense to depend on another large tagging library when we’re already using Mutagen. To make matters worse, I accidentally deleted the branch during our recent Bazaar upgrade problem.
I started shopping around for other possible solutions and found videoparser, which seemed quite nice and compact. It’s still a different library, though, and it doesn’t seem to be packaged in Debian.
I was considering just using it anyway for yet another temporary hack when I chanced on MatroskaParser.pm, a Perl library written by “Omion (on HA)”. It’s only 816 lines of Perl; discounting the README and the Matroska elements table, we’re looking at than less than 450.
Solution
I took the crazy decision of translating MatroskaParser.pm into Python. Despite the horror stories out there about Perl, this particular code is written in a style that is extremely readable if you’re somewhat familiar with the language.
Well, I’ve finished the porting: 250 lines of EBML parser written in Python. Parts of MatroskaParser.pm that are not relevant—mainly the validity checker and the Block parser—have been removed, and the output data structure has been simplified. The next job is to actually extract tags out of the structure.
Matroska tags
Matroska tags are quite different from MP3 and Vorbis tags, in that they’re not just a flat list of key-value pairs. Consider the following snippet.
[{'SimpleTag': [{'TagDefault': [1],
'TagLanguage': ['und'],
'TagName': ['TITLE'],
'TagString': ['Light + Shade']},
{'TagDefault': [1],
'TagLanguage': ['und'],
'TagName': ['ARTIST'],
'TagString': ['Mike Oldfield']}],
'Targets': [{'TargetTypevalue': [50]}]},
{'SimpleTag': [{'TagDefault': [1],
'TagLanguage': ['und'],
'TagName': ['TITLE'],
'TagString': ['Surfing']}],
'Targets': [{'TargetTypevalue': [30]}]}]
There are two types of tags in this example. The first (target type: 50) explains the album (title: Light + Shade, artist: Mike Oldfield), while the second (target type: 30) explains the track (title: Surfing). Translating this structure into tags that Exaile can understand is not hard, just needs a bit of planning.
(By the way, notice that Matroska makes implementing album artists / compilation albums very intuitive: you can have an artist tag at album level, and another at track level. There are even other levels specified.)
Another tricky part is getting the track length out of the structure. Under /Segment/Info, you’ll find something like
[{'Duration': [14821615.0],
'TimecodeScale': [22674]}]
At first I randomly assumed the duration is specified in seconds, and got around 171 days as output, which is obviously wrong. Apparently you need to apply this formula to get the length in seconds:
Length = Duration * TimecodeScale / 10^9
Code
The code is now available at Exaile’s repository. It’s licensed under GPL 2+ with the standard Exaile exception, although I will consider relicensing it if there is interest.
Notice that the last 100-or-so lines make up the Matroska tagging part. Depending on your needs, you may need to expand the list of elements based on either MatroskaParser.pm or the Matroska specification.
Future
Matroska read-only tag support will be in Exaile 0.3.2. Maybe one day I’ll add write support and integrate the whole thing into Mutagen, but don’t count on it. If anyone wants to do it, I’m more than happy to help.
My next goal is to create a subclass of the EBML parser that uses GIO. It probably won’t be relevant to most people, so just be aware.
What about WebM?
Funny how I made this post shortly before WebM was announced. Coincidence? Yes, unfortunately; I’m not as cool as the Mozilla and Opera people, who were let in on Google’s secret.
At this point, the WebM container is mostly just a subset of Matroska (the only incompatibility I’ve noticed is the change in doctype, from matroska to webm). As far as I know, they use the exact same EBML structure for tags, so there’s no reason Exaile or this code shouldn’t be able to read tags from a WebM file.
Posted by Johannes Sasongko