was originally asked to write an article about translation and Linux. After I'd foolishly agreed, I realized what a task that would be: rather like writing an article about translation and Windows. Where do you start? So, with the editor's approval, I propose to look at one particular applicationone which is in fact not unique to Linux. Hopefully, this article will provide an insight into the some of the less familiar software available to translators, in particular open-source software.
This article isn't intended as a user guide to OmegaT, so a link to more detailed information on installation and use can be found at the end.
OmegaT... oh my WHAT?
"OmegaTthe last computer-assisted translation tool" was how OmegaT was described by its developer when it appeared on the Internet in early 2001. It might be the last one you're likely to have heard of. But what exactly is it, and how does it differ from other translation memory applications?
The key features of OmegaT are: it is basicthe functionality is very limited; it is database-oriented rather than text-oriented; it is free; it is open-source; and it is programmed in Java. The last of these characteristics is of particular interest if you are one of the few people reading this article who are not using Microsoft Windows. More about that later.
In the translation memory world, OmegaT is a lean, mean machine. The basic concept is that you, the translator, have data which is useful to you in the form of past translations, and OmegaT's function is to let you get at that data as fast as possible. Anything else is secondary. In this respect, OmegaT differs markedly from applications such as Trados TWB and Wordfast, in which the translation memory functions are an extension of the familiar word processor environment. It is much more similar in concept to Deja Vu: rather than working in a document and sending your translations to a database, you find yourself translating text in a database, and injecting your translations into the text when you have finished.
Launch the application, and you are immediately presented with a candidate for the ugliest user interface of any translation memory ever written. You are unlikely, though, to find it confusing, because there isn't much to be confused by. On the right-hand side and taking up most of the screen is a column of cells, initially empty. This is where the text to be translated will appear, one segment in each cell. To the left of the text window is an area in which further information such as fuzzy matches is displayed, together with a text search box and some statistical information. Along the top of the screen is a menu bar with the items "File," "Edit," "Tools," "Server" and "Language." That's itnot a toolbar or button in sight.
Compared to some translation memory programs, the procedure for setting up a "project" may seem unnecessarily complex and tedious. Several files and directories are involved, about which the documentation is not particularly clear, and you must take care to ensure that everything is in the right place. You are instructed explicitly to examine one file, the "file handlers," in a text editor, and modify it manually if necessary: not difficult, but hardly in the spirit of modern user-friendly GUIs, either, and one of the signs that OmegaT is still at the beta stage. There are reasons behind the complexity, notably the fact that OmegaT makes it easier for you, at least in the long run, to process projects involving multiple files.
Once a project has been created, the texts identified by the project are loaded into the user interface. The ugliness now takes on a new dimension: the fonts are an odd mix. Worse, if you have a high resolution selected on your monitor, it may be difficult to be read comfortably. (I can see the headline on ZNET now: Open-source makes you go blind, says Ballmer.) In fairness, the problem may well be nothing to do with OmegaT, and might be resolved by judicious tweaking of the font properties of the Java run-time environment. It is also likely to be confined to the Linux version of Java. It would, however, be helped greatly were OmegaT to have a zoom function such as you would normally find in a word processor. But no matter, take a close lookor, if your eyesight is bad, a very close look.
Probably the first thing to strike the experienced translation memory user here is that OmegaT segments text by the paragraph. This is an interesting alternative to the more usual method of sentence-by-sentence segmenting. There are also good reasons for it. One is that the sentence-by-sentence translation approach employed by most translation memory applications constrains the translator, forcing him or her to use one-for-one substitution on the sentence level, or at least influencing his or her likelihood of taking a step back and restructuring whole paragraphs. Personally, I feel there is some justification in this argument, even for applications which provide a facility for merging sentences. Another argument is that more context is preserved within the translation memories, as matches return not just the sentence, but the whole paragraph. The drawback is that since the segments are larger, full or close matches maybe much less frequent.
Which alternative is better depends upon your reasons for using a translation memory. Many see translation memory as a tool for finding chunks of text which have already been translated, and of saving the translator the task of translating them (or, increasingly frequently, the customer the chore of paying for them). If you regard your translation memory as a database not of "chunks of text," but of terms and phrases the quality of which is determined by the context provided, you might be prepared to forego the productivity benefits in return for more context. In fact, there are those in the industry, notably MultiCorpora, vendors of the Canadian MultiTrans product, who advocate maintaining the entire corpus, thus providing context at document level.
Whatever the respective merits of sentence-based and paragraph-based segmenting, my personal view is let the user decide." ForeignDesk is one application I'm aware of (there are no doubt others) which offers the user the choice.
User interface and navigating within the text
The second thing to strike the eye is that the (target) text before and after the active segment is displayed in cells, also reminiscent of Deja Vu. Unlike the content of the active segment cell, however, the lines in the cells preceding and following it don't wrap. You can select any of the ten or so cells either side of the active cell by clicking on them, but the lack of line wrapping means that you only see the first 150 characters or so of each paragraph. (It's worth noting that if segmenting were performed at sentence level, this would be a much more acceptable value.) In other words, the context of the active sentence remains nebulous. You can step through the segments (i.e. paragraphs) one by one with Ctrl-N (next segment) and Ctrl-P (previous segment), move to any segment displayed by clicking on it with the mouse, or jump in either direction to a particular segment by entering the string number ("string" being term used by OmegaT for "segment") in the field provided.
These are the only direct means of navigating through the text. If you're used to an application which is an extension of your word processor, such as Wordfast or Trados TWB, this comes as something of a shock. You can't leave an active segment open and simply page up a few paragraphs to find that phrase you're sure came up half an hour ago. As no formatting is displayed, you can't use visual cues, for example to navigate at a glance to "the paragraph just before the last heading." And what translator, in Paragraph 1,012, would be inspired to take a peek at Paragraph 354? Hardly a realistic scenario.
However, these limitations aren't as strong as they might first appear. You soon get used to scrolling back through the segments. Much more valuable for navigation, though, is the "Keyword search" function, more about which later. If you feel unhappy about being unable to see the text layout, There is a simple solution: keep a copy of the source text file open in a word processor in a different window, and toggle between the two as needed.
Illustration of OmegaT, showing a translated document with two fuzzy matches for the active string. In the lower left-hand corner, the source text in OpenOffice.org
You translate by overwriting the text in the active segment. (On my system, in fact, you can't actually overwrite; overwrite/insert toggling doesn't work. This may however be a function of my configuration.) The source text is reproduced for you in the target text cell, although it is not generally needed. Hit <Return>, and the segment (paragraph) is saved and the next segment called up. Soon, you are likely to notice fuzzy matches appearing in the "Current project" window on the left.
OmegaT scans automatically for fuzzy matches, the top five of which appear in a table on the left-hand side of the interface. The table displays a serial number, the translation, and the percentage of fuzzy matching. You can toggle between the matches with Ctrl-<number>. The translation of the matches is displayed, and you toggle to see the corresponding source; I wonder whether I'm alone in finding this odd.
Searching for key terms
Now to the "Keyword search" function. This is where OmegaT really comes into its own. It remains user-unfriendly: you can't launch a search from the active segment and must type the words into the "Keyword search" box (though you can also cut and paste from the active window). OmegaT searches for any segment containing all the keywords entered, regardless of where in the segment they occur. It therefore resembles the Boolean "AND" syntax of a web search engine more closely than the "Concordance" function of Trados TWB or Wordfast. If your search is successful, a pop-up window appears with a table similar to that for fuzzy matches. Should you want more context (as the whole paragraph is reproduced, this is seldom the case), clicking on this number takes you to the segment concerned. The search is fast. The Keyword search function is available for the source text only.
Editing functions? What editing functions?
OmegaT's developer has clearly embraced a database philosophy with heart, mind and soul (or, you might say, hook, line and sinker). Working in OmegaT is a little like working in Excel, but worse. The fact is that OmegaT has virtually no editing functions. The usual cursor keys and key combinations are there (e.g. Ctrl-<right cursor key> to move one word to the right), and you can cut and paste. To enter soft returns within cells, I resorted to cutting and pasting from a word processor (though this may be due to my configuration). There isn't even have an undo function, much less a spellchecker.
Another sign of OmegaT's number-cruncher origins: as you translate, it lets you know how productive you are. A comment in the top left-hand corner informs you of the total number of segments and the number of the current segment, the total number of words in the project at creation, the total number now, and the number left to be translated. If you bill by the word (source or target), you might find these figures useful, and they can at least be described as psychological productivity enhancers, as you work through the night to meet that deadline in the morning...
Text file formats
If ever there was a case of hiding one's talents under a bushel, this is it. OmegaT can process two different file formats: plain text and HTML file formats. Given this limited choice, most translators probably wouldn't give it a second glance. Any translation memory application is expected at the very least to support MS Word, the "de facto" word processing format in the industry, and demand is also increasing for support for other formats such as Excel and Powerpoint, not to mention all manner of desk-top publishing and open-tag formats. You can, of course, convert your word processing file to plain text, translate it, and cut and paste the finished translation back into your word processor. In fact, if you do so you are likely to find the actual translation process appreciably faster, whatever application you are using. Unfortunately, the process of inserting the translation back into the final document, possibly paragraph by paragraph, and restoring the original formatting is not only time-consuming, but also a further source of error. Quite simply, very few translators are likely to find it acceptable.
To add insult to injury, OmegaT's documentation excuses the lack of file format filters with reference to its open-source status. In other words, if you don't like it, you can modify it. Indeed, you can. Under the terms of the licence, you can do almost anything you like with it. Unfortunately, most translators don't happen to be able to program in Java (or anything else, for that matter).
In fact, even in unmodified form, OmegaT is capable of much more. Its HTML parser can be regarded as an effective XML parser which also happens to be capable of parsing HTML, the latter being something of a poor relation of the former (see Alan Melby's article in the April 2000 issue of the TJ for an introduction to XML). Can I translate that out of geek-speak for you? Indeed I can. It means that OmegaT can read the XML file formats of certain applications which, in turn, have good or even excellent filters for conversion to more familiar formats such as MS Word or RTF. In particular, by dint of a simple procedure, OmegaT can read OpenOffice.org 1.0/StarOffice 6.0 files. (Note: OpenOffice.org 1.0 and StarOffice 6.0 are technically almost identical. The former is the open-source version of the latter, and as this article is about open-source software, I'll refer to it rather than to its commercial sister.)
It works like this. Open your MS Word file in OpenOffice.org, and save it as a native OpenOffice.org file. Open that file with a zip utility such as WinZip, and extract the file content.xml. Rename content.xml as content.html, and OmegaT will recognize it as a file for translation when you create a project. When your translation is finished, simply follow the same procedure in reverse. When you consider that OpenOffice.org preserves 99% of Word's formatting, the benefits are obvious.
The translation memory
OmegaT maintains a database for a project in progress in a file with the name <project name>.bin. This turns out to be a gzip archive (nicely concealed without the .gz extension), itself containing a file also with the name <project name>.bin. Once a project is compiled, OmegaT creates a new directory and places in it a copy of this file, which is then named <project name>_<language>.tm. The application supports multiple TMs: once this directory has been created, you can add other memories created earlier to it, and OmegaT will search them automatically.
Management of memories is otherwise difficult, mainly owing to the limited documentation and the complete lack of export filters. However, I was able to convert memories to plain text in OpenOffice.org, so exporting them to a practical interchange format should be an easy matter for anyone with basic macro programming skills.
OmegaT's list of flaws is long. To recap:
* crude user interface
* extremely limited editing functions
* lack of memory management functions
* limited documentation
However, these more obvious flaws tend to hide the fact that under the skin, OmegaT has a well-programmed and fast translation memory engine. More than one translator has installed a promising application with a pretty user interface and a full compliment of features only to find that it takes 30 seconds to open and close each segment.
In addition to sound core functionality, OmegaT has three other characteristics which make it unusual, in fact almost unique among TM applications.
Firstly, as already mentioned, OmegaT runs on Java, which in turn means that it will run on any platform on which Java runs. Those include Windows 95/98/2000/ME/NT/XP, UNIX, Linux, Solaris (both x86 and Sun SPARC), and Macintosh OS X. If your preferred operating system is Linux (like mine) or Macintosh OS X, your choice of translation memory programs is much smaller, and OmegaT correspondingly more interesting.
Secondly, OmegaT is free. The issue here is not so much cost; translation memory software has fallen dramatically in price in recent years and should now be affordable for most translators. Rather, the issue is one of availability. A glance at any translators' mailing list will show that all manner of obstacles can arise. The respective versions of the TM software and the operating system may be incompatible; when another application, in particular a word processor, is needed, all three must be compatiblebut are often not. Dongles, which are often needed, can sometimes be a source of problems. By contrast, OmegaT is always there. Simply download it from the Internet, download the Java run-time environment, install both, and start working. The JRE is available from a number of sites, and you can be confident that it will be around for a long time to come. If you want to be sure of having access to OmegaT at all times, just copy it onto a diskette. Five times, if you likeat 125 KB, it is tiny. Admittedly, the idea of a translation memory for use in an emergency might sound a little far-fetched, but stories of translators facing looming deadlines and an unworkable system seem to be surprisingly frequent.
Thirdly, and most importantly, OmegaT is open-source. In a nutshell, that means that anyone can modify it, improve it, extend itand is indeed encouraged to do so. In fact, when I began writing this article, it was on the assumption that there would be no further contribution from the original developer. In the open-source world, however, programs never die; at most, they only fade away. Even should no further development be forthcoming from the project's originator, users can still develop the application further themselves or pay a programmer to do the work. Users of open-source software are not therefore at the mercy of a commercial developer who may choose to abandon the product.
However, OmegaT's developer has now decided to revive the project, and has
already made a number of major improvements. All the more reason to take a
closer look at open-source software.
To find out more about OmegaT and to download a free copy, visit the OmegaT home page.