Building Blocks

Volume 15, No. 4
October 2011

Jost is an ATA-certified English-to-German translator and a localization and translation consultant. A native of Hamburg, Germany, Jost earned a Ph.D. in the field of Chinese translation history and linguistics in 1996. He began working in localization and technical translation in 1997. In 1999 he co-founded International Writers' Group (www.internationalwriters.com). Jost is also the publisher of the Tool Kit, a free technical newsletter for translation professionals (www.internationalwriters.com/toolkit). His computer guide for translators, A Translator's Tool Box for the 21^st Century, was published in 2003. His latest endeavor is TranslatorsTraining.com, a site that offers in-depth comparisons of translation tools.

Jost can be reached at jzetzsche@internationalwriters.com.

Front Page

Select one of the previous 57 issues.

	Index 1997-2011
	TJ Interactive: Translation Journal Blog
	Editorial
	Fifteen Years of Service by Gabe Bokor
	Translator Profiles
	My Life in Translation by Rina Ne’eman
	The Profession
	The Bottom Line by Fire Ant & Worker Bee
	Good Proofreader / Bad Proofreader by Pham Hoa Hiep, Ed.D.
	We are Still of Two Minds about It by Danilo Nogueira and Kelli Semolini
	The Financial Crisis and Translator's Math by Fotini Vallianatou
	Translators Around the World
	The Role of Translation Movements in the Cultural Maintenance of Iran from the Era of Cyrus the Great up to the Constitutional Revolution by Hossein Bahri
	Cultural Aspects of Translation
	When American Culture Floats Adrift: A case study of two versions of Brown's "The Da Vinci Code" by Orges Selmani
	Medical Translation
	Tradução de palavras compostas de Alemão para português—o caso dos textos médicos Katrin Herget e Teresa Alegre
	Translators and Computers
	Building Blocks by Jost Zetzsche, Ph.D.
	Translators' Education
	To Use or not to Use Translation in Language Teaching by Mogahed M. Mogahed, Ph.D.
	Interpreting
	Strategies for the Enhancement of Mandarin Chinese Proficiency: A Case Study of Trainee Interpreters in Taiwan by Riccardo Moratto
	Book Reviews
	An Empirical Study for Translation Studies—A Multifaceted Perspective Reviewed by Xiangjun Liu, Ph.D.
	Textología contrastiva, derecho comparado y traducción jurídica: Las sentencias de divorcio alemanas y españolas Reseñado por Concepción Mira Rueda
	Bridging Worlds Through Language and Translation Baris Bilgen, Ph.D. Candidate
	Portuguese
	Isso vai dar merda: implicações do conhecimento do significado de expressões idiomáticas na tradução de uma entrevista do ex-presidente Lula Ana Karla Pereira de Miranda e Dr^a Elizabete Aparecida Marques
	Caught in the Web
	Web Surfing for Fun and Profit by Cathy Flick, Ph.D.
	Translators’ On-Line Resources by Gabe Bokor
	Translators’ Best Websites by Gabe Bokor
	Translators' Tools
	Translators’ Emporium
	Call for Papers and Editorial Policies

Building Blocks

by Jost Zetzsche, Ph.D.

ast weekend I spent an hour looking through books in our home library that I had not pulled out for a while. One of them, a work on linguistics and Bible translation,¹ started with these two paragraphs:

Few of us even think of the extraordinary complexity of the speech process. Most of us simply talk. True enough we talk with varying degrees of fluency, but every normal child learns to talk, and this is quite a remarkable achievement.

Consider the simple fact that there is an infinite number of sentences potentially available in each language known to us. The sentence which stands at the beginning of this paragraph has probably never before been written down or spoken or even thought. Very similar sentences have been produced before, and other similar sentences will be produced again, but it is probable that until this chapter was written no one had ever before written or said: "Consider the simple fact that there is an infinite number of sentences potentially available in each language known to us."

When I first read this I found myself marveling at the incredibly variety of human expression through language. But that amazement came to a screeching halt when I realized what this statement means for the translation technology that I've been talking about for years and that many of us are using—translation memory.

Our technology has just given us access to the smaller building blocks of language, and it would be a shame not to use those.

To test the statement's validity, I rushed to my computer to perform a Google search on the sentence in question. Not a single hit! Indeed, chances are that the two occurrences of the sentence above were truly the only times this particular sentence was ever written. (Of course, now it has been mentioned in this article as well, so it will show up in Google from now on.)

What does this mean for us? A translation memory in its simplest form, of course, is nothing more than a collection of translated segments that occur most typically in sentence form. We all know that some kinds of texts have a fairly high degree of repetition, including instructional text, legal boiler plates, and certain medical phrasing, but the repetition in the majority of texts does not happen on the sentence level.

Until very recently, the translation memory component in most translation environments tools was stuck at almost exactly the same place it had occupied in the early 1990s. Perfect and fuzzy matching on the segment level and manual concordance searches through the translation memory were essentially the only ways to get to the data, meaning that the largest amount of data in our translation memories was doomed to the life of Sleeping Beauties—lots of data, all slumbering, most of it beautiful.

Only in the last year or two have things started to change. Partly due to increased competition and partly due to frustration on the side of us consumers about the restricted usefulness of the old paradigms, tool developers have been forced to look at their existing technology to try to find ways to extend its use.

Hold on to that thought while we investigate one area in which the use of translation memories has traditionally not been fully exploited. Let's look at our example sentence again:

"Consider the simple fact that there is an infinite number of sentences potentially available in each language known to us" has 0 Google hits²
"Consider the simple fact" has about 108,000 Google hits
"an infinite number of sentences" has 64,800 Google hits
"potentially available in each" has 41,300 Google hits
"language known to us" has 35,600 Google hits

I know that we're typically not worried about what is or is not available through search engines. But as translators we should be concerned about how we can reuse data that has already been translated, either by us or by somebody else. And the numbers above show us that there is an infinitely greater likelihood of finding matches for segments within a sentence than for the complete sentence.

As it so happens, this is exactly the one area that almost all tools developers have recently focused on: the extended and (semi-) automated use of so-called subsegments, or segments within whole sentences. Interestingly, though, the developers have approached this goal from very different angles. Here are some examples:

Trados Studio uses the so-called AutoComplete dictionaries to distill data from translation memories and then offers suggestions based on the source segment and the first few keystrokes.

memoQ performs subsegment searches based on its Longest Substring Concordance (LSC) technology to suggest subsegments in its regular search pane.

Déjà Vu X2 uses its DeepMiner technology to analyze matches of subsegments between source and target based on number of occurrences and uses those for a variety of things, including correction of fuzzy matches.

Star Transit looks into the target part of its reference materials (Transit's equivalence of a translation memory) if no matches in the source are found and makes suggestions based on your first few keystrokes.

There are two other tools—Lingotek and Multitrans—that have used subsegment searches for a long time, but partly due to their lack of market penetration they have not been able to make the splash that the more well-known tools are now creating.

What does this mean for our practical work?

The first thing that comes to mind is that the quality of the materials within a translation memory is more important than ever. Whereas before we might not have had to worry about the existence of poorly translated segments within our translation memory—chances were that the prince would never show up to wake those Sleeping Beauties anyway—now every translation unit or segment pair within a translation memory is actually being used by the subsegmenting abilities of our tools. If you used to roll your eyes about "translation memory maintenance" as "one of those things that one should do" but you never did anything about it, you now might actually have to start considering taking action. A good starting place might be to look at Olifant, the most powerful (and free) translation memory maintenance tool that is currently available.

The second arena where the new subsegmenting feature comes into play is in gaining a renewed understanding of the usefulness of accessing data from external sources. While aligning previously translated materials from the same client might not have always been a cost-effective process with the old translation memory paradigm of perfect and fuzzy matches, subsegmenting gives a much higher leverage and dramatically increases the usefulness of a variety of data sources. Aside from aligned data and well-maintained legacy translation memories, data resources such as the ones offered by TDA or MyMemory might also prove to be more useful and productive.

So, yes, the authors of that long-forgotten book in my library might have been right in regard to many complete sentences. But our technology has just given us access to the smaller building blocks of language, and it would be a shame not to use those.

¹ Peter Cotterell & Max Turner: Linguistics & Biblical Interpretation. InterVarsity, 1989.

² These and all other Google statistics were generated on September 13, 2011.