Volume 15, No. 4 
October 2011

Jost Zetzsche


Front Page


Index 1997-2011

TJ Interactive: Translation Journal Blog

Fifteen Years of Service
by Gabe Bokor

  Translator Profiles
My Life in Translation
by Rina Ne’eman

  The Profession
The Bottom Line
by Fire Ant & Worker Bee
Good Proofreader / Bad Proofreader
by Pham Hoa Hiep, Ed.D.
We are Still of Two Minds about It
by Danilo Nogueira and Kelli Semolini
The Financial Crisis and Translator's Math
by Fotini Vallianatou

Translators Around the World
The Role of Translation Movements in the Cultural Maintenance of Iran from the Era of Cyrus the Great up to the Constitutional Revolution
by Hossein Bahri

Cultural Aspects of Translation
When American Culture Floats Adrift: A case study of two versions of Brown's "The Da Vinci Code"
by Orges Selmani

Medical Translation
Tradução de palavras compostas de Alemão para português—o caso dos textos médicos
Katrin Herget e Teresa Alegre

  Translators and Computers
Building Blocks
by Jost Zetzsche, Ph.D.

  Translators' Education
To Use or not to Use Translation in Language Teaching
by Mogahed M. Mogahed, Ph.D.

Strategies for the Enhancement of Mandarin Chinese Proficiency: A Case Study of Trainee Interpreters in Taiwan
by Riccardo Moratto

  Book Reviews
An Empirical Study for Translation Studies—A Multifaceted Perspective
Reviewed by Xiangjun Liu, Ph.D.
Textología contrastiva, derecho comparado y traducción jurídica: Las sentencias de divorcio alemanas y españolas
Reseñado por Concepción Mira Rueda
Bridging Worlds Through Language and Translation
Baris Bilgen, Ph.D. Candidate

Isso vai dar merda: implicações do conhecimento do significado de expressões idiomáticas na tradução de uma entrevista do ex-presidente Lula
Ana Karla Pereira de Miranda e Dra Elizabete Aparecida Marques

  Caught in the Web
Web Surfing for Fun and Profit
by Cathy Flick, Ph.D.
Translators’ On-Line Resources
by Gabe Bokor
Translators’ Best Websites
by Gabe Bokor

  Translators' Tools
Translators’ Emporium

Call for Papers and Editorial Policies
  Translation Journal
The Translator & the Computer

Building Blocks

by Jost Zetzsche, Ph.D.

ast weekend I spent an hour looking through books in our home library that I had not pulled out for a while. One of them, a work on linguistics and Bible translation,1 started with these two paragraphs:

Few of us even think of the extraordinary complexity of the speech process. Most of us simply talk. True enough we talk with varying degrees of fluency, but every normal child learns to talk, and this is quite a remarkable achievement.

Consider the simple fact that there is an infinite number of sentences potentially available in each language known to us. The sentence which stands at the beginning of this paragraph has probably never before been written down or spoken or even thought. Very similar sentences have been produced before, and other similar sentences will be produced again, but it is probable that until this chapter was written no one had ever before written or said: "Consider the simple fact that there is an infinite number of sentences potentially available in each language known to us."

When I first read this I found myself marveling at the incredibly variety of human expression through language. But that amazement came to a screeching halt when I realized what this statement means for the translation technology that I've been talking about for years and that many of us are using—translation memory.

Our technology has just given us access to the smaller building blocks of language, and it would be a shame not to use those.
To test the statement's validity, I rushed to my computer to perform a Google search on the sentence in question. Not a single hit! Indeed, chances are that the two occurrences of the sentence above were truly the only times this particular sentence was ever written. (Of course, now it has been mentioned in this article as well, so it will show up in Google from now on.)

What does this mean for us? A translation memory in its simplest form, of course, is nothing more than a collection of translated segments that occur most typically in sentence form. We all know that some kinds of texts have a fairly high degree of repetition, including instructional text, legal boiler plates, and certain medical phrasing, but the repetition in the majority of texts does not happen on the sentence level.

Until very recently, the translation memory component in most translation environments tools was stuck at almost exactly the same place it had occupied in the early 1990s. Perfect and fuzzy matching on the segment level and manual concordance searches through the translation memory were essentially the only ways to get to the data, meaning that the largest amount of data in our translation memories was doomed to the life of Sleeping Beauties—lots of data, all slumbering, most of it beautiful.

Only in the last year or two have things started to change. Partly due to increased competition and partly due to frustration on the side of us consumers about the restricted usefulness of the old paradigms, tool developers have been forced to look at their existing technology to try to find ways to extend its use.

Hold on to that thought while we investigate one area in which the use of translation memories has traditionally not been fully exploited. Let's look at our example sentence again:

  • "Consider the simple fact that there is an infinite number of sentences potentially available in each language known to us" has 0 Google hits2
  • "Consider the simple fact" has about 108,000 Google hits
  • "an infinite number of sentences" has 64,800 Google hits
  • "potentially available in each" has 41,300 Google hits
  • "language known to us" has 35,600 Google hits

I know that we're typically not worried about what is or is not available through search engines. But as translators we should be concerned about how we can reuse data that has already been translated, either by us or by somebody else. And the numbers above show us that there is an infinitely greater likelihood of finding matches for segments within a sentence than for the complete sentence.

As it so happens, this is exactly the one area that almost all tools developers have recently focused on: the extended and (semi-) automated use of so-called subsegments, or segments within whole sentences. Interestingly, though, the developers have approached this goal from very different angles. Here are some examples:

  • Trados Studio uses the so-called AutoComplete dictionaries to distill data from translation memories and then offers suggestions based on the source segment and the first few keystrokes.
  • memoQ performs subsegment searches based on its Longest Substring Concordance (LSC) technology to suggest subsegments in its regular search pane.
  • Déjà Vu X2 uses its DeepMiner technology to analyze matches of subsegments between source and target based on number of occurrences and uses those for a variety of things, including correction of fuzzy matches.
  • Star Transit looks into the target part of its reference materials (Transit's equivalence of a translation memory) if no matches in the source are found and makes suggestions based on your first few keystrokes.

There are two other tools—Lingotek and Multitrans—that have used subsegment searches for a long time, but partly due to their lack of market penetration they have not been able to make the splash that the more well-known tools are now creating.

What does this mean for our practical work?

The first thing that comes to mind is that the quality of the materials within a translation memory is more important than ever. Whereas before we might not have had to worry about the existence of poorly translated segments within our translation memory—chances were that the prince would never show up to wake those Sleeping Beauties anyway—now every translation unit or segment pair within a translation memory is actually being used by the subsegmenting abilities of our tools. If you used to roll your eyes about "translation memory maintenance" as "one of those things that one should do" but you never did anything about it, you now might actually have to start considering taking action. A good starting place might be to look at Olifant, the most powerful (and free) translation memory maintenance tool that is currently available.

The second arena where the new subsegmenting feature comes into play is in gaining a renewed understanding of the usefulness of accessing data from external sources. While aligning previously translated materials from the same client might not have always been a cost-effective process with the old translation memory paradigm of perfect and fuzzy matches, subsegmenting gives a much higher leverage and dramatically increases the usefulness of a variety of data sources. Aside from aligned data and well-maintained legacy translation memories, data resources such as the ones offered by TDA or MyMemory might also prove to be more useful and productive.

So, yes, the authors of that long-forgotten book in my library might have been right in regard to many complete sentences. But our technology has just given us access to the smaller building blocks of language, and it would be a shame not to use those.



1 Peter Cotterell & Max Turner: Linguistics & Biblical Interpretation. InterVarsity, 1989.

2 These and all other Google statistics were generated on September 13, 2011.