Different approaches to automatic evaluation of machine translation (MT) quality are considered. We describe several methods for automatic evaluation
of MT, such as methods based on string matching and n-gram models. The candidate translations done by Google and PROMPT are compared with the
reference translation by an automatic translation evaluation program and the results of the evaluation are presented.
Keywords: automatic evaluation, quality of translation, machine translation, BLEU, F-measure, TER.
he idea of machine translation (MT) of natural languages first appeared in the seventeenth century, but became a reality only at the end of the twentieth
century. Today, computer programs are widely used to automate the translation process. Although great progress has been made in the field of machine
translation, fully automated translations are far from being perfect. Nevertheless, countries continue spending millions of dollars on various automatic
translation programs. In the early 1990s, the U.S. government sponsored a competition among MT systems. Perhaps, one of the valuable outcomes of that
enterprise was a corpus of manually produced numerical evaluations of MT quality, with respect to a set of reference translations . The development of
MT systems has given impetus to a large number of investigations, thereby encouraging many researchers to seek for reliable methods for automatic MT
Machine translation evaluation serves two purposes: the relative estimate allows one to find out whether one MT system is better than the other, and the
absolute estimate (having a value ranging from 0 to 1) gives an absolute measure of efficiency (for example, when equal to unity, it means perfect
However, the development of appropriate methods for numerical MT quality evaluation is a challenging task. In many fields of science, measurable efficiency
indices exist, such as, for example, the difference between the predicted and actually observed results. Since natural languages are complicated, an
assessment of translation correctness is extremely difficult. Two completely different sequences of words (sentences) can be fully equivalent (e.g., There is a vase on the table and The vase is on the table), and two sequences that differ by a small detail can have completely different
meanings (e.g., There is no vase on the table, and There is a vase on the table).
Although great progress has been made in the field of machine translation, fully automated translations are far from being perfect.
Traditionally, the bases for evaluating MT quality are adequacy (the translation conveys the same meaning as the original text) and fluency (the
translation is correct from the grammatical point of view). Most modern methods of MT quality assessment rely on reference translations. Earlier approaches
to scoring a ‘candidate’ text with respect to a reference text were based on the idea of similarity of a candidate text (the text translated by
an MT system) and a reference text (the text translated by a professional translator), i.e., the similarity score was to be proportional to the number of
matching words . At about the same time, a different idea was put forward. It was based on fact that matching words in the right order in the candidate
and reference sentences should have higher scores than matching words out of order .
Perhaps the simplest version of the same idea is that a candidate text should be rewarded for containing longer contiguous subsequences of matching words.
Papineni et al.  reported that a particular version of this idea, which they call ‘BLEU,’ correlates very highly with human
judgments. Doddington  proposed another version of this idea, now commonly known as the ‘NIST’ score. Although the BLEU and NIST measures
might be useful for comparing the relative quality of different MT outputs, it is difficult to gain insight from such measures .
In this paper we consider different methods of MT quality assessment and analyze the translations of candidate and reference texts. In the following
sections, we describe several automatic MT evaluation methods: some of them are based on string matching, others, such as n-gram models, are based on the
use of information retrieval. Next, we will assess the quality of translation by using an automatic program.
2. Methods of automatic MT quality evaluation
To date, the main approach to the quality assessment of language models for MT systems relies on the use of statistical methods. In this case, the model
is, in fact, a probability distribution on a set of all sentences of a language. Naturally, it is impossible to employ the model in this way; therefore,
use is made of more compact algorithms. Let us briefly consider what models are currently used in commercial and experimental systems of MT quality
assessment with unlimited dictionaries.
2.1 Method of approximate string matching
In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is a technique of finding strings that match a
pattern approximately (rather than exactly). The problem of finding approximate string matching is typically divided into two sub-problems: finding
an approximate substring inside a given string and finding dictionary strings that match the pattern approximately .
The word error rate (WER) is a metric based on this approach. The WER is calculated as the sum of insertions, deletions, and substitutions, normalized by
the length of the reference sentence. If the WER is equal to zero, the translation is identical to the reference text. The main problem lies in the fact
that the resulting estimate is not always in the range from 0 to 1. In some cases, when the translation is wrong, the WER can be greater than 1.
Another version of the WER is the WERg metric, in which the sum of insertions, deletions and substitutions is normalized by the Levenshtein distance, i.e.,
the length of the edits. In information theory and computational linguistics, the Levenshtein distance (editorial distance, or edit distance) between two
strings is defined as the minimum number of edits needed to transform one string into the other, with allowable edit operations being insertion, deletion,
or substitution of a single character . The advantage of this metric is that the value of the translation quality will always be in the range from 0 to
1 (even in the worst case of coincidence, or in the absence of translation, the value will not exceed unity).
Experiments performed by Blattsom et al. have shown that the WERg metric is not reliable and does not agree with the estimates obtained when the
machine translation is analyzed by humans .
The position-independent error rate (PER) neglects the order of the words in the string matching operation. In this case, the difference between the
candidate text and the reference text, normalized by the length of the reference translation, is calculated .
Another metric that is widely used in assessing the translation quality is the translation error rate (TER). This metric makes it possible to measure the
number of edits required to change a system output into one of the given reference translations .
In fact, any string matching metric can be used for assessing the MT quality. One such example is the “string kernel,” which allows one to take
into account different levels of natural language (e.g., morphological, lexical, etc.), or the relationship between synonyms .
2.2 N-gram models
In n-gram language models, use is made of an explicit assumption that the probability of the next word in a sentence depends on the previous n-1 words. In practice, the models with n = 1, 2, 3 and 4 are used. For the English language, the most successful are three-gram or four-gram
models. Today, almost all systems of MT quality assessment rely on n-gram models. In this case, the probability of the whole sentence is calculated as the
product of the probabilities of its constituent n-grams.
The main advantages of n-gram models are their relative simplicity and the possibility of constructing a model that can be trained on a sufficiently
large corpus of a language. However, such models are not devoid of drawbacks. The n-gram models make it impossible to simulate semantic and
pragmatic relationships in a language. In fact, if a dictionary contains N words, the number of possible pairs of words will be N2
. Even if only 0.1% of them actually occur in the language, the minimum volume of the language corpus, necessary to obtain statistically valid estimates,
will amount to 125 billion words or about 1 terabyte. For three-gram models, the minimum corpus will reach hundreds of thousands of terabytes .
To overcome the drawbacks, use is made of well-developed smoothing techniques, which enables the assessment of the model parameters under the conditions of
insufficient or non-existent data.
The main metrics based on n-grams are BLEU, NIST, F-measure, and METEOR.
BLEU (Bilingual Evaluation Understudy) is an algorithm for automatic evaluation of the quality of a machine translation, which is compared to the reference
translation, using n-grams. This metric of MT quality assessment was first proposed and implemented by Papineni et al. .
Measuring translation quality is a challenging task, primarily due to the lack of definition of an ‘absolutely correct’ translation. The most
common technique of translation quality assessment is to compare the output of automated and human translations of the same document. But this is not as
simple as may seem: One translator's translation may differ from that of another translator. This inconsistency between different reference translations
presents a serious problem, especially when different reference translations are used to assess the quality of automated translation solutions.
A document translated by specially designed automated software can have a 60% match with the translation done by one translator and a 40% match with that
of another translator. Although both professional translations are technically correct (they are grammatically correct, they convey the same meaning,
etc.), 60% overlap of words is a sign of higher MT quality. Thus, although reference translations are used for comparison, they cannot be a completely
objective and consistent measurement of the MT quality.
The BLEU metric scores the MT quality on a scale from 0 to 1. The closer the score to unity, the greater is the overlap with the reference translation and,
therefore, the better the MT system. To cut the long story short, the BLEU metric measures how many words coincide in the same line, with the best score
given not to matching words but to word sequences. For example, a string of four words in the translation that matches the human reference translation (in
the same order) will have a positive impact on the BLEU score and is weighted more heavily (and scored higher) than a one- or two-word match .
The NIST (National Institute of Standards and Technology) precision measure is a metric used to evaluate the MT variants . NIST was intended as an
improved version of BLEU. In this case, the arithmetic mean of n-grams is calculated. An important difference from the BLEU metric is the fact that
NIST also relies on the frequency component (precision and recall). If BLEU simply calculates the n-gram precision by adding an equal weight for
each exact match, NIST also calculates how informative each matching n-gram is.
For example, even if the bigram ‘on the’ coincides with the same phrase in the reference text, the translation still receives a lower score
than the correct matching of the bigram ‘size distribution,’ because the latter phrase is less likely to occur.
The F-measure is a metric which calculates the harmonic mean of precision and recall . The metric is based on the search for the best match between the
candidate and reference translations (the ratio between the total number of matching words to the length of the translation and the reference text).
Sometimes it is useful to combine the precision and recall of the same averaged value .
The metric for evaluation of translation with explicit ordering (METEOR) is an improved version of the F-measure . This system was designed to address
some of the weaknesses in the BLEU metric. The METEOR scores the output by matching the automated and reference translations word-for-word. When more than
one reference translation is available, the automated translation is compared with each of them and the best result is reported .
One can have different attitudes to the different metrics, but at this point BLEU, METEOR and NIST are most widely used. It is these metrics that are
compared with all the other MT quality assessment systems. The developers of the F-measure claim that their metric shows the best agreement with the
assessment made by a human . However this is not always the case. The F-measure does not work well with the smallest average edit distance .
Empirical data show that more attention should be paid to the completeness (recall) of the translation. Studies suggest that the recall is most often the
parameter, which allows one to determine the quality of translation .
3. Automatic evaluation of the quality of statistic (Google) and rule-based (Prompt) MT systems
Translation is an intellectual challenge, and, therefore, skepticism about the possibility of using a computer for automated translation is quite natural.
However, the creators of MT systems have managed to endow their systems with a form of understanding, and machine translation now belongs to a class of
artificial intelligence programs.
Currently, we can speak of two approaches to written translation: the first one is machine translation based on the rules of the source and target
languages and the second approach involves statistical machine translation.
The earliest “translation engines” in machine-based translations were all based on the direct, so-called “transformer,” approach.
Input sentences of the source language were transformed directly into output sentences of the target language, using a simple form of parsing. The parser
did a rough analysis of the source sentence, dividing it into subject, object, predicate, etc. Source words were then replaced by target words selected
from a dictionary, and their order rearranged so as to comply with the rules of the target language. This approach was used for a long time, only to be
finally replaced by a less direct approach, which is called “linguistic knowledge.” Modern computers, which have more processing power and more
memory, can do what was impossible in the 1960s. Linguistic-knowledge translators have two sets of grammar rules: one for the source language, and the
other for the target language. In addition, modern computers analyze not only grammar (morphological and syntactic structure) of the source language but
also the semantic information. They also have information about the idiomatic differences between the languages, which prevents them from making silly
mistakes. The representative of rule-base approach to machine translation is the Prompt software developed by the leading Russian developer of linguistic
The second approach is based on a statistical method: by analyzing a large number of parallel texts (identical texts in the source and target languages),
the program selects the variants that coincide most often and uses them in the translation. It does not apply grammatical rules, since its algorithms are
based on statistical analysis rather than traditional rule-based analysis. In addition, the lexical units here are word combinations, rather than separate
words. One of the well-known examples of this approach is “Google Translate,” which is based on an approach called statistical machine
translation. However, the translated sentences are sometimes so discordant that it is impossible to understand them .
In this section using concrete examples we will compare the quality of translations made by such MT systems as Google ( http://translate.google.ru/) and Prompt (www.translate.ru).
For the analysis, we selected five titles, abstracts, and keywords from the ‘Kvantovaya Elektronika’ journal , which is first published in
Russian and then translated into English by a group of professional translators.
наночастиц Au в жидкости
под действием лазерного
наночастиц в жидкости
процесса проведено на
уравнения для функции
наночастиц по размерам с
находятся в хорошем
данными, полученными при
наночастиц золота в воде
под действием излучения
лазера на парах меди при
излучения в среде ~106 Вт/см2.
: наночастицы, коллоидные
лазерных филаментов в
. Численно и
малым углом друг к другу в
кристалле сапфира в
Получены распре деления
энергии и концентрации
свободных электронов в
процессы при лазерной
. Показано, что при
различной полярности от 0
до 106 В/м в ходе воздействии
лазерного излучения со
потока ~106 Вт/см2 на
металлов (Cu, Al, Sn, Pb) изменение
плазменного факела на
ранних стадиях носит
количественный, а не
качественный характер. В
то же время характерные
размеры капель вещества
мишени, вынесенных из
существенно (в несколько
раз) уменьшаются при
независимо от его
: лазерное излучение,
. Обсуждается физическая
друг с другом частиц,
частиц в целом.
проводится на примере
нейтронов, помещенных в
общую потенциальную яму
количественные оценки и
: квантовая нуклеоника,
ассоциации, нейтроны в
кноидальные волны в
. Найдены новые частные
кноидальным волнам в
второго порядка при
профиля для каждой из
The corresponding translations were taken from http://iopscience.iop.org/1063-7818/42/2
For an automatic analysis, we used the relevant software that is publicly available from http://www.languagestudio.com/LanguageStudioDesktop.aspx#Pro.
Language StudioTM Lite is a free tool that provides key metrics for translation quality. This tool can be used to measure not only the quality,
but also the improvements in quality because custom translation engines are constantly being updated via the quality improvement feedback cycle. Language
StudioTM Lite currently supports such metrics as BLEU, F-Measure, and TER.
From the point of view of syntax, the abstracts presented for the analysis are characterized mainly by simple sentences, i.e., smth is presented or smth is investigated. Besides, most frequently used are compound sentences with an object clause, for example, it is shown that ... or it is found that … . As to the vocabulary, translators most often use one-word termswaveguide, two-word termslight wave, uncertainty relation, and three-word termstarget material droplets, whereas four-word termscrystal-like spatially periodic structureare extremely rare.
For the program to correctly score the translations, we preliminary processed the reference translations and candidate translations made by Google and
PROMPT. Each sentence started a new paragraph, and the texts were converted into .txt format.
Initially, we compared the reference translation and the outputs from Google and PROMPT, using n-gram metrics. The results of the translation
evaluation summary are presented below.
-- Report End --
-- Report End --
The results of comparison show that Google scored 62.554, while the PROMPT scored only 35.528. All this suggests that Google copes well with the
vocabulary, while PROMPT experiences some difficulties in translating unknown words (however, we believe that proper training of this MT system may yield
better results). In fact, this is not surprising, since statistical translation relies on n-gram models. All the advantages of statistical systems
manifest themselves when the system is trained for a sufficiently long time and high-quality corpora of parallel texts are available. Moreover, qualified
linguists are not required in this case, and the system can be trained during its operation. These systems have however some drawbacks: Large parallel
corpora of texts are needed for training; such systems rely on a complex mathematical apparatus; high-quality translation is only possible for phrases that
match the n-gram model, and translation strongly depends on the corpora, which were used for training.
The second analysis was performed using metrics such as BLEU, F-measure and TER. The two outputs were compared simultaneously with the reference
translation. As a consequence, we have the following results:
-- Report End --
As in the previous test, Google shows better results, which is not surprising, because scientific texts are highly standardized. The syntactic features of
scientific and technical texts include syntax and semantic completeness, frequent use of clichéd structures, a comprehensive system of connecting
elements (coordinating and subordinating conjunctions), etc. Scientific speech is characterized by complicated syntax, which is reflected in the use of
sophisticated coordinated and subordinated sentences and in the complexity of simple sentences, mainly with appositives. In adddition, scientific and
technical texts are characterized, first of all, by the frequent use of highly specialized and scientific terms. This is explained by the fact that
scientific terminology evolves due to the need for experts in a field to communicate with precision and brevity, but often has the effect of excluding
those who are unfamiliar with the particular specialized language of the group. Modern terminology is accurate, efficient, nominative, stylistically
neutral, and lacks emotional bias.
All the above-said allows Google to cope so well with standardized texts. Nevertheless, it should be noted that PROMPT does much better job when it comes
to grammar. Thus, there are more grammatically correct sentences in the PROMPT output than in the Google output. This is not surprising, because PROMT
relies on rule-based machine translation (RBMT). RBMT is based on linguistic description of two natural languages (bilingual dictionaries and other
databases containing morphological, grammatical and semantic information), formal grammars, and proper translation algorithms. The quality of translation
depends on the size of linguistic databases (dictionaries) and depth of description of natural languages .
An overview of the most commonly used metrics of MT evaluation is presented. Automatic evaluation of MT quality by such metrics as BLEU, F-measure, and TER
has significantly improved statistical MT. Typically, these metrics show good correlation of candidate translations with reference translations. One of the
major drawbacks of these metrics is that they cannot provide an assessment of the MT quality at the semantic or pragmatic levels. Nevertheless, at the
present these metrics are the only systems of automatic translation quality assessment.
The quality of the outputs from Google and PROMPT is compared with the reference translations, using n-gram models and different metrics. In both
cases, the Google output shows good correlation with the reference translation. The best match is registered at the vocabulary level which is to be
expected, because the basis of the statistical translation is the n-gram model. The worst results in terms of grammar is also shown by Google, which
is also understandable because PROMPT relies on the RBMT-model in which translation depends on the size of linguistic databases (dictionaries) and the
depth of description of natural languages, i.e., the maximum number of features of grammatical structures.
Since the translation into English is a priority for Google, this MT system is constantly being improved. All this suggests that the potential of transfer
translation systems will be sooner or later exhausted, while the translation quality of statistical MT systems will eventually improve. Nevertheless, we
believe that in the future, machine translation will combine these tworule-based and statisticalapproaches, as well as the universal semantic hierarchy
(USH) approach  in order to produce a correct translation.
The development of efficient and reliable evaluation metrics MP has been actively investigated in recent years. One of the most important tasks is to go
beyond the N-gram statistics, while continuing to use a fully automatic regime. The need for a fully automated metric cannot be underestimated, as
it should provide the highest rate of development and progress of MT systems.
The author thanks S.N. Vekovishcheva for valuable advice during the preparation of the manuscript.
1. White, J., O’Connell, T., and Carlson, L. (1993) “Evaluation of Machine Translation.” In Human Language Technology: Proceedings of the
Workshop (ARPA), pp 206–210.
2. Melamed, I.D. (1995) “Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons.” In Third Workshop on Very
Large Corpora (WVLC3), pp 184–198, Boston.
3. Brew, C., and Thompson, H. (1994) “Automatic Evaluation of Computer Generated Text: A Progress Report on the TextEval Project.” In Human
Language Technology: Proceedings of the Workshop (ARPA/ISTO), pp 108–113.
4. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J.. (2002) “BLEU: a Method for Automatic Evaluation of Machine Translation.” In Proceedings
of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp 311–318, Philadelphia.
5. Doddington, G. (2002) “Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics.” In Human Language
Technology: Notebook Proceedings, pp 128–132, San Diego.
6. Turian, J.P., Shen, L., and Melamed, I.D. (2003) ”Evaluation of Machine Translation and its Evaluation.” In Proceedings of MT Summit IX; New
Orleans, USA, 23-28 September 2003.
9. Blatz, J., Fitzgerald, E., Foster, G., Gandrabur, S., Goutte, C., Kulesza, A., Sanchis, A., and Ueffing, N. (2004) “Confidence Estimation for
Machine Translation.” In Proceedings of COLING, pp 315–321, Geneva.
12. Cancedda, N., and Yamada, K. (2005). “Method and Apparatus for Evaluating Machine Translation Quality.” US Patent Application 20050137854.
15. Melamed, I.D., Green, R., and Turian, J.P. (2003) “Precision and Recall of Machine Translation.” In Proc. HLT-03, pp 61–63.
17. Lavie, A., Sagae, K., and Jayaraman, S. (2004) “The Significance of Recall in Automatic Metrics for MT Evaluation.” Proceedings of the
Sixth Conference of the Association for Machine Translation in the Americas (AMTA'04), pp 134–143.
18. Banerjee, S., and Lavie, A. (2007) “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.” In
Proceedings of the Second Workshop on Statistical Machine Translation, pp 228–231, Prague.
19. Ulitkin, I. (2011) “Computer-assisted Translation Tools: A Brief Review.” Translation Journal, Vol. 15, No. 1, January 2011.