A Quality Evaluation Template for Machine Translation | January 2016 | Translation Journal

January 2016 Issue

Read, Comment and Enjoy!

Join Translation Journal

To receive regular updates,
fill in your details below.
You will also receive a PDF listing
8 Ways to Ignite your Translation Career.
Join now. 

A Quality Evaluation Template for Machine Translation

Even though Machine Translation (MT) is one of the most advanced and elaborate research fields within Translation Technology, the quality of MT output has always been a great concern, and MT evaluation is a popular research topic. In this paper, we first provide an overview of existing translation quality assessment methods for human translation, including translation industry quality standards and theoretical approaches to translation quality. Then we analyse some of the existing metrics for evaluation of MT: both automatic and manual. While automatic metrics (BLEU) are cheap and suitable for tracking progress in MT research, development of a specific system, or comparing different systems, they have various limitations compared to manual evaluation. Manual MT evaluation methods tend to overcome these drawbacks, at the same time, however, being expensive, time-consuming and subjective. Finally, we introduce a quantitative MT evaluation method based on error-count technique. This method is an attempt to combine techniques for machine and human translation evaluation for the purpose of evaluating the quality of MT.

Keywords: machine translation, human translation, translation quality, machine translation evaluation

1 Introduction

Today’s market offers a great variety of translation tools and resources that are used by professional translators to increase their productivity and the quality of translation products. Stand-alone tools, as well as online applications and resources have become a common part of working practices for individual translators and translation companies. These include online dictionaries, term banks, language corpora, term extractors, translation memories (TM) and machine translation (MT) systems. MT is the oldest and one of the most elaborate types of technologies for translators. Many MT systems are available online for free and very popular among general public, but not as popular among professionals in the field. According to a survey on the use of translation technologies among professional translators (Torres Domínguez, 2012), only 21% of them are currently using MT compared to, for instance, TM software, which is used by 54%. This is due to unsatisfactory quality of the outcome of MT systems, which has always been a great concern. The same amount of working time saved by applying MT is often spent on post-editing, or even retranslating the same passages from scratch. For translation companies, implementing MT implies large investments, which are not compensated in the short term.

Even though there is a common opinion that MT can be used only to get the ‘gist’ of the text, the development of technologies is moving forward and this idea is becoming more and more questionable. However, in order to achieve higher quality it is necessary to be able to evaluate it. The question of how the MT quality is currently measured within the translation industry was investigated in the recent survey within the QTLaunchpad project (Doherty et al., 2013). It shows that there is no unified MT quality assessment method that could be adopted within the industry, and some of the companies do not perform any MT evaluation at all. Existing methods include human evaluation, automatic MT evaluation metrics such as BLEU, and combined methods. However, all of them have their drawbacks and there is no method that would fit different evaluation purposes. Automatic evaluation metrics cannot cover all the aspects of translation quality and tend to favour specific types of MT systems (see Section 2.2.1). Human evaluation, on the other hand, is time-consuming and expensive, as well as subjective. We suggest that human evaluation procedure could be improved in order to achieve better efficiency and objectivity by developing a quantitative metric based on quality parameters and standards used in translation industry to evaluate human translation.

In this paper we will focus on free online MT systems and MT evaluation methods. We will describe some of the existing automatic MT evaluation metrics, followed by an overview of translation evaluation methods for human translation that are used in translation industry and in academia. Finally, based on the described metrics, we will propose a list of parameters for evaluation of MT systems.

2 Approaches to translation quality assessment

The quality of machine translation has been an issue since its early days. Already in the 50s researchers were dreaming of fully automatic high quality machine translation (FAHQMT). Then, throughout the years, it has been admitted that FAHQMT is very unlikely to be achieved in the nearest future and MT can be only used for other purposes like ‘gisting’ or post-editing. Nowadays, however, technologies continue to improve, and it becomes obvious that existing MT systems are capable of reaching higher standards. Therefore, a standard and precise MT quality assessment method needs to be available to the translation providers and MT users.

The QTLaunchpad project[1] conducted a survey (Doherty et al., 2013) on quality assessment in translation industry, which showed that the most popular way to measure MT quality was human evaluation (69% of participants). Other respondents (22%) reported using automatic evaluation metrics (we will return to them in the following sections), and 13% adopted some internally developed evaluation methods. Finally, 35% of the respondents opt for a combination of human and automatic evaluation methods, and 7% did not have any formal evaluation metric in place (Doherty et al., 2013: 11). In the next two sections we discuss how the problem of translation quality is approached in translation studies and in research on MT and suggest a template for quantitative MT quality assessment based error-count methods.

2.1 Quality assessment of human translation

Creating an optimal and efficient method for translation quality assessment (TQA) is a challenge for various reasons. Apart from having to decide what features actually constitute a good translation, one has to take into consideration a number of other factors, such as, for instance, the genre of the translated text or the purpose of translation. Thus, an evaluation of a translation of a legal document should put more weight on the aspect of accuracy, which is not the case literary translation. Another problem is the purpose of translation, i.e. whether the translation is only made for ‘gisting’, for internal communication, or is to be published (dissemination). And finally, the purpose of the evaluation itself plays an important role, be it assessment of professional translator’s performance for remuneration, assurance of translation within a specific translation service provider, comparison of MT systems, measure progress of MT systems over time, etc. All these factors make it difficult to create one unique metric that would fit for any purpose and conditions.

The issue of TQA has been addressed both in translation industry and in academia. The industry is focused on providing the best product for the client. Several industry standards have been developed, such as LISA QA Model, SAE-J2450[2], ATA Framework of Standard Error Marking[3], UNE-EN 15038[4], ITR BlackJack Quality Metric[5], and others, including the standards of such organisations as Translation Centre for the Bodies of the European Union[6] and the Translation Bureau of the Canadian Government. These standards, being created mostly for specific purposes of a corresponding organisation, are not suitable for all the situations where QA needs to be performed. Moreover, they are not based on well-elaborated theoretical frameworks, and some concepts (such as ‘major omission’ or ‘minor mistake’) are not clearly defined and thus depend on the rater’s individual judgments.

The researchers in translation studies addressed this issue by speculating on what a good translation is and how it should be best evaluated, while also covering the subject of translation teaching (Gouadec, 1981; Larose, 1987; Hurtado, 1995; House, 1997; Darwish, 2001; Nord, 2005; Williams, 2004; and others). Each of their approaches focuses on a different aspect of translation quality. House’s functionalist model relies on the situational characteristics of the source text and the translation and their comparison from the functional point of view. The main parameter being assessed in this approach is the functional equivalence of the two texts, or in other words, how well the purpose of the translation matches the purpose of the original text. However, this approach has certain inadequacies, since the translation is not always performed with the same purpose as the original text was created.

Nord’s (2005) ‘translation-oriented’ approach to TQA tries to overcome this drawback while stating that it is the function of the translation that is the most important quality factor, thus putting the suitability of the target text for its purpose in the centre of the TQA process. This type of approaches has been criticised (e.g. in Williams, 2009) for focusing too much on the target text. In addition, they do not provide a precise and quantitative model for TQA, but a rather general idea of what constitutes a good translation.

Thus, theoretical approaches to TQA provide good arguments regarding the aspects that have to be taken into consideration for evaluating a translation, but, on the other hand, they do not provide a practical tool, which could answer the industry needs and be used on a daily basis. Opting for more precise quantitative methods, recent works often propose a scale based on error-count scores. They have much in common with the existing industry standards, such as LISA QA and, in fact, consist of a categorisation of errors. Below, we provide an overview of some parameter scales including the ones that will serve us as a foundation for building our own scale.


[1] http://www.qt21.eu/


[1]http://www.atanet.org/certification/aboutexams_error.php [last accessed on 30.05.2014]



[1] http://cdt.europa.eu/EN/Pages/Homepage.aspx

Information Integrity

Omission unjustifiable by translation technique or strategy

Logical order of original argument

Addition unjustifiable by translation technique or strategy

Factual errors

Intra-textual referential integrity

Linguistic Integrity

Grammatical errors affecting meaning

Syntactical errors

Inappropriate register or style

Punctuation errors affecting meaning

Inconsistent use of terms


Incorrect modality

Faulty sentence structure

Unidiomatic usage

Misplaced modifiers

Incorrect acronyms, abbreviations, capitalization

Faulty/ambiguous/no parallelism

Shift of subject

Shift of tense

Incorrect tense sequence

Inconsistent tense usage

Shift of number

Shift of voice

Nonfunctional interruption

Awkward inversion

Awkward coordination

Improper subordination

Translation Integrity






Fitness for Purpose

Fit for purpose

Figure 1. Darwish (2001) Parameter List

Most of the scales attempt to evaluate the two aspects of translation quality: fidelity and fluency. Fidelity (in some works also referred to as ‘Content Transfer’, ‘Information Integrity’, ‘Accuracy’) represents the issues related to the transmission of the information from the source text

into the target text. Parameters of this type include, for instance, omissions of information present in the source text, or addition of unnecessary information to the target text, different types of factual errors, and others. Fluency (sometimes called ‘Language Parameters’, ‘Linguistic Integrity’) concerns the use of target language, its correctness and fluency. Thus, this category covers all types of grammar errors, style issues, misuse of idiomatic expressions, etc.

Ali Darwish (2001) proposes a scale that also follows this schema. In his model for assessing translator’s competence and the quality of translation as a product, he argues that each translation should be evaluated with regards to the purpose of translation, be it communicative, literal, reader-centred, and so on. He distinguishes the two attributes of translation: information integrity (which we call fidelity), linguistic integrity (which we call fluency). Their attributes can be present or absent in a given translation. In addition to these two categories, there is a third category of parameters ‘Translation Integrity’, which consists of variables of quality that can be measured on a continuous scale: Accuracy, Precision, Correctness, Completeness, Consistency, Clarity. These variables contribute to the presence or absence of translation integrity.

Darwish suggests that these observations can be then developed into a metrics model, which can be adjusted according to the purpose of translation, since each parameter is assigned a certain weight. One of the possible scales is presented in Figure 1 (Darwish, 2001, p. 25).

Many translation evaluation methods have been criticised for taking into consideration only word- or sentence-based errors, while the concept of translation quality also includes such general text-level aspects as coherence, transmission of the source text content, precision, etc. One of the solutions to this problem is suggested by Cristina Toledo (2010), where she proposes to combine an error-based scale with a text-level evaluation scale. The global text-level evaluation scale is provided in Table 1, translated into English. It matches the translated text with one of five levels of translation quality, which is measured by the quality of transmission of the source text, and the quality of expression of the target text. For each level, these parameter values are expressed by descriptive statements about the translation.

Table 1. Global evaluation parameters by Toledo (2010).


Transmission Quality

Expression Quality



Full transmission of the source text information; minimal revision. The translator’s work is equal to one of an experienced professional translator.

All or almost all is written as an original text in the target language. Minimal errors in terminology, grammar or orthography.



Almost full transmission, some insignificant omissions. Average revision, the translator can work with no supervision.

A big part of the text is written as an original in the target language, some isolated errors in terminology, grammar or orthography.



Transmission of the general idea(s) but with gaps and imprecisions; conscious revision. The translator can work independently.

Some parts are written like an original text in the target language, but others sound like a translation; a considerable number of errors in terminology, grammar or orthography.



Transmission distorted by omissions or erroneous interpretations; very deep revision. The translator needs supervision.

Almost all text sounds like a translation; multiple errors in terminology, grammar or orthography.



Transmission with many defects; too much revision. The translation is mostly incoherent.

Shows incapacity to express ideas in the target language.


  1. Language/locale
  2. Subject field/domain
  3. Terminology
  4. Text Type
  5. Audience
  6. Purpose
  7. Register
  8. Style
  9. Content correspondence
  10. Output modality
  11. File format
  12. Production technology

Figure 2. Dimensions in MQM

The most complete and detailed evaluation model of the ones presented here was developed by the QTLaunchpad project. Their Multidimensional Quality Metric (MQM) (Lommel et al., 2013) is designed in a way that it can be adjusted to suit different translation types, including both MT and human translation, different text genres, translation purposes. Furthermore, the MQM addresses the problem of the quality of source text, which often affects the quality of the translation product. Another important feature of the system is that it supports various degrees of granularity, i.e. it can be used both when just quick assessment is needed, and when the purpose of evaluation is a very precise fine quality feedback.

The concept of dimensions is central for the model, as they help the reviewer decide on what error types are relevant for this particular evaluation task. For example, the dimension ‘Register’ can be assigned the value ‘formal’, and in this case the evaluation will take into account the error types relevant for this type of texts. There are twelve such dimensions (Figure 2) The list of errors has a hierarchical structure with maximum four levels, which allows to have different levels of granularity (see example in Figure 3). In total, the list has over 120 error types of different levels[7].

2.   Accuracy

           17.Bilingual terminology

                     32.Bilingual terminology, normative


                     33.Overly literal

                     34.False friend

                     35.Inconsistent number

                     36.Inconsistent entities

                     37.Non-matching dates/times

                             3. Non-matching date

                             4. Non-matching time

Figure 3. Fragment of the MQM error list

An important feature that is specific for this model compared to the others described below is that it was created both for human and machine translation evaluation, and therefore covers more error types that are rather typical for MT, such as lack of grammatical agreement or use of wrong part of speech.

Most of the metrics described above were developed specifically for assessing human translation. However, they are not always suitable for evaluation of MT systems or they need to be specifically adjusted (as in case of the MQM metric). In the next section we will discuss existing metrics and methods developed specifically with the purpose of evaluating MT, their similarities with the metrics for human translation evaluation and the particularities of MT compared to human translation from the point of view of assessment strategies.

2.2 MT Evaluation

The metrics discussed above are not always suitable for evaluation of MT for a number of reasons. First of all, as research in MT goes on, the performance of MT systems has to be assessed in order to register improvements. In addition, it is necessary to compare different competing MT systems, a great number of which is currently available on the market or is being developed. Therefore evaluation methods have to be precise and quantitative in order to be able to register even the smallest differences. As MT evaluation has to be performed on a regular basis, the procedure needs to be clear and relatively fast and simple, whereas, as we have seen, many methods of human translation assessment only provide a theoretical background on translation quality, while leaving the rating procedure unclear.

[7] The full list with detailed explanations is available online https://docs.google.com/document/d/1hItozVqPPq4QFflUwGiDSLFoptKgi7HxS7NtFzCv0pM/pub#h.8nxu2cfs70it

In the following sections we look into TQA methods specific for MT, both automatic and manual, and discuss their advantages and drawbacks, as well as similarities with evaluation of human translation.

2.2.1 Automatic metrics

The fastest, most affordable and effortless way of measuring MT quality is the automatic method. The intuition behind it is that a good automatic translation is one that is close to a human translation. Segments (often sentences) of a candidate automatically translated text are compared to segments of one or several reference human translations. It can be done word by word (unigrams) or by word groups (n-grams).

The most established automatic metric in the field is the BLEU (Bilingual Evaluation Understudy) metric (Papineni et al., 2002). It was one of the first metrics that had high correlation with human evaluation. The BLEU score is calculated against one or several reference translations. They are compared with the MT output text by segments, normally sentences, and the scores for all segments are averaged for the whole corpus to obtain a total score of translation quality, which is always a value between 0 and 1, with 1 meaning that the output is identical to the reference translation. Since even a human translation will almost never be identical to the reference translation, it is not necessary to achieve the value of 1. Using multiple reference translation can increase the BLEU score, since there are more matching possibilities.

The BLEU metric is based on the calculation of precision. However, as explained by Papineni et al. (2002, p. 312), precision simply counts up the number of unigram in the candidate translation which occur in any reference translation and then divides by the total number of words in the candidate translation. At the same time, the example in Figure 4 shows that MT systems can generate ungrammatical phrases which would still receive a high precision score. The MT translation in this example would receive precision score of 1. Therefore BLEU uses a modified unigram precision (see Figure 4), which is computed by first counting ‘the maximum number of times a word occurs in any single reference translation. Next, one clips the total count of each candidate word by its maximum reference count, adds these clipped counts up, and divides by the total (unclipped) number of candidate words.’ (Papineni et al., 2002, p. 312) Modified n-gram precision is computed similarly. The algorithm counts all candidate n-grams and their maximum occurrences in the references. The candidate counts are clipped by the maximum counts, summed, and divided by the total number of candidate n-grams. Thus, the candidate sentence in Figure 4 will receive 0 score of bigram precision. There is also an additional brevity penalty (BP) for translations that are significantly shorter than the reference, which is similar to the concept of recall[8].

The intuition behind the BLEU score is based on two concepts related to translation quality: adequacy and fluency. The translation that uses the same words (unigrams) is considered to be adequate, while the one that uses similar structures (n-grams) is more fluent. Thus, the bigger n scores account to higher fluency.

It has been widely recognised that the BLEU metric has certain drawbacks, one of them being the way it behaves with rule-based systems: while it usually has high correlation with human judgments when evaluating statistical systems, it is not the case for rule-based systems (Koehn and Monz, 2006: 105). Callison-Burch et al. (2006) point out other weaknesses of the metric, such as not covering synonyms or paraphrases. In addition, all words are equally weighted, so missing out on a content bearing word or on a functional word makes no difference. And finally, recall is only considered at most indirectly. Despite all these drawbacks the BLEU metric appears to be useful for evaluating small changes in the same system and is still the most popular metric in the MT community.

[8] The concepts of precision and recall in the context of translation evaluation can be explained as follows: precision is the percentage of translated words that were translated correctly, and recall is the percentage of all words that were translated.

Figure 4. Modified unigram precision

Candidate: the the the the the the the.

Reference 1: The cat is on the mat.

Reference 2: There is a cat on the mat.

Modified Unigram Precision =2/7

There are similar metrics that try to deal with the drawbacks of BLEU and are, in fact, modifications of the BLEU formula. Thus, the NIST metric introduced by Doddington (2002) also measures n-gram precision, but it has a solution for the problem of word weight, which consists in giving more weight to less frequent n-grams. This allows to give more penalty for missing out on more informative n-grams. Another difference is that NIST uses an arithmetic mean of co-occurrences over n, instead of a geometric mean used by BLEU.

Another well-known metric is called METEOR (Banerjee and Lavie, 2005) and it was also created as an alternative to BLEU. In contrast to the previous two metrics, the idea is to focus on recall and not on precision. Furthermore, it counts only unigrams without taking into consideration higher order n-grams. The candidate text is aligned to a reference text by the algorithm that matches tokens in stages, maximum one match per token. Only two texts can be aligned, therefore, if there are more than one reference texts, the alignment is made for each of them. The matchings are made by three modules: the first module matches tokens, the second one matches the words after they are stemmed with the Porter stemmer, and the third module matches unigrams if they are synonyms in the WordNet database[9]. Fluency is captured by a penalty for high number of chunks, or in other words the number of matched n-grams. For instance, when the segment ‘the president spoke to the audience’ is matched to ‘the president then spoke to the audience’, there are two chunks, but if the segments matched exactly, there would have been only one chunk and therefore no penalty.

A completely different, but also intuitive method is the Word Error Rate (WER) metric, which counts how many substitutions, deletions and insertions are needed to transform the candidate translation into the reference. This metric was used in various works such as Tillmann et al. (1997) and Vidal (1997), but it has a significant disadvantage of depending of a specific reference translation.


The Translation Edit Rate (TER) score, which is based on a somewhat similar idea, was proposed by Snover et al. (2006). TER measures the amount of human post-editing needed to be performed, so that the MT output exactly matches a reference translation. Based on their experiment, the authors claim that only with one reference translation TER yields the same correlation with human judgment as BLEU would have with four reference translations. The number of edits is calculated by dynamic programming, for which the algorithm is described more in detail in Snover et al. (2006, p. 226).

Other automatic metrics include Position-independent word error rate (PER) (Tillmann et al., 1997), ROUGE (Lin and Och, 2004), and others. All of them provide a somewhat reliable way for fast and cheap MT evaluation, for instance, in order to observe improvements of an MT system under development, diagnose or compare MT systems, but none of them is capable of reaching a high enough quality to replace human judgment when a precise evaluation is needed. Moreover, since they take into account only sentence-length segments they do not show judgment upon such text properties as consistency, intratextual references, style or grammaticality, among others. Finally, automatic metrics are reference-based, i.e. they rely on one or several reference human translations, but they cannot take into account all the synonymic structures and paraphrases.

2.2.2 Human raters

A more precise, but also more expensive and time-consuming way of evaluating MT output is using human raters. Similarly to many automatic metrics like BLEU, some of these methods measure the two translation quality dimension, i.e. fidelity and fluency. For instance, they can be evaluated on a 1 to n scale. During such evaluation, raters assign each translated sentence two scores, for example, from 1 to 5, each score rating one of the dimensions (as described, among others, by Koehn and Monz, 2006). In addition, each dimension is sometimes rated by different characteristics. For example, fluency can be measured by intelligibility, clearness, readability, naturalness. The fidelity, which shows how much information is contained in the translation, can be measured by bilingual raters with a 5-point scale, where 5 is given to translations where all the information from the source text is preserved, and 1 marks translation with almost no information from the source text. When the raters are monolingual, a good human translation is provided, which serves as a reference to evaluate the MT translation adequacy.

This type of metrics is relatively simple and easy to implement, but on the other hand suffers from high subjectivity. Raters often disagree on the scores and for more precise evaluation multiple raters are required. In addition, bilingual raters are not always available, so one has to resort to reference-based evaluation, which is also one of the drawbacks of automatic metrics.

Informativeness of a translation is often measured by task-based evaluation, where the translated text is used to perform a certain task, for example, answer multiple-choice questions about the content of the text or extract specific information from the MT output (Voss and Tate, 2006). The percentage of the correct answers is calculated, which gives us the informativeness score. Another task-based method consists in measuring the time it takes for evaluators to read segments of the output text (supposedly, better translation is easier to read). The cloze task (Taylor, 1957) is another test for text readability. The raters are given translated sentences with certain words replaced by gaps (for instance, every 8th word) and have to guess the missing words. The number of correctly identified words normally correlates with readability of the text. In addition, there are metrics for MT-translated texts which are to be post-edited. The edit cost of post-editing is measured by the number of corrected words, amount of time spent, or the number of key strokes the editors have to make (Jurafsky and Martin, 2009, p. 931). The task-based metrics can turn out to be even more time and cost consuming than other human evaluation methods. Moreover, they require a clear procedure of calculating the final score based on the observed data.

In ranking method, MT systems are compared to each other, i.e. the rater is provided with a source sentence and its translations, and has to order them according to their quality (see, for instance, Callison-Burch, 2008). These metrics are best for comparing different systems, but they do not say anything about how well the ‘best’ system actually performs. In addition, they are also subjective, since it is not clearly defined what a ‘better translation’ is.

The error-count metrics, initially developed for assessing human translation, have also been applied for MT evaluation. The Multidimensional Quality Metric (Lommel et al., 2013) is universal metric applicable both to human and machine translation. Using of error-count scales to evaluation of MT has certain advantages and drawbacks. It requires specific training, as the evaluators need to learn different types of errors. It is time-consuming and needs well-prepared bilingual raters. The error weights have to be adjusted according to the purpose of evaluation and to the characteristics of texts. However, this method, due to it quantitative nature, allows a more precise and fine-grained evaluation compared to other methods, where quality is assessed in terms of n-point scales or even concepts of ‘good’ and ‘bad’.

3. Template for MT Quality Assessment

For our parameter list we combined parameters from three different sources: the QTLaunchpad Multidimensional Quality Metrics, Cristina Toledo’s research (Toledo, 2010) and Alan Darwish’s Transmetrics approach. Table 2 presents the final list with indication of the source of every parameter.

Since MT is often used only for ‘gisting’, i.e. to get a general idea of the content of the source text, it appears necessary to evaluate MT output texts from this point of view in addition to the error-count evaluation. For this reason ‘Global Parameters’ measure directly the two translation dimensions fidelity and fluency, and include additional parameters informativeness, completeness, clarity and logical order of arguments of the text. While the error-count parameters will be evaluated by the number of corresponding errors, the global parameters will be given a score within a scale.

From the MQM metric list we adopted only the parameters from the Fluency, Accuracy and Verity categories, while other two (Design and Legacy Compatibility) we omitted for not considering them relevant for the task of evaluating free online MT systems. Furthermore, our classification is less detailed and consists only of two error type levels within one dimension. In some cases (e.g. for monolingual terminology issues) we do not distinguish all the subtypes present in the MQM schema considering it too specific and not relevant for comparing MT systems. For the same reasons, we eliminated all the parameters that refer to local regulations and standards (for example, ‘inconsistency with company style guidelines’). Instead, we added some error types that, in our opinion, should be taken into account. Thus, in the Grammar type the following sub-types were added: ‘improper subordination’, ‘wrong modality’ and ‘wrong tense’ (from Darwish, 2001), ‘wrong aspect’ and ‘wrong preposition’ (own parameters), as well as ‘miscollocation’ (Darwish, 2001), ‘bad connection’ and ‘lack of logic’ from (Toledo, 2010), and the two mistranslation types ‘wrong meaning’ and ‘overly free’ (Toledo, 2010). From the Verity parameters we kept only ‘fidelity’ and ‘completeness’ (in our scheme they belong to the Global Parameters category) and discarded the two other: ‘legal requirements ‘and ‘local suitability’, which we do not consider relevant to our purposes. In addition to these we introduced the parameters of ‘clarity’ and ‘logical order of argument’ (adopted from Darwish, 2001), ‘fluency’ and ‘informativeness’. The ‘fluency’ parameter measures how idiomatic and natural the output text is, while the ‘informativeness’ reflects the amount of the information from the source text that was preserved in the translation.

Table 2. Parameters and sources.

I. Fluency


1.         Unintelligible

2.         Monolingual terminology

3.         Inconsistency (e.g. between abbreviations)

4.         Style (inappropriate variants and/or slang, tautology)

5.         Register

6.         Duplication

7.         Unclear reference

8.         Spelling (includes diacritics and capitalisation)

9.         Orthography

- Punctuation

- Character encoding/interpretation error

10. Grammar

- Morphology (wrong word form)

- Part of speech

- Agreement

- Word order


- Improper subordination

- Wrong modality

- Wrong tense

Darwish, 2001

- Wrong aspect

- Wrong preposition

Own parameters

11. Locale convention

- Date format

- Time format

- Measurement format

- Number format

- Quote marks format

12. Broken link/cross-reference


13. Miscollocation

Darwish, 2001

14. Bad connection, wrong use of connectors

15. Lack of logic

Toledo, 2010

II. Fidelity


16. Bilingual terminology

17. Mistranslation


- Wrong meaning

- Overly free

Toledo, 2010

- Overly literal

- False friend

- Inconsistent number

- Inconsistent entities

- Non-matching dates/times

- Unconverted value

18. Omission

19. Untranslated

20. Addition


III. Global Parameters

21. Fidelity

22. Completeness

23. Clarity (the text is understandable)

24. Logical order of argument (Discourse Consistency)

Darwish, 2001

25. Fluency

26. Informativeness

Own parameters

Using the selected parameters, we created an evaluating template, which consists of two parts: an error-count part and a global evaluation part. In the first part the error types are grouped following the principle of the two translation quality dimensions into ‘Fluency’ and ‘Fidelity’ categories. The rater counts the number of errors corresponding to each parameter and registers the number in the respective cell of the table (Table 3). The total number of errors is then summed.

Table 3. Template part I: error-count.

Error Type

Number of Errors





Monolingual terminology


Inconsistency (e.g. between abbreviations)


Style (inappropriate variants and/or slang, tautology)






Unclear reference


Spelling (includes diacritics and capitalisation)




Character encoding/interpretation error



Morphology (wrong word form)

Part of speech


Word order

Improper subordination

Wrong modality

Wrong tense

Wrong aspect

Wrong preposition


Locale convention

Date format

Time format

Measurement format

Number format

Quote marks format


Broken link/cross-reference




Bad connection, wrong use of connectors


Lack of logic




Bilingual terminology



Wrong meaning

Overly free

Overly literal

False friend

Inconsistent number

Inconsistent entities

Non-matching dates/times

Unconverted value








Total Number


In the second part (Table 4), the translation output is evaluated according to ‘The Global Parameters’ on the scale from 1 to 5, where the better translation gets the higher score. For instance, a perfectly fluent text gets the fluency score of 5. An average global score is then calculated.

Table 4. Template part II: global parameters.











Clarity (the text is understandable)


Logical order of argument (Discourse structure)






Average score


The MT systems evaluated with this template will receive two separate scores corresponding to the two template parts and thus can be compared both from the point of view of the quantity of errors and the general quality assessment of the whole text. Considering these two aspects is a way of assuring a more subjective way of measuring translation quality in cases where, for instance, the output text does not contain any specific errors from the list, but has a poor discourse structure, which makes it difficult to understand for the reader.

4 Conclusions

In spite of the rapid development of translation technologies, the quality of MT continues to be a big concern, especially for professional translators who remain reluctant to incorporate MT as a constant component of their working process. In order to change the situation, the quality of MT output has to be evaluated with high accuracy and objectiveness. There are a variety of methods of MT quality evaluation, which are suitable for specific purposes. However, none of them are universal and can be applied for any kind of translation quality assessment, and all of them have their drawbacks.

In this paper we discussed existing translation quality assessment methods, including automatic MT evaluation metrics and parameter-based scales used for evaluation of human translations. Based on them we proposed a new parameter list more suitable for evaluating machine translation systems. Based on this list we suggest a translation evaluation template, which consists of two evaluation scales. The first scale is an error-count type of scale, and is to be used by raters to assess the translation from the point of view of the number of errors. The second scale is used to evaluate different translation quality parameters for the whole text with a point system from 1 to 5. We suggest that this combined method can be successfully applied to evaluate the quality of machine translation output. In order to prove the efficiency of this method, future research will include various experiments with free and commercial MT systems, in which our method will be compared to other methods of MT evaluation.


Banerjee, S., & Lavie, A. (2005). METEOR An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the Association of Computational Linguistics. Ann Arbor, Michigan.

Callison-Burch, C., Osborne, M., & Koehn, P. (2006). Re-evaluating the Role of B LEU in Machine Translation Research. In Proceedings of EACL.

Darwish, A. (2001). Transmetrics: A Formative Approach to Translator Competence Assessment and Translation Quality Evaluation for the New Millennium. http://www.translocutions.com/translation/transmetrics_2001_revision.pdf [last access: 19.01.2015]

Doddington, G. (2002). Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics. In Proceeding HLT ’02 Proceedings of the second international conference on Human Language Technology Research (pp. 138–145). San Francisco, CA: Morgan Kaufmann Publishers Inc.

Doherty, D. et al. (2013). Mapping the Industry I: Findings on Translation Technologies and Quality Assessment. European Commission Report. http://www.qt21.eu/launchpad/sites/default/files/QTLP_Survey2i.pdf [last access: 19.01.2015]

Gouadec, D. (1981). Paramètres de l’évaluation des traductions. Meta: Journal Des Traducteurs, 26(2), 99–116.

House, J. (1977). A Model for Translation Quality Assessment. Gunter Narr Verlag Tübingen.

Hurtado Albir, A. (1995). La didáctica de la traducción. Evolución y estado actual. In P. Hernandez y J. M. Bravo (eds.). Perspectivas de la traducción, Universidad de Valladolid, pp. 49-74.

Jurafsky, D., & Martin, J.H. (2009). Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2nd edition. Prentice Hall.

Koehn, P., & Monz, C. (2006, June). Manual and automatic evaluation of machine translation between European languages. In Proceedings of the Workshop on Statistical Machine Translation (pp. 102-121). Association for Computational Linguistics.

Larose, R. (1987). Théories contemporaines de la traduction. Montreal: Presses de l’Université de Québec.

Lin, C., & Och, F. J. (2004). Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004) (pp. 21–26).

Lommel, A. R., Burchardt, A., & Uszkoreit, H. (2013). Multidimensional Quality Metrics: A Flexible System for Assessing Translation Quality. In Proceedings of ASLIB Translating and the Computer 34. Retrieved from http://www.mt-archive.info/10/Aslib-2013-Lommel.pdf

Nord, Ch. (2005) Text Analysis in Translation: Theory, Methodology, and Didactic Application of a Model for Translation Oriented Text Analysis. Amsterdamer Publikationen Zur Sprache Und Literatur 94.

Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 311–318). Philadelphia.

Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA-2006) (pp. 223–231). Cambridge, MA.

Taylor, W. L. (1957). Cloze readability scores as indices of individual differences in comprehension and aptitude. Journal of Applied Psychology, 4, 19–26.

Tillmann, C., Vogel, S., Ney, H., Zubiaga, A., & Sawaf, H. (1997). Accelerated DP based search for statistical translation. In Proceedings of the 5th European Conference on Speech Communication and Technology (pp. 2667–2670). Rhodes, Greece.

Toledo Báez, M. C. (2010). El resumen automático y la evaluación de traducciones en el contexto de la traducción especializada. Peter Lang International Academic Publishers.

Torres Domínguez, R. (2012). 2012 Use of Translation Technologies Survey. http://mozgorilla.com/download/19/ [last access: 2.03.2014].

Vidal, E. (1997). Finite-state speech-to-speech translation. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Munich, Germany.

Voss, C. R., & Tate, C. R. (2006). Task-based Evaluation of Machine Translation (MT) Engines : Measuring How Well People Extract Who, When, Where-Type Elements in MT Output. In Proceedings of the 11th Conference of the European Association for Machine Translation (EAMT).

Williams, M. (2004). Translation Quality Assessment: An Argumentation-centred Approach. University of Ottawa Press.

Williams, M. (2009). Translation quality assessment. Mutatis Mutandis, 2(1), 3-23.




[3]http://www.atanet.org/certification/aboutexams_error.php [last accessed on 30.05.2014]




[7] The full list with detailed explanations is available online https://docs.google.com/document/d/1hItozVqPPq4QFflUwGiDSLFoptKgi7HxS7NtFzCv0pM/pub#h.8nxu2cfs70it

[8] The concepts of precision and recall in the context of translation evaluation can be explained as follows: precision is the percentage of translated words that were translated correctly, and recall is the percentage of all words that were translated.


Log in

Log in