A Comparatives Study of Machine Translation Evaluation Systems | July 2016 | Translation Journal

July 2016 Issue

Read, Comment and Enjoy!

Join Translation Journal

To receive regular updates,
fill in your details below.
You will also receive a PDF listing
8 Ways to Ignite your Translation Career.
Join now. 

A Comparatives Study of Machine Translation Evaluation Systems

Abstract

Due to the importance of judging and evaluating of the outputs of Machine Translation (MT) systems, the main focus of current study is concentrated on comparison two common methods of evaluating which are named human and automatic evaluating methods.

Materials of the study have been selected among economical texts. Twenty English sentences were selected from "Translating Economic Texts" published by university of Payam-e- Nour. 

To assess translation outputs humanly, twenty MA students of translation studies participated in this study as evaluators. These evaluators judged about variables of adequacy and fluency as the main factors of human evaluation methods. In order to evaluate collected data Automatically, BLEU method of evaluation was used.

Generally speaking, both methods of evaluation leads to the same results, it means that human evaluation and automatic evaluation methods work similarly, but by human evaluation methods more precious results can be obtained. The main difference between them is the amount of time and cost of evaluating of data.

Key terms:

Machine Translation, Evaluation Systems, Human Evaluation, Automatic Evaluation, BLEU.

Introduction

Complicated task of translation demand a lot of attempts and time and since invention of computers, it has hoped that machine translation (MT), the technology for the automation of translation, be an appropriate means to perform this task.

To reach to this goal, during the past decades, various ideas and methodologies were explored. Despite of some 50 years of research on MT there is still no generally accepted methodology for evaluating Mt systems performance.

The first attempt to evaluate MT output was done by well-known association of ALPAC that had attracted much attention. According to Philipp Koehn( 2010) major deficiency of MT evaluation is that they are assess by people with little or no experience in MT techniques who unable to judge what is possible and what is unrealistic. Another problem faced by community of MT in evaluating MT system is existence of ambiguity in the nature of the goals and tasks of MT performance, according to Arnold (1993) evaluation of MT output is restricted by different factors such as purpose of using of MT systems, difference in purpose and usage of acquired results of evaluations and methodology of performing the evaluation. Philip Koehn (2010) enumerated extra factors on the way of evaluation MT performance he pointed out that different groups contributed in MT studies are actually interested in different aspects of the MT systems; developers of MT system are mainly concerned about the findings of evaluations because the need to know the errors made by an MT system for the purpose of further improvement, while an MT user is more interested in accuracy and the cost of using systems.

The main concern of this study is to compare between data collected through different methods of evaluation. Generally, there are two common methods of MT evaluation namely Human evaluation method and automatic evaluation. In this study both types of evaluation methods are investigated. Based on specific factors such as precision, speedtime and economical aspects of evaluation; therefore, existing MT systems are evaluated to find most precise, speedy and economical method of evaluation.

Review of literature

The gradual history of Machine translation trace back to the 17th century when Rene    Descarts in 1629 proposed a universal language in which same symbols are used to refer to equivalent ideas in different languages ( W. J. Hutchins 1986).

By growth of interest in automatization of translating in 1950s many MT groups were formulated in the US and Russia. In 1952, with emphasized on the necessity and possibility for MT, the first MT conference was held in Massachusetts Institute of Technology (MIT) and organized by Yehoshua Bar-Hillel (Hutchins, 1986). Due to the requests from funding agency the Automatic Language Processing Advisory Committee (ALPAC) formed in 1964. After studying different aspects of MT the committee issued its report in 1966 which concluded that" MT was slow, less accurate and twice as expensive as human translation."  

(Hutchins & Harold 1992). Many researchers refused the report and viewed it as narrow, biased and shortsighted and saw the ALPAC report as a severe damage to MT studies.

In the late 1990s and growing computers and artificial intelligence remarkable development can be seen in MT studies and powerful and useful MT systems are used on personal computers to give the readers the gist of the articles they are interested in.

According to the Arnold et al.(1996) the first attempt to evaluate performance of machine translation systems was ALPAC report by which all studies on development of MT systems were sustained for decades.

In 1976 European Commission (EC) bought a version of MT called Systran and developed its own model named Eurota. To develop performance of Eurota, EC needed recommendation for evaluation as a result a report regarding evaluation of the quality of MT was published in 1979 tahat called ' Van Slype Report'. The report stressed the importance of judging the system based on context of use and user requirements.

Between 1992 and 1994, DARPA (Defense Advanced Research Project Agency) was also working seriously on MT evaluations. Between 1992 and 1999 EAGLES (Expert Advisory Group on Language Engineering) set up by EC had several aims one of which was to propose standards, guidelines and recommendation for good practice in evaluation of language engineering products. Most of evaluation of MT takes place under contract and often under confidentiality agreements. Consequently there is little constructive criticism of methodology. Based on Saedi C. et al (2009) evaluations made by MT researchers are often minimal and misleading: the demonstration of a system with a carefully selected set of sentences or sentence types is not the basis for claims about a large scale system.

Important matter that should be take into consideration is to indicate the principle areas in which evaluation can take place, what aspects are important and methods that can may be employed ( Atashgah, S , M & Bijankhan, M, 2009).

Differences between MT and human translation are the main problem for MT evaluation. The quality of human translation is expected to be publishable, but the best quality of MT is for understanding the main idea of the text .therefore, as Schwarzi(1999) states the evaluation method for human translation cannot be applied to MT.

Douglas Arnold et al,( 1994) believe that the main concerns of assessing a MT output are making judgment about the performance of a system and selecting a better MT system among different available systems.

As M. Dobrinkat (2008) believes the main purposes of evaluation of MT systems are:

To compare different MT systems or different versions of one system. Evaluation helps to determine which system is the best in a certain aspect or for some specific purpose or domain.

Optimization of performance by finding system modifications that yield improved evaluation results.

Van Slype (1979) classified evaluation of MT into two subcategories: macro and micro evaluation aspects. Macro evaluation considerers evaluation aspects with regard to the user requirements such as goodness of translation whereas micro evaluation considers the sources of  insufficiency and so tries to look inside the translation system black box ( ibid, 1979).

Evaluation Methods

Increasing attention in MT during 1950s and 1960s made Machine Translation Evaluation (MTE) a necessity. Different goals were sought by different groups using MTE, founders want to know the return of their investments, researchers want to be informed the latest progress in this field and for users the applicability of this program was of most importance.

A traditional method of evaluating MT output is to look at output and judge by hand whether it is correct or not (Philipp Koehn, 2010). This task is usually done by bilingual evaluators who understand both the input and output languages are Evaluation is done at sentence level, but a longer document context may be essential to carry out the judgments. (Philipp Koehn, 2010). The quality of MT output can be judged from the language perspective or the usability perspective. According to White (2003a) an intuitive way to do assessment is by rating of intuitive judgment of the goodness of the translation. Evaluators are asked to rate a translation, normally presented sentence by sentence, in terms of their intuitive judgment. A set of attributes such as fidelity, intelligibility adequacy and fluency, determine the goodness of translation.

More common approach of human evaluating is to use a graded scale when eliciting judgment from human evaluators. Two common criteria in human evaluation are fluency and adequacy. Phillip Koehn (2010) illustrated that how these criteria are scored for adequacy from one to five based on the level of transferring the meaning of the source language into the target language and regarding fluency from one for incomprehensible to five for Flawless text. 

As M. Dobrinkat (2006) noted: Human evaluation assesses many aspects of translation such as adequacy, intelligibility and accuracy. But performing this comprehensive task is also expensive and time consuming; therefore, we prefer to have an automatic method for assessing the quality of machine translation output. Ideally, we would like a computer to tell us quickly whether our system got better after a change, or not. This is the objective of automatic machine translation evaluation (Philipp Koehn, 2010).

Different Automatic MT is used to assessing MTs output, but the main concern of the recent study is BLEU system of automatic evaluation of MT which is presented briefly:Bilingual Evaluation understudy (BLEU) method of Mt output scoring was developed in the IBM labs (Papineni et al., 2001) to obtain a rapid and economical way to automatically evaluate machine translation. The initial purpose of designing of this method was to correlate with human assessment. In this scoring method the assessment of MT output is conducted in sentence level.

The basic idea of BLEU is to reward closeness to one of human translation as reference translation, using modified unigram precision (Philipp Koehn, 2010). The precision is determined by the weighted overlap of n-grams from the candidate translation to the reference translation (for n=1,…, 4). The final score between 0 and 1 tells how close the candidate is to reference translation. BLEU is currently the most commonly used score for comparing MT systems and evaluating improvements, because it's easy to compute and provides reasonable performance. In modified n-gram precision, the numerator is bound to the maximum number of occurrences of that n-gram in any other the references.

Statement of problem

The use of MT evaluation methods have been widely adopted as a standard practice within the MT community. In most cases, it can provide reasonable rankings of MT systems according to their translation performance. However, for some language pairs and text types, evaluation metrics may fail to do so. Therefore, for each new language pair or text types the reliability of the evaluation methods needs to be tested. Also, the evaluation metrics currently in use provide no more information about translation quality than system ranking. This characteristic is a limitation of the evaluation metrics. The main purpose of this paper is, then, to compare the findings of two common methods of MT evaluation (Human and automatic), at the same time, based on the collected data by these two methods of evaluation a clear judgment about performance of two MT systems (Google translate and Pars translation system) is made with taking into account different factors such as acceptability and usability of translated text by the above mentioned systems.

Research questions:

Based on the above discussions, the following questions have been raised:

  1. Do different evaluation methods lead to the same results?
  2. What kinds of examined evaluation methods can be more reliable?

The Design, Method and Procedures

This study is an attempt to make a comparison between two kinds of common evaluation methods in MT studies namely: Human evaluation and Automatic evaluation of MT systems.. Based on Koehn (2010) in manual evaluation the outputs are assess based on correctness standard, whoever correctness is a broad measure of assessment for this it's divided to two sub-criteria as: fluency and adequacy, which based on Philipp Koehn (2010) are defined as follow:

Fluency: is the output good fluent Persian (target language of this study)? This involves both grammatical correctness and idiomatic word choices.

Adequacy: does the output convey the same meaning as the input sentence? Is part of message lost, added, or distort?

In order to assess collected data automatically; BLUE method of MT evaluation is applied. Based on Philipp Koehn (2010) this method was initiated to correlate with human assessment.

The basic idea of BLEU is to reward closeness to one of human translation as reference translation, using modified unigram precision (Philipp Koehn, 2010). The precision is determined by the weighted overlap of n-grams from the candidate translation to the reference translation (for n=1…, 4). The final score between 0 and 1tells how close the candidates are to reference translation.

Therefore, the research is designed to evaluate humanly and automatically outputs of both Google translate system and Pars translation system firstly, to recognize existing similarity and differences of different methods of MT assessment and secondly to find the most reliable method of MT evaluation. In human evaluation method adequacy and fluency of translated sentences were assessed by scaling of these criteria from 5 for high quality translation to 1 for lower ones. BLEU method of evaluation is applied to assess outputs of investigating systems automatically. In order to conduct human evaluation twenty subjects were selected as human evaluators. All of the subjects were MA students of translation study at University of Kharazmi. In order to examine the usability of output, samples of study are given reference translations and the participants of the study by considering the reference sentences should judge about adequacy and fluency of translated sentences by MT systems. Materialsof the study were selected from" Translating Economical Texts", published by Payam-e- Nour University. The final results of the study are shown by tables.   

Significance of study

In this study, it is attempted to evaluate the performance of existing English to Persian MT systems by different attitudes towards MT performance. In order to determine the acceptability and usability of translated text researcher analyze collected data based on translators' opinion as professional in this field of study. This can increase the value and applicability of the findings, in contrast to previous studies which conducted without participants who have enough knowledge about translation and principles of translation. Also in order to determine the best method of evaluation translated text were analyzed by both human and automatic methods of evaluating of MT.

Discussion and Results

In order to evaluate collected data automatically, the BLEU method of evaluation is selected in this study. This method is presented by Papineni et al. (2001) and is a language independent metric to provide a quick overview of the performance of an MT system. For measuring the quality of MT this method, follows very simple hypothesis the closer a machine translation is to a professional human translation, the better it is. In practice, BLEU works, by comparing its translation against available reference translation by human translators. The results of evaluation are presented by following tables:

1. In the walrasian scheme, factors of production are concrete items in existence at a moment of time.

Reference translation:

در طرح والراس: عوامل تولید اقلام مشخصی را تشکیل می دهند که در لحظه ای از زمان وجود دارند.

Candidate 1(Google translation system):

در این طرح،Walrasian عوامل تولید اقلام بتن در وجود در یک لحظه از زمان است.

10/ 16= 62%

62% of translated text done by Google translation system is similar to the reference translation.

Candidate 2(Pars translation system)

.6/ 14= 42% عوامل تولید اقلام ملموس در وجود در یک گشتاور زمان هست. در طرح,Walrasian 42% of translated text done by Pars Translation System is similar to the reference sentence.

2. Walras seems deliberately to slur over the distinction between income from work and income from property.

Reference Translation:

به نظر میرسد والراس آگاهانه تمایز بین درآمد از کار و درآمد از مالکیت را لوث میکند .

Candidate 1: 10/20= 50% similarity with reference translation.

به نظر میرسد عمدا به بیش از تمایز بین درآمد حاصل از کار و درآمد حاصل از اموال لکه دار کردن.

Candidate 2: 8/21= 38% similarity with reference translation.

عمدا به نظر میرساند که بر روی تمایز ما بین درآمد از کار و درآمد از دارایی در بهم بر هم Walrasian بنویسید.

3. All factors are free and equal in the market.

Reference Translation:

در بازار همه عوامل آزاد و برابرند.

Candidate 1: 7/7=100% similarity with RT

همه عوامل در بازار آزاد و برابر هستند.

Candidate 2: 6/8= 75% similarity with RT

تمام عوامل مجانی و برابر در بازار هستند.

4. Economic activity was organized on the assumption of cheap and abundant oil.

Reference Translation:

فعالیتهای اقتصادی را بر اساس نفت ارزان و فراوان سازماندهی می کردند.

Candidate 1: 6/12= 50%

فعالیتهای اقتصادی در این فرض از نفت ارزان و فراوان برگزار شد.

Candidate 2: 6/12= 50%

فعالیت اقتصاد روی فرض از ارزان و نفت فراوان سازمان داده شد.

5. We say that the economy is experiencing inflation.

Reference Translation:

میگوییم که اقتصاد دچار تورم شده است.

Candidate 1: 4/9= 44% of similarity with RT

ما میگویند که اقتصاد در حال تجربه تورم است.

Candidate 2: 5/7= 71%

ما میگوییم اقتصاد که تورم تجربه است.

System /
Sentences

Google translate
( Candidate 1)

Pars Translation system
(candidate 2)

Sentence 1

0.6 0.4

Sentence 2

0.5 0.4

Sentence 3

1

0.7

Sentence 4

0.5 0.5

Sentence 5

0.4 0.7

 Total score

3

2.7

Table1. Obtained results of MT outputs by BLEU methods of evaluation

As it is shown by the table 1 candidate no. 1 has gained higher BLEU score with sum of 3 against candidate No 2 which has gained 2.7 scores.

Human Evaluation

To evaluate humanly, collected data on performed task of translation by both MT systems, adequacy and fluency of translated text were analyzed based on human evaluation method presented in (Philipp Koehn, 2010). In this method both criteria of evaluation are scored from 1 to 5 based on the quality of translation which in the current study is assessed by 20 MA translation students, the details of evaluation are illustrated in the following table:

Score

Adequacy

Score

Fluency

5

All meaning

5

Flawless Persian

4

Most meaning

4

Good Persian

3

Much meaning

3

Non-native Persian

2

Little meaning

2

Dis –fluent Persian

1

None

1

Incomprehensible

Table 2, scoring Adequacy and fluency based on Philipp Koehn

As previously noted the output of both under study MT systems had been evaluated by 20 MA students to make a comparison between findings of two model of evaluation. The following table shows the average scores presented for each system separately by the participants of the study:

Note: stars(*) under each score show the assigned score for each sentence by participants.

Fluency

Adequacy

   

5

4

3

2

1

5

4

3

2

1

Translated sentence by Google translate

Sentence No.

     

*

     

*

   

عوامل تولید اقلام بتن در Walrasian در این طرح وجود در یک لحظه از زمان است.

1

   

*

     

*

     

به نظر میرسد عمدا به بیش از تمایز بین درآمد Walrasian حاصل از کار و درآمد حاصل از اموال لکه دار کردن.

2

   

*

     

*

     

همه عوامل در بازار آزاد و برابر هستند.

3

     

*

     

*

   

فعالیتهای اقتصادی در این فرض از نفت ارزان و فراوان برگزار شد.

4

   

*

       

*

   

ما میگویند که اقتصاد در حال تجربه تورم است.

5

Table No.3 assigned scores by participants for translated text by candidate 1 (Google translate system)

Fluency

Adequacy

   

5

4

3

2

1

5

4

3

2

1

Translated sentence by Pars translation system

Sentence No.

       

*

   

*

   

عوامل تولید اقلام ملموس در وجود Walrasian در طرح در یک گشتاور زمان است.

1

       

*

       

*

عمدا به نظر میرساند که بر روی تمایز مابین Walrasian درآمد از کار و درآمد از دارایی در بهم بر هم بنویسد

2

   

*

       

*

   

تمام عوامل مجانی و برابر در بازار هستند.

3

     

*

       

*

 

فعالیت اقتصادی روی فرض از ارزان و نفت فراوان سازمان داده شده.

4

     

*

         

*

ما میگوییم اقتصاد که تورم تجربه است.

5

Table No.4 assigned scores by participants for translated text by candidate 2 (Pars Translate system)

According to Phillip Kohen's model of evaluation reference translation is the scale of assessing and marking MT outputs and full score (5) is dedicated to it as the best and standard translation of the sentences; therefore, the closer to the reference translation the better translated text is gained. Findings of  human evaluation method are illustrated by table No. 5.

Criteria

Translation text by

N

Min
Scores

Max
Scores

Mean

Fluency

Reference

5

5

5

5

Google MT

5

2

4

2.6

Pars MT

5

1

3

1.8

Adequacy

Reference

5

5

5

5

Google MT

5

2

4

3.4

Pars MT

5

1

3

2

TableNo5. Mean of fluency and adequacy of MT systems' based on Phillip Kohen Model.

According to the results of the study which are illustrated by table No.5 it can be said that regarding fluency criteria, performance of Google translate system with mean of 2.6 is more acceptable than performance of Pars system with 1.8 mean score. Regarding variable of adequacy the same result can be elicited from the table by which Google MT system with mean score of 3.4 had performed better than Pars MT system with mean score of 2.

Conclusion

As it is discussed previously, the main concern of the study is to make a comparison between human evaluation and automatic evaluation methods of MT systems, in order to evaluate collected data automatically the model of BLEU is applied and  according to analyzed data both methods of evaluation reach to the same result which can be considered as the answer of research question No 1 and regarding to research question No. 2 human evaluation which is done by professional translators is more expensive and time consuming method of evaluation than automatic MT evaluation methods, in contrast, due to the careful examination of details in the translated text by human evaluators, the automatic methods of MT evaluation are less reliable than human Evaluation system.

 

References:

  1. ALPAC. (1966). Language and machines: computers in translation and linguistics. A report by the Automatic Language Processing Advisory Committee. Division of Behavioral Sciences, National Academy of Sciences, National Research Council. Washington, D.C.: National Academy of Sciences - National Research Council.
  2. Arnold, D, Sadler, L. and Humphreys, R. L. (1993). Evaluation: An Assessment. Machine Translation. Vol. 8, Number 1-2. Special issue in Evaluation of MT Systems.
  3. Arnold, D, at el. (1996) Machine Translation: An Introductory Guide. Manchester: Blackwell.
  4. Chakaveh Saedi at el. (2009). Automatic Translation between English and Persian texts. CAASL-3 – Third Workshop on Computational Approaches to Arabic Script-based Languages [at] MT Summit XII, August 26, 2009, Ottawa, Ontario, Canada.
  5. Dobrinkat, M (2008). Domain adaptation in Statistical Machine Translation Systems via User Feedback. Master thesis, Helsinki University of Technology.
  6. Hutchins, W. John and Harold L. Somers. (1992) An Introduction to Machine Translation. London: Academic Press.
  7. Hutchins, W. John. (1986) Machine Translation: Past, Present, Future. West Sussex, England: Ellis Horwood Limited.
  8. Atashgah, M & Bijankhan, M. (2009) Corpus-based analysis for multi-token units in Persian. CAASL-3 – Third Workshop on Computational Approaches to Arabic Script based Languages [at] MT Summit XII, August 26, 2009, Ottawa, Ontario, Canada.
  9. Papineni, K. (2002). Machine translation evaluation: n-grams to the rescue. LREC-2002: Third International Conference on Language Resources and Evaluation. Proceedings, Las Palmas de Gran Canaria, Spain, 27 May – 2 June 2002; 2pp.
  10. Philipp Kohen .( 2010). Statistical Machine Translation, Cambridge University Press, New York. 
  11. Schwarzi, A. (1999). The (Im) Possibilities of Machine Translation. Peter Lang.
  12. Van Slype, G. (1979). Critical Study of Methods for Evaluating the Quality of Machine Translation. Final report. Brussels: Bureau Marcel van Dijk [for] European Commission.
  13. White, J. (2003a). How to Evaluate Machine translation. In H. Somers. ed. Computers and Translation: a translator's guide. Amsterdam, Philadelphia: J. Benjamins B.
Log in

Log in