MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

MT SUMMIT 2013

Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng

September 2nd-6th, 2013, Nice, France

Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory

Department of Computer and Information Science

University of Macau

The importance of machine translation (MT) evaluation

Automatic MT evaluation metrics introduction 1. Lexical similarity

2. Linguistic features

3. Metrics combination

Designed metric: LEPOR Series 1. Motivation

2. LEPOR Metrics Description

3. Performances on international ACL-WMT corpora

4. Publications and Open source tools

Further information

• Eager communication with each other of different nationalities

– Promote the translation technology

• Rapid development of Machine translation

– machine translation (MT) began as early as in the 1950s (Weaver, 1955)

– big progress science the 1990s due to the development of computers (storage capacity and computational power) and the enlarged bilingual corpora (Marino et al. 2006)

• Some recent works of MT research:

– Och (2003) present MERT (Minimum Error Rate Training) for log-linear SMT

– Su et al. (2009) use the Thematic Role Templates model to improve the translation

– Xiong et al. (2011) employ the maximum-entropy model, etc.

– The data-driven methods including example-based MT (Carl and Way, 2003) and statistical MT (Koehn, 2010) became main approaches in MT literature.

• How well the MT systems perform and whether they make some progress?

• Difficulties of MT evaluation

– language variability results in no single correct translation

– the natural languages are highly ambiguous and different languages do not always express the same content in the same way (Arnold, 2003)

• Traditional manual evaluation criteria:

– intelligibility (measuring how understandable the sentence is)

– fidelity (measuring how much information the translated sentence retains as compared to the original) by the Automatic Language Processing Advisory Committee (ALPAC) around 1966 (Carroll, 1966)

– adequacy (similar as fidelity), fluency (whether the sentence is well-formed and fluent) and comprehension (improved intelligibility) by Defense Advanced Research Projects Agency (DARPA) of US (White et al., 1994)

• Problems of manual evaluations :

– Time-consuming

– Expensive

– Unrepeatable

– Low agreement (Callison-Burch, et al., 2011)

2.1 Lexical similarity

2.2 Linguistic features

2.3 Metrics combination

• Precision-based

Bleu (Papineni et al., 2002 ACL)

• Recall-based

ROUGE(Lin, 2004 WAS)

• Precision and Recall

Meteor (Banerjee and Lavie, 2005 ACL)

• Word-order based

NKT_NSR(Isozaki et al., 2010EMNLP), Port (Chen et al., 2012 ACL), ATEC (Wong et al., 2008AMTA)

• Word-alignment based

AER (Och and Ney, 2003 J.CL)

• Edit distance-based

WER(Su et al., 1992Coling), PER(Tillmann et al., 1997 EUROSPEECH), TER (Snover et al., 2006 AMTA)

• Language model

LM-SVM (Gamon et al., 2005EAMT)

• Shallow parsing

GLEU (Mutton et al., 2007ACL), TerrorCat (Fishel et al., 2012WMT)

• Semantic roles

Named entity, morphological, synonymy, paraphrasing, discourse representation, etc.

• MTeRater-Plus (Parton et al., 2011WMT)

– Combine BLEU, TERp (Snover et al., 2009) and Meteor (Banerjee and Lavie, 2005; Lavie and Denkowski, 2009)

• MPF & WMPBleu (Popovic, 2011WMT)

– Arithmetic mean of F score and BLEU score

• SIA (Liu and Gildea, 2006ACL)

– Combine the advantages of n-gram-based metrics and loose-sequence-based metrics

• hLEPOR: harmonic mean of enhanced Length Penalty, Precision, n-gram Position difference Penalty and Recall

• Weaknesses in existing metrics:

– perform well on certain language pairs but weak on others, which we call as the language-bias problem;

– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call as the extremism problem;

– present incomprehensive factors (e.g. BLEU focus on precision only).

– What to do?

• to address some of the existing problems:

– Design tunable parameters to address the language-bias problem;

– Use concise or optimized linguistic features for the linguistic extremism problem;

– Design augmented factors.

• Sub-factors:

• 𝐸𝐿𝑃 = 𝑒1−

𝑟

𝑐 ∶ 𝑐<𝑟

𝑒1−𝑐

𝑟 ∶ 𝑐≥𝑟

(1)

• 𝑟: length of reference sentence

• 𝑐: length of candidate (system-output) sentence

• 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = exp −𝑁𝑃𝐷 (2)

• 𝑁𝑃𝐷 =1

𝐿𝑒𝑛𝑔𝑡ℎ𝑜𝑢𝑡𝑝𝑢𝑡 |𝑃𝐷𝑖|𝐿𝑒𝑛𝑔𝑡ℎ𝑜𝑢𝑡𝑝𝑢𝑡𝑖=1

(3)

• 𝑃𝐷𝑖 = |𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡 −𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓| (4)

• 𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡: position of matched token in

output sentence

• 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓: position of matched token in reference

sentence

Fig. 1. N-gram word alignment algorithm

Fig. 2. Example of n-gram word alignment

Fig. 3. Example of NPD calculation

• N-gram precision and recall:

• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝐴𝑙𝑖𝑔𝑛𝑒𝑑𝑛𝑢𝑚

𝐿𝑒𝑛𝑔𝑡ℎ𝑜𝑢𝑡𝑝𝑢𝑡 (5)

• 𝑅𝑒𝑐𝑎𝑙𝑙 =𝐴𝑙𝑖𝑔𝑛𝑒𝑑𝑛𝑢𝑚

𝐿𝑒𝑛𝑔𝑡ℎ𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 (6)

• 𝐻𝑃𝑅 =𝛼+𝛽 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙

𝛼𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝛽𝑅𝑒𝑐𝑎𝑙𝑙 (7)

• Sentence-level hLEPOR Metric:

• ℎ𝐿𝐸𝑃𝑂𝑅 =𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤𝐿𝑃𝐿𝑃,𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙, 𝑤𝐻𝑃𝑅𝐻𝑃𝑅

= 𝑤𝑖𝑛𝑖=1

𝑤𝑖

𝐹𝑎𝑐𝑡𝑜𝑟𝑖

𝑛𝑖=1

=𝑤𝐿𝑃+𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙+𝑤𝐻𝑃𝑅𝑤𝐿𝑃𝐿𝑃

+𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙

+𝑤𝐻𝑃𝑅𝐻𝑃𝑅

(8)

• System-level hLEPOR Metric:

• ℎ𝐿𝐸𝑃𝑂𝑅 =1

𝑛𝑢𝑚𝑠𝑒𝑛𝑡 |ℎ𝐿𝐸𝑃𝑂𝑅𝑖|𝑛𝑢𝑚𝑠𝑒𝑛𝑡𝑖=1 (9)

• Example, employment of linguistic features:

Fig. 4. Example of n-gram POS alignment

Fig. 5. Example of NPD calculation

• Enhanced version with linguistic features:

• ℎ𝐿𝐸𝑃𝑂𝑅𝐸 =1

𝑤ℎ𝑤+𝑤ℎ𝑝(𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅𝑤𝑜𝑟𝑑 +

𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅𝑃𝑂𝑆) (10)

• The system-level scores ℎ𝐿𝐸𝑃𝑂𝑅𝑤𝑜𝑟𝑑 and ℎ𝐿𝐸𝑃𝑂𝑅𝑃𝑂𝑆 use the same algorithm on word sequence and POS sequence respectively.

• When multi-references:

• Select the alignment that results in the minimum NPD score.

Fig. 6. N-gram alignment when multi-references

• How reliable is the automatic metric?

• Evaluation criteria for evaluation metrics:

– Human judgments are the golden to approach, currently.

• Correlation with human judgments:

• System-level Spearman rank correlation coefficient:

– 𝜌 𝑋𝑌 = 1 −6 𝑑𝑖

2𝑛𝑖=1

𝑛(𝑛2−1) (11)

– 𝑋 = 𝑥1, … , 𝑥𝑛 , 𝑌 = {𝑦1, … , 𝑦𝑛}

• Training data (WMT08)

– 2,028 sentences for each document

– English vs Spanish/German/French/Czech

• Testing data (WMT11)

– 3,003 sentences for each document

– English vs Spanish/German/French/Czech

Table 1. values of tuned parameters

Table 2. correlation with human judgments on WMT11 corpora

• Language-independent Model for Machine Translation Evaluation with Reinforced Factors

– Aaron L.-F. Han, Derek Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing, Xiaodong Zeng. Proceedings of MT Summit 2013. Nice, France.

• Machine Translation evaluation tool-hLEPOR: https://github.com/aaronlifenghan/aaron-project-hlepor

http://wenku.baidu.com/view/8b8a195a33687e21af45a9e6.html




https://github.com/aaronlifenghan/aaron-project-hlepor






• Ongoing and further works:

– The combination of translation and evaluation, tuning the translation model using evaluation metrics

– Evaluation models from the perspective of semantics

– The exploration of unsupervised evaluation models, extracting features from source and target languages

• Actually speaking, the evaluation works are very related to the similarity measuring. Where we have employed them is in the MT evaluation. These works can be further developed into other literature:

– information retrieval

– question and answering

– Searching

– text analysis

– etc.

MT SUMMIT 2013, September 2nd-6th, 2013, Nice, France

Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng

Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory

Department of Computer and Information Science

University of Macau

MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Technology

Transcript of MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors