MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

33
MT SUMMIT 2013 Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng September 2 nd -6 th , 2013, Nice, France Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory Department of Computer and Information Science University of Macau

description

Presentation PPT in MT SUMMIT 2013. Language-independent Model for Machine Translation Evaluation with Reinforced Factors International Association for Machine Translation2013 Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)

Transcript of MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Page 1: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

MT SUMMIT 2013

Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng

September 2nd-6th, 2013, Nice, France

Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory

Department of Computer and Information Science

University of Macau

Page 2: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

The importance of machine translation (MT) evaluation

Automatic MT evaluation metrics introduction 1. Lexical similarity

2. Linguistic features

3. Metrics combination

Designed metric: LEPOR Series 1. Motivation

2. LEPOR Metrics Description

3. Performances on international ACL-WMT corpora

4. Publications and Open source tools

Further information

Page 3: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Eager communication with each other of different nationalities

– Promote the translation technology

• Rapid development of Machine translation

– machine translation (MT) began as early as in the 1950s (Weaver, 1955)

– big progress science the 1990s due to the development of computers (storage capacity and computational power) and the enlarged bilingual corpora (Marino et al. 2006)

Page 4: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Some recent works of MT research:

– Och (2003) present MERT (Minimum Error Rate Training) for log-linear SMT

– Su et al. (2009) use the Thematic Role Templates model to improve the translation

– Xiong et al. (2011) employ the maximum-entropy model, etc.

– The data-driven methods including example-based MT (Carl and Way, 2003) and statistical MT (Koehn, 2010) became main approaches in MT literature.

Page 5: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• How well the MT systems perform and whether they make some progress?

• Difficulties of MT evaluation

– language variability results in no single correct translation

– the natural languages are highly ambiguous and different languages do not always express the same content in the same way (Arnold, 2003)

Page 6: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Traditional manual evaluation criteria:

– intelligibility (measuring how understandable the sentence is)

– fidelity (measuring how much information the translated sentence retains as compared to the original) by the Automatic Language Processing Advisory Committee (ALPAC) around 1966 (Carroll, 1966)

– adequacy (similar as fidelity), fluency (whether the sentence is well-formed and fluent) and comprehension (improved intelligibility) by Defense Advanced Research Projects Agency (DARPA) of US (White et al., 1994)

Page 7: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Problems of manual evaluations :

– Time-consuming

– Expensive

– Unrepeatable

– Low agreement (Callison-Burch, et al., 2011)

Page 8: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

2.1 Lexical similarity

2.2 Linguistic features

2.3 Metrics combination

Page 9: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Precision-based

Bleu (Papineni et al., 2002 ACL)

• Recall-based

ROUGE(Lin, 2004 WAS)

• Precision and Recall

Meteor (Banerjee and Lavie, 2005 ACL)

Page 10: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Word-order based

NKT_NSR(Isozaki et al., 2010EMNLP), Port (Chen et al., 2012 ACL), ATEC (Wong et al., 2008AMTA)

• Word-alignment based

AER (Och and Ney, 2003 J.CL)

• Edit distance-based

WER(Su et al., 1992Coling), PER(Tillmann et al., 1997 EUROSPEECH), TER (Snover et al., 2006 AMTA)

Page 11: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Language model

LM-SVM (Gamon et al., 2005EAMT)

• Shallow parsing

GLEU (Mutton et al., 2007ACL), TerrorCat (Fishel et al., 2012WMT)

• Semantic roles

Named entity, morphological, synonymy, paraphrasing, discourse representation, etc.

Page 12: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• MTeRater-Plus (Parton et al., 2011WMT)

– Combine BLEU, TERp (Snover et al., 2009) and Meteor (Banerjee and Lavie, 2005; Lavie and Denkowski, 2009)

• MPF & WMPBleu (Popovic, 2011WMT)

– Arithmetic mean of F score and BLEU score

• SIA (Liu and Gildea, 2006ACL)

– Combine the advantages of n-gram-based metrics and loose-sequence-based metrics

Page 13: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• hLEPOR: harmonic mean of enhanced Length Penalty, Precision, n-gram Position difference Penalty and Recall

Page 14: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Weaknesses in existing metrics:

– perform well on certain language pairs but weak on others, which we call as the language-bias problem;

– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call as the extremism problem;

– present incomprehensive factors (e.g. BLEU focus on precision only).

– What to do?

Page 15: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• to address some of the existing problems:

– Design tunable parameters to address the language-bias problem;

– Use concise or optimized linguistic features for the linguistic extremism problem;

– Design augmented factors.

Page 16: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Sub-factors:

• 𝐸𝐿𝑃 = 𝑒1−

𝑟

𝑐 ∶ 𝑐<𝑟

𝑒1−𝑐

𝑟 ∶ 𝑐≥𝑟

(1)

• 𝑟: length of reference sentence

• 𝑐: length of candidate (system-output) sentence

Page 17: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = exp −𝑁𝑃𝐷 (2)

• 𝑁𝑃𝐷 =1

𝐿𝑒𝑛𝑔𝑡ℎ𝑜𝑢𝑡𝑝𝑢𝑡 |𝑃𝐷𝑖|𝐿𝑒𝑛𝑔𝑡ℎ𝑜𝑢𝑡𝑝𝑢𝑡𝑖=1

(3)

• 𝑃𝐷𝑖 = |𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡 −𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓| (4)

• 𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡: position of matched token in

output sentence

• 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓: position of matched token in reference

sentence

Page 18: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Fig. 1. N-gram word alignment algorithm

Page 19: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Fig. 2. Example of n-gram word alignment

Page 20: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Fig. 3. Example of NPD calculation

Page 21: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• N-gram precision and recall:

• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝐴𝑙𝑖𝑔𝑛𝑒𝑑𝑛𝑢𝑚

𝐿𝑒𝑛𝑔𝑡ℎ𝑜𝑢𝑡𝑝𝑢𝑡 (5)

• 𝑅𝑒𝑐𝑎𝑙𝑙 =𝐴𝑙𝑖𝑔𝑛𝑒𝑑𝑛𝑢𝑚

𝐿𝑒𝑛𝑔𝑡ℎ𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 (6)

• 𝐻𝑃𝑅 =𝛼+𝛽 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙

𝛼𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝛽𝑅𝑒𝑐𝑎𝑙𝑙 (7)

Page 22: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Sentence-level hLEPOR Metric:

• ℎ𝐿𝐸𝑃𝑂𝑅 =𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤𝐿𝑃𝐿𝑃,𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙, 𝑤𝐻𝑃𝑅𝐻𝑃𝑅

= 𝑤𝑖𝑛𝑖=1

𝑤𝑖

𝐹𝑎𝑐𝑡𝑜𝑟𝑖

𝑛𝑖=1

=𝑤𝐿𝑃+𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙+𝑤𝐻𝑃𝑅𝑤𝐿𝑃𝐿𝑃

+𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙

+𝑤𝐻𝑃𝑅𝐻𝑃𝑅

(8)

• System-level hLEPOR Metric:

• ℎ𝐿𝐸𝑃𝑂𝑅 =1

𝑛𝑢𝑚𝑠𝑒𝑛𝑡 |ℎ𝐿𝐸𝑃𝑂𝑅𝑖|𝑛𝑢𝑚𝑠𝑒𝑛𝑡𝑖=1 (9)

Page 23: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Example, employment of linguistic features:

Fig. 4. Example of n-gram POS alignment

Fig. 5. Example of NPD calculation

Page 24: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Enhanced version with linguistic features:

• ℎ𝐿𝐸𝑃𝑂𝑅𝐸 =1

𝑤ℎ𝑤+𝑤ℎ𝑝(𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅𝑤𝑜𝑟𝑑 +

𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅𝑃𝑂𝑆) (10)

• The system-level scores ℎ𝐿𝐸𝑃𝑂𝑅𝑤𝑜𝑟𝑑 and ℎ𝐿𝐸𝑃𝑂𝑅𝑃𝑂𝑆 use the same algorithm on word sequence and POS sequence respectively.

Page 25: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• When multi-references:

• Select the alignment that results in the minimum NPD score.

Fig. 6. N-gram alignment when multi-references

Page 26: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• How reliable is the automatic metric?

• Evaluation criteria for evaluation metrics:

– Human judgments are the golden to approach, currently.

• Correlation with human judgments:

• System-level Spearman rank correlation coefficient:

– 𝜌 𝑋𝑌 = 1 −6 𝑑𝑖

2𝑛𝑖=1

𝑛(𝑛2−1) (11)

– 𝑋 = 𝑥1, … , 𝑥𝑛 , 𝑌 = {𝑦1, … , 𝑦𝑛}

Page 27: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Training data (WMT08)

– 2,028 sentences for each document

– English vs Spanish/German/French/Czech

• Testing data (WMT11)

– 3,003 sentences for each document

– English vs Spanish/German/French/Czech

Page 28: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Table 1. values of tuned parameters

Page 29: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Table 2. correlation with human judgments on WMT11 corpora

Page 31: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Ongoing and further works:

– The combination of translation and evaluation, tuning the translation model using evaluation metrics

– Evaluation models from the perspective of semantics

– The exploration of unsupervised evaluation models, extracting features from source and target languages

Page 32: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

• Actually speaking, the evaluation works are very related to the similarity measuring. Where we have employed them is in the MT evaluation. These works can be further developed into other literature:

– information retrieval

– question and answering

– Searching

– text analysis

– etc.

Page 33: MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

MT SUMMIT 2013, September 2nd-6th, 2013, Nice, France

Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng

Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory

Department of Computer and Information Science

University of Macau