Re-evaluating Bleu
description
Transcript of Re-evaluating Bleu
Re-evaluating Bleu
Alison AlvarezMachine Translation Seminar
February 16, 2006
Spring 2006 MT Seminar
Overview
• The Weaknesses of Bleu Introduction Precision and Recall Fluency and Adequacy Variations Allowed by Bleu Bleu and Tides 2005
• An Improved Model Overview of the Model Experiment Results
• Conclusions
Spring 2006 MT Seminar
Introduction
• Bleu has been shown to have high correlations with human judgments
• Bleu has been used by MT researchers for five years, sometimes in place of manual human evaluations
• But does the minimization of the error rate accurately show improvements in translation quality?
Spring 2006 MT Seminar
Precision and Bleu
• Of my answers, how many are right/wrong?
• Precision = B C / C or A/C
A
Reference Translation Hypothesis Translation
B C
Spring 2006 MT Seminar
Precision and Bleu
Bleu is a precision based metric
• The modified precision score, pn:
Pn = ∑sc ∑ngramsCountmatched(ngram)
∑sc ∑ngramsCount(ngram)
Spring 2006 MT Seminar
Recall and Bleu
• Of the potential answers how many did I retrieve/miss?
• Recall = B C / B or A/B
A
Reference Translation Hypothesis Translation
B C
Spring 2006 MT Seminar
Recall and Bleu
• Because Bleu uses multiple reference translations at once, recall cannot be calculated
Spring 2006 MT Seminar
Fluency and Adequacy to Evaluators
• Fluency “How do you judge the fluency of this
translation” Judged with no reference translation and
to the standard of written English
• Adequacy “How much of the meaning expressed
in the reference is also expressed in the hypothesis translation?”
Spring 2006 MT Seminar
Variations
• Bleu allows for variations in word and phrase order that lead to less fluency
• No constraints occur on the order of matching n-grams
Spring 2006 MT Seminar
Variations
Spring 2006 MT Seminar
Variations
The above two translations have the same bigram score.
Spring 2006 MT Seminar
Bleu and Tides 2005
• Bleu scores showed significant divergence from human judgments in the 2005 Tides Evaluation
• It ranked the system considered the best by humans as sixth in performance
Spring 2006 MT Seminar
Bleu and Tides 2005
• Reference: Iran had already announced Kharazi would boycott the conference after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs
• System A: Iran has already stated that Kharazi’s statements to the conference because of the Jordanian King Abdullah II in which he stood accused Iran of interfering in Iraqi affairs.
• N-gram matches: 1-gram: 27; 2-gram: 20; 3-gram: 15; 4 gram: 10
• Human scores: Adequacy: 3,2; Fluency 3,2From Callison-Burch 2005
Spring 2006 MT Seminar
Bleu and Tides 2005
• Reference: Iran had already announced Kharazi would boycott the conference after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs
• System B: Iran already announced that Kharazi will not attend the conference because of statements made by Jordanian Monarch Abdullah II who has accused Iran of interfering in Iraqi affairs.
• N-gram matches: 1-gram: 24; 2-gram: 19; 3-gram: 15; 4 gram: 12
• Human scores: Adequacy: 5,4; Fluency 5,4From Callison-Burch 2005
Spring 2006 MT Seminar
An Experiment with Bleu
Spring 2006 MT Seminar
Bleu and Tides 2005
• “This opens the possibility that in order to for Bleu to be valid only sufficiently similar systems should be compared with one another”
Spring 2006 MT Seminar
Additional Flaws
• Multiple Human reference translations are expensive
• N-grams showing up in multiple reference translations are weighted the same
• Content words are weighed the same as common words ‘The’ counts the same as ‘Parliament’
• Bleu accounts for the diversity of human translations, but not synonyms
Spring 2006 MT Seminar
An Extension of Bleu
• Described in Babych & Hartley, 2004• Adds weights to matched items using
tf/idf S-score
Spring 2006 MT Seminar
Addressing Flaws
• Can work with only one human translation Can actually calculate recall The paper is not very clear about this sentence
is selected
• Content words are weighed the differently than common words ‘The’ does not count the same as ‘Parliament’
Spring 2006 MT Seminar
Calculating the tf/idf Score
• tf.idf(i,j) = (1 + log (tfi,j)) log (N / dfi),
• if tfi,j ≥ 1; where: tfi,j is the number of occurrences of the word wi in the
document dj;
dfi is the number of documents in the corpus where the word wi occurs;
• N is the total number of documents in the corpus.From Babych 2004
Spring 2006 MT Seminar
Calculating the S-Score
• The S-score was calculated as:
Pdoc(i,j) is the relative frequency of the word in the text Pcorp-doc(i) is the relative frequency of the same word in the rest
of the corpus, without this text; (N – df(i)) / N is the proportion of texts in the corpus, where
this word does not occur Pcorp(i) is the relative frequency of the word in the whole
corpus, including this particular text.
( ))( )()](),( /)(log),( icorpiidoccorpjidocP NdfNPPjiS −×−= −
Spring 2006 MT Seminar
Integrating the S-Score
• If for a lexical item in a text the S‑score > 1, all counts for the N-grams containing this item are increased by the S-score (not just by 1, as in the baseline BLEU approach).
• If the S-score ≤1; the usual N-gram count is applied: the number is increased by 1.
From Babych 2004
Spring 2006 MT Seminar
The Experiment
• Used 100 French-English texts from the DARPA-94 evaluation corpus
• Included two reference translations• Results from 4 Different MT systems
Spring 2006 MT Seminar
The Experiment
• Stage 1: tf/idf & S-scores are calculated on the two
reference translations
• Stage 2: N-gram based evaluation using Precision,
Recall of n-grams in MT output N-gram matches were adjusted to N-gram
weights or S-Score
• Stage 3: Comparison with human scores
Spring 2006 MT Seminar
Results for tf/idf
System[ade] / [flu]
BLEU[1&2]
Prec.(w) 1/2
Recall(w) 1/2
Fscore(w) 1/2
CANDIDE0.677 / 0.455
0.3561 0.47670.4709
0.33630.3324
0.39440.3897
GLOBALINK0.710 / 0.381
0.3199 0.42890.4277
0.31460.3144
0.36300.3624
MS0.718 / 0.382
0.3003 0.42170.4218
0.33320.3354
0.37230.3737
REVERSONA / NA
0.3823 0.47600.4756
0.36430.3653
0.41270.4132
SYSTRAN0.789 / 0.508
0.4002 0.48640.4813
0.37590.3734
0.42410.4206
Corr r(2) with [ade] – MT
0.5918 0.33990.3602
0.79660.8306
0.64790.6935
Corr r(2) with [flu] – MT
0.9807 0.96650.9721
0.89800.8505
0.98530.9699
Spring 2006 MT Seminar
Results for S-Score
System[ade] / [flu]
BLEU[1&2]
Prec.(w) 1/2
Recall(w) 1/2
Fscore(w) 1/2
CANDIDE0.677 / 0.455
0.3561 0.45700.4524
0.32810.3254
0.38200.3785
GLOBALINK0.710 / 0.381
0.3199 0.40540.4036
0.30860.3086
0.35040.3497
MS0.718 / 0.382
0.3003 0.39630.3969
0.32370.3259
0.35630.3579
REVERSONA / NA
0.3823 0.45470.4540
0.35630.3574
0.39960.4000
SYSTRAN0.789 / 0.508
0.4002 0.46330.4585
0.36660.3644
0.40940.4061
Corr r(2) with [ade] – MT
0.5918 0.29450.2996
0.80460.8317
0.61840.6492
Corr r(2) with [flu] – MT
0.9807 0.95250.9555
0.90930.8722
0.99420.9860
Spring 2006 MT Seminar
Results
• The n-gram model beats BLEU in adequacy
• The f-score metric is more strongly correlated with fluency
• Single Reference translations are stable (add stability chart?)
Spring 2006 MT Seminar
Conclusions
• The Bleu model can be too coarse to show differentiate between very different MT systems
• Adequacy is harder to predict than fluency
• Adding weights and using recall and f-scores can bring higher correlations with adequacy and fluency scores
Spring 2006 MT Seminar
References
• Chris Callison-Burch, Miles Osborne and Philipp Koehn. 2006. Re-evaluating the Role of Bleu in Machine Translation Research, to appear in EACL-06.
• Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02). Philadelphia, PA. July 2002. pp. 311-318.
• Babych B, Hartley A. 2004. Extending BLEU MT Evaluation Method with Frequency Weighting, In Proceedings of the 42th Annual Meeting of the Association for Computational Linguistics (ACL-04). Barcelona, Spain. July 2004.
• Dan Melamed, Ryan Green, and joseph P. Turian. Precision and recall of machine translation. In Proceedings of the Human Language Technology Conference (HLT), pages 61--63, Edmonton, Alberta, May 2003. HLT-NAACL. http://citeseer.csail.mit.edu/melamed03precision.html
• Deborah Coughlin. 2003. Correlating automated andhuman assessments of machine translation quality.In Proceedings of MT Summit IX.
• LDC. 2005. Linguistic data annotation specification:Assessment of fluency and adequacy in translations.Revision 1.5
Spring 2006 MT Seminar
• The Brevity Penalty is designed to compensate for overly terse translations
BP = {c = length of corpus of hypothesis translationsr = effective corpus length*
Precision and Bleu
1 if c > re1-r/c if c ≤ r
Spring 2006 MT Seminar
• Thus, the total Bleu score is this:
BLEU = BP * exp(∑ wn log pn)
Precision and Bleu
n
n=1
Spring 2006 MT Seminar
Flaws in the Use of Bleu
• Experiments with Bleu, but no manual evaluation (Callison-Burch 2005)