Post on 05-Jul-2020
Challenges in Predicting Machine TranslationUtility for Human Post-Editors
Michael Denkowski and Alon Lavie
Language Technologies InstituteCarnegie Mellon University
October 29, 2012
Source Text FastTranslation
MT System
Good fast translation?
Source Text GoodTranslation
Translators
Source Text FastTranslation
MT System
Good fast translation?
Source Text GoodTranslation
Translators
Source Text FastTranslation
MT System
Good fast translation?
Source Text GoodTranslation
Translators
MT with Human Post-Editing
Source Text
FastTranslation
Translators
MT System
Good FastTranslation
Source Text
FastTranslation
Translators
MT System
Very SlowRe-Translation
Source Text
FastTranslation
Translators
MT System
Very SlowRe-Translation
Introduction
Utility prediction: We need to reliably predict the usability ofautomatic translations.
“Referenceless” utility prediction:
• Corresponds to confidence estimation task
• Confidence Estimation for post-editing (Specia 2011)
• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)
Reference-aided utility prediction
• Corresponds to MT evaluation task
• This work
Introduction
Utility prediction: We need to reliably predict the usability ofautomatic translations.
“Referenceless” utility prediction:
• Corresponds to confidence estimation task
• Confidence Estimation for post-editing (Specia 2011)
• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)
Reference-aided utility prediction
• Corresponds to MT evaluation task
• This work
Introduction
Utility prediction: We need to reliably predict the usability ofautomatic translations.
“Referenceless” utility prediction:
• Corresponds to confidence estimation task
• Confidence Estimation for post-editing (Specia 2011)
• WMT 2012 Shared Quality (for post-editing) Estimation Task(Callison-Burch et al., 2012)
Reference-aided utility prediction
• Corresponds to MT evaluation task
• This work
This Work
Machine translation as a starting point for human translators
• Goal is utility for post-editing
• Compare post-editing to traditional adequacy-driven tasks
Examine results of a post-editing experiment
• Simulate a real-world localization scenario
• Examine challenges in predicting translation usefulness forhuman translators
Adequacy Tasks
Adequacy: semantic similarity to reference translations
Significant research efforts on improving end quality of machinetranslation:
• ACL Workshops on Statistical Machine Translation(Callison-Burch et al., 2011)
• NIST Open Machine Translation Evaluations(Przybocki et al., 2009)
Measured by absolute scores or rankings
Motivation: MT for user consumption, input for other NLP tasks
Post-Editing
Human-targeted translation edit rate (HTER, Snover et al., 2006)
1. Human translators correct MT output
2. Automatically calculate number of edits using TER
TER =# of edits
# of reference words
Edits: insertion, deletion, substitution, block shift
Translation ExampleWMT 2011 Czech–English Track
Ref: He was supposed to pay half a million to Lubos G.
1: He had for Lubosi G. to pay half a million crowns.
0.27
2: He had to pay lubosi G. half a million kronor.
0.09
Translation ExampleWMT 2011 Czech–English Track
Ref: He was supposed to pay half a million to Lubos G.
1: He had for Lubosi G. to pay half a million crowns.
0.27
2: He had to pay lubosi G. half a million kronor.
0.09
Translation ExampleWMT 2011 Czech–English Track
Ref: He was supposed to pay half a million to Lubos G.
1: He had for Lubosi G. to pay half a million crowns.
0.27
2: He had to pay lubosi G. half a million kronor.
0.09
Translation ExampleWMT 2011 Czech–English Track
Ref: He was supposed to pay half a million to Lubos G.
1: He had for to pay Lubosi Lubos G. to pay half a million crowns.
0.27
2: He had to pay lubosi Lubos G. half a million kronor.
0.09
Translation ExampleWMT 2011 Czech–English Track
Ref: He was supposed to pay half a million to Lubos G.
1: He had for to pay Lubosi Lubos G. to pay half a million crowns.
0.27
2: He had to pay lubosi Lubos G. half a million kronor.
0.09
Translation ExampleWMT 2011 Czech–English Track
Ref: The problem is that life of the lines is two to four years.
1: The problem is that life is two lines, up to four years.
0.49 0.29
2: The problem is that the durability of lines is two or four years.
0.34 0.14
Translation ExampleWMT 2011 Czech–English Track
Ref: The problem is that life of the lines is two to four years.
1: The problem is that life is two lines, up to four years.
0.49 0.29
2: The problem is that the durability of lines is two or four years.
0.34 0.14
Translation ExampleWMT 2011 Czech–English Track
Ref: The problem is that life of the lines is two to four years.
1: The problem is that life is two lines, up to four years.
0.49
0.29
2: The problem is that the durability of lines is two or four years.
0.34
0.14
Translation ExampleWMT 2011 Czech–English Track
Ref: The problem is that life of the lines is two to four years.
1: The problem is that life is two of the lines , up to is two to four years.
0.49
0.29
2: The problem is that the durability life of lines is two or to four years.
0.34
0.14
Translation ExampleWMT 2011 Czech–English Track
Ref: The problem is that life of the lines is two to four years.
1: The problem is that life is two of the lines , up to is two to four years.
0.49 0.29
2: The problem is that the durability life of lines is two or to four years.
0.34 0.14
MT Post-Editing Experiment
90 sentences from Google Docs documentation
Translated from English to Spanish by two systems:
• Microsoft Translator
• Moses system (Europarl)
180 MT outputs total
Sent to human translators at Kent State Institute for AppliedLinguistics for post-editing
Translators never saw the reference translations
MT Post-Editing Experiment
90 sentences from Google Docs documentation
Translated from English to Spanish by two systems:
• Microsoft Translator
• Moses system (Europarl)
180 MT outputs total
Sent to human translators at Kent State Institute for AppliedLinguistics for post-editing
Translators never saw the reference translations
MT Post-Editing Experiment
Data collected from professional translators (in training):
Post-edited translations
Expert post-editing ratings1: No editing required2: Minor editing, meaning preserved3: Major editing, meaning lost4: Re-translate
From parallel data:
Independent reference translations
MT Post-Editing Experiment
Evaluate post-edited results using standard MT evaluation metrics:
BLEU (Papineni et al., 2002):
• n-gram precision with a brevity penalty
TER (Snover et al., 2006):
• Minimum edit distance
Meteor (Denkowski and Lavie, 2011):
• Tunable alignment-based metric
Task: Reference-assisted utility prediction
MT Post-Editing Results
Average rating: 1.69
Average HTER: 12.4
Automatic metric scores:
BLEU TER Meteor
Post-edited 79.2 12.4 90.0
MT vs Ref 31.7 49.5 58.2
Post vs Ref 34.1 48.3 59.2
MT Post-Editing Results
Average rating: 1.69
Average HTER: 12.4
Automatic metric scores:
BLEU TER Meteor
Post-edited 79.2 12.4 90.0
MT vs Ref 31.7 49.5 58.2
Post vs Ref 34.1 48.3 59.2
MT Post-Editing Results
Average rating: 1.69
Average HTER: 12.4
Automatic metric scores:
BLEU TER Meteor
Post-edited 79.2 12.4 90.0
MT vs Ref 31.7 49.5 58.2
Post vs Ref 34.1 48.3 59.2
MT Post-Editing Results
Average rating: 1.69
Average HTER: 12.4
Automatic metric scores:
BLEU TER Meteor
Post-edited 79.2 12.4 90.0
MT vs Ref 31.7 49.5 58.2
Post vs Ref 34.1 48.3 59.2
MT Post-Editing Results
r 4-pt BLEU TER Meteor
4-point – 0.32 0.28 0.33
HTER 0.49 0.26 0.24 0.27
Metric correlation with post-editing scores
MT Post-Editing Experiment
Oracle experiment: tune Meteor to maximize correlation
How well can we (over)fit expert post-editing ratings?
The Meteor Metric
Flexible alignment:
Scoring features:
• Precision/Recall contribution (insertions, deletions)
• Fragmentation penalty (reordering)
• Content/function word contribution
• Flexible match weights
MT Post-Editing Results
r 4-pt BLEU TER Meteor Meteororacle4-point – 0.32 0.28 0.33 0.35
HTER 0.49 0.26 0.24 0.27 0.34
Metric correlation with post-editing scores
MT Post-Editing Experiment
Additional experiment: translation usability
Divide translations into two groups:
• Suitable for post-editing (1-2)
• Not suitable for post-editing (3-4)
Examine metric score distribution of each group
Assess metric ability to distinguish between usable and non-usabletranslations
Unfair advantage: reference translations
MT Post-Editing Experiment
Additional experiment: translation usability
Divide translations into two groups:
• Suitable for post-editing (1-2)
• Not suitable for post-editing (3-4)
Examine metric score distribution of each group
Assess metric ability to distinguish between usable and non-usabletranslations
Unfair advantage: reference translations
Usability Experiment Results
0.0 0.2 0.4 0.6 0.8 1.0BLEU Score
0
5
10
15
20
25
Sent
ence
s
UsableNon-usable
0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score
0
2
4
6
8
10
12
14
16
18
Sent
ence
s
UsableNon-usable
Usability Experiment Results
0.0 0.2 0.4 0.6 0.8 1.0BLEU Score
0
5
10
15
20
25
Sent
ence
s
UsableNon-usable
0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score
0
2
4
6
8
10
12
14
16
18
Sent
ence
s
UsableNon-usable
Usability Experiment Results
0.0 0.2 0.4 0.6 0.8 1.0BLEU Score
0
5
10
15
20
25
Sent
ence
s
UsableNon-usable
0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score
0
2
4
6
8
10
12
14
16
18
Sent
ence
s
UsableNon-usable
Larger Data Set
Are out results skewed by the small size of the data (180 sentences)?
WMT12 Quality Estimation Task:
1832 English-to-Spanish MT outputs
HTER scores and 5-point multiple-expert ratings
Run usability experiment with this data
Larger Data Set
Are out results skewed by the small size of the data (180 sentences)?
WMT12 Quality Estimation Task:
1832 English-to-Spanish MT outputs
HTER scores and 5-point multiple-expert ratings
Run usability experiment with this data
WMT 2012 Quality Estimation Task Data
0.0 0.2 0.4 0.6 0.8 1.0BLEU Score
0
50
100
150
200
Sent
ence
s
UsableNon-usable
0.0 0.2 0.4 0.6 0.8 1.0Oracle Meteor Score
0
50
100
150
200
Sent
ence
s
UsableNon-usable
Usability vs HTER
How well do experts and HTER agree?
0.0 0.2 0.4 0.6 0.8 1.0HTER
0
10
20
30
40
50
60
70
80
Sent
ence
s
UsableNon-usable
0.0 0.2 0.4 0.6 0.8 1.0HTER
0
50
100
150
200
250
Sent
ence
s
UsableNon-usable
Kent State WMT 2012
Usability vs HTER
How well do experts and HTER agree?
0.0 0.2 0.4 0.6 0.8 1.0HTER
0
10
20
30
40
50
60
70
80
Sent
ence
s
UsableNon-usable
0.0 0.2 0.4 0.6 0.8 1.0HTER
0
50
100
150
200
250
Sent
ence
s
UsableNon-usable
Kent State WMT 2012
Usability vs HTER (WMT12)
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
Expert
Rating
HTER
0
20
40
60
80
100
Conclusions
MT for post-editing utility is a significantly different task fromMT for adequacy
Current MT tools under-perform on predicting post-editingusability
Even metrics that use post-editing information (HTER) don’tmatch expert assessments
To improve post-editing usability, we need better data, bettermetrics, better MT systems
Conclusions
www.transcenter.info
Challenges in Predicting Machine TranslationUtility for Human Post-Editors
Michael Denkowski and Alon Lavie
Language Technologies InstituteCarnegie Mellon University
October 29, 2012