10. Lucia Specia (USFD) Evaluation of Machine Translation

109
Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Translation Quality Assessment: Evaluation and Estimation Lucia Specia University of Sheffield [email protected] EXPERT Winter School, 12 November 2013 Translation Quality Assessment: Evaluation and Estimation 1 / 66

Transcript of 10. Lucia Specia (USFD) Evaluation of Machine Translation

Page 1: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Translation Quality Assessment:

Evaluation and Estimation

Lucia Specia

University of [email protected]

EXPERT Winter School, 12 November 2013

Translation Quality Assessment: Evaluation and Estimation 1 / 66

Page 2: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

“Machine Translation evaluation is better understood thanMachine Translation”

(Carbonell and Wilks, 1991)

Translation Quality Assessment: Evaluation and Estimation 2 / 66

Page 3: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Outline

1 Translation quality

2 Manual metrics

3 Task-based metrics

4 Reference-based metrics

5 Quality estimation

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 3 / 66

Page 4: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Outline

1 Translation quality

2 Manual metrics

3 Task-based metrics

4 Reference-based metrics

5 Quality estimation

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 4 / 66

Page 5: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Why is evaluation important?

Compare MT systems

Measure progress of MT systems over time

Diagnose of MT systems

Assess (and pay) human translators

Quality assurance

Tuning of SMT systems

Decision on fitness-for-purpose

...

Translation Quality Assessment: Evaluation and Estimation 5 / 66

Page 6: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Why is evaluation hard?

What does quality mean?

Fluent?Adequate?Easy to post-edit?

Quality for whom/what?

End-user: gisting (Google Translate), internalcommunications, or publication (dissemination)MT-system: tuning or diagnosisPost-editor: draft translations (light vs heavypost-editing)Other applications, e.g. CLIR

Translation Quality Assessment: Evaluation and Estimation 6 / 66

Page 7: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Why is evaluation hard?

What does quality mean?

Fluent?Adequate?Easy to post-edit?

Quality for whom/what?

End-user: gisting (Google Translate), internalcommunications, or publication (dissemination)MT-system: tuning or diagnosisPost-editor: draft translations (light vs heavypost-editing)Other applications, e.g. CLIR

Translation Quality Assessment: Evaluation and Estimation 6 / 66

Page 8: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

Ref: Do not buy this product, it’s their craziest invention!

MT: Do buy this product, it’s their craziest invention!

Severe if end-user does not speak source language

Trivial to post-edit by translators

Ref: The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

MT: Six-hour battery, 30 minutes to full charge last.

Ok for gisting - meaning preserved

Very costly for post-editing if style is to be preserved

Translation Quality Assessment: Evaluation and Estimation 7 / 66

Page 9: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

Ref: Do not buy this product, it’s their craziest invention!

MT: Do buy this product, it’s their craziest invention!

Severe if end-user does not speak source language

Trivial to post-edit by translators

Ref: The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

MT: Six-hour battery, 30 minutes to full charge last.

Ok for gisting - meaning preserved

Very costly for post-editing if style is to be preserved

Translation Quality Assessment: Evaluation and Estimation 7 / 66

Page 10: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

Ref: Do not buy this product, it’s their craziest invention!

MT: Do buy this product, it’s their craziest invention!

Severe if end-user does not speak source language

Trivial to post-edit by translators

Ref: The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

MT: Six-hour battery, 30 minutes to full charge last.

Ok for gisting - meaning preserved

Very costly for post-editing if style is to be preserved

Translation Quality Assessment: Evaluation and Estimation 7 / 66

Page 11: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

Ref: Do not buy this product, it’s their craziest invention!

MT: Do buy this product, it’s their craziest invention!

Severe if end-user does not speak source language

Trivial to post-edit by translators

Ref: The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

MT: Six-hour battery, 30 minutes to full charge last.

Ok for gisting - meaning preserved

Very costly for post-editing if style is to be preserved

Translation Quality Assessment: Evaluation and Estimation 7 / 66

Page 12: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

How do we measure quality?

Manual metrics:

Error counts, ranking, acceptability, 1-N judgements onfluency/adequacyTask-based human metrics: productivity tests(HTER, PE time, keystrokes), user-satisfaction, readingcomprehension

Automatic metrics:

Based on human references: BLEU, METEOR, TER, ...Reference-less: quality estimation

Translation Quality Assessment: Evaluation and Estimation 8 / 66

Page 13: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

How do we measure quality?

Manual metrics:

Error counts, ranking, acceptability, 1-N judgements onfluency/adequacyTask-based human metrics: productivity tests(HTER, PE time, keystrokes), user-satisfaction, readingcomprehension

Automatic metrics:

Based on human references: BLEU, METEOR, TER, ...Reference-less: quality estimation

Translation Quality Assessment: Evaluation and Estimation 8 / 66

Page 14: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Outline

1 Translation quality

2 Manual metrics

3 Task-based metrics

4 Reference-based metrics

5 Quality estimation

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 9 / 66

Page 15: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Judgements in an n-point scale

Adequacy using 5-point scale (NIST-like)

5 All meaning expressed in the source fragment appears inthe translation fragment.

4 Most of the source fragment meaning is expressed in thetranslation fragment.

3 Much of the source fragment meaning is expressed in thetranslation fragment.

2 Little of the source fragment meaning is expressed in thetranslation fragment.

1 None of the meaning expressed in the source fragment isexpressed in the translation fragment.

Translation Quality Assessment: Evaluation and Estimation 10 / 66

Page 16: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Judgements in an n-point scale

Fluency using 5-point scale (NIST-like)

5 Native language fluency. No grammar errors, goodword choice and syntactic structure in the translationfragment.

4 Near native language fluency. Few terminology orgrammar errors which don’t impact the overallunderstanding of the meaning.

3 Not very fluent. About half of translation containserrors.

2 Little fluency. Wrong word choice, poor grammar andsyntactic structure.

1 No fluency. Absolutely ungrammatical and for the mostpart doesn’t make any sense.

Translation Quality Assessment: Evaluation and Estimation 11 / 66

Page 17: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Judgements in an n-point scale

Issues:

Subjective judgements

Hard to reach significant agreement

Is it realible at all?Can we use multiple annotators?

Are fluency and adequacy really separable?

Ref: Absolutely ungrammatical and for the most part doesn’t

make any sense.

MT: Absolutely sense doesn’t ungrammatical for the and most

make any part.

Translation Quality Assessment: Evaluation and Estimation 12 / 66

Page 18: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Judgements in an n-point scale

Issues:

Subjective judgements

Hard to reach significant agreement

Is it realible at all?Can we use multiple annotators?

Are fluency and adequacy really separable?

Ref: Absolutely ungrammatical and for the most part doesn’t

make any sense.

MT: Absolutely sense doesn’t ungrammatical for the and most

make any part.

Translation Quality Assessment: Evaluation and Estimation 12 / 66

Page 19: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Ranking

WMT-13 Appraise tool: rank translations best-worst (w. ties)

Translation Quality Assessment: Evaluation and Estimation 13 / 66

Page 20: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Ranking

Issues:

Subjective judgements: what does “best” mean?

Hard to judge for long sentences

Ref: The majority of existing work focuses on predicting some

form of post-editing effort to help professional translators.

MT1: Few of the existing work focuses on predicting some formof post-editing effort to help professional translators.

MT2: The majority of existing work focuses on predicting some

form of post-editing effort to help machine translation.

Only serve for comparison purposes - the best systemmight not be good enough

Absolute evaluation can do both

Translation Quality Assessment: Evaluation and Estimation 14 / 66

Page 21: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Ranking

Issues:

Subjective judgements: what does “best” mean?

Hard to judge for long sentences

Ref: The majority of existing work focuses on predicting some

form of post-editing effort to help professional translators.

MT1: Few of the existing work focuses on predicting some formof post-editing effort to help professional translators.

MT2: The majority of existing work focuses on predicting some

form of post-editing effort to help machine translation.

Only serve for comparison purposes - the best systemmight not be good enough

Absolute evaluation can do both

Translation Quality Assessment: Evaluation and Estimation 14 / 66

Page 22: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Ranking

Issues:

Subjective judgements: what does “best” mean?

Hard to judge for long sentences

Ref: The majority of existing work focuses on predicting some

form of post-editing effort to help professional translators.

MT1: Few of the existing work focuses on predicting some formof post-editing effort to help professional translators.

MT2: The majority of existing work focuses on predicting some

form of post-editing effort to help machine translation.

Only serve for comparison purposes - the best systemmight not be good enough

Absolute evaluation can do both

Translation Quality Assessment: Evaluation and Estimation 14 / 66

Page 23: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Error counts

More fine-grained

Aimed at diagnosis of MT systems, quality control ofhuman translation.

E.g.: Multidimensional Quality Metrics (MQM)

Machine and human translation qualityTakes quality of source text into accountActual metric is based on a specification

Translation Quality Assessment: Evaluation and Estimation 15 / 66

Page 24: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Error counts

More fine-grained

Aimed at diagnosis of MT systems, quality control ofhuman translation.

E.g.: Multidimensional Quality Metrics (MQM)

Machine and human translation qualityTakes quality of source text into accountActual metric is based on a specification

Translation Quality Assessment: Evaluation and Estimation 15 / 66

Page 25: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

MQM

Issues selected based on a given specification (dimensions):

Language/locale

Subject field/domain

Text Type

Audience

Purpose

Register

Style

Content correspondence

Output modality, ...

Translation Quality Assessment: Evaluation and Estimation 16 / 66

Page 26: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

MQM

Issue types (core):

Altogether: 120 categories

Translation Quality Assessment: Evaluation and Estimation 17 / 66

Page 27: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

MQM

Issue types: http://www.qt21.eu/launchpad/content/

high-level-structure-0

Combining issue types:TQ = 100− AccP − (FluPT − FluPS)− (VerPT − VerPS)

translate5: open source graphical (Web) interface for inlineerror annotation: www.translate5.net

Translation Quality Assessment: Evaluation and Estimation 18 / 66

Page 28: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

MQM

Issue types: http://www.qt21.eu/launchpad/content/

high-level-structure-0

Combining issue types:TQ = 100− AccP − (FluPT − FluPS)− (VerPT − VerPS)

translate5: open source graphical (Web) interface for inlineerror annotation: www.translate5.net

Translation Quality Assessment: Evaluation and Estimation 18 / 66

Page 29: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

MQM

Translation Quality Assessment: Evaluation and Estimation 19 / 66

Page 30: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Error counts

Issues:

Time consuming

Requires training, esp. to distinguish betweenfine-grained error types

Different errors are more relevant for differentspecifications: need to select and weight them accordingly

Translation Quality Assessment: Evaluation and Estimation 20 / 66

Page 31: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Outline

1 Translation quality

2 Manual metrics

3 Task-based metrics

4 Reference-based metrics

5 Quality estimation

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 21 / 66

Page 32: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Post-editing

Productivity analysis: Measure translation quality withintask. E.g. Autodesk - Productivity test through post-editing

2-day translation and post-editing , 37 participantsIn-house Moses (Autodesk data: software)Time spent on each segment

Translation Quality Assessment: Evaluation and Estimation 22 / 66

Page 33: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Post-editing

PET: Records time, keystrokes, edit distance

Translation Quality Assessment: Evaluation and Estimation 23 / 66

Page 34: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Post-editing - PET

Translation Quality Assessment: Evaluation and Estimation 24 / 66

Page 35: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Post-editing - PET

Translation Quality Assessment: Evaluation and Estimation 25 / 66

Page 36: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Post-editing - PET

How often post-editing (PE) a translation tool output isfaster than translating from scratch (HT):

System Faster than HT

Google 94%Moses 86.8%

Systran 81.20%Trados 72.40%

Comparing the time to translate from scratch with the timeto PE MT, in seconds:

Annotator HT (s) PE (s) HT/PE PE/HT

Average 31.89 18.82 1.73 0.59Deviation 9.99 6.79 0.26 0.09

Translation Quality Assessment: Evaluation and Estimation 26 / 66

Page 37: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Post-editing - PET

How often post-editing (PE) a translation tool output isfaster than translating from scratch (HT):

System Faster than HT

Google 94%Moses 86.8%

Systran 81.20%Trados 72.40%

Comparing the time to translate from scratch with the timeto PE MT, in seconds:

Annotator HT (s) PE (s) HT/PE PE/HT

Average 31.89 18.82 1.73 0.59Deviation 9.99 6.79 0.26 0.09

Translation Quality Assessment: Evaluation and Estimation 26 / 66

Page 38: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

User satisfaction

Solving a problem: E.g.: Intel measuring user satisfactionwith un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

Translation Quality Assessment: Evaluation and Estimation 27 / 66

Page 39: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

User satisfaction

Solving a problem: E.g.: Intel measuring user satisfactionwith un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hour

Customers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

Translation Quality Assessment: Evaluation and Estimation 27 / 66

Page 40: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

User satisfaction

Solving a problem: E.g.: Intel measuring user satisfactionwith un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

Translation Quality Assessment: Evaluation and Estimation 27 / 66

Page 41: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Reading comprehension

Defense language proficiency test (Jones et al., 2005):

Translation Quality Assessment: Evaluation and Estimation 28 / 66

Page 42: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Reading comprehension

MT quality: function of

1 Text passage comprehension, as measured by answersaccuracy, and

2 Time taken to complete a test item (read a passage +answer its questions)

Translation Quality Assessment: Evaluation and Estimation 29 / 66

Page 43: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Reading comprehension

MT quality: function of

1 Text passage comprehension, as measured by answersaccuracy, and

2 Time taken to complete a test item (read a passage +answer its questions)

Translation Quality Assessment: Evaluation and Estimation 29 / 66

Page 44: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Reading comprehension

Compared to Human Translation (HT):

Translation Quality Assessment: Evaluation and Estimation 30 / 66

Page 45: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Reading comprehension

Compared to Human Translation (HT):

Translation Quality Assessment: Evaluation and Estimation 31 / 66

Page 46: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Task-based metrics

Issues:

Final goal needs to be very clear

Can be more cost/time consuming

Final task has to have a meaningful metric

Other elements may affect the final quality measurement(e.g. Chinese vs. Americans)

Translation Quality Assessment: Evaluation and Estimation 32 / 66

Page 47: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Outline

1 Translation quality

2 Manual metrics

3 Task-based metrics

4 Reference-based metrics

5 Quality estimation

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 33 / 66

Page 48: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Automatic metrics

Compare output of an MT system to one or morereference (usually human) translations: how close is theMT output to the reference translation?

Numerous metrics: BLEU, NIST, etc.

Advantages:Fast and cheap, minimal human labour, no need forbilingual speakers

Once test set is created, can be reused many timesCan be used on an on-going basis during systemdevelopment to test changes

Translation Quality Assessment: Evaluation and Estimation 34 / 66

Page 49: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Automatic metrics

Disadvantages:

Very few metrics look at variable ways of saying thesame thing (word-level): stems, synonyms, paraphrasesIndividual sentence scores are not very reliable,aggregate scores on a large test set are required

Very few of these metrics penalise differentmismatches differentlyReference translations are only a subset of thepossible good translations

Translation Quality Assessment: Evaluation and Estimation 35 / 66

Page 50: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Automatic metrics

Disadvantages:

Very few metrics look at variable ways of saying thesame thing (word-level): stems, synonyms, paraphrasesIndividual sentence scores are not very reliable,aggregate scores on a large test set are requiredVery few of these metrics penalise differentmismatches differently

Reference translations are only a subset of thepossible good translations

Translation Quality Assessment: Evaluation and Estimation 35 / 66

Page 51: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Automatic metrics

Disadvantages:

Very few metrics look at variable ways of saying thesame thing (word-level): stems, synonyms, paraphrasesIndividual sentence scores are not very reliable,aggregate scores on a large test set are requiredVery few of these metrics penalise differentmismatches differentlyReference translations are only a subset of thepossible good translations

Translation Quality Assessment: Evaluation and Estimation 35 / 66

Page 52: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

String matching

BLEU: BiLingual Evaluation Understudy

Most widely used metric, both for MT systemevaluation/comparison and SMT tuning

Matching of n-grams between MT and Ref: rewardssame words in equal order

#clip(g) count of reference n-grams g which happen in ahypothesis sentence h clipped by the number of times gappears in the reference sentence for h; #(g ′) = numberof n-grams in hypotheses

n-gram precision pn for a set of MT translations H =

pn =

∑h∈H

∑g∈ngrams(h) #clip(g)∑

h∈H∑

g ′∈ngrams(h) #(g ′)

Translation Quality Assessment: Evaluation and Estimation 36 / 66

Page 53: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

BLEU

Combine (mean of the log) 1-n n-gram precisions∑n

log pn

Bias towards translations with fewer words (denominator)Brevity penalty to penalise MT sentences that areshorter than reference

Compares the overall number of words wh of the entirehypotheses set with ref length wr :

BP =

{1 if wh ≥ wr

e(1−wr/wh) otherwise

BLEU = BP ∗ exp

(∑n

log pn

)

Translation Quality Assessment: Evaluation and Estimation 37 / 66

Page 54: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

BLEU

Combine (mean of the log) 1-n n-gram precisions∑n

log pn

Bias towards translations with fewer words (denominator)Brevity penalty to penalise MT sentences that areshorter than reference

Compares the overall number of words wh of the entirehypotheses set with ref length wr :

BP =

{1 if wh ≥ wr

e(1−wr/wh) otherwise

BLEU = BP ∗ exp

(∑n

log pn

)

Translation Quality Assessment: Evaluation and Estimation 37 / 66

Page 55: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

BLEU

Combine (mean of the log) 1-n n-gram precisions∑n

log pn

Bias towards translations with fewer words (denominator)Brevity penalty to penalise MT sentences that areshorter than reference

Compares the overall number of words wh of the entirehypotheses set with ref length wr :

BP =

{1 if wh ≥ wr

e(1−wr/wh) otherwise

BLEU = BP ∗ exp

(∑n

log pn

)Translation Quality Assessment: Evaluation and Estimation 37 / 66

Page 56: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

BLEU

Scale: 0-1, but highly dependent on the test set

Rewards fluency by matching high n-grams (up to 4)

Adequacy rewarded by unigrams and brevity penalty –poor model of recall

Synonyms and paraphrases only handled if they are inany of the reference translations

All tokens are equally weighted: missing out on acontent word = missing out on a determiner

Better for evaluating changes in the same system thancomparing different MT architectures

Translation Quality Assessment: Evaluation and Estimation 38 / 66

Page 57: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

BLEU

Not good at sentence-level, unless smoothing is applied:

Ref: the Iraqi weapons are to be handed over to the armywithin two weeks

MT: in two weeks Iraq’s weapons will give army

1-gram precision: 4/82-gram precision: 1/73-gram precision: 0/64-gram precision: 0/5BLEU = 0

Translation Quality Assessment: Evaluation and Estimation 39 / 66

Page 58: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

BLEU

Importance of clipping and brevity penalty

Ref1: the Iraqi weapons are to be handed over to the armywithin two weeks

Ref2: the Iraqi weapons will be surrendered to the army in twoweeks

MT: the the the theCount for the should be clipped at 2: max count of theword in any reference. Unigram score = 2/4 (not 4/4)

MT: Iraqi weapons will be1-gram precision: 4/42-gram precision: 3/33-gram precision: 2/24-gram precision: 1/1Precision (pn) = 1

Precision score penalised because h < r

Translation Quality Assessment: Evaluation and Estimation 40 / 66

Page 59: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

BLEU

Importance of clipping and brevity penalty

Ref1: the Iraqi weapons are to be handed over to the armywithin two weeks

Ref2: the Iraqi weapons will be surrendered to the army in twoweeks

MT: the the the theCount for the should be clipped at 2: max count of theword in any reference. Unigram score = 2/4 (not 4/4)

MT: Iraqi weapons will be1-gram precision: 4/42-gram precision: 3/33-gram precision: 2/24-gram precision: 1/1Precision (pn) = 1

Precision score penalised because h < r

Translation Quality Assessment: Evaluation and Estimation 40 / 66

Page 60: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

String matching

BLEU:

Translation Quality Assessment: Evaluation and Estimation 41 / 66

Page 61: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Edit distance

WER: Word Error Rate:

Levenshtein edit distance

Minimum proportion of insertions, deletions, andsubstitutions needed to transform an MT sentence intothe reference sentence

Heavily penalises reorderings: correct translation in thewrong location: deletion + insertion

WER =S + D + I

N

PER: Position-independent word Error Rate:

Does not penalise reorderings: output and referencesentences are unordered sets

Translation Quality Assessment: Evaluation and Estimation 42 / 66

Page 62: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Edit distance: TER

TER: Translation Error Rate

Adds shift operation

REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times

HYP: [this week] the saudis denied information published in the ***** new york times

1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:TER = 4

13= 0.31

Human-targeted TER (HTER)

TER between MT and its post-edited version

Translation Quality Assessment: Evaluation and Estimation 43 / 66

Page 63: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Edit distance: TER

TER: Translation Error Rate

Adds shift operation

REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times

HYP: [this week] the saudis denied information published in the ***** new york times

1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:TER = 4

13= 0.31

Human-targeted TER (HTER)

TER between MT and its post-edited version

Translation Quality Assessment: Evaluation and Estimation 43 / 66

Page 64: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Edit distance: TER

TER: Translation Error Rate

Adds shift operation

REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times

HYP: [this week] the saudis denied information published in the ***** new york times

1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:TER = 4

13= 0.31

Human-targeted TER (HTER)

TER between MT and its post-edited version

Translation Quality Assessment: Evaluation and Estimation 43 / 66

Page 65: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Edit distance: TER

TER: Translation Error Rate

Adds shift operation

REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times

HYP: [this week] the saudis denied information published in the ***** new york times

1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:TER = 4

13= 0.31

Human-targeted TER (HTER)

TER between MT and its post-edited version

Translation Quality Assessment: Evaluation and Estimation 43 / 66

Page 66: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Edit distance: TER

TER:

Translation Quality Assessment: Evaluation and Estimation 44 / 66

Page 67: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Alignment-based

METEOR:

Unigram Precision and Recall

Align MT output with reference. Take best scoring pairfor multiple refs.

Matching considers word inflection variations (stems),synonyms/paraphrases

Fluency addressed via a direct penalty: fragmentation ofthe matching

METEOR score = F-mean score discounted forfragmentation = F-mean * (1 - DF)

Translation Quality Assessment: Evaluation and Estimation 45 / 66

Page 68: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

METEOR

Example:

Ref: the Iraqi weapons are to be handed over to the armywithin two weeks

MT: in two weeks Iraq’s weapons will give army

Matching:

Ref: Iraqi weapons army two weeks

MT two weeks Iraq’s weapons army

P = 5/8 =0.625

R = 5/14 = 0.357

F-mean = 10*P*R/(9P+R) = 0.3731

Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6

Discounting factor: DF = 0.5 * (0.6**3) = 0.108

METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333

Translation Quality Assessment: Evaluation and Estimation 46 / 66

Page 69: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

METEOR

Example:

Ref: the Iraqi weapons are to be handed over to the armywithin two weeks

MT: in two weeks Iraq’s weapons will give army

Matching:

Ref: Iraqi weapons army two weeks

MT two weeks Iraq’s weapons army

P = 5/8 =0.625

R = 5/14 = 0.357

F-mean = 10*P*R/(9P+R) = 0.3731

Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6

Discounting factor: DF = 0.5 * (0.6**3) = 0.108

METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333

Translation Quality Assessment: Evaluation and Estimation 46 / 66

Page 70: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

METEOR

Example:

Ref: the Iraqi weapons are to be handed over to the armywithin two weeks

MT: in two weeks Iraq’s weapons will give army

Matching:

Ref: Iraqi weapons army two weeks

MT two weeks Iraq’s weapons army

P = 5/8 =0.625

R = 5/14 = 0.357

F-mean = 10*P*R/(9P+R) = 0.3731

Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6

Discounting factor: DF = 0.5 * (0.6**3) = 0.108

METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333

Translation Quality Assessment: Evaluation and Estimation 46 / 66

Page 71: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

METEOR

Example:

Ref: the Iraqi weapons are to be handed over to the armywithin two weeks

MT: in two weeks Iraq’s weapons will give army

Matching:

Ref: Iraqi weapons army two weeks

MT two weeks Iraq’s weapons army

P = 5/8 =0.625

R = 5/14 = 0.357

F-mean = 10*P*R/(9P+R) = 0.3731

Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6

Discounting factor: DF = 0.5 * (0.6**3) = 0.108

METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333

Translation Quality Assessment: Evaluation and Estimation 46 / 66

Page 72: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Others

WMT shared task on metrics:

TerroCat

DepRef

MEANT and TINE

TESLA

LEPOR

ROSE

AMBER

Many other linguistically motivated metrics wherematching is not done at the word-level (only)

...

Translation Quality Assessment: Evaluation and Estimation 47 / 66

Page 73: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Outline

1 Translation quality

2 Manual metrics

3 Task-based metrics

4 Reference-based metrics

5 Quality estimation

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 48 / 66

Page 74: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

No access to reference translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Translation Quality Assessment: Evaluation and Estimation 49 / 66

Page 75: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

No access to reference translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Translation Quality Assessment: Evaluation and Estimation 49 / 66

Page 76: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

No access to reference translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Translation Quality Assessment: Evaluation and Estimation 49 / 66

Page 77: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

No access to reference translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Translation Quality Assessment: Evaluation and Estimation 49 / 66

Page 78: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

No access to reference translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Translation Quality Assessment: Evaluation and Estimation 49 / 66

Page 79: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Overview

Quality estimation (QE): metrics that provide anestimate on the quality of unseen translations

No access to reference translations

Quality defined by the data

Quality = Can we publish it as is?

Quality = Can a reader get the gist?

Quality = Is it worth post-editing it?

Quality = How much effort to fix it?

Translation Quality Assessment: Evaluation and Estimation 49 / 66

Page 80: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Framework

MachineLearning

X: examples of source &

translations

QE modelY: Quality scores for

examples in X

Feature extraction

Features

Translation Quality Assessment: Evaluation and Estimation 50 / 66

Page 81: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Framework

MT systemTranslation

for xt'

QE modelQuality score

y'

Features

Feature extraction

SourceText x

s'

Translation Quality Assessment: Evaluation and Estimation 50 / 66

Page 82: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Framework

Main components to build a QE system:

1 Definition of quality: what to predict

2 (Human) labelled data (for quality)

3 Features

4 Machine learning algorithm

Translation Quality Assessment: Evaluation and Estimation 51 / 66

Page 83: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Definition of quality

Predict 1-N absolute scores for adequacy/fluency

Predict 1-N absolute scores for post-editing effort

Predict average post-editing time per word

Predict relative rankings

Predict relative rankings for same source

Predict percentage of edits needed for sentence

Predict word-level edits and its types

Predict BLEU, etc. scores for document

Translation Quality Assessment: Evaluation and Estimation 52 / 66

Page 84: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Datasets

SHEF (several): http://staffwww.dcs.shef.ac.uk/

people/L.Specia/resources.html

LIG (10K, fr-en): http://www-clips.imag.fr/geod/

User/marion.potet/index.php?page=download

LMSI (14K, fr-en, en-fr, 2 post-editors):http://web.limsi.fr/Individu/wisniews/

recherche/index.html

Translation Quality Assessment: Evaluation and Estimation 53 / 66

Page 85: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Features

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Translation Quality Assessment: Evaluation and Estimation 54 / 66

Page 86: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

QuEst

Goal: framework to explore features for QE

Feature extractors for 150+ features of all types: Java

Machine learning: wrappers for a number of algorithmsin the scikit-learn toolkit, grid search, feature selection

Open source:http://www.quest.dcs.shef.ac.uk/

Translation Quality Assessment: Evaluation and Estimation 55 / 66

Page 87: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

State of the art in QE

WMT12-13 shared tasks

Sentence- and word-level estimation of PE effortDatasets and language pairs:

Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en

Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

N

Translation Quality Assessment: Evaluation and Estimation 56 / 66

Page 88: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

State of the art in QE

WMT12-13 shared tasksSentence- and word-level estimation of PE effort

Datasets and language pairs:

Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en

Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

N

Translation Quality Assessment: Evaluation and Estimation 56 / 66

Page 89: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

State of the art in QE

WMT12-13 shared tasksSentence- and word-level estimation of PE effortDatasets and language pairs:

Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en

Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

N

Translation Quality Assessment: Evaluation and Estimation 56 / 66

Page 90: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

State of the art in QE

WMT12-13 shared tasksSentence- and word-level estimation of PE effortDatasets and language pairs:

Quality Year Languages1-5 subjective scores WMT12 en-esRanking all sentences best-worst WMT12/13 en-esHTER scores WMT13 en-esPost-editing time WMT13 en-esWord-level edits: change/keep WMT13 en-esWord-level edits: keep/delete/replace WMT13 en-esRanking 5 MTs per source WMT13 en-es; de-en

Evaluation metric:

MAE =

∑Ni=1 |H(si)− V (si)|

NTranslation Quality Assessment: Evaluation and Estimation 56 / 66

Page 91: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Baseline system

Features:

number of tokens in the source and target sentences

average source token length

average number of occurrences of words in the target

number of punctuation marks in source and target sentences

LM probability of source and target sentences

average number of translations per source word

% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4

% of seen source unigrams

SVM regression with RBF kernel with the parameters γ, ε and Coptimised using a grid-search and 5-fold cross validation on thetraining set

Translation Quality Assessment: Evaluation and Estimation 57 / 66

Page 92: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Baseline system

Features:

number of tokens in the source and target sentences

average source token length

average number of occurrences of words in the target

number of punctuation marks in source and target sentences

LM probability of source and target sentences

average number of translations per source word

% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4

% of seen source unigrams

SVM regression with RBF kernel with the parameters γ, ε and Coptimised using a grid-search and 5-fold cross validation on thetraining set

Translation Quality Assessment: Evaluation and Estimation 57 / 66

Page 93: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Results - scoring sub-task (WMT12)

System ID MAE RMSE• SDLLW M5PbestDeltaAvg 0.61 0.75

UU best 0.64 0.79SDLLW SVM 0.64 0.78

UU bltk 0.64 0.79Loria SVMlinear 0.68 0.82

UEdin 0.68 0.82TCD M5P-resources-only* 0.68 0.82

Baseline bb17 SVR 0.69 0.82Loria SVMrbf 0.69 0.83

SJTU 0.69 0.83WLV-SHEF FS 0.69 0.85

PRHLT-UPV 0.70 0.85WLV-SHEF BL 0.72 0.86

DCU-SYMC unconstrained 0.75 0.97DFKI grcfs-mars 0.82 0.98DFKI cfs-plsreg 0.82 0.99

UPC 1 0.84 1.01DCU-SYMC constrained 0.86 1.12

UPC 2 0.87 1.04TCD M5P-all 2.09 2.32

Translation Quality Assessment: Evaluation and Estimation 58 / 66

Page 94: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Results - scoring sub-task (WMT13)

System ID MAE RMSE• SHEF FS 12.42 15.74

SHEF FS-AL 13.02 17.03CNGL SVRPLS 13.26 16.82

LIMSI 13.32 17.22DCU-SYMC combine 13.45 16.64DCU-SYMC alltypes 13.51 17.14

CMU noB 13.84 17.46CNGL SVR 13.85 17.28

FBK-UEdin extra 14.38 17.68FBK-UEdin rand-svr 14.50 17.73

LORIA inctrain 14.79 18.34Baseline bb17 SVR 14.81 18.22

TCD-CNGL open 14.81 19.00LORIA inctraincont 14.83 18.17

TCD-CNGL restricted 15.20 19.59CMU full 15.25 18.97

UMAC 16.97 21.94

Translation Quality Assessment: Evaluation and Estimation 59 / 66

Page 95: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Open issues

Agreement between annotators

Absolute value judgements: difficult to achieveconsistency even in highly controlled settings

WMT12: 30% of initial dataset discardedRemaining annotations had to be scaled

Translation Quality Assessment: Evaluation and Estimation 60 / 66

Page 96: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Open issues

Annotation costs: active learning to select subset ofinstances to be annotated (Beck et al., ACL 2013)

Translation Quality Assessment: Evaluation and Estimation 61 / 66

Page 97: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Open issues

Curse of dimensionality: feature selection to identifyrelevant info for dataset (Shah et al., MT Summit 2013)

0.68

21'

0.48

57'

0.44

01'

0.5313'

0.4614' 0.

5339'

0.35

91'

0.5462'

0.554'

0.67

17'

0.47

19'

0.4292

'

0.52

65'

0.4741' 0.5437'

0.35

78'

0.5399'

0.5401'

0.629'

0.449'

0.4183'

0.5123'

0.44

93' 0.51

13'

0.34

01'

0.5301'

0.5401'

0.63

24'

0.4471'

0.41

69'

0.51

09'

0.46

09' 0.5309'

0.34

09'

0.52

49'

0.52

49'

0.61

31'

0.43

2'

0.41

1'

0.5025'

0.441'

0.506'

0.33

7'

0.52

1'

0.51

94'

0.3'

0.35'

0.4'

0.45'

0.5'

0.55'

0.6'

0.65'

0.7'

WMT12'

EAMT211'(en2es)'

EAMT211'(fr2en)'

EAMT2092s1'

EAMT2092s2'

EAMT2092s3'

EAMT2092s4'

GALE112s1'

GALE112s2''

BL' AF' BL+PR' AF+PR' FS'

Common feature set identified, but nuanced subsets forspecific datasets

Translation Quality Assessment: Evaluation and Estimation 62 / 66

Page 98: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Open issues

Curse of dimensionality: feature selection to identifyrelevant info for dataset (Shah et al., MT Summit 2013)

0.68

21'

0.48

57'

0.44

01'

0.5313'

0.4614' 0.

5339'

0.35

91'

0.5462'

0.554'

0.67

17'

0.47

19'

0.4292

'

0.52

65'

0.4741' 0.5437'

0.35

78'

0.5399'

0.5401'

0.629'

0.449'

0.4183'

0.5123'

0.44

93' 0.51

13'

0.34

01'

0.5301'

0.5401'

0.63

24'

0.4471'

0.41

69'

0.51

09'

0.46

09' 0.5309'

0.34

09'

0.52

49'

0.52

49'

0.61

31'

0.43

2'

0.41

1'

0.5025'

0.441'

0.506'

0.33

7'

0.52

1'

0.51

94'

0.3'

0.35'

0.4'

0.45'

0.5'

0.55'

0.6'

0.65'

0.7'

WMT12'

EAMT211'(en2es)'

EAMT211'(fr2en)'

EAMT2092s1'

EAMT2092s2'

EAMT2092s3'

EAMT2092s4'

GALE112s1'

GALE112s2''

BL' AF' BL+PR' AF+PR' FS'

Common feature set identified, but nuanced subsets forspecific datasets

Translation Quality Assessment: Evaluation and Estimation 62 / 66

Page 99: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Open issues

How to use estimated PE effort scores?: Do users preferdetailed estimates (sub-sentence level) or an overallestimate for the complete sentence or not seeing badsentences at all?

Too much information vs hard-to-interpret scores

IBM’s Goodness metric

MATECAT project investigating it

Translation Quality Assessment: Evaluation and Estimation 63 / 66

Page 100: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Open issues

How to use estimated PE effort scores?: Do users preferdetailed estimates (sub-sentence level) or an overallestimate for the complete sentence or not seeing badsentences at all?

Too much information vs hard-to-interpret scores

IBM’s Goodness metric

MATECAT project investigating it

Translation Quality Assessment: Evaluation and Estimation 63 / 66

Page 101: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Open issues

How to use estimated PE effort scores?: Do users preferdetailed estimates (sub-sentence level) or an overallestimate for the complete sentence or not seeing badsentences at all?

Too much information vs hard-to-interpret scores

IBM’s Goodness metric

MATECAT project investigating it

Translation Quality Assessment: Evaluation and Estimation 63 / 66

Page 102: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Outline

1 Translation quality

2 Manual metrics

3 Task-based metrics

4 Reference-based metrics

5 Quality estimation

6 Conclusions

Translation Quality Assessment: Evaluation and Estimation 64 / 66

Page 103: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Conclusions

(Machine) Translation evaluation & estimation: still anopen problem

Different metrics for: different purposes/users, differentneeds, different notions of quality

Quality estimation: learning of these different notions,but requires labelled data

Solution:

Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metricsInvent your own metric!

Translation Quality Assessment: Evaluation and Estimation 65 / 66

Page 104: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Conclusions

(Machine) Translation evaluation & estimation: still anopen problem

Different metrics for: different purposes/users, differentneeds, different notions of quality

Quality estimation: learning of these different notions,but requires labelled data

Solution:

Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metricsInvent your own metric!

Translation Quality Assessment: Evaluation and Estimation 65 / 66

Page 105: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Conclusions

(Machine) Translation evaluation & estimation: still anopen problem

Different metrics for: different purposes/users, differentneeds, different notions of quality

Quality estimation: learning of these different notions,but requires labelled data

Solution:

Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metricsInvent your own metric!

Translation Quality Assessment: Evaluation and Estimation 65 / 66

Page 106: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Conclusions

(Machine) Translation evaluation & estimation: still anopen problem

Different metrics for: different purposes/users, differentneeds, different notions of quality

Quality estimation: learning of these different notions,but requires labelled data

Solution:

Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metrics

Use various metricsInvent your own metric!

Translation Quality Assessment: Evaluation and Estimation 65 / 66

Page 107: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Conclusions

(Machine) Translation evaluation & estimation: still anopen problem

Different metrics for: different purposes/users, differentneeds, different notions of quality

Quality estimation: learning of these different notions,but requires labelled data

Solution:

Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metrics

Invent your own metric!

Translation Quality Assessment: Evaluation and Estimation 65 / 66

Page 108: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Conclusions

(Machine) Translation evaluation & estimation: still anopen problem

Different metrics for: different purposes/users, differentneeds, different notions of quality

Quality estimation: learning of these different notions,but requires labelled data

Solution:

Think of what quality means in your scenarioMeasure significanceMeasure agreement if manual metricsUse various metricsInvent your own metric!

Translation Quality Assessment: Evaluation and Estimation 65 / 66

Page 109: 10. Lucia Specia (USFD) Evaluation of Machine Translation

Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions

Translation Quality Assessment:

Evaluation and Estimation

Lucia Specia

University of [email protected]

EXPERT Winter School, 12 November 2013

Translation Quality Assessment: Evaluation and Estimation 66 / 66