Lucia Specia - Estimativa de qualidade em TA

79
Quality of Machine Translation Quality Estimation Open issues Conclusions Estimativa da qualidade da tradu¸ ao autom´ atica Lucia Specia University of Sheffield [email protected] Faculdade de Letras da Universidade do Porto 13 May 2013 Estimativa da qualidade da tradu¸c˜ ao autom´ atica 1 / 31

description

 

Transcript of Lucia Specia - Estimativa de qualidade em TA

Page 1: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Estimativa da qualidade da traducao

automatica

Lucia Specia

University of [email protected]

Faculdade de Letras da Universidade do Porto13 May 2013

Estimativa da qualidade da traducao automatica 1 / 31

Page 2: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Estimativa da qualidade da traducao automatica 2 / 31

Page 3: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Estimativa da qualidade da traducao automatica 3 / 31

Page 4: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Estimativa da qualidade da traducao automatica 4 / 31

Page 5: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Estimativa da qualidade da traducao automatica 4 / 31

Page 6: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Estimativa da qualidade da traducao automatica 4 / 31

Page 7: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Estimativa da qualidade da traducao automatica 4 / 31

Page 8: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Estimativa da qualidade da traducao automatica 4 / 31

Page 9: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

N-gram matching between system output and one ormore reference translations: BLEU and many others

Issue 1: Too many possible good quality translations,need thousands of references to capture valid variations

Solution: HyTER (Language Weaver) annotation tool togenerate all possible correct translations! [DM12]

Translations built bottom-up from word/phrasetranslation equivalents using FSA2-2.5 hours worth of expert annotation per sentenceOne annotator: 5.2× 106 pathsA bunch of annotators: 8.5× 1011 paths

Estimativa da qualidade da traducao automatica 5 / 31

Page 10: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

N-gram matching between system output and one ormore reference translations: BLEU and many others

Issue 1: Too many possible good quality translations,need thousands of references to capture valid variations

Solution: HyTER (Language Weaver) annotation tool togenerate all possible correct translations! [DM12]

Translations built bottom-up from word/phrasetranslation equivalents using FSA2-2.5 hours worth of expert annotation per sentenceOne annotator: 5.2× 106 pathsA bunch of annotators: 8.5× 1011 paths

Estimativa da qualidade da traducao automatica 5 / 31

Page 11: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

N-gram matching between system output and one ormore reference translations: BLEU and many others

Issue 1: Too many possible good quality translations,need thousands of references to capture valid variations

Solution: HyTER (Language Weaver) annotation tool togenerate all possible correct translations! [DM12]

Translations built bottom-up from word/phrasetranslation equivalents using FSA2-2.5 hours worth of expert annotation per sentenceOne annotator: 5.2× 106 pathsA bunch of annotators: 8.5× 1011 paths

Estimativa da qualidade da traducao automatica 5 / 31

Page 12: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Issue 2: Difficult to quantify severity of mismatchingn-grams

ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!

Some attempts to weight mismatches differently -sparse, lexicalised approach

However, same error is more or less important dependingon the user or purpose:

Severe if end-user does not speak source languageTrivial to post-edit by translators

Estimativa da qualidade da traducao automatica 6 / 31

Page 13: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Issue 2: Difficult to quantify severity of mismatchingn-grams

ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!

Some attempts to weight mismatches differently -sparse, lexicalised approach

However, same error is more or less important dependingon the user or purpose:

Severe if end-user does not speak source languageTrivial to post-edit by translators

Estimativa da qualidade da traducao automatica 6 / 31

Page 14: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Issue 2: Difficult to quantify severity of mismatchingn-grams

ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!

Some attempts to weight mismatches differently -sparse, lexicalised approach

However, same error is more or less important dependingon the user or purpose:

Severe if end-user does not speak source languageTrivial to post-edit by translators

Estimativa da qualidade da traducao automatica 6 / 31

Page 15: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Issue 2: Difficult to quantify severity of mismatchingn-grams

ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!

Some attempts to weight mismatches differently -sparse, lexicalised approach

However, same error is more or less important dependingon the user or purpose:

Severe if end-user does not speak source languageTrivial to post-edit by translators

Estimativa da qualidade da traducao automatica 6 / 31

Page 16: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Conversely:

ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

sys Six-hours battery, 30 minutes to full charge last.

Ok for gisting - meaning preservedVery costly for post-editing if style is to be preserved

Estimativa da qualidade da traducao automatica 7 / 31

Page 17: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Conversely:

ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

sys Six-hours battery, 30 minutes to full charge last.

Ok for gisting - meaning preservedVery costly for post-editing if style is to be preserved

Estimativa da qualidade da traducao automatica 7 / 31

Page 18: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

Measure translation quality within task. E.g. Autodesk -Productivity test through post-editing [Aut11]

2-day translation and post-editing , 37 participantsIn-house Moses (Autodesk data: software)Time spent on each segment

Estimativa da qualidade da traducao automatica 8 / 31

Page 19: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Estimativa da qualidade da traducao automatica 9 / 31

Page 20: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese

95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Estimativa da qualidade da traducao automatica 9 / 31

Page 21: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hour

Customers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Estimativa da qualidade da traducao automatica 9 / 31

Page 22: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Estimativa da qualidade da traducao automatica 9 / 31

Page 23: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Estimativa da qualidade da traducao automatica 9 / 31

Page 24: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Estimativa da qualidade da traducao automatica 10 / 31

Page 25: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Metrics either depend on references or post-editing/use oftranslations (task-based)

Our proposal

Quality assessment without reference, prior topost-editing/use of translations

Estimativa da qualidade da traducao automatica 11 / 31

Page 26: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Metrics either depend on references or post-editing/use oftranslations (task-based)

Our proposal

Quality assessment without reference, prior topost-editing/use of translations

Estimativa da qualidade da traducao automatica 11 / 31

Page 27: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Why don’t translators use (more) MT?

Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Estimativa da qualidade da traducao automatica 12 / 31

Page 28: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Why don’t translators use (more) MT?Translations are not good enough!

What about TMs? Aren’t fuzzy matches useful?

Estimativa da qualidade da traducao automatica 12 / 31

Page 29: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Estimativa da qualidade da traducao automatica 12 / 31

Page 30: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Estimativa da qualidade da traducao automatica 12 / 31

Page 31: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT

QTLaunchPad project

Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/

Estimativa da qualidade da traducao automatica 13 / 31

Page 32: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT

QTLaunchPad project

Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/

Estimativa da qualidade da traducao automatica 13 / 31

Page 33: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT

QTLaunchPad project

Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/

Estimativa da qualidade da traducao automatica 13 / 31

Page 34: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT

QTLaunchPad project

Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/

Estimativa da qualidade da traducao automatica 13 / 31

Page 35: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

QE system

Examples: source &

translations,quality scores

Qualityindicators

Estimativa da qualidade da traducao automatica 14 / 31

Page 36: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

Sourcetext

MT system

Translation

QE system

Quality score

Examples: source &

translations,quality scores

Qualityindicators

Estimativa da qualidade da traducao automatica 14 / 31

Page 37: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Examples of positive results

Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences

Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec

Accuracy in selecting best translation among 4 MTsystems

Best MT system Highest QE score54% 77%

Estimativa da qualidade da traducao automatica 15 / 31

Page 38: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Examples of positive results

Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences

Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec

Accuracy in selecting best translation among 4 MTsystems

Best MT system Highest QE score54% 77%

Estimativa da qualidade da traducao automatica 15 / 31

Page 39: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Examples of positive results

Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences

Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec

Accuracy in selecting best translation among 4 MTsystems

Best MT system Highest QE score54% 77%

Estimativa da qualidade da traducao automatica 15 / 31

Page 40: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art

Quality indicators:

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Learning algorithms: wide range

Datasets: few with absolute human scores (1-4/5 scores,PE time, edit distance)

Estimativa da qualidade da traducao automatica 16 / 31

Page 41: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art

Quality indicators:

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Learning algorithms: wide range

Datasets: few with absolute human scores (1-4/5 scores,PE time, edit distance)

Estimativa da qualidade da traducao automatica 16 / 31

Page 42: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art

Quality indicators:

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Learning algorithms: wide range

Datasets: few with absolute human scores (1-4/5 scores,PE time, edit distance)

Estimativa da qualidade da traducao automatica 16 / 31

Page 43: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Estimativa da qualidade da traducao automatica 17 / 31

Page 44: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Shallow indicators:(S/T/S-T) Sentence length(S/T) Language model(S/T) Token-type ratio(S) Average number of possible translations per word(S) % of n-grams belonging to different frequencyquartiles of a source language corpus(T) Untranslated/OOV words(T) Mismatching brackets, quotation marks(S-T) Preservation of punctuation(S-T) Word alignment score, etc.

These do well for estimation post-editing effort...

...but are not enough for other aspects of quality, e.g.adequacy

Estimativa da qualidade da traducao automatica 18 / 31

Page 45: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Shallow indicators:(S/T/S-T) Sentence length(S/T) Language model(S/T) Token-type ratio(S) Average number of possible translations per word(S) % of n-grams belonging to different frequencyquartiles of a source language corpus(T) Untranslated/OOV words(T) Mismatching brackets, quotation marks(S-T) Preservation of punctuation(S-T) Word alignment score, etc.

These do well for estimation post-editing effort...

...but are not enough for other aspects of quality, e.g.adequacy

Estimativa da qualidade da traducao automatica 18 / 31

Page 46: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Linguistic indicators - count-based:

(S/T/S-T) Content/non-content words

(S/T/S-T) Nouns/verbs/... NP/VP/...

(S/T/S-T) Deictics (references)

(S/T/S-T) Discourse markers (references)

(S/T/S-T) Named entities

(S/T/S-T) Zero-subjects

(S/T/S-T) Pronominal subjects

(S/T/S-T) Negation indicators

(T) Subject-verb / adjective-noun agreement

(T) Language Model of POS

(T) Grammar checking (dangling words)

(T) Coherence

Estimativa da qualidade da traducao automatica 19 / 31

Page 47: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Linguistic indicators - alignment-based:

(S-T) Correct translation of pronouns

(S-T) Matching of dependency relations

(S-T) Matching of named entities

(S-T) Alignment of parse trees

(S-T) Alignment of predicates & arguments, etc.

Some indicators are language-dependent, others needresources that are language-dependent, but apply to mostlanguages, e.g. LM of POS tags

Estimativa da qualidade da traducao automatica 20 / 31

Page 48: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Linguistic indicators - alignment-based:

(S-T) Correct translation of pronouns

(S-T) Matching of dependency relations

(S-T) Matching of named entities

(S-T) Alignment of parse trees

(S-T) Alignment of predicates & arguments, etc.

Some indicators are language-dependent, others needresources that are language-dependent, but apply to mostlanguages, e.g. LM of POS tags

Estimativa da qualidade da traducao automatica 20 / 31

Page 49: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Fine-grained, lexicalised indicators:

target-word = “process” =

{1, if source-word = “hdhh alamlyt”.

0, otherwise.

target-word = “process” =

{1, if source-pos = “DT DTNN”.

0, otherwise.

Closer to error detection

Need large amounts of training data [BHAO11], or RB approaches

Estimativa da qualidade da traducao automatica 21 / 31

Page 50: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Fine-grained, lexicalised indicators:

target-word = “process” =

{1, if source-word = “hdhh alamlyt”.

0, otherwise.

target-word = “process” =

{1, if source-pos = “DT DTNN”.

0, otherwise.

Closer to error detection

Need large amounts of training data [BHAO11], or RB approaches

Estimativa da qualidade da traducao automatica 21 / 31

Page 51: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Do these indicators work?

To some extent... Issues:

Representation of shallow/deep indicators: counts,ratios, (absolute) differences?

F = S − T , F = |S − T |, F =T

S, F =

S − T

S...

Resources to extract deep indicators: availability andreliability

Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples

Estimativa da qualidade da traducao automatica 22 / 31

Page 52: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Do these indicators work?

To some extent... Issues:

Representation of shallow/deep indicators: counts,ratios, (absolute) differences?

F = S − T , F = |S − T |, F =T

S, F =

S − T

S...

Resources to extract deep indicators: availability andreliability

Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples

Estimativa da qualidade da traducao automatica 22 / 31

Page 53: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Do these indicators work?

To some extent... Issues:

Representation of shallow/deep indicators: counts,ratios, (absolute) differences?

F = S − T , F = |S − T |, F =T

S, F =

S − T

S...

Resources to extract deep indicators: availability andreliability

Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples

Estimativa da qualidade da traducao automatica 22 / 31

Page 54: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Do these indicators work?

To some extent... Issues:

Representation of shallow/deep indicators: counts,ratios, (absolute) differences?

F = S − T , F = |S − T |, F =T

S, F =

S − T

S...

Resources to extract deep indicators: availability andreliability

Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples

Estimativa da qualidade da traducao automatica 22 / 31

Page 55: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Manual scoring: agreement between translators

Absolute value judgements: difficult to achieve consistencyacross annotators even in highly controlled setup

en-es news WMT12 dataset: 3 professionaltranslators, 1-5 scores

15% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled (0.33, 0.17,0.50)

Estimativa da qualidade da traducao automatica 23 / 31

Page 56: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Manual scoring: agreement between translators

Absolute value judgements: difficult to achieve consistencyacross annotators even in highly controlled setup

en-es news WMT12 dataset: 3 professionaltranslators, 1-5 scores

15% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled (0.33, 0.17,0.50)

Estimativa da qualidade da traducao automatica 23 / 31

Page 57: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Manual scoring: Agreement between translators

en-pt subtitles of TV series: 3 non-professionalsannotators, 1-4 scores

351 cases (41%): full agreement445 cases (52%): partial agreement54 cases (7%): null agreement

Agreement by score:

Score Full4 59%3 35%2 23%1 50%

Estimativa da qualidade da traducao automatica 24 / 31

Page 58: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Manual scoring: Agreement between translators

en-pt subtitles of TV series: 3 non-professionalsannotators, 1-4 scores

351 cases (41%): full agreement445 cases (52%): partial agreement54 cases (7%): null agreement

Agreement by score:

Score Full4 59%3 35%2 23%1 50%

Estimativa da qualidade da traducao automatica 24 / 31

Page 59: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimativa da qualidade da traducao automatica 25 / 31

Page 60: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimativa da qualidade da traducao automatica 25 / 31

Page 61: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versa

Certain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimativa da qualidade da traducao automatica 25 / 31

Page 62: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimativa da qualidade da traducao automatica 25 / 31

Page 63: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

TIME: varies considerably across translators (expected)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

100

200

300

400

500

600

A1

A2

A3

A4

A5

A6

A7

A8

Segments

Annotators

Seconds

Can we normalise this variation?

A dedicated QE system for each translator?

Estimativa da qualidade da traducao automatica 26 / 31

Page 64: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

TIME: varies considerably across translators (expected)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00

5.00

10.00

15.00

20.00

25.00

A1

A2

A3

A4

A5

A6

A7

A8

Annotators

Seconds / word

Segments

Can we normalise this variation?

A dedicated QE system for each translator?

Estimativa da qualidade da traducao automatica 26 / 31

Page 65: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

Time, HTER, Keystrokes: data from 8 post-editors

Estimativa da qualidade da traducao automatica 27 / 31

Page 66: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

PET: http://pers-www.wlv.ac.uk/~in1676/pet/

Estimativa da qualidade da traducao automatica 27 / 31

Page 67: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

How to use estimated PE effort scores?

Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?

Wasting time to read scores and translations vs wasting“gisting” information

How to define a threshold on the estimated translationquality to decide what should be filtered out?

Translator dependentTask dependent (SDL)

Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?

Too much information vs hard-to-interpret scores

Estimativa da qualidade da traducao automatica 28 / 31

Page 68: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

How to use estimated PE effort scores?

Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?

Wasting time to read scores and translations vs wasting“gisting” information

How to define a threshold on the estimated translationquality to decide what should be filtered out?

Translator dependentTask dependent (SDL)

Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?

Too much information vs hard-to-interpret scores

Estimativa da qualidade da traducao automatica 28 / 31

Page 69: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

How to use estimated PE effort scores?

Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?

Wasting time to read scores and translations vs wasting“gisting” information

How to define a threshold on the estimated translationquality to decide what should be filtered out?

Translator dependentTask dependent (SDL)

Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?

Too much information vs hard-to-interpret scores

Estimativa da qualidade da traducao automatica 28 / 31

Page 70: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Estimativa da qualidade da traducao automatica 29 / 31

Page 71: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Page 72: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Page 73: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Page 74: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Page 75: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Page 76: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Page 77: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Estimativa da qualidade da traducao

automatica

Lucia Specia

University of [email protected]

Faculdade de Letras da Universidade do Porto13 May 2013

Estimativa da qualidade da traducao automatica 31 / 31

Page 78: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Autodesk.

Translation and Post-Editing Productivity.

In http: // translate. autodesk. com/ productivity. html ,2011.

Nguyen Bach, Fei Huang, and Yaser Al-Onaizan.

Goodness: a method for measuring machine translation confidence.

pages 211–219, Portland, Oregon, 2011.

Markus Dreyer and Daniel Marcu.

Hyter: Meaning-equivalent semantics for translation evaluation.

In Proceedings of the 2012 Conference of the North AmericanChapter of the Association for Computational Linguistics: HumanLanguage Technologies, pages 162–171, Montreal, Canada, 2012.

Intel.

Being Streetwise with Machine Translation in an EnterpriseNeighborhood.

Estimativa da qualidade da traducao automatica 31 / 31

Page 79: Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

In http:

// mtmarathon2010. info/ JEC2010_ Burgett_ slides. pptx ,2010.

Intel.

Enabling Multilingual Collaboration through Machine Translation.

In http: // media12. connectedsocialmedia. com/ intel/ 06/

8647/ Enabling_ Multilingual_ Collaboration_ Machine_

Translation. pdf , 2012.

Estimativa da qualidade da traducao automatica 31 / 31