Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Estimativa da qualidade da traducao

automatica

Lucia Specia

University of [email protected]

Faculdade de Letras da Universidade do Porto13 May 2013

Estimativa da qualidade da traducao automatica 1 / 31

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Outline

3 Open issues

4 Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Introduction

MT evaluation metrics

N-gram matching between system output and one ormore reference translations: BLEU and many others

Issue 1: Too many possible good quality translations,need thousands of references to capture valid variations

Solution: HyTER (Language Weaver) annotation tool togenerate all possible correct translations! [DM12]

Translations built bottom-up from word/phrasetranslation equivalents using FSA2-2.5 hours worth of expert annotation per sentenceOne annotator: 5.2× 106 pathsA bunch of annotators: 8.5× 1011 paths

Issue 2: Difficult to quantify severity of mismatchingn-grams

ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!

Some attempts to weight mismatches differently -sparse, lexicalised approach

However, same error is more or less important dependingon the user or purpose:

Severe if end-user does not speak source languageTrivial to post-edit by translators

Conversely:

ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

sys Six-hours battery, 30 minutes to full charge last.

Ok for gisting - meaning preservedVery costly for post-editing if style is to be preserved

Conversely:

ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

sys Six-hours battery, 30 minutes to full charge last.

Ok for gisting - meaning preservedVery costly for post-editing if style is to be preserved

Task-based evaluation

Measure translation quality within task. E.g. Autodesk -Productivity test through post-editing [Aut11]

2-day translation and post-editing , 37 participantsIn-house Moses (Autodesk data: software)Time spent on each segment

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Overall customer satisfaction: 75% for English→Chinese

95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hour

Customers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

Outline

3 Open issues

4 Conclusions

Overview

Metrics either depend on references or post-editing/use oftranslations (task-based)

Our proposal

Quality assessment without reference, prior topost-editing/use of translations

Overview

Metrics either depend on references or post-editing/use oftranslations (task-based)

Our proposal

Quality assessment without reference, prior topost-editing/use of translations

Overview

Why don’t translators use (more) MT?

Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Overview

Why don’t translators use (more) MT?Translations are not good enough!

What about TMs? Aren’t fuzzy matches useful?

Overview

Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Overview

Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT

QTLaunchPad project

Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/

Framework

QTLaunchPad project

Framework

QTLaunchPad project

Framework

QTLaunchPad project

Framework

QE system

Examples: source &

translations,quality scores

Qualityindicators

Framework

Sourcetext

MT system

Translation

QE system

Quality score

Examples: source &

translations,quality scores

Qualityindicators

Examples of positive results

Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences

Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec

Accuracy in selecting best translation among 4 MTsystems

Best MT system Highest QE score54% 77%

State-of-the-art

Quality indicators:

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Learning algorithms: wide range

Datasets: few with absolute human scores (1-4/5 scores,PE time, edit distance)

State-of-the-art

Quality indicators:

Fluency indicators

Adequacyindicators

State-of-the-art

Quality indicators:

Fluency indicators

Adequacyindicators

Outline

3 Open issues

4 Conclusions

State-of-the-art indicators

Shallow indicators:(S/T/S-T) Sentence length(S/T) Language model(S/T) Token-type ratio(S) Average number of possible translations per word(S) % of n-grams belonging to different frequencyquartiles of a source language corpus(T) Untranslated/OOV words(T) Mismatching brackets, quotation marks(S-T) Preservation of punctuation(S-T) Word alignment score, etc.

These do well for estimation post-editing effort...

...but are not enough for other aspects of quality, e.g.adequacy

Shallow indicators:(S/T/S-T) Sentence length(S/T) Language model(S/T) Token-type ratio(S) Average number of possible translations per word(S) % of n-grams belonging to different frequencyquartiles of a source language corpus(T) Untranslated/OOV words(T) Mismatching brackets, quotation marks(S-T) Preservation of punctuation(S-T) Word alignment score, etc.

These do well for estimation post-editing effort...

...but are not enough for other aspects of quality, e.g.adequacy

Linguistic indicators - count-based:

(S/T/S-T) Content/non-content words

(S/T/S-T) Nouns/verbs/... NP/VP/...

(S/T/S-T) Deictics (references)

(S/T/S-T) Discourse markers (references)

(S/T/S-T) Named entities

(S/T/S-T) Zero-subjects

(S/T/S-T) Pronominal subjects

(S/T/S-T) Negation indicators

(T) Subject-verb / adjective-noun agreement

(T) Language Model of POS

(T) Grammar checking (dangling words)

(T) Coherence

Linguistic indicators - alignment-based:

(S-T) Correct translation of pronouns

(S-T) Matching of dependency relations

(S-T) Matching of named entities

(S-T) Alignment of parse trees

(S-T) Alignment of predicates & arguments, etc.

Some indicators are language-dependent, others needresources that are language-dependent, but apply to mostlanguages, e.g. LM of POS tags

Linguistic indicators - alignment-based:

(S-T) Correct translation of pronouns

(S-T) Matching of dependency relations

(S-T) Matching of named entities

(S-T) Alignment of parse trees

(S-T) Alignment of predicates & arguments, etc.

Some indicators are language-dependent, others needresources that are language-dependent, but apply to mostlanguages, e.g. LM of POS tags

Fine-grained, lexicalised indicators:

target-word = “process” =

{1, if source-word = “hdhh alamlyt”.

0, otherwise.

{1, if source-pos = “DT DTNN”.

0, otherwise.

Closer to error detection

Need large amounts of training data [BHAO11], or RB approaches

Fine-grained, lexicalised indicators:

{1, if source-word = “hdhh alamlyt”.

0, otherwise.

{1, if source-pos = “DT DTNN”.

0, otherwise.

Closer to error detection

Need large amounts of training data [BHAO11], or RB approaches

Do these indicators work?

To some extent... Issues:

Representation of shallow/deep indicators: counts,ratios, (absolute) differences?

F = S − T , F = |S − T |, F =T

S, F =

S − T

Resources to extract deep indicators: availability andreliability

Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples

F = S − T , F = |S − T |, F =T

S, F =

S − T

F = S − T , F = |S − T |, F =T

S, F =

S − T

F = S − T , F = |S − T |, F =T

S, F =

S − T

Manual scoring: agreement between translators

Absolute value judgements: difficult to achieve consistencyacross annotators even in highly controlled setup

en-es news WMT12 dataset: 3 professionaltranslators, 1-5 scores

15% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled (0.33, 0.17,0.50)

Manual scoring: agreement between translators

Absolute value judgements: difficult to achieve consistencyacross annotators even in highly controlled setup

en-es news WMT12 dataset: 3 professionaltranslators, 1-5 scores

15% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled (0.33, 0.17,0.50)

Manual scoring: Agreement between translators

en-pt subtitles of TV series: 3 non-professionalsannotators, 1-4 scores

351 cases (41%): full agreement445 cases (52%): partial agreement54 cases (7%): null agreement

Agreement by score:

Score Full4 59%3 35%2 23%1 50%

Manual scoring: Agreement between translators

en-pt subtitles of TV series: 3 non-professionalsannotators, 1-4 scores

351 cases (41%): full agreement445 cases (52%): partial agreement54 cases (7%): null agreement

Agreement by score:

Score Full4 59%3 35%2 23%1 50%

More objective ways of annotating translations

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER

HTER =#edits

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versa

Certain edits seem to require more cognitive effort thanothers - not captured by HTER

HTER =#edits

TIME: varies considerably across translators (expected)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

Segments

Annotators

Seconds

Can we normalise this variation?

A dedicated QE system for each translator?

TIME: varies considerably across translators (expected)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00

Annotators

Seconds / word

Segments

Can we normalise this variation?

A dedicated QE system for each translator?

Time, HTER, Keystrokes: data from 8 post-editors

PET: http://pers-www.wlv.ac.uk/~in1676/pet/

How to use estimated PE effort scores?

Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?

Wasting time to read scores and translations vs wasting“gisting” information

How to define a threshold on the estimated translationquality to decide what should be filtered out?

Translator dependentTask dependent (SDL)

Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?

Too much information vs hard-to-interpret scores

Outline

3 Open issues

4 Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Conclusions

My vision

Conclusions

My vision

Conclusions

My vision

Conclusions

My vision

Conclusions

My vision

Estimativa da qualidade da traducao

automatica

Lucia Specia

University of [email protected]

Faculdade de Letras da Universidade do Porto13 May 2013

Autodesk.

Translation and Post-Editing Productivity.

In http: // translate. autodesk. com/ productivity. html ,2011.

Nguyen Bach, Fei Huang, and Yaser Al-Onaizan.

Goodness: a method for measuring machine translation confidence.

pages 211–219, Portland, Oregon, 2011.

Markus Dreyer and Daniel Marcu.

Hyter: Meaning-equivalent semantics for translation evaluation.

In Proceedings of the 2012 Conference of the North AmericanChapter of the Association for Computational Linguistics: HumanLanguage Technologies, pages 162–171, Montreal, Canada, 2012.

Intel.

Being Streetwise with Machine Translation in an EnterpriseNeighborhood.

In http:

// mtmarathon2010. info/ JEC2010_ Burgett_ slides. pptx ,2010.

Intel.

Enabling Multilingual Collaboration through Machine Translation.

In http: // media12. connectedsocialmedia. com/ intel/ 06/

8647/ Enabling_ Multilingual_ Collaboration_ Machine_

Translation. pdf , 2012.

Lucia Specia - Estimativa de qualidade em TA

Technology

Transcript of Lucia Specia - Estimativa de qualidade em TA