Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA 2014 Welocalize and...

Enterprise MT Content Drift: Challenges,

Impacts and Advanced Solutions

Alon Lavie & Olga BeregovayaAMTA 2014

October 25, 2014

Welocalize MT Program for eDell: workflow, processes, challenges

Safaba EMTGlobal MT Enterprise Content Drift: the evidence Identifying Content Drift: Indicators and their

correlation Overnight Retraining: the approach Overnight Retraining: the Pilot Study and Results EMTGlobal v4.0: Advanced Rapid Adaptation Welocalize: Expected Impacts Summary and Conclusions

Outline

Welocalize Approach to MT Program Deployment

Content Sent Through the TMT Process

eDell content types handled through the MT PE Process vary between different – mainly Marketing – content categories: Partner Marketing Global Support Channel Support Consumer Marketing Communication Corporate HR Customer Proposals Product Launches Global eDell Web content

The daily/weekly/monthly volumes per content type vary depending on the Dell business priorities

Translation Setup: Source document is pre-translated by translation memory

matches augmented by Safaba MT

Translation Memory “fuzzy match” threshold typically 75-85%

Pre-translations are presented to human translator as starting point for editing; translators can use or ignore the suggested pre-translations

Currently 28 languages go through the TM/MT workflow

Post-Editing Productivity Assessment: Contrastive translation projects that measure and compare

translation team productivity with MT post-editing versus translation using just translation memories

Productivity measured by contrasting translated words per hour under both conditions: MT-PE throughput / HT throughput

Translation with MT Post-Editing for Dell

MT Post-Editing Productivity Assessment Evaluated by Welocalize in the context of the Dell MT

Program

50.00%

100.00%

150.00%

200.00%

250.00%

300.00%BLEU

PE Distance

Productivity Delta

LOCALE ID Initial engine Retrained EngineCH_TI -11.75% 4.7%CS-CZ 37.53% -DA-DK 88.67% -DE_DE 20.24% 31.2%EL_EL 18.36% 51.3%ES_ES 28.5% ES_LX 2.31% 99.3%FI-FI 102.80%

FR_FR 21.73% 46.4%HE-IL 25.43% -

HU-HU 32.53% -IT-IT 84.89% -JA-JP 20.62% -

KO-KR 13.83% -NL-NL 27.85% -

NO-NO 59.71% -PL-PL 33.83% -

PT_BR 23.77% 31.3%PT_PT 30.24% 27.7%RU_RU 22.88% 36.6%

Productivity Gains through Retraining

1 2 3 4 5-20%

Productivity Delta and Fluency

Human Evaluation Fluency Score (1-5)

1 2 3 4 5-20%

Productivity Delta and Adequacy

Human Evaluation Adequacy Score (1-5)

de_DE es_ES/LA fr_FR/CA it_IT pt_BR

-1.00-0.80-0.60-0.40-0.200.000.200.400.600.801.00

Adequacy, Fluency and PE Distance Correlation

Adequacy & PE Distance Fluency & PE Distance

A Solid Engine Translates into Solid Gains

1.5 Years later -Welocalize Post-Editing Adoption

Q4/Q3/Q2/Q1 Volumes

Dell - “High-traffic” MT Program

Quarterly MT throughput volumes allow Welocalize and Safaba to accumulate post-edits sufficient for far more frequent re-trainings than scheduled maintenance engine updates

MT quality results are consistently above target – engine degradation will force translators to compensate with additional effort

Continuously above target, monitoring trend

Result

Target

Week 40 Week 41 Week 42 Week 43 Week 44 Week 45 Week 46 Week 47 Week 48 Week 49 Week 50 Week 51 Week 52 Week 5399.30%

99.40%

99.50%

99.60%

99.70%

99.80%

99.90%

100.00%

Final Post-Edited Output Quality

Client-Specific MT Adaptation

The majority of the MT systems Safaba develops are specifically developed and optimized for specific client content types

Data Scenario: Some amount of client-specific data: translation memories,

terminology glossaries and monolingual data resources

Additional domain-specific and general background data resources: other client-specific content types, TAUS data, other general parallel and monolingual background data

Safaba EMTGlobal

Client-Specific MT Adaptation

Safaba Suite of Adaptation Approaches: Data selection, filtering and prioritization methods

Data mixture and interpolation methods

Model mixture and interpolation methods

Client-specific Automated Post-Editing (Language Optimization Engine)

Styling and Formatting post-processing modules

Terminology and DNT runtime overrides

Safaba EMTGlobal

Client-specific Enterprise MT systems often degrade in performance over time for two main reasons:

1. Client content, even in controlled-domains, gradually changes over time: new products, new terminology, new content developers

2. The typical integrated setup of MT and translation memories: TMs are updated more frequently, so over time, only “harder” source segments are sent for translation to MT

Current Full MT system retraining is resource and time consuming:

MT systems are relatively static – they are fully retrained only periodically (typically only a couple of times per year)

The Result: MT accuracy for new projects declines over time post-editing productivity also declines over time

We see strong evidence of “content drift” over time with many of our clients, especially in post-editing setups

Enterprise Content Drift

Evidence from Safaba EMTGlobal Systems for Dell MT Program: BLEU scores before and after retraining on held

out “recent” incremental data

AR-EGCS DA DE EL ES-ES FI FR-FR HE HU IT JA KO0

20132014

Enterprise Content DriftEvidence from a typical client-specific MT system: EMTGlobal English-to-German Dell MT System:

February 2013 System: 565K client + 964K background segments

March 2014 System: 594K client + 6,795K background segments

Two test sets: “Original” test set from February 2013 system build (1,200

segments)

“Incremental” test set extracted from incremental data (500 segments)

System Test Scores and Statistics:

Lang System Gloss Inconsist.

Orig. BLEU

Orig. MET

Orig. TER

Orig. LEN

Orig. OOVs

Incr. BLEU

Incr. MET

Incr. TER

Incr. LEN

Incr. OOVs

DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2

DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31

segments)

Orig. BLEU

Orig. MET

Orig. TER

Orig. LEN

Orig. OOVs

Incr. BLEU

Incr. MET

Incr. TER

Incr. LEN

Incr. OOVs

DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2

DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31

segments)

Orig. BLEU

Orig. MET

Orig. TER

Orig. LEN

Orig. OOVs

Incr. BLEU

Incr. MET

Incr. TER

Incr. LEN

Incr. OOVs

DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2

DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31

segments)

Orig. BLEU

Orig. MET

Orig. TER

Orig. LEN

Orig. OOVs

Incr. BLEU

Incr. MET

Incr. TER

Incr. LEN

Incr. OOVs

DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2

DE March 2014

24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31

segments)

Orig. BLEU

Orig. MET

Orig. TER

Orig. LEN

Orig. OOVs

Incr. BLEU

Incr. MET

Incr. TER

Incr. LEN

Incr. OOVs

DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2

DE March 2014

24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31

segments)

Orig. BLEU

Orig. MET

Orig. TER

Orig. LEN

Orig. OOVs

Incr. BLEU

Incr. MET

Incr. TER

Incr. LEN

Incr. OOVs

DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2

DE March 2014

24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31

Analysis of Content Drift Over Time: Three EMTGlobal MT systems for Dell:

English to Chinese, Spanish and German Systems trained and deployed in February 2013

Test sets: “Original” test set from February 2013 system build

(1,200 segments) “Incremental” test set extracted from 2014

incremental data (500 segments) Data sets extracted from live Dell production

projects in August-2013, December-2013 and March-2014 along with their post-edited references

Analysis of Content Drift Over Time: BLEU Scores

Chinese Spanish German0

FebAugDecMarInc-2013Inc-2014

Content Drift Indicators

Goal: Establish real-time quantifiable measures that are indicative of Enterprise Content Drift Immediate: Available immediately at MT production time, prior to

any post-editing of the MT output Predictive: Strongly correlate with expected MT evaluation score and

post-editing effort Similar to real-time MT Quality Estimation scores, but specific to

capturing content drift

Three Measures: Core Out-of-Vocabulary (OOV) Type and Token fractions:

Fraction of source types (tokens) that were out-of-vocabulary in the core MT system (OOVs)

Source-side Unigram Coverage: Fraction of source type (token) unigrams that were observed in the MT

system training data Source-side Trigram Coverage:

Fraction of source type (token) trigrams that were observed in the MT system training data

Identifying Enterprise Content Drift

Performance of Content Drift Indicators on Dell EMTGlobal Systems:

OOVs (Fraction of Tokens)

Chinese Spanish German0.00

Performance of Content Drift Indicators on Dell EMTGlobal Systems:

Source Trigram Coverage

Chinese Spanish German0.00

“Overnight” Incremental Adaptation Objective: Counter “content drift” and help maintain and

accelerate post-editing productivity with fast and frequent incremental adaptation retraining

Setting: New additional post-edited client data is deposited and made available for adaptation in small incremental batches

Challenge: Full offline system retraining is slow and computationally intensive and can take several days

Safaba Solution: implement fast “light-weight” adaptations that can be executed, tested and deployed into production within hours (“overnight”)

Suffix-array variant of Moses supports rapid updating of indexed training data

Safaba Language Optimization Engine (automated post-editing module) supports rapid retraining

KenLM supports rapid rebuilding of language models

Currently in pilot testing with Welocalize and Dell

The Approach:

Goal: Rapid MT System Adaptation using Incremental Data Current Approach: Language Optimization Engine (LOE)

Incremental Retraining Safaba EMTGlobal MT systems include a core MT engine and a target-side

Language Optimization Engine

Retraining the LOE component is fast – typically within a few hours

Not equivalent to full MT system retraining, but effective in closing the gap

New Approach: EMTGlobal v4.0 Advanced Adaptation Technology:

Supports significantly improved client-specific adaptation within the core MT engine

Supports rapid incremental retraining of core MT engines

Much closer to full MT system retraining at similar time frame as LOE retraining

Will be available in late Q4 of 2014

Safaba Overnight Retraining

The Approach:

Full Solution: Overnight Retraining Incremental data from post-edited MT projects is delivered to

Safaba

Incremental system retraining is launched automatically, completed within hours

Newly-adapted version of the MT system is automatically tested and QAed for quality

Newly-adapted version of the MT system is deployed into production

The Pilot Project Pilot project with Welocalize to assess impact of Overnight Retraining on

Safaba EMTGlobal Dell MT systems, using samples of real post-edited translation project data

Setup: Languages: English to Chinese, Spanish and German Baseline Systems: 2014 retrained Dell EMTGlobal 3.0 MT systems Incremental Data: Three batches of incremental data from live translation

projects Methodology:

Three versions of the MT systems: Baseline Baseline + Retrained on Data Set #1 Baseline + Retrained on Data Set #1 & #2

MT Evaluation: Translate Data Set #3 (unseen) with the three versions of the MT system Assess impact on translation performance using automated MT evaluation

metrics Additional analysis using Safaba “Content Drift Indicators”

Original number of segments Number of segments post-filtering

Set 1 Set 2 Set 3 Set 1 Set 2 Set 3

ENUS-ESXL 1108 4553 704 926 2411 528

ENUS-ZHCN 3191 2181 1328 1143 1084 714

ENUS-DEDE 3043 1220 2270 2325 977 1466

Pilot Results: Automated Metric Scores English-to-Chinese:

Incremental Adaptation of Language Optimization Engine (LOE)

Incrementally retraining on Data Sets #1 & #2 results in gain of +3.0 BLEU points on Data Set #3

BLEU METEOR TER30

2013 System2014 BaselineBaseline+DS1Baseline+DS1&2

Pilot Results: Content Drift Indicator Statistics

English-to-Chinese: Incremental Adaptation of Language Optimization Engine (LOE)

Adding Data Sets #1 & #2 reduces Data Set #3 OOVs by 0.3%, improves unigram coverage by 0.36% and improves trigram coverage by 14.22%

OOV Tokens0.00%

Unigrams Covered

Trigrams Covered

2013 System2014 BaselineBaseline+DS1Baseline+DS1&2

Preliminary Results: Advanced Adaptation with EMTGlobal v4.0

English-to-Chinese: Incremental Adaptation with EMTGlobal v4.0

Incrementally retraining on Data Sets #1 & #2 results in gain of +6.8 BLEU points on Data Set #3

BLEU METEOR TER30

2014 BaselineBaseline+DS1&2

Summary of Pilot Results

Excellent results for English-to-Chinese!

Spanish and German results show no gain or loss in MT accuracy as a result of LOE incremental retraining with the available data sets

Performance on Data Set #3 remains completely flat with both retrainings according to all automated metrics

Data analysis with Content Drift Indicators reveals that Data Sets #1 & #2 for these two language pairs did not contain novel translations sufficient for improving MT performance on Data Set #3

No significant reduction in Data Set #3 OOVs

No significant improvement in coverage of source-side n-grams

Translators were asked to compare each engine iteration using the same source strings

Result

Target

Day 1: read the MT output first. Then read the source text (ST). Then score the segment for Adequacy and Fluency

AdequacyOn a 4-point scale, rate how much of the meaning is rendered in the translation:

4 Everything3 Most2 Little1 None

Fluency

Rate on a 4- point scale the extent to which the translation is well-formed grammatically, contains correct spellings, adheres to common use of terms, titles and names, is intuitively acceptable and can be sensibly interpreted by a native speaker:

4 Flawless3 Good2 Disfluent1 Incomprehensible

*Based on TAUS Adequacy/Fluency Guidelines

Comparing the iterations: Compare the NEW MT output to that of the previous week and indicate with a X in the correpsonding column whether it is better / worse / equal.If it is better or worse, indicate in the error categories & comment column what has improved or regressed.

Overnight Retraining Pilot Evaluation Setup

• The human evaluation results we have an our disposal are work in progress - based on evaluating a small subset of translated data and just one iteration of “overnight retraining”

• The improvements observed by automated metrics are not yet reflected in the human assessment

• Human evaluation results consistent between Baseline+ DS 1 and DS2 - no degradation is introduced but from translator perspective no significant change in quality is captured, possibly requires a larger evaluation set or a different approach to evaluation string selection

Translator feedback: improvement in fluency but no improvement in capturing the meaning of whole sentence; punctuation has improved, but the translation stil needs improvement; part of the sentence is now more fluent

Interim Chinese Pilot Results - Welocalize

We need to be more granular than “Quality” and look at “Relevance” (coverage and fluency will increase based on Safaba findings)

Our expected benefits from this approach – needs to be in-synch with sufficient daily volumes

No need to wait for scheduled retrainings Two things will happen – the translator gets more used to post-

editing, and the MT engines catch up with the changes in the source content in the “live” mode

Benefit for the client – once the actual ongoing engine relevance statistics have been captured, we’ll be able to predict higher throughputs and offer better discounts

Welocalize “Wins” from “Overnight Retraining”

Summary and Conclusions Enterprise Content Drift is a natural and frequent phenomenon

in large-scale commercial MT implementation projects Enterprise MT systems need to constantly adapt or else are

likely to significantly degrade in translation accuracy and value over time

Safaba’s Content Drift Indicators can identify and quantify content drift and can be effectively used to predict the impact of incremental MT system retraining

Are being incorporated into Safaba’s new EMTGlobal MT Monitoring Portal

Safaba’s “Overnight Retraining” incremental adaptation is effective in combating content drift and maintaining/improving MT system performance over time and maintaining translator productivity levels

Safaba’s upcoming EMTGlobal v4.0 will dramatically enhance these capabilities!

Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA 2014 Welocalize and...

Business

Transcript of Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA 2014 Welocalize and...

Welocalize July Newsletter

AMTA Poland 2008

Datafication of translation (Welocalize)

healthyhowrah.org · 2021. 1. 7. · name of the block amta-i amta-i anita-i anita-i amta-i anita-i amta.i anita-i amta-i amta-i anita-i anita-i anita-i anita-i amta-i amta.i anita-i

Subject: Option to publish AMTA papers in IEEE Xplore Dear ...events.amta.org/2016/documents/Xplore.pdf · Subject: Option to publish AMTA papers in IEEE Xplore Dear Author, The AMTA

MA AMTA 52st Annual Meeting

Welocalize Introduction and Overview

AMTA 2013 Massage Therapy Profession Research Report AMTA Executive Director Shelly Johnson

Pres Wseas Amta Bucharest08

AMTA'2008 translation universals

AMTA Industry Report 2005

AMTA Standing Councils & Committees Handbook

AMTA-OK Winter 2013 Newsletter

AMTA Newsletter May 2010

AMTA 2012 National Convention

Amta ok summer 2013 newsletter

Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas

Journey of Collaboration Autodesk, Alpha CRC, hiSoft, Moravia, Welocalize

Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014

Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing