Post on 29-Jun-2015
description
Enterprise MT Content Drift: Challenges,
Impacts and Advanced Solutions
Alon Lavie & Olga BeregovayaAMTA 2014
October 25, 2014
Welocalize MT Program for eDell: workflow, processes, challenges
Safaba EMTGlobal MT Enterprise Content Drift: the evidence Identifying Content Drift: Indicators and their
correlation Overnight Retraining: the approach Overnight Retraining: the Pilot Study and Results EMTGlobal v4.0: Advanced Rapid Adaptation Welocalize: Expected Impacts Summary and Conclusions
Outline
Welocalize Approach to MT Program Deployment
Content Sent Through the TMT Process
eDell content types handled through the MT PE Process vary between different – mainly Marketing – content categories: Partner Marketing Global Support Channel Support Consumer Marketing Communication Corporate HR Customer Proposals Product Launches Global eDell Web content
The daily/weekly/monthly volumes per content type vary depending on the Dell business priorities
Translation Setup: Source document is pre-translated by translation memory
matches augmented by Safaba MT
Translation Memory “fuzzy match” threshold typically 75-85%
Pre-translations are presented to human translator as starting point for editing; translators can use or ignore the suggested pre-translations
Currently 28 languages go through the TM/MT workflow
Post-Editing Productivity Assessment: Contrastive translation projects that measure and compare
translation team productivity with MT post-editing versus translation using just translation memories
Productivity measured by contrasting translated words per hour under both conditions: MT-PE throughput / HT throughput
Translation with MT Post-Editing for Dell
MT Post-Editing Productivity Assessment Evaluated by Welocalize in the context of the Dell MT
Program
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
0.00%
50.00%
100.00%
150.00%
200.00%
250.00%
300.00%BLEU
PE Distance
Productivity Delta
LOCALE ID Initial engine Retrained EngineCH_TI -11.75% 4.7%CS-CZ 37.53% -DA-DK 88.67% -DE_DE 20.24% 31.2%EL_EL 18.36% 51.3%ES_ES 28.5% ES_LX 2.31% 99.3%FI-FI 102.80%
FR_FR 21.73% 46.4%HE-IL 25.43% -
HU-HU 32.53% -IT-IT 84.89% -JA-JP 20.62% -
KO-KR 13.83% -NL-NL 27.85% -
NO-NO 59.71% -PL-PL 33.83% -
PT_BR 23.77% 31.3%PT_PT 30.24% 27.7%RU_RU 22.88% 36.6%
Productivity Gains through Retraining
1 2 3 4 5-20%
0%
20%
40%
60%
80%
100%
Productivity Delta and Fluency
Human Evaluation Fluency Score (1-5)
Prod
uctiv
ity D
elta
1 2 3 4 5-20%
0%
20%
40%
60%
80%
100%
Productivity Delta and Adequacy
Human Evaluation Adequacy Score (1-5)
Prod
uctiv
ity D
elta
de_DE es_ES/LA fr_FR/CA it_IT pt_BR
-1.00-0.80-0.60-0.40-0.200.000.200.400.600.801.00
Adequacy, Fluency and PE Distance Correlation
Adequacy & PE Distance Fluency & PE Distance
A Solid Engine Translates into Solid Gains
1.5 Years later -Welocalize Post-Editing Adoption
Q4/Q3/Q2/Q1 Volumes
Dell - “High-traffic” MT Program
Quarterly MT throughput volumes allow Welocalize and Safaba to accumulate post-edits sufficient for far more frequent re-trainings than scheduled maintenance engine updates
MT quality results are consistently above target – engine degradation will force translators to compensate with additional effort
Continuously above target, monitoring trend
Result
Target
Week 40 Week 41 Week 42 Week 43 Week 44 Week 45 Week 46 Week 47 Week 48 Week 49 Week 50 Week 51 Week 52 Week 5399.30%
99.40%
99.50%
99.60%
99.70%
99.80%
99.90%
100.00%
Final Post-Edited Output Quality
Client-Specific MT Adaptation
The majority of the MT systems Safaba develops are specifically developed and optimized for specific client content types
Data Scenario: Some amount of client-specific data: translation memories,
terminology glossaries and monolingual data resources
Additional domain-specific and general background data resources: other client-specific content types, TAUS data, other general parallel and monolingual background data
Safaba EMTGlobal
Client-Specific MT Adaptation
Safaba Suite of Adaptation Approaches: Data selection, filtering and prioritization methods
Data mixture and interpolation methods
Model mixture and interpolation methods
Client-specific Automated Post-Editing (Language Optimization Engine)
Styling and Formatting post-processing modules
Terminology and DNT runtime overrides
Safaba EMTGlobal
Client-specific Enterprise MT systems often degrade in performance over time for two main reasons:
1. Client content, even in controlled-domains, gradually changes over time: new products, new terminology, new content developers
2. The typical integrated setup of MT and translation memories: TMs are updated more frequently, so over time, only “harder” source segments are sent for translation to MT
Current Full MT system retraining is resource and time consuming:
MT systems are relatively static – they are fully retrained only periodically (typically only a couple of times per year)
The Result: MT accuracy for new projects declines over time post-editing productivity also declines over time
We see strong evidence of “content drift” over time with many of our clients, especially in post-editing setups
Enterprise Content Drift
Evidence from Safaba EMTGlobal Systems for Dell MT Program: BLEU scores before and after retraining on held
out “recent” incremental data
AR-EGCS DA DE EL ES-ES FI FR-FR HE HU IT JA KO0
10
20
30
40
50
60
70
20132014
Enterprise Content Drift
Enterprise Content DriftEvidence from a typical client-specific MT system: EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets: “Original” test set from February 2013 system build (1,200
segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss Inconsist.
Orig. BLEU
Orig. MET
Orig. TER
Orig. LEN
Orig. OOVs
Incr. BLEU
Incr. MET
Incr. TER
Incr. LEN
Incr. OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2
107
DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
Enterprise Content DriftEvidence from a typical client-specific MT system: EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets: “Original” test set from February 2013 system build (1,200
segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss Inconsist.
Orig. BLEU
Orig. MET
Orig. TER
Orig. LEN
Orig. OOVs
Incr. BLEU
Incr. MET
Incr. TER
Incr. LEN
Incr. OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2
107
DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
Enterprise Content DriftEvidence from a typical client-specific MT system: EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets: “Original” test set from February 2013 system build (1,200
segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss Inconsist.
Orig. BLEU
Orig. MET
Orig. TER
Orig. LEN
Orig. OOVs
Incr. BLEU
Incr. MET
Incr. TER
Incr. LEN
Incr. OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2
107
DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
Enterprise Content DriftEvidence from a typical client-specific MT system: EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets: “Original” test set from February 2013 system build (1,200
segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss Inconsist.
Orig. BLEU
Orig. MET
Orig. TER
Orig. LEN
Orig. OOVs
Incr. BLEU
Incr. MET
Incr. TER
Incr. LEN
Incr. OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2
107
DE March 2014
24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
Enterprise Content DriftEvidence from a typical client-specific MT system: EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets: “Original” test set from February 2013 system build (1,200
segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss Inconsist.
Orig. BLEU
Orig. MET
Orig. TER
Orig. LEN
Orig. OOVs
Incr. BLEU
Incr. MET
Incr. TER
Incr. LEN
Incr. OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2
107
DE March 2014
24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
Enterprise Content DriftEvidence from a typical client-specific MT system: EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets: “Original” test set from February 2013 system build (1,200
segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss Inconsist.
Orig. BLEU
Orig. MET
Orig. TER
Orig. LEN
Orig. OOVs
Incr. BLEU
Incr. MET
Incr. TER
Incr. LEN
Incr. OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2
107
DE March 2014
24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
Enterprise Content Drift
Analysis of Content Drift Over Time: Three EMTGlobal MT systems for Dell:
English to Chinese, Spanish and German Systems trained and deployed in February 2013
Test sets: “Original” test set from February 2013 system build
(1,200 segments) “Incremental” test set extracted from 2014
incremental data (500 segments) Data sets extracted from live Dell production
projects in August-2013, December-2013 and March-2014 along with their post-edited references
Enterprise Content Drift
Analysis of Content Drift Over Time: BLEU Scores
Chinese Spanish German0
10
20
30
40
50
60
70
FebAugDecMarInc-2013Inc-2014
Content Drift Indicators
Goal: Establish real-time quantifiable measures that are indicative of Enterprise Content Drift Immediate: Available immediately at MT production time, prior to
any post-editing of the MT output Predictive: Strongly correlate with expected MT evaluation score and
post-editing effort Similar to real-time MT Quality Estimation scores, but specific to
capturing content drift
Three Measures: Core Out-of-Vocabulary (OOV) Type and Token fractions:
Fraction of source types (tokens) that were out-of-vocabulary in the core MT system (OOVs)
Source-side Unigram Coverage: Fraction of source type (token) unigrams that were observed in the MT
system training data Source-side Trigram Coverage:
Fraction of source type (token) trigrams that were observed in the MT system training data
Identifying Enterprise Content Drift
Content Drift Indicators
Performance of Content Drift Indicators on Dell EMTGlobal Systems:
OOVs (Fraction of Tokens)
Identifying Enterprise Content Drift
Chinese Spanish German0.00
1.00
2.00
3.00
4.00
5.00
6.00
FebAugDecMarInc-2013Inc-2014
Content Drift Indicators
Performance of Content Drift Indicators on Dell EMTGlobal Systems:
Source Trigram Coverage
Identifying Enterprise Content Drift
Chinese Spanish German0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
FebAugDecMarInc-2013Inc-2014
“Overnight” Incremental Adaptation Objective: Counter “content drift” and help maintain and
accelerate post-editing productivity with fast and frequent incremental adaptation retraining
Setting: New additional post-edited client data is deposited and made available for adaptation in small incremental batches
Challenge: Full offline system retraining is slow and computationally intensive and can take several days
Safaba Solution: implement fast “light-weight” adaptations that can be executed, tested and deployed into production within hours (“overnight”)
Suffix-array variant of Moses supports rapid updating of indexed training data
Safaba Language Optimization Engine (automated post-editing module) supports rapid retraining
KenLM supports rapid rebuilding of language models
Currently in pilot testing with Welocalize and Dell
The Approach:
Goal: Rapid MT System Adaptation using Incremental Data Current Approach: Language Optimization Engine (LOE)
Incremental Retraining Safaba EMTGlobal MT systems include a core MT engine and a target-side
Language Optimization Engine
Retraining the LOE component is fast – typically within a few hours
Not equivalent to full MT system retraining, but effective in closing the gap
New Approach: EMTGlobal v4.0 Advanced Adaptation Technology:
Supports significantly improved client-specific adaptation within the core MT engine
Supports rapid incremental retraining of core MT engines
Much closer to full MT system retraining at similar time frame as LOE retraining
Will be available in late Q4 of 2014
Safaba Overnight Retraining
The Approach:
Full Solution: Overnight Retraining Incremental data from post-edited MT projects is delivered to
Safaba
Incremental system retraining is launched automatically, completed within hours
Newly-adapted version of the MT system is automatically tested and QAed for quality
Newly-adapted version of the MT system is deployed into production
Safaba Overnight Retraining
The Pilot Project Pilot project with Welocalize to assess impact of Overnight Retraining on
Safaba EMTGlobal Dell MT systems, using samples of real post-edited translation project data
Setup: Languages: English to Chinese, Spanish and German Baseline Systems: 2014 retrained Dell EMTGlobal 3.0 MT systems Incremental Data: Three batches of incremental data from live translation
projects Methodology:
Three versions of the MT systems: Baseline Baseline + Retrained on Data Set #1 Baseline + Retrained on Data Set #1 & #2
MT Evaluation: Translate Data Set #3 (unseen) with the three versions of the MT system Assess impact on translation performance using automated MT evaluation
metrics Additional analysis using Safaba “Content Drift Indicators”
Safaba Overnight Retraining
Data
Original number of segments Number of segments post-filtering
Set 1 Set 2 Set 3 Set 1 Set 2 Set 3
ENUS-ESXL 1108 4553 704 926 2411 528
ENUS-ZHCN 3191 2181 1328 1143 1084 714
ENUS-DEDE 3043 1220 2270 2325 977 1466
Safaba Overnight Retraining
Pilot Results: Automated Metric Scores English-to-Chinese:
Incremental Adaptation of Language Optimization Engine (LOE)
Incrementally retraining on Data Sets #1 & #2 results in gain of +3.0 BLEU points on Data Set #3
BLEU METEOR TER30
35
40
45
50
55
60
65
70
2013 System2014 BaselineBaseline+DS1Baseline+DS1&2
Safaba Overnight Retraining
Pilot Results: Content Drift Indicator Statistics
English-to-Chinese: Incremental Adaptation of Language Optimization Engine (LOE)
Adding Data Sets #1 & #2 reduces Data Set #3 OOVs by 0.3%, improves unigram coverage by 0.36% and improves trigram coverage by 14.22%
OOV Tokens0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
Safaba Overnight Retraining
Unigrams Covered
Trigrams Covered
0.4
0.5
0.6
0.7
0.8
0.9
1
2013 System2014 BaselineBaseline+DS1Baseline+DS1&2
Preliminary Results: Advanced Adaptation with EMTGlobal v4.0
English-to-Chinese: Incremental Adaptation with EMTGlobal v4.0
Incrementally retraining on Data Sets #1 & #2 results in gain of +6.8 BLEU points on Data Set #3
BLEU METEOR TER30
35
40
45
50
55
60
65
70
75
2014 BaselineBaseline+DS1&2
Safaba Overnight Retraining
Summary of Pilot Results
Excellent results for English-to-Chinese!
Spanish and German results show no gain or loss in MT accuracy as a result of LOE incremental retraining with the available data sets
Performance on Data Set #3 remains completely flat with both retrainings according to all automated metrics
Data analysis with Content Drift Indicators reveals that Data Sets #1 & #2 for these two language pairs did not contain novel translations sufficient for improving MT performance on Data Set #3
No significant reduction in Data Set #3 OOVs
No significant improvement in coverage of source-side n-grams
Safaba Overnight Retraining
Translators were asked to compare each engine iteration using the same source strings
Result
Target
Day 1: read the MT output first. Then read the source text (ST). Then score the segment for Adequacy and Fluency
AdequacyOn a 4-point scale, rate how much of the meaning is rendered in the translation:
4 Everything3 Most2 Little1 None
Fluency
Rate on a 4- point scale the extent to which the translation is well-formed grammatically, contains correct spellings, adheres to common use of terms, titles and names, is intuitively acceptable and can be sensibly interpreted by a native speaker:
4 Flawless3 Good2 Disfluent1 Incomprehensible
*Based on TAUS Adequacy/Fluency Guidelines
Comparing the iterations: Compare the NEW MT output to that of the previous week and indicate with a X in the correpsonding column whether it is better / worse / equal.If it is better or worse, indicate in the error categories & comment column what has improved or regressed.
Overnight Retraining Pilot Evaluation Setup
• The human evaluation results we have an our disposal are work in progress - based on evaluating a small subset of translated data and just one iteration of “overnight retraining”
• The improvements observed by automated metrics are not yet reflected in the human assessment
• Human evaluation results consistent between Baseline+ DS 1 and DS2 - no degradation is introduced but from translator perspective no significant change in quality is captured, possibly requires a larger evaluation set or a different approach to evaluation string selection
Translator feedback: improvement in fluency but no improvement in capturing the meaning of whole sentence; punctuation has improved, but the translation stil needs improvement; part of the sentence is now more fluent
Interim Chinese Pilot Results - Welocalize
We need to be more granular than “Quality” and look at “Relevance” (coverage and fluency will increase based on Safaba findings)
Our expected benefits from this approach – needs to be in-synch with sufficient daily volumes
No need to wait for scheduled retrainings Two things will happen – the translator gets more used to post-
editing, and the MT engines catch up with the changes in the source content in the “live” mode
Benefit for the client – once the actual ongoing engine relevance statistics have been captured, we’ll be able to predict higher throughputs and offer better discounts
Welocalize “Wins” from “Overnight Retraining”
Summary and Conclusions Enterprise Content Drift is a natural and frequent phenomenon
in large-scale commercial MT implementation projects Enterprise MT systems need to constantly adapt or else are
likely to significantly degrade in translation accuracy and value over time
Safaba’s Content Drift Indicators can identify and quantify content drift and can be effectively used to predict the impact of incremental MT system retraining
Are being incorporated into Safaba’s new EMTGlobal MT Monitoring Portal
Safaba’s “Overnight Retraining” incremental adaptation is effective in combating content drift and maintaining/improving MT system performance over time and maintaining translator productivity levels
Safaba’s upcoming EMTGlobal v4.0 will dramatically enhance these capabilities!