Machine translation evaluation
Hermes Traducciones y Servicios Lingüísticos
MT at Hermes 2
Pure RBMT engines with pre- and post-processing macros.
Texts from technical domains.
Applied-technology department has been working for over a
year in MT engines.
Over 250,000 words post-edited with internal engines in the
last year.
Average new word count for projects post-edited with internal
engines: 9,000 words.
Our purpose with MT evals
Automated metrics might help us:
predict PE time and productivity gains;
negotiate reasonable discounts;
evaluate quality of engines;
measure performance of applied-technology department;
not depend on human-reported data.
3
What we hoped to find
We hoped some metric would correlate with productivity gain
data provided by post-editors.
We gathered BLEU, F-Measure, METEOR and TER
values.
Ideally, we would end up relying on automated metrics rather
than time and productivity measurements reported by post-
editors.
4
What we hoped to find 5
0.00
20.00
40.00
60.00
80.00
100.00
120.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00
Productivity gain %
What we hoped to find 6
0.00
20.00
40.00
60.00
80.00
100.00
120.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00
Productivity gain %
7
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00
Productivity gain %
BLEU
F-Measure
TER
METEOR
What we actually found: No correlation
What we actually found: No correlation 8
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00
Productivity gain %
BLEU
F-Measure
TER
METEOR
Reasons for the variability
Different CAT environments (Trados Studio, memoQ,
Idiom, TagEditor, etc.).
Different engines (per domain, per client, etc.).
Different clients, different needs.
Different post-editors.
Or, if same post-editor, different post-editing skills over time.
Different word volumes.
Specific productivity or consistency-enhancement
processing can affect metrics negatively.
9
Productivity-enhancement example
Source: Add events as described in Adding Events to a Model.
PE: Agregue los eventos como se describe en Adición de eventos a un
modelo.
Raw 1: Agregue los eventos como se describe en la adición de los eventos a
un modelo.
Raw 2: Agregue los eventos como se describe en Adding Events to a Model.
Scores:
Raw 1 Raw 2
BLEU 68,59 53,33
TER 17,65 29,41
10
Metrics for Raw 1 are significantly
better, but Raw 2 is faster to post-edit
thanks to automatic terminology
insertion tools (such as Xbench).
Human evaluation
Adequacy: How much of the meaning expressed in the gold-
standard translation or the source is also expressed in the target
translation?
4. Everything
3. Most
2. Little
1. None
Fluency: To what extent is a target side translation grammatically
well informed, without spelling errors and experienced as using
natural/intuitive language by a native speaker?
4. Flawless
3. Good
2. Dis-fluent
1. Incomprehensible
11
Source: TAUS MT evaluation guidelines https://evaluation.taus.net/resources/adequacy-fluency-guidelines
Conclusions
We combine automated metrics with time/productivity data reported
by post-editor for final evaluation of internal MT performance.
Poor post-editing skills or any project-specific contingency can be
counter-balanced with good automated metrics.
We look for qualitative information in automated metrics, not
quantitative.
BLEU values of 65 and 70 for two different engines tell us both
are good engines, not that one will render 5% better results than
the other.
12
Top Related