11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation

Machine translation evaluation

Hermes Traducciones y Servicios Lingüísticos

MT at Hermes 2

Pure RBMT engines with pre- and post-processing macros.

Texts from technical domains.

Applied-technology department has been working for over a

year in MT engines.

Over 250,000 words post-edited with internal engines in the

last year.

Average new word count for projects post-edited with internal

engines: 9,000 words.

Our purpose with MT evals

Automated metrics might help us:

predict PE time and productivity gains;

negotiate reasonable discounts;

evaluate quality of engines;

measure performance of applied-technology department;

not depend on human-reported data.

3

What we hoped to find

We hoped some metric would correlate with productivity gain

data provided by post-editors.

We gathered BLEU, F-Measure, METEOR and TER

values.

Ideally, we would end up relying on automated metrics rather

than time and productivity measurements reported by post-

editors.

4

What we hoped to find 5

0.00

20.00

40.00

60.00

80.00

100.00

120.00

0.00 20.00 40.00 60.00 80.00 100.00 120.00

Productivity gain %

What we hoped to find 6

0.00

20.00

40.00

60.00

80.00

100.00

120.00

0.00 20.00 40.00 60.00 80.00 100.00 120.00

Productivity gain %

7

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00

Productivity gain %

BLEU

F-Measure

TER

METEOR

What we actually found: No correlation

What we actually found: No correlation 8

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00

Productivity gain %

BLEU

F-Measure

TER

METEOR

Reasons for the variability

Different CAT environments (Trados Studio, memoQ,

Idiom, TagEditor, etc.).

Different engines (per domain, per client, etc.).

Different clients, different needs.

Different post-editors.

Or, if same post-editor, different post-editing skills over time.

Different word volumes.

Specific productivity or consistency-enhancement

processing can affect metrics negatively.

9

Productivity-enhancement example

Source: Add events as described in Adding Events to a Model.

PE: Agregue los eventos como se describe en Adición de eventos a un

modelo.

Raw 1: Agregue los eventos como se describe en la adición de los eventos a

un modelo.

Raw 2: Agregue los eventos como se describe en Adding Events to a Model.

Scores:

Raw 1 Raw 2

BLEU 68,59 53,33

TER 17,65 29,41

10

Metrics for Raw 1 are significantly

better, but Raw 2 is faster to post-edit

thanks to automatic terminology

insertion tools (such as Xbench).

Human evaluation

Adequacy: How much of the meaning expressed in the gold-

standard translation or the source is also expressed in the target

translation?

4. Everything

3. Most

2. Little

1. None

Fluency: To what extent is a target side translation grammatically

well informed, without spelling errors and experienced as using

natural/intuitive language by a native speaker?

4. Flawless

3. Good

2. Dis-fluent

1. Incomprehensible

11

Source: TAUS MT evaluation guidelines https://evaluation.taus.net/resources/adequacy-fluency-guidelines

https://evaluation.taus.net/resources/adequacy-fluency-guidelines





Conclusions

We combine automated metrics with time/productivity data reported

by post-editor for final evaluation of internal MT performance.

Poor post-editing skills or any project-specific contingency can be

counter-balanced with good automated metrics.

We look for qualitative information in automated metrics, not

quantitative.

BLEU values of 65 and 70 for two different engines tell us both

are good engines, not that one will render 5% better results than

the other.

12

11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation

Technology

Transcript of 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation