tauyou MT platform: the basics

Diego Bartolome @[email protected]

performance demandedin high end markets

performance demanded in low end markets

sustaining technology

disruptive technology

Objectives for Machine Translation

Productivity gains

Direct cost reduction

Quality consistency

New uses for Machine Translation

Multilingual customer support

Social Media monitoring

Applications enabled by Big Data

Internet of Everything /Internet of Things

Speech-to-Speech translation


What is your experience with MT?

1. Quality Metrics

2. Cost reduction

3. Impact on Delivery Times

4. Feedback on quality

5. Your Feelings

Machine Translation Types

Google/Bing Translator vs. tauyou

Advantages Big(gger) data

State-of-the-art technology

Costs of Machine Translation

Internal development – people and time

Free tools – Google + Bing

DIY solutions

Traditional pricing model

tauyou managed solution

Revenue from Machine Translation

Translation as a Service

Private Machine Translation Portal

MT of internal communication (flat rate)


Questions1. Where do you provide value now?

2. Where do you think the value will be?

3. How important is confidentiality?

4. Do you care about control?

5. How much could you invest on MT?

(time, people, money)

6. When will your solution be available?

On Language Quality

Some Languages Sorted

From EN into

1) FR, ES, PT, IT

2) DE, NL, HE, DA, NO, SV

3) ZH, JA, RU

4) KR, AR, TR, HI

On Domain Quality

Who is willing to pay?

Where does your revenue come from?

What are your key skills?

What domains achieve good quality?

… Quality Order of your domains ...

Questions1. What is your main motivation?

2. Can you try more than 1 domain?

3. Can you train at least 2 language pairs?

4. Can you pilot several MT vendors?

5. What are your expectations?

Data acquisition

OPUS corpora

WMT workshops


Multilingual websites


Corpora building

Related vs. unrelated materials

Percentage of out-of-domain

Does mono-lingual data help?

Corpora extension with linguistic processing

Ad-hoc corpus for file translation

The more, the better?

Data cleaning

Clean translation memories

Length, punctuation, terminology, …

Inconsistencies, repetitions, ...

Segment splitting

Optimize weight of most frequent n-grams

Validate their translations

Add out-of-domain data (optimization)


Data cleaning and selection is a key process

Just more data may harm the quality

Training strategies

One single system with all TMs

+ glossaries

+ linguistic processing input/output

+ forbidden words lists

Layered approach

Generic domain subdomain client→ → →

Models optimization

Filter the translation tables

Remove the garbage + tune weights

Optimize language models

Adapt them to the translation purpose

Tune parameters correctly

Tune set, test set, optimization parameters

Improve tokenization, recasing, ...

Workflow integration

Use MT as a secondary TM

Bilingual pre-translated translation files

CAT tool integration

Differentiated workflow

Continuous improvement


Use updated TMs in new trainings

Immediate (incremental) retraining

Rule-based automatic post-editing

Selective pre- and/or post-processing

Source content optimization

Linguistic processing notes

In the source and/or target language

Grammar checking

Entities detection

Proper nouns, alphanumeric words, ...

Compound words splitting

Sentence reordering

The Post-editor profile

Do skills needed differ from translation?

Post-editing guidelines

Full vs. light post-editing



Do you have the right resources to start?

Quality Metrics

SMT metrics: BLEU, NIST

Feedback from translators

Translation time vs. Post-editing time

Word Error Rate (WER) or Edit Distance

Cost reduction


Are you able to measure?

Change before you

have to Jack Welch