Catalan daily goes Catalan

Post on 22-Apr-2015

762 views 1 download

description

 

Transcript of Catalan daily goes Catalan

Catalan daily goes Catalan

LocWord 2012, A4Magí Camps (La Vanguardia)Blanca Vidal (Lucy Software)

[1] Introduction, background

79.239

45.309

31.762

15.6626.779

0

10.000

20.000

30.000

40.000

50.000

60.000

70.000

80.000

90.000

Newspapers in CatalanNet Circulation

Source: Estudi General de Mitjans (EGM), 2012

Introduction, background

Increase +4% of copies+7% of readers

Distribution57% Spanish43% Catalan

Results

Introduction, background

Why a Catalan version?Celebration of LV’s 130 anniversaryNormalization of the use of Catalan

Investment to face the crisisOpportunity to consolidate LV’s hegemony

[2] Customer goals

To publish two language editions of the same

newspaper daily (supplements incl.).

Journalists should be able to write in

any of the two languages.

Neither quality nor distribution

timeframes should be affected.

• Tailor-made system• Complying with LV’s style guide• Seamless integration into journalist’s

workflow• Translation of Hermes XML and

InDesign formats• Reliability, high availability• High performance

MT

Customer requirements

[3] Ramp-up phaseProject set-up

Work areas MT linguistic improvement/tuning Post-editing preparationMT system set-up and integrationMT lexicon training

Duration 8 months (+ 3 months)

Staff LV: 10-12 in-house journalistsLucy: 3 computational linguists / lexicographers 1 software developerIncyta: 2 professional post-editors

Important! On-site support

SubphasesTASKS Phase 1 Phase 2 Phase 3 Phase 4

Linguistic improvement/tuning

- Language-type definition x

- Creation of a corpus of real texts x x x x

- Analysis of the translation quality x x x x

- Error reporting (lexicon and grammar errors) x x x x

- Linguistic implementation (lex and grammar) x x x x

- Pre and post-editing filters x x x x

Post-editing preparation

- Gathering of MT post-editing guidelines x

- Evaluation of post-editing effort x x

- Creation and training of the post-editing team x

Technical set-up

- System set-up and integration x

- Preparation of XML converters x

Maintenance

- Lexicon maintenance training x

Duration 2 mo 3 mo 3 mo 3 mo

[a] Linguistic tuning

Language model

Corpus

Translation quality (TQ)

Analysis and error-reporting

Implementation

Accomplished improvement data

Linguistic tuning

Catalan language model• no exclusion• compliant with standards• innovative in terminology• dynamic in syntactical structures

Corpus• ES: 500,000 transl. units – 8,300,000

words• CA: 250,000 transl. units – 3,000,000

words

Conclusions• No specific domains (except Sports)• Culture: proper names• Opinion: idioms, plays on words• Errors not repetitive• % style to be post-edited

Linguistic tuningTranslation Quality

Minimal post-

editing24%

Perfect74%

Medium post-edit

2%

Linguistic tuning

Analysis and error reporting• Semi-automatic detection of missing words• Terminology lists• New and different translations, error

reporting

Implementation• Proper names [44.5 % of the TUs ]• Idioms• Alternatives

Linguistic tuning

Accomplished improvement data• Work in figures

40,000 lexicon entries (20,000 for each transl. direction)Around 440 grammar rulesAround 7,200 words in the proper names files (each transl. dir)

• Non-measurable workUnderstanding of the MT systemUnderstanding of the newspaper specificitiesSupport in the style guide taking into account MT

• ImprovementES>CA 41% diff => 35% better , 4% similar, 2% worseCA>ES 36% diff => 32% better, 3% similar, 1% worse

[b] Post-editing

Post-editing

Metrics on translation volume

Metrics on post-editing effortSpecificities of the

text Post-editors workspace

Post-editing resources Error reporting

process and tools

Post-editing team and profile

Post-editing: metrics

FileTotal

translation unitsLex/gram

post-edition %Style

post-edition %

LV_2010-10-27 2,474 464 18.79% 394 15.96%

Conclusions

• Different sections had different levels of post-editing• What style corrections could be avoided?• Post-editing speed: 1,000-1,500 words/h• Daily volume: 75,000 words• New post-editing team: 20 post-editors/12 editors

(= 42.512 words)

Post-editing: resources, workspace

Post-editors should have proficiency in their skills BUT also

Be trained on MT post-ed

Have an integrated workspace

Have resources at a click

Post-editing guide

Adapt CMS to new workflow

Resources on Intranet language

portal

Classified frequent MT

errors

Reference document for

training

New processing

status

New mark-ups

Bilingual style guide

Links to all reference

dictionaries

MT portal for any journalist

Post-editing: resources, workspace

La Vanguardia’s intranet: linguistic portal

Post-editing: error reporting, team

Error reporting

• Crucial for continuous improvement• Not automated (yet)• Provide better support to error reporting

Definition of post-editing profile and team

• Proficient in Catalan• Journalist background

[c] System integration

During phase 1: pre-production• Pre-production set-up and installation• Hermes XML converter• Changes in the LT engine to translate

InDesign files

During phase 3: production• Production installation• Test (load, performance and stress)• Performance 500-1,200 w/sec• Definition of the final installation size

System integration

• Production: balanced high performance (HP) and high availability (HA) configuration• System requirements: normal Windows Server -> low HW footprint (e.g. Dual Core/Quad 2.5-3 GHz, 2-4 GB RAM running Win Server 2003/2008)

MaintenancePre-production

HermesInDesign

Language portal

Production

InDesignHermes

Web Service Web Service

[4] Operation: production process

Staff• 20 post-editors• 12 editors

Effort• 30’ linguistic review• 10’ journalistic review• 70,000 words/day + suppl.

Timeline• Start 5 p.m.• First edition 11.30 p.m. • Second edition 2.30 a.m.

Operation: production process

Challenge accomplished!

[5] Next goals

Success! Yes.Thanks to • Close work and

cooperation• Three parties involved• Time and effort

investment• Customisation

Next!• How to reduce post-

editing effort• How to re-use post-

edited text

Thank you for your attention

Magí CampsLa Vanguardiamcamps@lavanguardia.eswww.lavanguardia.es

Blanca VidalLucy Software Ibéricablanca.vidal@lucysoftware.comwww.lucysoftware.com

Ignasi NavarroIncytaIgnasi_navarro@incyta.comwww.incyta.com