European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine...

13
European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme

Transcript of European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine...

Page 1: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

EuropeanPatent Office

Wolfgang Täger

December 2006

EuropeanPatent Office

European Machine Translation Programme

Page 2: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

Programme Partners and Goals

• Trigger: Success of JP-EN patent translation

• Agreement EPO - Member States

1. MT of patents/ abstracts/ communications to/from English

2. Three language pairs per year

3. First three languages: FR - DE - ES

• Candidates for next year: Swedish, Dutch, Italian, Romanian, Greek

Page 3: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

MT engine

Trial with SMT system (Language Weaver)

Call for tender: Winner Worldlingo (Systran)

Going public (esp@cenet): December 2006

Needed: Improve translation by specific dictionaries

Page 4: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

Dictionary format

Desiderata • open standard • XML-Unicode• support features of MT engines• support conditional translations (e.g. based on IPC)

Is not intended for terminology (no definitions, lexical focus and no semantic focus).

OLIF format was chosen

How to get dictionaries ? By bilingual term extraction !

Page 5: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

Available corpora

560.000 EP-B publications => claims in EN,DE,FR

300.000 DE-T2 publications

37.000 ES-B3/T3 publications

=> Align corpora for term extraction, concordancing, translation memory (and SMT)

CL EN CL FR CL DE

DESC EN OR FR OR DE

EP-B1 DE-T2

CL ES

DESC ES

ES B3/T3 (LaTex)

(CL DE)

DESC DE

Page 6: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

Available corpora

560.000 EP-B publications => claims in EN,DE,FR

300.000 DE-T2 publications

37.000 ES-B3/T3 publications

=> Align corpora for term extraction, concordancing, translation memory (and SMT)

CL EN CL FR CL DE

DESC EN OR FR OR DE

EP-B1 DE-T2

CL ES

DESC ES

ES B3/T3 (LaTex)

(CL DE)

DESC DE

Page 7: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

Alignment & Extraction

Alignment: Trial at EPO with internally developed SW

Result was not improved by external companies during call for tender.

Page 8: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

Alignment & Extraction

Call for tender for bilingual term extraction

Winner: DFKI

1. Alignment of corpora, POS tagging, Identification of terms

2. Pairing of terms using clues like co-occurrence score, string similarity, grammatical clues, position, available dictionaries, ...

3. Providing further information like gender, inflection, transitivity, countable, ...

Page 9: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

Validation & Concordancing

Development of OLIF editor at EPO• Remove noise• Correct entries• Use concordancer (provides statistics based on parallel corpora)

=> DEMO

Page 10: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

OLIF format

• Support of more languages• Clarification of inflection scheme• Clarification of term vs lex approach• Tools

Page 11: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

Relational database ??

Concept Term

SurfForm

Lemma

InflForm

LexType

RegEx

Infl

SemRelTransl

Naming

Page 12: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

Relational database ??

„hot drink ...“ grüner Tee

grüner

grün

Nom. Sg. str. f. pos.

DE, Adj

-er

iLike „klein“

SemRelTransl

Naming

Page 13: European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.

The European Patent OfficeEuropeanPatent Office

End

Thank you!