Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

21
Multilingual Term Extraction as a Service from Acrolinx Ben Gottesman Michael Klemme Acrolinx CHAT2013

description

Presenters: Ben Gottesman and Michael Klemme (Acrolinx) This presentation is a part of TaaS project funded from the European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 296312

Transcript of Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

Page 1: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

Multilingual Term Extraction

as a Service from Acrolinx

Ben Gottesman

Michael Klemme

Acrolinx

CHAT2013

Page 2: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

term extraction: automatically identifying potential terms in a

document (corpus)

Definitions

The wizard begins creating the bootable image.

Der Assistent beginnt mit der Erstellung des bootfähigen Image.

(… or, if the source-language terminology already exists, just identify translations)

multilingual term extraction: automatically identifying potential terms

and their translations in a document and its translation (parallel

corpus / translation memory)

Page 3: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

Identify same-language synonyms via translations in common

Synonyms

German English

Die Spannungsversorgung für

die Elektronik wird vom

Speisegerät G526 sichergestellt.

The voltage supply for the

electronics is maintained by the

power supply unit G526.

Spannungsversorgung für

interne Speisung (X3e)

Power supply for internal supply

(X3e)

Unterspannung in der

Stromversorgung

Undervoltage in the power

supply

Spannungsversorgung

Stromversorgung

voltage supply

power supply

Page 4: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

• What is multilingual term extraction?

• What is the workflow from customer perspective?– customer use case examples

– show extraction results, demonstrate human validation

• How does the extraction work?– how we identify candidates

• source-language candidates

• translation candidates

– how we filter translation candidates

– how we identify source-language synonyms

• What is Acrolinx and how does MTE fit in?

Outline

Page 5: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

• What is multilingual term extraction?

• What is the workflow from customer perspective?– customer use case examples

– show extraction results, demonstrate human validation

• How does the extraction work?– how we identify candidates

• source-language candidates

• translation candidates

– how we filter translation candidates

– how we identify source-language synonyms

• What is Acrolinx and how does MTE fit in?

Outline

Page 6: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

1. Customer provides translated documents

2. Acrolinx provides extracted multilingual term

candidates to customer

3. Customer validates candidates

4. Validated results become (or are added to)

customer’s term bank

Workflow: Customer perspective

Page 7: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

Use case 1– de-<en,fr,es,it,pt> (mostly de-en)

– ~142,000 bilingual segments; ~2,685,000 tokens (total)

Use case 2– de-<en,fr> (all data trilingual)

– ~132,000 bilingual segments; ~1,259,000 tokens

– data document-aligned, not segment-aligned, so extra step required

Use case 3– en-de

– ~942,000 bilingual segments; ~25,000,000 tokens

– extract translations of a given list of keywords

– determine which keywords don’t occur in data

Customer use cases, past examples

Page 8: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

• human validation in Excel

Results

“Baugruppe” has been translated

inconsistently into English in the past

Mark respective translations as

preferred/deprecated to guide translators

in the future.

Page 9: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

Results

“Stromversorgung” and “Einspeisung” have translations in common.

→ automatically identified as possible synonyms, so same Cluster ID

To validate synonym link, edit Subcluster IDs to be the same.

Mark respective variants as preferred/deprecated to guide authors.

Page 10: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

• What is multilingual term extraction?

• What is the workflow from customer perspective?– customer use case examples

– show extraction results, demonstrate human validation

• How does the extraction work?– how we identify candidates

• source-language candidates

• translation candidates

– how we filter translation candidates

– how we identify source-language synonyms

• What is Acrolinx and how does MTE fit in?

Outline

Page 11: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

• Extract source-language term candidates from

source-language text (unless source-language

terminology exists)

– linguistics-based• especially part-of-speech patterns

– same functionality built into the core Acrolinx

product

How does the extraction work?

The wizard begins creating the bootable image.

Page 12: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

How does the extraction work?

The wizard begins creating the bootable image.

Der Assistent beginnt mit der Erstellung des bootfähigen Image.

• Extract translation candidates of each source-

language term candidate from target-language

text

– use statistical phrase-alignment technology

– same used in statistical machine translation

Page 13: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

• Filter translation candidates

… based on:

– confidence score calculated from translation probabilities

• can adjust threshold to favour precision or recall

– surface characteristics (closed-class words, punctuation)

– term-candidacy of translation (if possible for language)

How does the extraction work?

translation candidates for “Eingangsspannung” (pink = filtered out)

Page 14: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

• Identify synonyms (‘cluster’ candidates)

– link confidence based on the degree to which translations are shared

– can adjust threshold to favour precision or recall of links

How does the extraction work?

cluster around “Stromwandler” (minimum link confidence threshold = 0.01)

Page 15: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

• Identify synonyms (‘cluster’ candidates)

– link confidence based on the degree to which translations are shared

– can adjust threshold to favour precision or recall of links

How does the extraction work?

cluster around “Stromwandler” (minimum link confidence threshold = 0.03)

Page 16: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

• What is multilingual term extraction?

• What is the workflow from customer perspective?– customer use case examples

– show extraction results, demonstrate human validation

• How does the extraction work?– how we identify candidates

• source-language candidates

• translation candidates

– how we filter translation candidates

– how we identify source-language synonyms

• What is Acrolinx and how does MTE fit in?

Outline

Page 17: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

Acrolinx is Content Optimization Software. It helps

authors make there text– more correct,

– more consistent,

– and more readable.

What is Acrolinx?

Page 18: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

Acrolinx is Content Optimization Software. It helps

authors make their text– more correct,

– more consistent,

– and more readable.

Consistent use of terminology is an important factor in

the readability of text. Acrolinx provides:– term extraction (monolingual, aka term harvesting)

– terminology management

– term checking

Multilingual Term Extraction as a Service is a natural

complement to the prior terminology functions.

What is Acrolinx?

Page 19: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

Visit Acrolinx at tekom!

→ Hall 3, Stand 310

Acrolinx @ tekom

Page 20: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

• What is multilingual term extraction?

• What is the workflow from customer perspective?– customer use case examples

– show extraction results, demonstrate human validation

• How does the extraction work?– how we identify candidates

• source-language candidates

• translation candidates

– how we filter translation candidates

– how we identify source-language synonyms

• What is Acrolinx and how does MTE fit in?

Outline

Page 21: Multilingual Term Extraction as a Service from Acrolinx, CHAT2013

Questions?