Multilingual Term Extraction as a Service from Acrolinx, CHAT2013
-
Upload
taus-enabling-better-translation -
Category
Technology
-
view
287 -
download
4
description
Transcript of Multilingual Term Extraction as a Service from Acrolinx, CHAT2013
Multilingual Term Extraction
as a Service from Acrolinx
Ben Gottesman
Michael Klemme
Acrolinx
CHAT2013
term extraction: automatically identifying potential terms in a
document (corpus)
Definitions
The wizard begins creating the bootable image.
Der Assistent beginnt mit der Erstellung des bootfähigen Image.
(… or, if the source-language terminology already exists, just identify translations)
multilingual term extraction: automatically identifying potential terms
and their translations in a document and its translation (parallel
corpus / translation memory)
Identify same-language synonyms via translations in common
Synonyms
German English
Die Spannungsversorgung für
die Elektronik wird vom
Speisegerät G526 sichergestellt.
The voltage supply for the
electronics is maintained by the
power supply unit G526.
Spannungsversorgung für
interne Speisung (X3e)
Power supply for internal supply
(X3e)
Unterspannung in der
Stromversorgung
Undervoltage in the power
supply
Spannungsversorgung
Stromversorgung
voltage supply
power supply
• What is multilingual term extraction?
• What is the workflow from customer perspective?– customer use case examples
– show extraction results, demonstrate human validation
• How does the extraction work?– how we identify candidates
• source-language candidates
• translation candidates
– how we filter translation candidates
– how we identify source-language synonyms
• What is Acrolinx and how does MTE fit in?
Outline
• What is multilingual term extraction?
• What is the workflow from customer perspective?– customer use case examples
– show extraction results, demonstrate human validation
• How does the extraction work?– how we identify candidates
• source-language candidates
• translation candidates
– how we filter translation candidates
– how we identify source-language synonyms
• What is Acrolinx and how does MTE fit in?
Outline
1. Customer provides translated documents
2. Acrolinx provides extracted multilingual term
candidates to customer
3. Customer validates candidates
4. Validated results become (or are added to)
customer’s term bank
Workflow: Customer perspective
Use case 1– de-<en,fr,es,it,pt> (mostly de-en)
– ~142,000 bilingual segments; ~2,685,000 tokens (total)
Use case 2– de-<en,fr> (all data trilingual)
– ~132,000 bilingual segments; ~1,259,000 tokens
– data document-aligned, not segment-aligned, so extra step required
Use case 3– en-de
– ~942,000 bilingual segments; ~25,000,000 tokens
– extract translations of a given list of keywords
– determine which keywords don’t occur in data
Customer use cases, past examples
• human validation in Excel
Results
“Baugruppe” has been translated
inconsistently into English in the past
Mark respective translations as
preferred/deprecated to guide translators
in the future.
Results
“Stromversorgung” and “Einspeisung” have translations in common.
→ automatically identified as possible synonyms, so same Cluster ID
To validate synonym link, edit Subcluster IDs to be the same.
Mark respective variants as preferred/deprecated to guide authors.
• What is multilingual term extraction?
• What is the workflow from customer perspective?– customer use case examples
– show extraction results, demonstrate human validation
• How does the extraction work?– how we identify candidates
• source-language candidates
• translation candidates
– how we filter translation candidates
– how we identify source-language synonyms
• What is Acrolinx and how does MTE fit in?
Outline
• Extract source-language term candidates from
source-language text (unless source-language
terminology exists)
– linguistics-based• especially part-of-speech patterns
– same functionality built into the core Acrolinx
product
How does the extraction work?
The wizard begins creating the bootable image.
How does the extraction work?
The wizard begins creating the bootable image.
Der Assistent beginnt mit der Erstellung des bootfähigen Image.
• Extract translation candidates of each source-
language term candidate from target-language
text
– use statistical phrase-alignment technology
– same used in statistical machine translation
• Filter translation candidates
… based on:
– confidence score calculated from translation probabilities
• can adjust threshold to favour precision or recall
– surface characteristics (closed-class words, punctuation)
– term-candidacy of translation (if possible for language)
How does the extraction work?
translation candidates for “Eingangsspannung” (pink = filtered out)
• Identify synonyms (‘cluster’ candidates)
– link confidence based on the degree to which translations are shared
– can adjust threshold to favour precision or recall of links
How does the extraction work?
cluster around “Stromwandler” (minimum link confidence threshold = 0.01)
• Identify synonyms (‘cluster’ candidates)
– link confidence based on the degree to which translations are shared
– can adjust threshold to favour precision or recall of links
How does the extraction work?
cluster around “Stromwandler” (minimum link confidence threshold = 0.03)
• What is multilingual term extraction?
• What is the workflow from customer perspective?– customer use case examples
– show extraction results, demonstrate human validation
• How does the extraction work?– how we identify candidates
• source-language candidates
• translation candidates
– how we filter translation candidates
– how we identify source-language synonyms
• What is Acrolinx and how does MTE fit in?
Outline
Acrolinx is Content Optimization Software. It helps
authors make there text– more correct,
– more consistent,
– and more readable.
What is Acrolinx?
Acrolinx is Content Optimization Software. It helps
authors make their text– more correct,
– more consistent,
– and more readable.
Consistent use of terminology is an important factor in
the readability of text. Acrolinx provides:– term extraction (monolingual, aka term harvesting)
– terminology management
– term checking
Multilingual Term Extraction as a Service is a natural
complement to the prior terminology functions.
What is Acrolinx?
Visit Acrolinx at tekom!
→ Hall 3, Stand 310
Acrolinx @ tekom
• What is multilingual term extraction?
• What is the workflow from customer perspective?– customer use case examples
– show extraction results, demonstrate human validation
• How does the extraction work?– how we identify candidates
• source-language candidates
• translation candidates
– how we filter translation candidates
– how we identify source-language synonyms
• What is Acrolinx and how does MTE fit in?
Outline
Questions?