Post on 20-Aug-2015
Wednesday, 4 June
Beyond Data: Delivering Machine Transla;on with Subject Ma@er
Exper;se John Tinsley, Iconic Transla1on Machines
TAUS Machine Transla;on Showcase 2014 Dublin (Ireland)
The research within the project MosesCore leading to these results has received funding from the European Union 7th Framework Programme, grant agreement no 288487
Beyond Data Delivering Machine Translation with
Subject Matter Expertise
John Tinsley Director / Co-Founder
TAUS MT Showcase. 4th June 2014, Dublin
Data Engineering What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
Patents: an MT nightmare
L is an organic group selected from -CH2-(OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words
Data Engineering What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
Data Engineering + Linguistic Engineering An “ensemble” architecture
Chinese pre-ordering rules
Statistical Post-editing
Input
Output
Training Data
Spanish med-device entity recognizer Multi-output
Combination
Korean pharma tokenizer
Patent input classifier
Client TM/terminology (optional)
Japanese script normalisation
German Compounding rules
Moses
RBMT
Moses
Moses
If you don’t understand it, you can’t translate it
MT with Subject Matter Expertise
“Allopurinol-induced serious cutaneous adverse reactions (SCAR), including Steven Johnson’s syndrome
(SJS) and toxic epidermal necrolysis (TEN), are associated with a genetic marker, the HLA-B*5801
allele.”!
“IPTranslator is perfect for someone who needs to search [patents] across multiple languages and with is useful in the case of both patentability and infringement searches.”
– Aalt van de Kuilen, Global Head of Patent Information, Abbott
Machine Translation for Patents
What is the value for users?
Specialist solutions deliver more useable outcomes for the user
Post-editing
For information purposes
Multilingual search
Increased productivity Extract more meaning Retrieve more relevant results
= = =
De-risking the machine translation proposition
What is the value for users?
+ Data + Time + €€€ = ???
+ No data needed + Systems are ready to go + No upfront cost = Evaluate immediately
Our Prerequisites Typical Prerequisites
Customisation. Refinement.
» Incorporation of user feedback » Incremental training with post-edits » Tuning for specific input types
Iconic in practice
Iconic had a domain-specific MT solution for that industry
Machine Translation technology for the legal industry
Business Need
Iconic in practice
Delivered immediately and initial results were positive
Translation samples required for initial evaluation
Process (1)
Iconic in practice
“The complexities and unforeseen but inevitable surprises of MT integration in large scale production processes were handled both competently and efficiently.”
Integrate Iconic with GlobalSight for productivity pilot
Process (2)
Iconic in practice
>20% productivity increase for translator post-editing Iconic output
“Iconic delivered measurable productivity gains from the outset”
Performance
Iconic in practice
• Ongoing improvement through feedback from translators • Ongoing improvement through the incorporation of post-edits
• More than 4 million words translated to date for Asian languages • Periodic roll-out of new languages over time
Looking forward
Need: short-term solution to provide on-demand translation through a web search interface
Iconic in practice
Process: integrate directly through Iconic API and evaluate quality and throughput concurrently
Outcomes: in 5 months of production for English-Portuguese alone, we processed:
• 15,526 translation requests • 14,606,374 words
All content is not created equal
We cannot afford to be dogmatic when it comes to MT
Domain specific MT is about more than just data
Know your subject matter!
Take home messages…