Machine Translation: Challenges and Approaches Nizar Habash Post-doctoral Fellow Center for...
-
date post
20-Dec-2015 -
Category
Documents
-
view
218 -
download
2
Transcript of Machine Translation: Challenges and Approaches Nizar Habash Post-doctoral Fellow Center for...
Machine Translation:Challenges and Approaches
Nizar HabashPost-doctoral Fellow
Center for Computational Learning Systems
Columbia University
Invited LectureCS 4705: Introduction to Natural Language Processing
Fall 2004
Sounds like Faulkner?
http://www.ee.ucla.edu/~simkin/sounds_like_faulkner.html
It lay on the table a candle burning at each corner upon the envelope tied in a soiled pink garter two artificial flowers. Not hit a man in glasses.
It was once a shade, which was in all beautiful weather under a tree and varied like the branches in the wind.
William Faulkner, "The sound and the fury“
Es war einmal ein Schatten, der lag bei jedem schönen Wetter unter einem Baum und schwankte wie die Zweige im Wind.
Helmut Wördemann, "Der unzufriedene Schatten“
(Translated by Systran)
Faulkner
Machine Translation
Faulkner
Machine Translation
Progress in MTStatistical MT example
Form a talk by Charles Wayne, DARPA
2002 2003 Human Translationinsistent Wednesday may recurred her trips to Libya tomorrow for flying
Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment .
Egyptair Has Tomorrow to Resume Its Flights to Libya
Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.
Egypt Air May Resume its Flights to Libya Tomorrow
Cairo, April 6 (AFP) - An Egypt Air official announced, on Tuesday, that Egypt Air will resume its flights to Libya as of tomorrow, Wednesday, after the UN Security Council had announced the suspension of the embargo imposed on Libya.
Road Map
• Why Machine Translation (MT)?
• Multilingual Challenges for MT
• MT Approaches
• MT Evaluation
Why (Machine) Translation?
Languages in the world• 6,800 living languages• 600 with written tradition • 95% of world population
speaks 100 languages
Translation Market• $8 Billion Global Market• Doubling every five years
(Donald Barabé, invited talk, MT Summit 2003)
Why Machine Translation?
• Full Translation– Domain specific
• Weather reports
• Machine-aided Translation– Translation dictionaries
– Translation memories
– Requires post-editing
• Cross-lingual NLP applications– Cross-language IR
– Cross-language Summarization
Road Map
• Why Machine Translation (MT)?• Multilingual Challenges for MT
– Orthographic variations– Lexical ambiguity– Morphological variations– Translation divergences
• MT Paradigms• MT Evaluation
Multilingual Challenges
• Orthographic Variations– Ambiguous spelling
• اشعارا االوالد األو الد� كتب ك�ت�ب�اشع�ارا�
– Ambiguous word boundaries•
• Lexical Ambiguity– Bank بنك (financial) vs. ضفة(river)– Eat essen (human) vs. fressen (animal)
Multilingual Challenges Morphological Variations
• Affixation vs. Root+Pattern
write written كتب
بوكتم
kill killed قتل لوقتم
do done فعل لوفعم
conj
noun
pluralarticle
• Tokenization
And the cars and the cars
اتسيارالو w Al SyArAt
Et les voitures et le voitures
Multilingual ChallengesTranslation Divergences
• How languages map semantics to syntax• 35% of sentences in TREC El Norte Corpus (Dorr et al 2002)
• Divergence Types– Categorial (X tener hambre X be hungry) [98%]
– Conflational (X dar puñaladas a Z X stab Z) [83%]
– Structural (X entrar en Y X enter Y) [35%]
– Head Swapping (X cruzar Y nadando X swim across Y) [8%]
– Thematic (X gustar a Y Y like X) [6%]
هنا تلسI-am-not here
be
I here
I am not here
not
ليس
نا ا هنا
Translation Divergencesconflation
Je ne suis pas iciI not be not here
etre
Je icine pas
*نا ا بردان
*קר ל
بردانانا I cold
be
I cold
I am cold ליקרcold for-me
אני
Translation Divergencescategorial, thematic and structural
tener
Yo frio
tengo frioI-have cold
swim
I quicklyacross
river
I swam across the river quickly
Translation Divergenceshead swap and categorial
اسرع
انا عبورسباحة
نهر
سباحة النهر عبور اسرعتI-sped crossing the-river swimming
swim
I quicklyacross
river
I swam across the river quickly
Translation Divergences head swap and categorial
חצה
אני אתב
נהר
ב
שחיה מהירות
חציתי את הנהר בשחיה במהירותI-crossed obj river in-swim speedily
Translation Divergences head swap and categorial
חצה
אני אתב
נהר
ב
שחיה מהירות
اسرع
انا عبورسباحة
نهر
swim
I quicklyacross
river
noun
prep
verb
noun
adverb
verb
nounverb
noun
Translation DivergencesOrthography+Morphology+Syntax
妈妈的车 mama de che
car
mom
possessed-by
mom’s car
ماما ةسيارsayyArat mama
la voiture de maman
Road Map
• Why Machine Translation (MT)?
• Multilingual Challenges for MT
• MT Approaches– Gisting / Transfer / Interlingua– Statistical / Symbolic / Hybrid – Practical Considerations
• MT Evaluation
MT ApproachesMT Pyramid
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
Gisting
MT ApproachesGisting Example
Sobre la base de dichas experiencias se estableció en 1988 una metodología.
Envelope her basis out speak experiences them settle at 1988 one methodology.
On the basis of these experiences, a methodology was arrived at in 1988.
MT ApproachesMT Pyramid
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
Gisting
Transfer
MT ApproachesTransfer Example
• Transfer Lexicon – Map SL structure to TL structure
poner
X mantequilla en
Y
:obj:mod:subj
:obj
butter
X Y
:subj :obj
X puso mantequilla en Y X buttered Y
MT ApproachesMT Pyramid
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
Gisting
Transfer
Interlingua
MT ApproachesInterlingua Example: Lexical Conceptual Structure
(Dorr, 1993)
MT ApproachesMT Pyramid
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
Interlingua
Gisting
Transfer
MT ApproachesMT Pyramid
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
Interlingual Lexicons
Dictionaries/Parallel Corpora
Transfer Lexicons
MT ApproachesStatistical vs. Symbolic
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
MT Approaches Noisy Channel Model
Portions from http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf
MT Approaches
IBM Model (Word-based Model)
http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
MT ApproachesStatistical vs. Symbolic vs. Hybrid
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
MT ApproachesStatistical vs. Symbolic vs. Hybrid
MT ApproachesHybrid Example: GHMT
• Generation-Heavy Hybrid Machine Transaltion
• Lexical transfer but NO structural transfer
poner
Maria mantequilla en
pan
:obj:mod:subj
:obj
lay locate place put render set stand
Maria butter bilberry on in into at
bread loaf
:obj:mod:subj
:obj
Maria puso la mantequilla en el pan.
MT ApproachesHybrid Example: GHMT
• LCS-driven Expansion
• Conflation Example
Goal
BUTTERV
MARIA BREAD
Agent Goal
PUTV
BUTTERN
ThemeAgent
MARIA BREAD
[CAUSE GO] [CAUSE GO]
CategorialVariation
MT ApproachesHybrid Example: GHMT
• Structural Overgeneration
put
Maria butter on
bread
lay
Maria butter at
loaf
render
Maria butter into
loaf
butter
Maria bread
bread
Maria butter …
• Structural N-gram Model– Long-distance – Lexemes
• Surface N-gram Model– Local – Surface-forms
John
buy
MT ApproachesHybrid Example: GHMT
Target Statistical Resources
car
a red
John bought cara red
MT ApproachesHybrid Example: GHMT Linearization &Ranking
Maria buttered the bread -47.0841Maria butters the bread -47.2994Maria breaded the butter -48.7334Maria breads the butter -48.835Maria buttered the loaf -51.3784Maria butters the loaf -51.5937Maria put the butter on bread -54.128
MT Approaches Practical Considerations
• Resources Availability– Parsers and Generators
• Input/Output compatability
– Translation Lexicons• Word-based vs. Transfer/Interlingua
– Parallel Corpora• Domain of interest
• Bigger is better
• Time Availability– Statistical training, resource building
MT Approaches Resource Poverty
No Parser?No Translation Dictionary?
Parallel Corpus• Align with rich language
• Extract dictionary
•Parse rich side•Infer parses
•Build a statistical parser
Road Map
• Why Machine Translation (MT)?
• Multilingual Challenges for MT
• MT Approaches
• MT Evaluation
MT Evaluation
• More art than science
• Wide range of Metrics/Techniques– interface, …, scalability, …, faithfulness, ...
space/time complexity, … etc.
• Automatic vs. Human-based– Dumb Machines vs. Slow Humans
MT Evaluation Metrics
• System-based MetricsCount internal resources: size of lexicon, number of grammar rules, etc.– easy to measure– not comparable across systems– not necessarily related to utility
(Church and Hovy 1993)
MT Evaluation Metrics• Text-based Metrics
– Sentence-based Metrics• Quality: Accuracy, Fluency, Coherence, etc.
• 3-point scale to 100-point scale
– Comprehensibility Metrics• Comprehension, Informativeness,
• x-point scales, questionnaires
• most related to utility
• hard to measure
MT Evaluation Metrics
• Text-based Metrics (cont’d)– Amount of Post-Editing
• number of keystrokes per page
• not necessarily related to utility
• Cost-based Metrics– Cost per page– Time per page
5 contents of original sentence conveyed (might need minor corrections)
4 contents of original sentence conveyed BUT errors in word order
3 contents of original sentence generally conveyed BUT errors in relationship between phrases, tense, singular/plural, etc.
2 contents of original sentence not adequately conveyed, portions of original sentence incorrectly translated, missing modifiers
1 contents of original sentence not conveyed, missing verbs, subjects, objects, phrases or clauses
Human-based Evaluation ExampleAccuracy Criteria
5 clear meaning, good grammar, terminology and sentence structure
4 clear meaning BUT bad grammar, bad terminology or bad sentence structure
3 meaning graspable BUT ambiguities due to bad grammar, bad terminology or bad sentence structure
2 meaning unclear BUT inferable
1 meaning absolutely unclear
Human-based Evaluation ExampleFluency Criteria
Fluency vs. Accuracy
Accuracy
Fluency
conMTFAHQ
MTProf.MT
Info.MT
Automatic Evaluation ExampleBleu Metric
• Bleu – BiLingual Evaluation Understudy (Papineni et al 2001)
– Modified n-gram precision with length penalty
– Quick, inexpensive and language independent
– Correlates highly with human evaluation
– Bias against synonyms and inflectional variations
Test Sentence
colorless green ideas sleep furiously
Gold Standard References
all dull jade ideas sleep iratelydrab emerald concepts sleep furiously
colorless immature thoughts nap angrily
Automatic Evaluation ExampleBleu Metric
Test Sentence
colorless green ideas sleep furiously
Gold Standard References
all dull jade ideas sleep iratelydrab emerald concepts sleep furiously
colorless immature thoughts nap angrily
Unigram precision = 4/5
Automatic Evaluation ExampleBleu Metric
Test Sentence
colorless green ideas sleep furiouslycolorless green ideas sleep furiouslycolorless green ideas sleep furiouslycolorless green ideas sleep furiously
Gold Standard References
all dull jade ideas sleep iratelydrab emerald concepts sleep furiously
colorless immature thoughts nap angrily
Unigram precision = 4 / 5 = 0.8Bigram precision = 2 / 4 = 0.5
Bleu Score = (a1 a2 …an)1/n
= (0.8 ╳ 0.5)½ = 0.6325 63.25
Automatic Evaluation ExampleBleu Metric