Machine Translation - Stanford University$Hindi$ English:$$$$$He adores listening to music!...
-
Upload
trinhkhanh -
Category
Documents
-
view
217 -
download
1
Transcript of Machine Translation - Stanford University$Hindi$ English:$$$$$He adores listening to music!...
Machine Translation
Introduction to MT
Dan Jurafsky
Machine Transla-on
• Fully automa1c • Helping human translators
Enter Source Text:
Transla1on from Stanford’s Phrasal:
这 不过 是 一 个 时间 的 问题 . �
This is only a ma@er of 1me.
Dan Jurafsky
Google Translate
• Fried ripe plantains: • h@p://laylita.com/recetas/2008/02/28/platanos-‐maduros-‐fritos/
Dan Jurafsky
Machine Transla-on
• The Story of the Stone (“The Dream of the Red Chamber”) • Cao Xueqin 1792
• Chinese gloss: Dai-‐yu alone at bed on think-‐of-‐with-‐gra1tude Bao-‐chai… again listen to window outside bamboo 1p plantain leaf of on, rain sound sigh drop, clear cold penetrate curtain, not feeling again fall down tears come.
• Hawkes transla1on: As she lay there alone, Dai-‐yu’s thoughts turned to Bao-‐chai… Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without no1cing it she had begun to cry.
Dan Jurafsky
Difficul-es in Chinese to English transla-on
• Long Chinese sentences: 4 English sentences to 1 Chinese • Chinese no pronouns or ar1cles (English the, a) • Chinese has loca1ve post-‐posi1ons, English preposi1ons
• Chinese bed on, window outside, English on the bed, outside the window • Chinese rarely marks tense:
• English as, turned to, had begun, • Chinese tou, ‘penetrate’ -‐> English penetrated
• Chinese rela1ve clauses are before the noun, English a]er • Chinese: [window outside bamboo on] rain • English: rain [on the bamboo outside the window]
• Stylis1c and cultural differences • Chinese bamboo 1p plaintain leaf -‐> bamboos and plantains • Chinese rain sound sigh drop -‐> insistent rustle of the rain • Chinese ma ‘curtain’ -‐> curtains of her bed
Dan Jurafsky
Alignment in Machine Transla-on
Dan Jurafsky
Early MT History
1946 Booth and Weaver discuss MT in New York 1947-‐48 idea of dic1onary-‐based direct transla1on 1947 Warren Weaver suggests transla1on by computer 1949 Weaver memorandum 1952 all 18 MT researchers in world meet at MIT 1954 IBM/Georgetown Demo Russian-‐English MT 1955-‐65 lots of labs take up MT h@p://www.hutchinsweb.me.uk/PPF-‐TOC.htm
Dan Jurafsky
1949 Weaver memorandum
• h@p://www.mt-‐archive.info/Weaver-‐1949.pdf
• “There are certain invariant proper1es which are… common to all languages”
• ‘When I look at an ar1cle in Russian, I say "This is really wri@en in English, but it has been coded in some strange symbols. I will now proceed to decode.”’
• “[If] one can see… N words on either side, then, if N is large enough, one can unambiguously decide the meaning of the central word.”
8
Dan Jurafsky
The History of MT: Pessimism
• 1959/1960 • Yehoshua Bar-‐Hillel “Report on the state of MT in US and GB” • FAHQ MT too hard because we would have to encode all of human knowledge • Instead we should work on computer tools for human translators
Dan Jurafsky
The claim that fully automa-c high quality MT is impossible
Yehoshua Bar-‐Hillel. 1960. A Demonstra1on of the Nonfeasibility of Fully Automa1c High Quality Transla1on.
• Little John was looking for his toy box. Finally he found it. The box was in the pen. John was very happy.!
• Pen1: Enclosure for small children • Pen2: Wri1ng utensil
� Pen1: Enclosure for small children
Dan Jurafsky
• The box was in the pen.!
Dan Jurafsky
The claim that fully automa-c high quality MT is impossible
Yehoshua Bar-‐Hillel, 1960
“I now claim that no exis1ng or imaginable program will enable an electronic computer to determine…”
Dan Jurafsky
The state of the art in MT
Dan Jurafsky
The state of the art in MT
Dan Jurafsky
History of MT: Further Pessimism The ALPAC report
• Headed by John R. Pierce of Bell Labs • Conclusions:
• MT doesn’t work • MT a failure: all current MT work had to be post-‐edited • Intelligibility and informa1veness worse than human
• We don’t need MT anyhow • Already too many human translators from Russian
• Results: MT research suffered • Funding loss • Number of research labs declined • Associa1on for Machine Transla1on and Computa1onal Linguis1cs dropped MT from its name
Dan Jurafsky
MT in the modern age
• 1975-‐1985 Resurgence of MT in Europe and Japan • Domain-‐specific rule-‐based systems
• 1990-‐present • Rise of Sta1s1cal Machine Transla1on
Machine Translation
Introduction to MT
Machine Translation
Language Divergences
Dan Jurafsky
Language Similari-es and Divergences
• Typology: • the study of systema1c cross-‐linguis1c similari1es and differences
• What are the dimensions along which human languages vary?
Dan Jurafsky
Syntac-c Varia-on: Basic Word Orders
• SVO (Subject-‐Verb-‐Object) languages English, German, French, Mandarin I baked a pizza!
• SOV Languages Japanese, Hindi English: He adores listening to music!Japanese: kare ha ongaku wo kiku no ga daisuki desu! he music to listening adores
• VSO languages • Irish, Classical Arabic, Tagalog
In many languages one word order is more basic
Dan Jurafsky
Morphology
• Morpheme: “Minimal meaningful unit of language” Word = Morpheme + Morpheme + Morpheme +…!
• Stems: (base form, root) hope+ing à hoping hop à hopping
• Affixes • Prefixes: An1disestablishmentarianism • Suffixes: An1disestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German
Dan Jurafsky
Morphemes per Word
isola1ng synthe1c
Vietnamese
Joseph Greenberg. 1954. A Quan1ta1ve Approach to the Morphological Typology of Language. IJAL 26:3.
1 3
1.06
Yakut (Turkic)
2.17
English
1.68
West Greenlandic (Eskimo-‐Inuit)
3.72
2
Swahili
2.55
4
Dan Jurafsky
Few morphemes per word: Cantonese
“He said this was the biggest building in the whole country” Each word in this sentence has one morpheme (and one syllable): keui wa chyuhn gwok jeui daaih gaan nguk haih li gaan!he say entire country most big bldg house is this bldg!
Dan Jurafsky
Many Morphemes per word: Turkish
uygarlaştıramadıklarımızdanmışsınızcasına uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına Behaving as if you are among those whom we could not cause to become civilized
Dan Jurafsky
Word Segmenta-on Are word boundaries marked in wri-ng?
• Some wri1ng systems: boundaries between words not marked • Chinese, Japanese, Thai • Word segmenta1on becomes an important part of text normaliza1on for MT
• Some languages tend to have sentences that are quite long, closer to English paragraphs than sentences: • Modern Standard Arabic, Chinese • Sentence segmenta1on may be necessary for MT between these languages and languages like English
Dan Jurafsky Inferen-al Load: cold vs. hot languages
• Hot languages: • Who did what to whom is marked explicitly • English
• Cold languages: • The hearer has more “figuring out” of who the various actors in the various events are
• Japanese, Chinese
Balthasar Bickel. 2003. Referen1al density in discourse and syntac1c typology. Language 79:2, 708-‐36
Dan Jurafsky
Inferen-al Load: The blue noun phrases are not in the Chinese original
飓风丽塔已经减弱为第三级飓风, Rita weakened and was downgraded to a Category 3 storm;
ø 迫近美国德课萨斯州和路易斯安那州, [Rita/it/the storm] is moving close to Texas and Louisiana;
当局表示, the authori1es announced;
虽然 ø 在登陆前可能再稍微减弱, although [Rita/it/the storm] might weaken again before landing,
但 ø 仍然会非常危险, [Rita/it/the storm] is s1ll very dangerous;
ø 预料 ø 会在当地时间星期六凌晨在德州和路易斯安那州之间登陆, [the authori1es] predict [Rita/it/the storm] will arrive at the Texas-‐
Louisiana border on Saturday morning local 1me;
ø 直接吹袭休斯敦市东面的主要炼油设施。 [Rita/it/the storm] will directly hit the oil-‐refining industry east of
Houston.
Dan Jurafsky
Lexical Divergences
• Word to phrases: • English computer science !• French informatique!
• Part of Speech divergences
• English She likes to sing !• German Sie singt gerne [She sings likefully]
• English I’m hungry!• Spanish Tengo hambre [I have hunger]
Dan Jurafsky
Lexical Specificity Divergences
• Gramma1cal specificity • Spanish: plural pronouns have gender (ellos/ellas) • English: plural pronouns no gender (they)
• So transla1ng “they” from English to Spanish, need to figure out gender of the referent!
Dan Jurafsky Lexical Divergences: Seman-c Specificity
English brother Mandarin gege (older brother), didi (younger brother)
English wall German Wand (inside) Mauer (outside)
English fish!Spanish pez (the creature) pescado (fish as food) !
Cantonese ngau!English cow beef
Dan Jurafsky
Predicate Argument divergences
• English Spanish The bo@le floated out. La botella salió flotando.
The bo@le exited floa-ng
• Satellite-‐framed languages: • direc1on of mo1on is marked on the satellite • Crawl out, float off, jump down, walk over to, run after!
• Most of Indo-‐European, Hungarian, Finnish, Chinese
• Verb-‐framed languages: • direc1on of mo1on is marked on the verb • Spanish, French, Arabic, Hebrew, Japanese, Tamil, Polynesian, Mayan, Bantu families
L. Talmy. 1985. Lexicaliza1on pa@erns: Seman1c Structure in Lexical Form.
Dan Jurafsky
Predicate Argument divergences: Heads and Argument swapping
Heads: English: X swim across Y Spanish: X crucar Y nadando English: I like to eat German: Ich esse gern English: I’d prefer vanilla German: Mir wäre Vanille lieber
Arguments: Spanish: Y me gusta English: I like Y German: Der Termin fällt mir ein English: I forget the date
Dorr, Bonnie J., "Machine Transla1on Divergences: A Formal Descrip1on and Proposed Solu1on," Computa1onal Linguis1cs, 20:4, 597-‐-‐633
Dan Jurafsky
Predicate-‐Argument Divergence Counts
Found divergences in 32% of sentences in UN Spanish/English Corpus
Part of Speech X tener hambre Y have hunger
98%
Phrase/Light verb X dar puñaladas a Z X stab Z
83%
Structural X entrar en Y X enter Y
35%
Heads swap X cruzar Y nadando X swim across Y
8%
Arguments swap X gustar a Y Y likes X
6%
B.Dorr et al. 2002. DUSTer: A Method for Unraveling Cross-‐Language Divergences for Sta1s1cal Word-‐Level Alignment
Machine Translation
Language Divergences
Machine Translation
Three classical methods for MT
Dan Jurafsky
3 Classical methods for MT
• Direct • Transfer • Interlingua
Dan Jurafsky
Three MT Approaches: Direct, Transfer, Interlingual
Dan Jurafsky
Direct Transla-on
• Proceed word-‐by-‐word through text • Transla1ng each word • No intermediate structures except morphology • Knowledge is in the form of
• Huge bilingual dic1onary • word-‐to-‐word transla1on informa1on
• A]er word transla1on, can do simple reordering • Adjec1ve ordering English -‐> French/Spanish
Dan Jurafsky
Direct MT Dic-onary entry
Dan Jurafsky
Direct MT
Dan Jurafsky
Problems with direct MT
• German
• Chinese
Dan Jurafsky
The Transfer Model
• Idea: apply contras1ve knowledge, i.e., knowledge about the difference between two languages
• Steps: • Analysis: Syntac1cally parse source language • Transfer: Rules to turn this parse into parse for target language • Genera-on: Generate target sentence from parse tree
Dan Jurafsky
English to French
English: Adjec1ve Noun French: Noun Adjec1ve • This is not always true
Route mauvaise ‘bad road, badly-‐paved road’ Mauvaise route ‘wrong road’
• But is a reasonable first approxima1on
• Rule:
Dan Jurafsky
Transfer rules
Dan Jurafsky
Transferring the green witch….
45
Dan Jurafsky
Interlingua
• Instead of N2 sets of transfer rules • Use meaning as a representa1on language
1. Parse source sentence into meaning representa1on 2. Generate target sentence from meaning.
• Intui1on: Use other NLP applica1ons to do MT work • English book to Spanish: libro or reservar!• Disambiguate book into concepts BOOKVOLUME and RESERVE
• Need 2N systems (a parser and generator for each language)
Dan Jurafsky
Interlingua for Mary did not slap the green witch
Machine Translation
Three classical methods for MT