An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of...
-
Upload
quentin-shields -
Category
Documents
-
view
214 -
download
0
Transcript of An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of...
![Page 1: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/1.jpg)
An Overview of the AVENUE Project
Presented byLori Levin
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
Pittsburgh, PA USA
![Page 2: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/2.jpg)
AVENUE Project
• Dr. Jaime Carbonell, PI • Dr. Alon Lavie, Co-PI• Dr. Lori Levin, Co-PI• Dr. Robert Frederking• Dr. Ralf Brown• Dr. Rodolfo Vega
• Mapudungun– Dr. Eliseo Cañulef– Rosendo Huisca – and others
• Erik Peterson• Christian Monson• Ariadna Font Llitjós• Alison Alvarez• Roberto Aranovich• Dr. Jeff Good• Dr. Katharina Probst
• Hebrew – Dr. Shuly Wintner– student
This research was funded in part by NSF grant number IIS-0121-631.
![Page 3: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/3.jpg)
MT Approaches
Interlingua: introduce-self
Syntactic ParsingPronoun-acc-1-sg chiamare-1sg N
Semantic Analysis
Sentence Planning Text
Generation[np poss-1sg “name”] BE-pres N
SourceMi chiamo Lori
TargetMy name is Lori
Transfer Rules
Direct: SMT, EBMT
AVENUE: Automate Rule Learning
![Page 4: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/4.jpg)
Approaches to MT
• Direct– Works best with large parallel corpora
• Millions of words
– Can be done without linguistic resources
• Interlingua– Useful when you are translating between more than
two languages– Requires linguistic knowledge
• Transfer– Requires linguistic knowledge
![Page 5: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/5.jpg)
Useful Resources for MT
• Parallel corpus
• Monolingual corpus
• Lexicon
• Morphological Analyzer (lemmatizer)
• Human Linguist
• Human non-linguist
![Page 6: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/6.jpg)
Low Resource Situations• Indigenous languages
– May lack large corpora– May lack a computational linguist
• “Strategic” Languages– Aside from standard written Arabic and Chinese
• Resource-rich language: limited domain– Most of the large parallel corpora are newspaper,
parliamentary proceedings, or broadcast news– Fewer resources for conversation related to
humanitarian aid.
![Page 7: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/7.jpg)
Why Machine Translation for Languages with Limited Resources?
• We are in the age of information explosion– The internet+web+Google anyone can get the information
they want anytime…• But what about the text in all those other languages?
– How do they read all this English stuff?– How do we read all the stuff that they put online?
• MT for these languages would Enable:– Better government access to native indigenous and minority
communities– Better minority and native community participation in
information-rich activities (health care, education, government) without giving up their languages.
– Civilian and military applications (disaster relief)– Language preservation
![Page 8: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/8.jpg)
Mixed Resource Situations
• Some resources are available and others aren’t.
![Page 9: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/9.jpg)
Omnivorous MT• Eat whatever resources are available
• Eat large or small amounts of data
![Page 10: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/10.jpg)
AVENUE’s Inventory
• Resources– Parallel corpus– Monolingual corpus– Lexicon– Morphological
Analyzer (lemmatizer)– Human Linguist– Human non-linguist
• Techniques– Rule based transfer
system– Example Based MT– Morphology Learning– Rule Learning– Interactive Rule
Refinement– Multi-Engine MT
![Page 11: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/11.jpg)
The Avenue Low Resource Scenario
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 12: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/12.jpg)
The Avenue Low Resource Scenario
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 13: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/13.jpg)
The Avenue Low Resource Scenario
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 14: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/14.jpg)
The Avenue Low Resource Scenario
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 15: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/15.jpg)
AVENUE
• Rules can be written by hand or learned automatically.
• Hybrid– Rule-based transfer– Statistical decoder– Multi-engine combinations with SMT and EBMT
![Page 16: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/16.jpg)
AVENUE systems(Small and experimental, but tested on unseen data)
• Hebrew-to-English – Alon Lavie, Shuly Wintner, Katharina Probst– Hand-written and automatically learned– Automatic rules trained on 120 sentences perform
slightly better than about 20 hand-written rules.
• Hindi-to-English – Lavie, Peterson, Probst, Levin, Font, Cohen, Monson– Automatically learned– Performs better than SMT when training data is limited
to 50K words
![Page 17: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/17.jpg)
AVENUE systems(Small and experimental, but tested on unseen data)
• English-to-Spanish– Ariadna Font Llitjos– Hand-written, automatically corrected
• Mapudungun-to-Spanish – Roberto Aranovich and Christian Monson– Hand-written
• Dutch-to-English – Simon Zwarts– Hand-written
![Page 18: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/18.jpg)
The Avenue Low Resource Scenario
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 19: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/19.jpg)
Elicitation
• Get data from someone who is– Bilingual – Literate
• With consistent spelling
– Not experienced with linguistics
![Page 20: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/20.jpg)
English-Hindi Example
Elicitation Tool: Erik Peterson
![Page 21: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/21.jpg)
English-Chinese Example
Note: Translator has to insert spaces between words in Chinese.
![Page 22: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/22.jpg)
English-Arabic Example
![Page 23: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/23.jpg)
Purpose of Elicitation
• Provide a small but highly targeted corpus of hand aligned data– To support machine
learning from a small data set
– To discover basic word order
– To discover how syntactic dependencies are expressed
– To discover which grammatical meanings are reflected in the morphology or syntax of the language
srcsent: Tú caístetgtsent: eymi ütrünagimialigned: ((1,1),(2,2))context: tú = Juan [masculino, 2a persona del
singular]comment: You (John) fell
srcsent: Tú estás cayendotgtsent: eymi petu ütünagimialigned: ((1,1),(2 3,2 3))context: tú = Juan [masculino, 2a persona del
singular]comment: You (John) are falling
srcsent: Tú caíste tgtsent: eymi ütrunagimialigned: ((1,1),(2,2))context: tú = María [femenino, 2a persona del
singular]comment: You (Mary) fell
![Page 24: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/24.jpg)
Languages
• The set of feature structures with English sentences has been delivered to the Linguistic Data Consortium as part of the Reflex program.
• Translated (by LDC) into:– Thai– Bengali
• Plans to translate into:– Seven “strategic” languages per year for five years.
• As one small part of a language pack (BLARK) for each language.
![Page 25: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/25.jpg)
Languages
• Spanish version in progress at New Mexico State University (Helmreich and Cowie)– Plans to translate into Guarani
• Portuguese version in progress in Brazil (Marcello Modesto)– Plans to translate into Karitiana
• 200 speakers
• Plans to translate into Inupiaq (Kaplan and MacLean)
![Page 26: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/26.jpg)
Previous Elicitation Work
• Pilot corpus– Around 900 sentences– No feature structures
• Mapudungun– Two partial translations
• Quechua– Three translations
• Aymara– Seven translations
• Hebrew• Hindi
– Several translations• Dutch
![Page 27: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/27.jpg)
The Avenue Low Resource Scenario
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 28: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/28.jpg)
AVENUE Machine Translation System
Type informationSynchronous Context Free
RulesAlignments
x-side constraints
y-side constraints
xy-constraints, e.g. ((Y1 AGR) = (X1 AGR))
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)
((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)
((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))
Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI)
Rule learning: Katharina Probst
![Page 29: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/29.jpg)
Rule Learning - Overview
• Goal: Acquire Syntactic Transfer Rules• Use available knowledge from the major-
language side (grammatical structure)• Three steps:
1. Flat Seed Generation: first guesses at transfer rules; flat syntactic structure
2. Compositionality Learning: use previously learned rules to learn hierarchical structure
3. Constraint Learning: refine rules by learning appropriate feature constraints
![Page 30: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/30.jpg)
Flat Seed Rule Generation
Learning Example: NP
Eng: the big apple
Heb: ha-tapuax ha-gadol
Generated Seed Rule:
NP::NP [ART ADJ N] [ART N ART ADJ]
((X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2))
![Page 31: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/31.jpg)
Flat Seed Rule Generation
• Create a “flat” transfer rule specific to the sentence pair, partially abstracted to POS– Words that are aligned word-to-word and have the same POS in
both languages are generalized to their POS– Words that have complex alignments (or not the same POS)
remain lexicalized
• One seed rule for each translation example• No feature constraints associated with seed rules (but
mark the example(s) from which it was learned)
![Page 32: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/32.jpg)
Compositionality Learning
Initial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N]
((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8))
NP::NP [ART ADJ N] [ART N ART ADJ]
((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))
NP::NP [ART N] [ART N]
((X1::Y1) (X2::Y2))
Generated Compositional Rule:
S::S [NP V NP] [NP V P NP]
((X1::Y1) (X2::Y2) (X3::Y4))
![Page 33: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/33.jpg)
Compositionality Learning
• Detection: traverse the c-structure of the English sentence, add compositional structure for translatable chunks
• Generalization: adjust constituent sequences and alignments
• Two implemented variants:– Safe Compositionality: there exists a transfer rule that
correctly translates the sub-constituent– Maximal Compositionality: Generalize the rule if supported
by the alignments, even in the absence of an existing transfer rule for the sub-constituent
![Page 34: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/34.jpg)
Constraint LearningInput: Rules and their Example Sets
S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26}
((X1::Y1) (X2::Y2) (X3::Y4))
NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13}
((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))
NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11}
((X1::Y1) (X2::Y2))
Output: Rules with Feature Constraints:
S::S [NP V NP] [NP V P NP]
((X1::Y1) (X2::Y2) (X3::Y4)
(X1 NUM = X2 NUM)
(Y1 NUM = Y2 NUM)
(X1 NUM = Y1 NUM))
![Page 35: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/35.jpg)
Constraint Learning
• Goal: add appropriate feature constraints to the acquired rules• Methodology:
– Preserve general structural transfer– Learn specific feature constraints from example set
• Seed rules are grouped into clusters of similar transfer structure (type, constituent sequences, alignments)
• Each cluster forms a version space: a partially ordered hypothesis space with a specific and a general boundary
• The seed rules in a group form the specific boundary of a version space
• The general boundary is the (implicit) transfer rule with the same type, constituent sequences, and alignments, but no feature constraints
![Page 36: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/36.jpg)
Transfer and Decoding
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 37: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/37.jpg)
The Transfer Engine
AnalysisSource text is parsed into its grammatical structure. Determines transfer application ordering.
Example:
ראיתי את האיש הזקן
(I) saw *acc the man the old
S
VP
V P NP
D N D Adj
הזקן האיש את ראיתי
TransferA target language tree is created by reordering, insertion, and deletion.
S
NP VP
N V NP
DET Adj N
I saw the old man
Source words translated with transfer lexicon.
GenerationTarget language constraints are checked, target morphology applied, and final translation produced.
E.g. “saw” in past tense selected.
Final translation:
“I saw the old man”
![Page 38: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/38.jpg)
Symbolic Decoder
• System rarely finds a full parse/transfer for complete input sentence• XFER engine produces comprehensive lattice of segment
translations• Decoder selects best combination of translation segments• Search for optimal scoring path of partial translations, based on
multiple features:– Target Language Model scores– XFER Rule Scores– Path Fragmentation– Other features…
• Symbolic decoding essential for scenarios where there is insufficient data for training large target LM– Effective Rule Scoring is crucial
![Page 39: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/39.jpg)
The Avenue Low Resource Scenario
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 40: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/40.jpg)
Rule Refinement
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 41: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/41.jpg)
Interactive and Automatic Refinement of Translation Rules
• Problem: Improve Machine Translation quality.
• Proposed Solution: Put bilingual speakers back into the loop; use their corrections to detect the source of the error and automatically improve the lexicon and the grammar.
• Approach: Automate post-editing efforts by feeding them back into the MT system.Automatic refinement of translation rules that
caused an error beyond post-editing.
• Goal: Improve MT coverage and overall quality.
![Page 42: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/42.jpg)
Technical Challenges
Elicit minimal MT information from non-expert users
Automatically Refine and Expand
Translation Rules minimally
Manually written Automatically Learned
Automatic Evaluation of Refinement process
![Page 43: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/43.jpg)
43
Error Typology for Automatic Rule Refinement (simplified)
Missing word
Extra word
Wrong word order
Incorrect word
Wrong agreement
Local vs Long distance
Word vs. phrase
+ Word change
Sense
Form
Selectional restrictions
Idiom
Missing constraint
Extra constraint
![Page 44: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/44.jpg)
TCTool (Demo)• Add a word• Delete a word• Modify a word• Change word order
Actions:
Interactive elicitation of error information
precision recall
error detection 90% 89%
error classification 72% 71%
![Page 45: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/45.jpg)
1. Refine a translation rule:R0 R1 (change R0 to make it more
specific or more general)
Types of Refinement Operations
Automatic Rule Adaptation
R0:
R1:
NP
DET N ADJ
NP
DET ADJ N
a nice house
una casa bonito
NP
DET N ADJ
NP
DET ADJ N
a nice house
una casa bonita
N gender = ADJ gender
![Page 46: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/46.jpg)
2. Bifurcate a translation rule:R0 R0 (same, general rule)
R1 (add a new more specific rule)
Types of Refinement Operations
Automatic Rule Adaptation
R0: NP
DET N ADJ
NP
DET ADJ N
NP
DET ADJ N
NP
DET ADJ N
R1:
a nice house una casa bonita
a great artist un gran artista
ADJ type: pre-nominal
![Page 47: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/47.jpg)
AVENUE/LETRAS 47
Error Information Elicitation
Refinement Operation Typology
Automatic Rule Adaptation
Change word orderSL: Gaudí was a great artist
MT system output:TL: Gaudí era un artista grande
Ucorrection: *Gaudí era un artista grande Gaudí era un gran artista
A concrete example
clue word
error
correction
![Page 48: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/48.jpg)
Mapudungun
• Indigenous Language of Chile and Argentina• ~ 1 Million Mapuche Speakers
![Page 49: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/49.jpg)
Mapudungun Language
• 900,000 Mapuche people• At least 300.000 speakers of Mapudungun• Polysynthetic
sl: pe- rke- fi- ñ Maria ver-REPORT-3pO-1pSgS/INDtl: DICEN QUE LA VI A MARÍA (They say that) I saw Maria.
![Page 50: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/50.jpg)
AVENUE Mapudungun
• Joint project between Carnegie Mellon University, the Chilean Ministry of Education, and Universidad de la Frontera.
![Page 51: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/51.jpg)
Mapudungun to Spanish Resources
• Initially: – Large team of native speakers at Universidad de la Frontera,
Temuco, Chile• Some knowledge of linguistics• No knowledge of computational linguistics
– No corpus– A few short word lists– No morphological analyzer
• Later: Computational Linguists with non-native knowledge of Mapudungun
• Other considerations:– Produce something that is useful to the community, especially for
bilingual education– Experimental MT systems are not useful
![Page 52: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/52.jpg)
Mapudungun
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
Corpus: 170 hours of spoken Mapudungun
Example Based MT
Spelling checker
Spanish Morphology from UPC, Barcelona
![Page 53: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/53.jpg)
Mapudungun Products
• http://www.lenguasamerindias.org/– Click: traductor mapudungún– Dictionary lookup (Mapudungun to Spanish)– Morphological analysis– Example Based MT (Mapudungun to Spanish)
![Page 54: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/54.jpg)
V
pe
I Didn’t see Maria
VSuff
la
VSuffG VSuff
fi
VSuffG VSuff
ñ
VSuffG
NP
N
Maria
N
S
V
VP
S
VP
NP“a”V
V“no”
vi N
María
N
![Page 55: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/55.jpg)
V
pe
Transfer to Spanish: Top-Down
VSuff
la
VSuffG VSuff
fi
VSuffG VSuff
ñ
VSuffG
NP
N
Maria
N
S
V
VP
S
VP
NP“a”V
VP::VP [VBar NP] -> [VBar "a" NP]( (X1::Y1)
(X2::Y3)
((X2 type) = (*NOT* personal)) ((X2 human) =c +)
(X0 = X1) ((X0 object) = X2)
(Y0 = X0)
((Y0 object) = (X0 object))(Y1 = Y0)(Y3 = (Y0 object))((Y1 objmarker person) = (Y3 person))((Y1 objmarker number) = (Y3 number))((Y1 objmarker gender) = (Y3 ender)))
![Page 56: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/56.jpg)
Mapudungun
• Indigenous Language of Chile and Argentina• ~ 1 Million Mapuche Speakers
![Page 57: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/57.jpg)
Collaboration
• Mapuche Language Experts – Universidad de la Frontera (UFRO)
• Instituto de Estudios Indígenas (IEI)– Institute for Indigenous Studies
• Chilean Funding– Chilean Ministry of Education
(Mineduc)• Bilingual and Multicultural Education
Program
Eliseo Cañulef
Rosendo Huisca
Hugo Carrasco
Hector Painequeo
Flor Caniupil
Luis Caniupil Huaiquiñir
Marcela Collio Calfunao
Cristian Carrillan Anton
Salvador Cañulef
Carolina Huenchullan Arrúe
Claudio Millacura Salas
![Page 58: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/58.jpg)
Accomplishments
• Corpora Collection
– Spoken Corpus• Collected: Luis Caniupil Huaiquiñir • Medical Domain• 3 of 4 Mapudungun Dialects
– 120 hours of Nguluche– 30 hours of Lafkenche– 20 hours of Pwenche
• Transcribed in Mapudungun• Translated into Spanish
– Written Corpus• ~ 200,000 words• Bilingual Mapudungun – Spanish• Historical and newspaper text
nmlch-nmjm1_x_0405_nmjm_00:M: <SPA>no pütokovilu kay koC: no, si me lo tomaba con agua
M: chumgechi pütokoki femuechi pütokon pu <Noise> C: como se debe tomar, me lo tomé pués
nmlch-nmjm1_x_0406_nmlch_00:M: ChengewerkelafuymiürkeC: Ya no estabas como gente entonces!
![Page 59: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/59.jpg)
Accomplishments
• Developed At UFRO– Bilingual Dictionary with Examples
• 1,926 entries
– Spelling Corrected Mapudungun Word List• 117,003 fully-inflected word forms
– Segmented Word List• 15,120 forms• Stems translated into Spanish
![Page 60: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/60.jpg)
Accomplishments
• Developed at LTI using Mapudungun language resources from UFRO– Spelling Checker
• Integrated into OpenOffice
– Hand-built Morphological Analyzer– Prototype Machine Translation Systems
• Rule-Based• Example-Based
– Website: LenguasAmerindias.org
![Page 61: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/61.jpg)
AVENUE Hebrew
• Joint project of Carnegie Mellon University and University of Haifa
![Page 62: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/62.jpg)
Hebrew Language
• Native language of about 3-4 Million in Israel• Semitic language, closely related to Arabic and with
similar linguistic properties– Root+Pattern word formation system– Rich verb and noun morphology– Particles attach as prefixed to the following word: definite article
(H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)…
• Unique alphabet and Writing System– 22 letters represent (mostly) consonants– Vowels represented (mostly) by diacritics– Modern texts omit the diacritic vowels, thus additional level of
ambiguity: “bare” word word– Example: MHGR mehager, m+hagar, m+h+ger
![Page 63: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/63.jpg)
Hebrew Resources
• Morphological analyzer developed at Technion
• Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary
• Human Computational Linguists
• Native Speakers
![Page 64: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/64.jpg)
Hebrew
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
![Page 65: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/65.jpg)
Flat Seed Rule Generation
Learning Example: NP
Eng: the big apple
Heb: ha-tapuax ha-gadol
Generated Seed Rule:
NP::NP [ART ADJ N] [ART N ART ADJ]
((X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2))
![Page 66: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/66.jpg)
Compositionality Learning
Initial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N]
((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8))
NP::NP [ART ADJ N] [ART N ART ADJ]
((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))
NP::NP [ART N] [ART N]
((X1::Y1) (X2::Y2))
Generated Compositional Rule:
S::S [NP V NP] [NP V P NP]
((X1::Y1) (X2::Y2) (X3::Y4))
![Page 67: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/67.jpg)
Constraint LearningInput: Rules and their Example Sets
S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26}
((X1::Y1) (X2::Y2) (X3::Y4))
NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13}
((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))
NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11}
((X1::Y1) (X2::Y2))
Output: Rules with Feature Constraints:
S::S [NP V NP] [NP V P NP]
((X1::Y1) (X2::Y2) (X3::Y4)
(X1 NUM = X2 NUM)
(Y1 NUM = Y2 NUM)
(X1 NUM = Y1 NUM))
![Page 68: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/68.jpg)
Challenges for Hebrew MT
• Paucity in existing language resources for Hebrew– No publicly available broad coverage morphological
analyzer– No publicly available bilingual lexicons or dictionaries– No POS-tagged corpus or parse tree-bank corpus for
Hebrew– No large Hebrew/English parallel corpus
• Scenario well suited for CMU transfer-based MT framework for languages with limited resources
![Page 69: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/69.jpg)
Hebrew Morphology Example
• Input word: B$WRH
0 1 2 3 4
|--------B$WRH--------|
|-----B-----|$WR|--H--|
|--B--|-H--|--$WRH---|
![Page 70: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/70.jpg)
Hebrew Morphology Example
Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE))
Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET))
Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE))
![Page 71: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/71.jpg)
Sample Output (dev-data)
maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat
a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police
in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money
![Page 72: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/72.jpg)
QuechuaSpanish MT
• V-Unit: funded Summer project in Cusco (Peru) June-August 2005 [preparations and data collection started earlier]
• Intensive Quechua course in Centro Bartolome de las Casas (CBC)
• Worked together with two Quechua native and one non-native speakers on developing infrastructure (correcting elicited translations, segmenting and translating list of most frequent words)
![Page 73: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/73.jpg)
Quechua Spanish Prototype MT System
• Stem Lexicon (semi-automatically generated): 753 lexical entries
• Suffix lexicon: 21 suffixes – (150 Cusihuaman)
• Quechua morphology analyzer• 25 translation rules• Spanish morphology generation
module• User-Studies: 10 sentences, 3
users (2 native, 1 non-native)
![Page 74: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/74.jpg)
Quechua facts• Agglutinative language
• A stem can often have 10 to 12 suffixes, but it can have up to 28 suffixes
• Supposedly clear cut boundaries, but in reality several suffixes change when followed by certain other suffixes
• No irregular verbs, nouns or adjectives
• Does not mark for gender
• No adjective agreement
• No definite or indefinite articles (‘topic’ and ‘focus’ markers perform a similar task of articles and intonation in English or Spanish)
![Page 75: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/75.jpg)
Quechua examples
– taki+ni (also written takiniy)sing 1sg (I sing) canto
– taki+sha+ni (takishaniy)sing progr 1sg (I am singing) estoy cantando
– taki+pa+ku+q+chu? taki sing -pa+ku to join a group to do something -q agentive -chu interrogative
(para) cantar con la gente (del pueblo)? (to sing with the people (of the village)?)
![Page 76: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/76.jpg)
Quechua Resources
• A few native speakers, not linguists
• A computational linguist learning Quechua
• Two fluent, but non-native linguists
![Page 77: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/77.jpg)
Quechua
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
Parallel Corpus: OCR with correction
![Page 78: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/78.jpg)
Grammar rules;taki+sha+ni -> estoy cantando (I am singing){VBar,3} VBar::VBar : [V VSuff VSuff] -> [V V]( (X1::Y2)
((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) =c ger) ((y2 mood) = (x2 mood)) ((y1 form) =c estar) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 tense) = (x3 tense))((x0 tense) = (x3 tense))((y1 mood) = (x3 mood))((x3 inflected) =c +)((x0 inflected) = +))
lex = cantarmood = ger
lex = estarperson = 1number = sgtense = presmood = ind
SpanishMorphologyGeneration
estoy
cantando
![Page 79: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/79.jpg)
Hindi Resources
• Large statistical lexicon from the Linguistic Data Consortium (LDC)
• Parallel Corpus from LDC• Morphological Analyzer-Generator from LDC• Lots of native speakers• Computational linguists with little or no
knowledge of Hindi• Experimented with the size of the parallel corpus
– Miserly and large scenarios
![Page 80: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/80.jpg)
Hindi
Learning
Module
Learned Transfer
Rules
Lexical Resources
Run Time Transfer System
Decoder
Translation
Correction
Tool
Word-Aligned Parallel Corpus
Elicitation Tool
Elicitation Corpus
Elicitation Rule Learning
Run-Time System
Rule Refinement
Rule
Refinement
Module
Morphology
Morphology Analyzer
Learning Module Handcrafted
rules
INPUT TEXT
OUTPUT TEXT
15,000 Noun Phrases from Penn TreeBank
Parallel Corpus
EBMT
SMT
Supported by DARPA TIDES
![Page 81: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/81.jpg)
Manual Transfer Rules: Example
; NP1 ke NP2 -> NP2 of NP1; Ex: jIvana ke eka aXyAya; life of (one) chapter ; ==> a chapter of life;{NP,12}NP::NP : [PP NP1] -> [NP1 PP]( (X1::Y2) (X2::Y1); ((x2 lexwx) = 'kA'))
{NP,13}NP::NP : [NP1] -> [NP1]( (X1::Y1))
{PP,12}PP::PP : [NP Postp] -> [Prep NP]( (X1::Y2) (X2::Y1))
NP
PP NP1
NP P Adj N
N1 ke eka aXyAya
N
jIvana
NP
NP1 PP
Adj N P NP
one chapter of N1
N
life
![Page 82: An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bfa81a28abf838c998f1/html5/thumbnails/82.jpg)
System BLEU M-BLEU NIST
EBMT 0.058 0.165 4.22
SMT 0.093 0.191 4.64
XFER (naïve) man
grammar
0.055 0.177 4.46
XFER (strong) no grammar
0.109 0.224 5.29
XFER (strong) learned
grammar
0.116 0.231 5.37
XFER (strong) man
grammar
0.135 0.243 5.59
XFER+SMT
0.136 0.243 5.65
Very miserly training data.
Seven combinations of components
Strong decoder allows re-ordering
Three automatic scoring metrics
Hindi-English