Genetic, Morphological, and Statistical Characterization ...
Morphological Processing for Statistical Machine Translation
description
Transcript of Morphological Processing for Statistical Machine Translation
Morphological Processing for Statistical Machine Translation
Presenter: Nizar Habash
COMS E6998: Topics in Computer Science: Machine Translation
February 7, 2013Reading Set #1
Papers Discussed
• Nizar Habash and Fatiha Sadat. 2006. Arabic Preprocessing Schemes for Statistical Machine Translation.
• Nimesh Singh and Nizar Habash. 2012. Hebrew Morphological Preprocessing for Statistical Machine Translation.
• Introduction
• Arabic and Hebrew Morphology
• Approach
• Experimental Settings
• Results
• Conclusions
Outline
3
The Basic Idea
• Reduction of word sparsity improves translation quality
• This reduction can be achieved by – increasing training data, or by –morphologically driven preprocessing
Introduction
• Morphologically rich languages are especially challenging for SMT
• Model sparsity, high OOV rate especially under low-resource conditions
• A common solution is to tokenize the source words in a preprocessing step• Lower OOV rate Better SMT (in terms of BLEU)
• Increased token symmetry Better SMT models
• conj+article+noun :: conj article noun
• wa+Al+kitAb :: and the book
Introduction
• Different tokenizations can be used
• No one “correct” tokenization. Tokenizations vary in terms of
• Scheme (what) and Technique (how)
• Accuracy
• Consistency
• Sparsity reduction
• The two papers consider different preprocessing options and other settings to study SMT from Arabic/Hebrew to English
• Introduction
• Arabic and Hebrew Morphology
• Approach
• Experimental Settings
• Results
• Conclusions
Outline
7
Linguistic Issues• Arabic & Hebrew are Semitic
languages– Root-and-pattern morphology– Extensive use of affixes and clitics
• Rich Morphology–Clitics
[CONJ+ [PART+ [DET+ BASE +PRON]]] w+ l+ Al+ mktb and+ for+ the+ office
–Morphotacticsw+l+Al+mktb wllmktb مكتبللو +ال+و مكتب+ل
Linguistic Issues
• Orthographic & Morphological Ambiguity– wjdnA وجدنا
• wjd+nA wajad+nA (we found) • w+jd+nA wa+jad~u+nA (and our grandfather)
– בשורה bbšwrh בשורה bšwrh ‘gospel’ב+שורה b+šwrh ‘in+(a/the) line’
’b+šwr+h ‘in her bull ב+שור+ה[lit. in+bull+her]
Arabic Orthographic Ambiguity
wdrst AltAlbAt AlErbyAt ktAbA bAlSynypw+drs+t Al+Talb+At Al+Erb+y+At ktAb+A b+Al+Syn+y+p
And+study+they the+student+f.pl. the+Arab+f.pl. book+a in+the+Chinese
The Arab students studied a book in Chinesethe+arab students studied a+book in+chinese
th+rb stdnts stdd +bk n+chnsthrb stdnts stdd bk nchns
to+herb so+too+dents studded bake in chains?
Extraw+
Repeated Al+
Repeated Al+
MT LABHINTS
Arabic MorphemesProclitics Word Base Enclitic
CONJ PART DET/FUT Prefix STEM Suffix PRON
w+f+
k+b+l+
Al+ ROOT+
PATTERN
+y+ϵ
+ϵ +p+yn +wn+An +At
+y +w +A
+y
+nA +k +km +kn +h +hA
+hm +hn
l+ +t +nA +tm +tn
+ ϵ +wA +n +nys+ A+ t+
n+ y++ϵ +wn +wA +n +yn +y +An +A
Verbs Nominals everything
circumfix
Clitics are optional, affixes are obligatory!
MT LABHINTS
• Introduction
• Arabic and Hebrew Morphology
• Approach
• Experimental Settings
• Results
• Conclusions
Outline
12
ApproachHabash&Sadat 2006 / Singh&Habash 2012
• Preprocessing scheme– What to tokenize
• Preprocessing Technique– How to tokenize
• Regular expressions• Morphological analysis• Morphological tagging / disambiguation• Unsupervised morphological segmentation
• Not always independent
Arabic Preprocessing Schemes
• ST Simple Tokenization • D1 Decliticize conjunctions: w+/f+• D2 D1 + Decliticize particles: b+/l+/k+/s+• D3 D2 + Decliticize article Al+ and pron’l clitics• BW Morphological stem and affixes• EN D3, Lemmatize, English-like POS tags,
Subj• ON Orthographic Normalization• WA wa+ decliticization• TB Arabic Treebank• L1 Lemmatize, Arabic POS tags• L2 Lemmatize, English-like POS tags
Input: wsyktbhA? ‘and he will write it?’ST wsyktbhA ?D1 w+ syktbhA ?D2 w+ s+ yktbhA ?D3 w+ s+ yktb +hA ?BW w+ s+ y+ ktb +hA ?EN w+ s+ ktb/VBZ S:3MS +hA ?
Arabic Preprocessing Techniques
• REGEX: Regular Expressions• BAMA: Buckwalter Arabic Morphological
Analyzer (Buckwalter 2002; 2004)
– Pick first analysis– Use TOKAN (Habash 2006)
• A generalized tokenizer• Assumes disambiguated morphological analysis• Declarative specification of any preprocessing scheme
• MADA: Morphological Analysis and Disambiguation for Arabic (Habash&Rambow 2005)
– Multiple SVM classifiers + combiner– Selects BAMA analysis – Use TOKAN
Hebrew Preprocessing Techniques/Schemes
• Regular Expressionso RegEx-S1 = Conjunctions: ו ‘and’ and ש ‘that/who’o RegEx-S2 = RegEx-S1 and Prepositions: ב ‘in’, כ
‘like/as’, ל ‘to/for’, and מ ‘from’
o RegEx-S3 = RegEx-S2 and the article ה ‘the’o RegEx-S4 = RegEx-S3 and pronominal enclitics
• Morfessor (Creutz and Lagus, 2007)o Morf - Unsupervised splitting into morphemes
• Hebrew Morphological Tagger (Adler, 2009)o Htag - Hebrew morphological analysis and
disambiguation
Tokenization System Statistics
TokenIncrease
Similarity to Baseline
OOVReduction
(DEV)
Accuracy
Gold-S4
Gold (Scheme)
RegEx-S1 113% 87.4% 26% 70.1% 99.7% (S1)
RegEx-S2 141% 62.2% 50% 65.3% 79.1% (S2)
RegEx-S3 163% 46.3% 60% 68.2% 70.6% (S3)
RegEx-S4 190% 33.8% 66% 54.5%
17
• Aggressive tokenization schemes have:• More tokens
• More change from the baseline (untokenized)
• Fewer OOVs (baseline OOV is 7%)
Tokenization System Statistics
TokenIncrease
Similarity to Baseline
OOVReduction
(DEV)
Accuracy
Gold-S4
Gold (Scheme)
RegEx-S1 113% 87.4% 26% 70.1% 99.7% (S1)
RegEx-S2 141% 62.2% 50% 65.3% 79.1% (S2)
RegEx-S3 163% 46.3% 60% 68.2% 70.6% (S3)
RegEx-S4 190% 33.8% 66% 54.5%
Morf 124% 81.6% 96% 72.9%
Htag 130% 71.8% 56% 94.0%
Gold-S4 136% 68.4% 18
• Introduction
• Arabic and Hebrew Morphology
• Approach
• Experimental Settings
• Results
• Conclusions
Outline
19
Arabic-English Experiments
• Portage Phrase-based MT (Sadat et al., 2005)
• Training Data: parallel 5 Million words only– All in News genre– Learning curve: 1%, 10% and 100%
• Language Modeling: 250 Million words • Development Tuning Data: MT03 Eval Set• Test Data MT04
– Mixed genre: news, speeches, editorials• Metric: BLEU (Papineni et al 2001)
Arabic-English Experiments
• Each experiment– Select a preprocessing scheme– Select a preprocessing technique
• Some combinations do not exist– REGEX and EN
MADA BAMA REGEX
BLE
U
100%
10%
1%
Training
> >
Arabic-English Results
Hebrew-English Experiments
• Phrase-based statistical MT
• Moses (Koehn et al., 2007)
• MERT (Och, 2003) tuned for BLEU (Papineni et al., 2002)
• Language models: English Gigaword (5-gram) plus training (3-gram)
• True casing for English output
• Training data 850,000 words
Hebrew-English Experiments
• Compare seven systems
• Vary only preprocessing• Baseline, RegEx-S{1-4}, Morf, and Htag
• Metrics
• BLEU, NIST (Doddington, 2002),
• METEOR (Banerjee & Lavie, 2005)
Results
Method
Blind Test
BLEU NISTMETEO
R OOV
Baseline 19.31 5.4951 44.36 1311
RegEx-S1 20.39 5.6468 45.46 985
RegEx-S2 21.69 5.8082 46.50 671
RegEx-S3 21.61 5.8761 46.60 567
RegEx-S4 21.07 5.8067 46.03 461
Morf 22.25 5.9751 46.53 48
Htag 22.79 6.1033 48.20 556
Combo1 22.72 6.0381 47.20 74
Combo2 22.69 6.0275 47.17 250
• Htag is consistently best, and Morf is consistently second best, in terms of BLEU and NIST
25
MethodBlind Test
BLEU NISTMETEO
R OOV
Baseline 19.31 5.4951 44.36 1311
RegEx-S1 20.39 5.6468 45.46 985
RegEx-S2 21.69 5.8082 46.50 671
RegEx-S3 21.61 5.8761 46.60 567
RegEx-S4 21.07 5.8067 46.03 461
Morf 22.25 5.9751 46.53 48
Htag 22.79 6.1033 48.20 556
Combo1 22.72 6.0381 47.20 74
Combo2 22.69 6.0275 47.17 250
• Morf has very low OOV, but still does worse than Htag (and even more poorly according to METEOR), indicating that it sometimes over-tokenizes.
26
Results
MethodBlind Test
BLEU NISTMETEO
R OOV
Baseline 19.31 5.4951 44.36 1311
RegEx-S1 20.39 5.6468 45.46 985
RegEx-S2 21.69 5.8082 46.50 671
RegEx-S3 21.61 5.8761 46.60 567
RegEx-S4 21.07 5.8067 46.03 461
Morf 22.25 5.9751 46.53 48
Htag 22.79 6.1033 48.20 556
Combo1 22.72 6.0381 47.20 74
Combo2 22.69 6.0275 47.17 250
• Within RegEx, BLEU peaks at S2/S3, similar to Arabic D2 (Habash & Sadat, 2006)
27
Results
Translation Example
Hebrew .יש לנו קומקום ופלאטה בחדר
Reference We have an electric kettle and a hotplate in our room.
Baseline We have brought ופלאטה in the room.
RegEx-S1 We have קומקום and פלאטה in the room.
RegEx-S2 We have קומקום and פלאטה in the room.
RegEx-S3 We've got קומקום and פלאטה in the room.
RegEx-S4 We have kettle and ופלאט room.
Morf We've got a complete wonder anywhere.
Htag We've got kettle and פלאטה in the room.
28
• Introduction
• Arabic and Hebrew Morphology
• Approach
• Experimental Settings
• Results
• Conclusions
Outline
29
Conclusions• Preprocessing is useful for improving Arabic-English &
Hebrew-English SMT– But as more data is added, the value diminishes
• Tokenization with a morphological tagger does best but requires a lot of linguistic knowledge
• Morfessor does quite well with no linguistic information necessary, and significantly reduces OOV (but perhaps erroneously)
• Optimal Scheme/Technique choice varies by training data size – In Arabic, for large amounts of training data, splitting off
conjunctions and particles performs best– But, for small amount of training data, following an English-like
tokenization performs best