Morphological Processing for Statistical Machine Translation

Morphological Processing for Statistical Machine Translation

Presenter: Nizar Habash

COMS E6998: Topics in Computer Science: Machine Translation

February 7, 2013Reading Set #1

Papers Discussed

• Nizar Habash and Fatiha Sadat. 2006. Arabic Preprocessing Schemes for Statistical Machine Translation.

• Nimesh Singh and Nizar Habash. 2012. Hebrew Morphological Preprocessing for Statistical Machine Translation.

• Introduction

• Arabic and Hebrew Morphology

• Approach

• Experimental Settings

• Results

• Conclusions

Outline

3

The Basic Idea

• Reduction of word sparsity improves translation quality

• This reduction can be achieved by – increasing training data, or by –morphologically driven preprocessing

Introduction

• Morphologically rich languages are especially challenging for SMT

• Model sparsity, high OOV rate especially under low-resource conditions

• A common solution is to tokenize the source words in a preprocessing step• Lower OOV rate Better SMT (in terms of BLEU)

• Increased token symmetry Better SMT models

• conj+article+noun :: conj article noun

• wa+Al+kitAb :: and the book

Introduction

• Different tokenizations can be used

• No one “correct” tokenization. Tokenizations vary in terms of

• Scheme (what) and Technique (how)

• Accuracy

• Consistency

• Sparsity reduction

• The two papers consider different preprocessing options and other settings to study SMT from Arabic/Hebrew to English

• Introduction


• Approach


• Results

• Conclusions

Outline

7

Linguistic Issues• Arabic & Hebrew are Semitic

languages– Root-and-pattern morphology– Extensive use of affixes and clitics

• Rich Morphology–Clitics

[CONJ+ [PART+ [DET+ BASE +PRON]]] w+ l+ Al+ mktb and+ for+ the+ office

–Morphotacticsw+l+Al+mktb wllmktb مكتبللو +ال+و مكتب+ل

Linguistic Issues

• Orthographic & Morphological Ambiguity– wjdnA وجدنا

• wjd+nA wajad+nA (we found) • w+jd+nA wa+jad~u+nA (and our grandfather)

– בשורה bbšwrh בשורה bšwrh ‘gospel’ב+שורה b+šwrh ‘in+(a/the) line’

’b+šwr+h ‘in her bull ב+שור+ה[lit. in+bull+her]

Arabic Orthographic Ambiguity

wdrst AltAlbAt AlErbyAt ktAbA bAlSynypw+drs+t Al+Talb+At Al+Erb+y+At ktAb+A b+Al+Syn+y+p

And+study+they the+student+f.pl. the+Arab+f.pl. book+a in+the+Chinese

The Arab students studied a book in Chinesethe+arab students studied a+book in+chinese

th+rb stdnts stdd +bk n+chnsthrb stdnts stdd bk nchns

to+herb so+too+dents studded bake in chains?

Extraw+

Repeated Al+

Repeated Al+

MT LABHINTS

Arabic MorphemesProclitics Word Base Enclitic

CONJ PART DET/FUT Prefix STEM Suffix PRON

w+f+

k+b+l+

Al+ ROOT+

PATTERN

+y+ϵ

+ϵ +p+yn +wn+An +At

+y +w +A

+y

+nA +k +km +kn +h +hA

+hm +hn

l+ +t +nA +tm +tn

+ ϵ +wA +n +nys+ A+ t+

n+ y++ϵ +wn +wA +n +yn +y +An +A

Verbs Nominals everything

circumfix

Clitics are optional, affixes are obligatory!

MT LABHINTS

• Introduction


• Approach


• Results

• Conclusions

Outline

12

ApproachHabash&Sadat 2006 / Singh&Habash 2012

• Preprocessing scheme– What to tokenize

• Preprocessing Technique– How to tokenize

• Regular expressions• Morphological analysis• Morphological tagging / disambiguation• Unsupervised morphological segmentation

• Not always independent

Arabic Preprocessing Schemes

• ST Simple Tokenization • D1 Decliticize conjunctions: w+/f+• D2 D1 + Decliticize particles: b+/l+/k+/s+• D3 D2 + Decliticize article Al+ and pron’l clitics• BW Morphological stem and affixes• EN D3, Lemmatize, English-like POS tags,

Subj• ON Orthographic Normalization• WA wa+ decliticization• TB Arabic Treebank• L1 Lemmatize, Arabic POS tags• L2 Lemmatize, English-like POS tags

Input: wsyktbhA? ‘and he will write it?’ST wsyktbhA ?D1 w+ syktbhA ?D2 w+ s+ yktbhA ?D3 w+ s+ yktb +hA ?BW w+ s+ y+ ktb +hA ?EN w+ s+ ktb/VBZ S:3MS +hA ?

Arabic Preprocessing Techniques

• REGEX: Regular Expressions• BAMA: Buckwalter Arabic Morphological

Analyzer (Buckwalter 2002; 2004)

– Pick first analysis– Use TOKAN (Habash 2006)

• A generalized tokenizer• Assumes disambiguated morphological analysis• Declarative specification of any preprocessing scheme

• MADA: Morphological Analysis and Disambiguation for Arabic (Habash&Rambow 2005)

– Multiple SVM classifiers + combiner– Selects BAMA analysis – Use TOKAN

Hebrew Preprocessing Techniques/Schemes

• Regular Expressionso RegEx-S1 = Conjunctions: ו ‘and’ and ש ‘that/who’o RegEx-S2 = RegEx-S1 and Prepositions: ב ‘in’, כ

‘like/as’, ל ‘to/for’, and מ ‘from’

o RegEx-S3 = RegEx-S2 and the article ה ‘the’o RegEx-S4 = RegEx-S3 and pronominal enclitics

• Morfessor (Creutz and Lagus, 2007)o Morf - Unsupervised splitting into morphemes

• Hebrew Morphological Tagger (Adler, 2009)o Htag - Hebrew morphological analysis and

disambiguation

Tokenization System Statistics

TokenIncrease

Similarity to Baseline

OOVReduction

(DEV)

Accuracy

Gold-S4

Gold (Scheme)

RegEx-S1 113% 87.4% 26% 70.1% 99.7% (S1)

RegEx-S2 141% 62.2% 50% 65.3% 79.1% (S2)

RegEx-S3 163% 46.3% 60% 68.2% 70.6% (S3)

RegEx-S4 190% 33.8% 66% 54.5%

17

• Aggressive tokenization schemes have:• More tokens

• More change from the baseline (untokenized)

• Fewer OOVs (baseline OOV is 7%)

Tokenization System Statistics

TokenIncrease

Similarity to Baseline

OOVReduction

(DEV)

Accuracy

Gold-S4

Gold (Scheme)

RegEx-S1 113% 87.4% 26% 70.1% 99.7% (S1)

RegEx-S2 141% 62.2% 50% 65.3% 79.1% (S2)

RegEx-S3 163% 46.3% 60% 68.2% 70.6% (S3)

RegEx-S4 190% 33.8% 66% 54.5%

Morf 124% 81.6% 96% 72.9%

Htag 130% 71.8% 56% 94.0%

Gold-S4 136% 68.4% 18

• Introduction


• Approach


• Results

• Conclusions

Outline

19

Arabic-English Experiments

• Portage Phrase-based MT (Sadat et al., 2005)

• Training Data: parallel 5 Million words only– All in News genre– Learning curve: 1%, 10% and 100%

• Language Modeling: 250 Million words • Development Tuning Data: MT03 Eval Set• Test Data MT04

– Mixed genre: news, speeches, editorials• Metric: BLEU (Papineni et al 2001)

Arabic-English Experiments

• Each experiment– Select a preprocessing scheme– Select a preprocessing technique

• Some combinations do not exist– REGEX and EN

MADA BAMA REGEX

BLE

U

100%

10%

1%

Training

> >

Arabic-English Results

Hebrew-English Experiments

• Phrase-based statistical MT

• Moses (Koehn et al., 2007)

• MERT (Och, 2003) tuned for BLEU (Papineni et al., 2002)

• Language models: English Gigaword (5-gram) plus training (3-gram)

• True casing for English output

• Training data 850,000 words

Hebrew-English Experiments

• Compare seven systems

• Vary only preprocessing• Baseline, RegEx-S{1-4}, Morf, and Htag

• Metrics

• BLEU, NIST (Doddington, 2002),

• METEOR (Banerjee & Lavie, 2005)

Results

Method

Blind Test

BLEU NISTMETEO

R OOV

Baseline 19.31 5.4951 44.36 1311

RegEx-S1 20.39 5.6468 45.46 985

RegEx-S2 21.69 5.8082 46.50 671

RegEx-S3 21.61 5.8761 46.60 567

RegEx-S4 21.07 5.8067 46.03 461

Morf 22.25 5.9751 46.53 48

Htag 22.79 6.1033 48.20 556

Combo1 22.72 6.0381 47.20 74

Combo2 22.69 6.0275 47.17 250

• Htag is consistently best, and Morf is consistently second best, in terms of BLEU and NIST

25

MethodBlind Test

BLEU NISTMETEO

R OOV

Baseline 19.31 5.4951 44.36 1311

RegEx-S1 20.39 5.6468 45.46 985

RegEx-S2 21.69 5.8082 46.50 671

RegEx-S3 21.61 5.8761 46.60 567

RegEx-S4 21.07 5.8067 46.03 461

Morf 22.25 5.9751 46.53 48

Htag 22.79 6.1033 48.20 556

Combo1 22.72 6.0381 47.20 74

Combo2 22.69 6.0275 47.17 250

• Morf has very low OOV, but still does worse than Htag (and even more poorly according to METEOR), indicating that it sometimes over-tokenizes.

26

Results

MethodBlind Test

BLEU NISTMETEO

R OOV

Baseline 19.31 5.4951 44.36 1311

RegEx-S1 20.39 5.6468 45.46 985

RegEx-S2 21.69 5.8082 46.50 671

RegEx-S3 21.61 5.8761 46.60 567

RegEx-S4 21.07 5.8067 46.03 461

Morf 22.25 5.9751 46.53 48

Htag 22.79 6.1033 48.20 556

Combo1 22.72 6.0381 47.20 74

Combo2 22.69 6.0275 47.17 250

• Within RegEx, BLEU peaks at S2/S3, similar to Arabic D2 (Habash & Sadat, 2006)

27

Results

Translation Example

Hebrew .יש לנו קומקום ופלאטה בחדר

Reference We have an electric kettle and a hotplate in our room.

Baseline We have brought ופלאטה in the room.

RegEx-S1 We have קומקום and פלאטה in the room.

RegEx-S2 We have קומקום and פלאטה in the room.

RegEx-S3 We've got קומקום and פלאטה in the room.

RegEx-S4 We have kettle and ופלאט room.

Morf We've got a complete wonder anywhere.

Htag We've got kettle and פלאטה in the room.

28

• Introduction


• Approach


• Results

• Conclusions

Outline

29

Conclusions• Preprocessing is useful for improving Arabic-English &

Hebrew-English SMT– But as more data is added, the value diminishes

• Tokenization with a morphological tagger does best but requires a lot of linguistic knowledge

• Morfessor does quite well with no linguistic information necessary, and significantly reduces OOV (but perhaps erroneously)

• Optimal Scheme/Technique choice varies by training data size – In Arabic, for large amounts of training data, splitting off

conjunctions and particles performs best– But, for small amount of training data, following an English-like

tokenization performs best

Thank you!

Questions?

Nizar Habash

[email protected]

Morphological Processing for Statistical Machine Translation

Documents

Transcript of Morphological Processing for Statistical Machine Translation