Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami...

22
20 Dec 2008 IIT Kharagpur Slide 1 Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar

Transcript of Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami...

Page 1: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 1

Statistical Machine Translation English to Hindi

Sumit GoswamiNirav ShahDevshri Roy

Sudeshna Sarkar

Page 2: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 2

Introduction

• Machine translation (MT) is the automatic translation from one natural language into another language using computers.

• Statistical machine translation (SMT) is an approach to MT that is characterized by the use of machine learning methods.

Page 3: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 3

Objective

• Objective of our work is to explore the different ways of statistical techniques with linguistic inputs to improve a baseline statistical machine translation system from English to Hindi.

Page 4: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 4

Problem with the limited training data

• In general, in statistical machine translation, if more data will be provided for learning; higher will be the quality of translation.

• The size of the training data is limited to 50k sentence pair.

• As a result there are limits in the availability of the bilingual data in the created phrase translation table.

Page 5: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 5

Proposed Methodology

• We propose a method of increasing the bilingual data in the phrase translation table by appending it with the freely available dictionary.

Page 6: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 6

Problem with the word translation

• The main problem with the word translation is that words with equivalent meanings may not appear in the same order in both sentences.

• For example

In English “I read a book.”

In Hindi, “Maine Pustak Padhi.”• In English sentences, the typical sentence order

is subject-verb-object (SVO). In Hindi, it is subject-object-verb (SOV).

Page 7: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 7

Methodologies

• We have used the phrase based modeling concept instead of word based modeling.

• Again, the phrase-based models are limited to the mapping of small text chunks (phrases) without any explicit use of linguistic information, may it be morphological, syntactic, or semantic. Such additional information are valuable by integrating it in pre-processing or post-processing.

• We extended the phrase-based statistical machine translation models using a factored representation.

Page 8: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 8

Resources

• Statistical machine translation

– Moses

• Hindi Dictionary – Sabdakosh

• PoS Tagger

• Hindi Morphological Analyser

• Training and Testing – TIDES IIIT Dataset

– EILMT Tourism Dataset

Page 9: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 9

Addition of new source of knowledge

• Downloaded Hindi dictionary Sabdhakosh

• Preprocessed the parallel data

• Obtained a maximum likelihood lexical translation table.

• Generated a phrase translation table.

Page 10: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 10

Aligned word table

• Get the English Hindi equivalent of almost all major and frequently used root words

• We insure that each hindi translated word is next to the english word.Example:want || AvaSyakawA ||| (0 0)want ||| cAhanA ||| (0 0)war ||| yuxXa ||| (0 0)

Page 11: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 11

Lexical translation table

• We assign the w(e|h) as well as the inverse w(h|e) word translation table.

• Example

English Hindi w(e|h)

want AvaSyakawA 0.5

want cAhanA 0.5

war yuxXa 0.5

Page 12: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 12

Scoring of words in the translation table

Five different phrase translation scores are assigned and phrase translation table is generated.

• phrase translation probability φ(e|h) • lexical weighting lex(e|h) • phrase translation probability φ(h|e) • lexical weighting lex(h|e) • phrase penalty (always exp(1) = 2.718) Example:

want ||| AvaSyakawA ||| (0) ||| (0) ||| 0.5 0.5 1 0.5 2.718want ||| cAhanA ||| (0) ||| (0) ||| 0.5 0.5 1 0.5 2.718war ||| yuxXa ||| (0) ||| (0) ||| 0.5 0.5 0.5 0.5 2.718

Page 13: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 13

Phrase Based Statistical Machine Translation

We have used the Phrased based modeling and generated the phrase table on the given 50k parallel corpus provided by ICON and appended that phrase translation table by the preprocessed English-Hindi dictionary.

Example Output of the phrase translation table

! time came for ||| samaya AyA waba waka parisZWiwiyAM ||| () (0) (1) (3) ||| (1) (2) () (3) () ||| 0.5 8.11523e-08 0.5 5.50722e-12 2.718

! time came for ||| samaya AyA waba waka ||| () (0) (1) (3) ||| (1) (2) () (3) ||| 0.5 8.11523e-08 0.5 5.14693e-07 2.718

! time came ||| samaya AyA waba ||| () (0) (1) ||| (1) (2) () ||| 0.5 2.21728e-06 0.5 4.83502e-05 2.718

! time came ||| samaya AyA ||| () (0) (1) ||| (1) (2) ||| 0.333333 2.21728e-06 0.5 0.0328891 2.718

time ||| samaya ||| (0) ||| (0) ||| 0.5 0.5 0.5 0.5 2.718

Page 14: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 14

Factored Translation Model

• Extended phrase-based ST model using a factored representation

• We annotate each word with a feature vector

• The feature vector includes the

– surface form

– root

– part-of-speech tag

– the morphological information

• Annotation is used to construct ST models that can be combined together to maximize translation quality

Page 15: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 15

Sentence Analysis

• The sentence analysis is broken up into

– Tokenize the given sentence.

– Take the surface form of the word

– Generate the root word

– Generate the part of speech factor

– Obtain the other morphological information like

gender, number

Page 16: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 16

Sample of Factored Translation

<Sentence id="2">1 ye<fsaf='yaha,P,any,p,a,,0,'>2 loga<fsaf='loga,n,m,s,,0,,'>|<fsaf='loga,n,m,p,,0,,'>|<fs af='loga,n,m,s,,1,,'>

3kAPI<fs af='kAPI,n,f,s,,0,,'>|<fs af='kAPI,n,f,p,,0,,'>|<fs af='kAPI,n,f,s,,1,,'>|<fs af='kAPI,n,f,p,,1,,'>|<fs af='kAPI,D,,,,,,'>

4 pariSramI <fsaf='pariSramI,n,f,s,,0,,'>|

<fs af='pariSramI,n,f,p,,0,,'>|<fs af='pariSramI,n,f,s,,1,,'>|<fs af='pariSramI,n,f,p,,1,,'>

5hEM <fsaf='hE,v,any,p,u,,,hE'>|<fs af='hE,v,any,p,a,,,hE'>

</Sentence>

Page 17: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 17

Sample of Hindi Tagged File

apane|apanA|adj|adj.m.s 30|30 sAWiyoM|sAWI|n|n.m.p ko|ko|p|p.null.null sAWa|sAWa|D|D.null.null

lekara|le|v|v.any.any saraxAra|saraxAra|n|n.m.s guraxIwasiMha|siMha|n|n.m.s kalakawAkI|kA|sh_n|sh_n.f.s

galiyoM|galI|n|n.f.p meM|meM|p|p.null.null vilIna|vilIna|n|n.m.s ho|ho|v|v.any.any gae|jA|v|v.m.p .unake|

vaha|sh_P|sh_P.m.a anya|anya|adj|adj.any.any kRewra|kRewra|n|n.m.s hE|hE|v|v.any.s raMgawa|raMgawa|

n|n.f.s va|va|Avy|Avy.null.null greta|greta|adj|adj.any.any nikobAra.unakI|vaha|sh_P|sh_P.f.a muKya|

muKya|adj|adj.any.any gawiviXiyAMportableyarameM|meM|p|p.null.null keMxriwa|keMxriwa|adj|

adj.any.any hEM|hE|v|v.any.p jahAM|jahAz|D|D.null.null para|para|p|p.null.null unakI|vaha|sh_P|sh_P.f.a

aneka|aneka|n|n.m.s sAmAjika|sAmAjika|adj|adj.any.any waWA|waWA|D|D.null.null sAMskqwika|

sAMskqwika|adj|adj.any.any saMsWAeM|saMsWA|n|n.f.p hEM|hE|v|v.any.p .

Page 18: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 18

Sample of English Tagged File

city_NN palace_NN is_VBZ a_DT magnificent_JJ structure_NN ,_, the_DT palace_NN occupies_VBZ

one_CD seventh_JJ of_IN the_DT walled_JJ city_NN of_IN jaipur_NN and_CC is_VBZ a_DT

wonderful_JJ blend_VB of_IN rajput_NN and_CC mughal_JJ architecture_NN ._.the_DT jeep_FW

safari_FW not_RB only_RB refreshes_VBZ and_CC revitalizes_VBZ but_CC one_PRP feels_VBZ

close_RB to_TO nature_NN while_IN diving_NN through_IN the_DT quiet_JJ and_CC beautiful_JJ

countryside_NN ._.boparais_NN organization_NN is_VBZ running_VBG a_DT camping_NN site_NN

at_IN barog_NN in_IN district_NN solan_NN ._.this_DT was_VBD to_TO prevent_VB tobacco_NN

smuggling_NN from_IN coimbatore_NN ._.shimla_NN is_VBZ surrounded_VBN by_IN pine_VB ,_,

cedar_NN ,_, oak_NN and_CC rhododendron_NN forests_NNS ._.the_DT monumental_JJ labor_NN

of_IN love_NN of_IN a_DT great_JJ ruler_NN for_IN his_PRP$

Page 19: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 19

NIST score = 4.3187  BLEU score = 0.0976 for system "iit kgp"

# ------------------------------------------------------------------------

Individual N-gram scoring        1-gram  2-gram  3-gram  4-gram  5-gram  6-gram  7-gram  8-gram  9-gram        ------  ------  ------  ------  ------  ------  ------  ------  ------NIST:  3.7437  0.5263  0.0433  0.0048  0.0006  0.0001  0.0000  0.0000  0.0000  " iit kgp "

BLEU:  0.4408  0.1444  0.0575  0.0248  0.0119  0.0062  0.0034  0.0020  0.0013  " iit kgp "

# ------------------------------------------------------------------------Cumulative N-gram scoring        1-gram  2-gram  3-gram  4-gram  5-gram  6-gram  7-gram  8-gram  9-gram        ------  ------  ------  ------  ------  ------  ------  ------  ------NIST:  3.7437  4.2700  4.3133  4.3181  4.3187  4.3188  4.3188  4.3188  4.3188  " iit kgp t"

BLEU:  0.4408  0.2523  0.1541  0.0976  0.0641  0.0435  0.0302  0.0215  0.0157  " iit kgp "MT evaluation scorer ended on 2008 Dec 1 at 11:38:48

Evaluation on TIDES IIIT Test Dataset

Page 20: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 20

MT evaluation scorer began on 2008 Dec 11 at 00:28:11command line:

NIST score = 5.1704 BLEU score = 0.1873 for system "iit kgp"# ------------------------------------------------------------------------Individual N-gram scoring

1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram ------ ------ ------ ------ ------ ------ ------ ------ ------ NIST: 4.3795 0.7168 0.0645 0.0080 0.0016 0.0002 0.0001 0.0001 0.0000 "iit kgp“ BLEU: 0.5282 0.2334 0.1269 0.0837 0.0607 0.0469 0.0364 0.0290 0.0246 "iit kgp“# ------------------------------------------------------------------------Cumulative N-gram scoring

1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram ------ ------ ------ ------ ------ ------ ------ ------ ------ NIST: 4.3795 5.0963 5.1608 5.1688 5.1704 5.1706 5.1706 5.1707 5.1707 "iit kgp“BLEU: 0.5200 0.3457 0.2462 0.1873 0.1490 0.1226 0.1028 0.0876 0.0759 "iit kgp“

MT evaluation scorer ended on 2008 Dec 11 at 00:28:14

Evaluation on EILMT Tourism Test Dataset

Page 21: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 21

Comparison of Results

IIIT baseline

BLEU score 1-gram 0.1059

3-gram 0.1649

IIT, Kgp score 1-gram 0.5282

3-gram 0.2462

IIIT NIST score = 3.9

IIT, Kgp NIST score = 5.1

Page 22: Slide 120 Dec 2008 IIT Kharagpur Statistical Machine Translation English to Hindi Sumit Goswami Nirav Shah Devshri Roy Sudeshna Sarkar.

20 Dec 2008 IIT Kharagpur Slide 22

Future Work

• NER followed by a transliteration system

• Increasing the amount of parallel text by paraphrasing

• Extract parallel data from the comparable bilingual corpora so

as to increase our training corpus

Thank You