Andy Way, IGK Summer School, Edinburgh, Sept. 2006 Hybrid Data-Driven Models of Machine Translation...

Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Hybrid Data-Driven Models of Machine Translation

Andy Way (& Declan Groves)

National Centre for Language Technology,School of Computing,

Dublin City University, Dublin 9, Ireland

[email protected]

mailto:[email protected]


Outline

• Motivations• Example-Based Machine Translation

– Marker-Based EBMT

• Statistical Machine Translation• Experiments:

– Language Pairs & Corpora Used– EBMT and PBSMT baseline systems– Hybrid System Experiments

• Making use of merged data sets

• ‘Phrases’, ‘Chunks’ and Training-Test Corpora• Conclusions• Future Work


Motivations

• Most MT research carried out today is corpus-based:– Example-Based Machine Translation (EBMT)– Statistical Machine Translation (SMT)

• Lack of comparative research:– Relative unavailability of EBMT systems– Lack of participation of EBMT researchers in

competitive evaluations– Dominance of the SMT approach


Example-Based Machine Translation

• As with SMT, EBMT makes use of information extracted from sententially-aligned bilingual corpora. In general:

– SMT only uses parameters, throws away data– EBMT makes use of linguistic units directly

• During Translation:1. Source side of bitext is searched for close matches2. Source-target subsentential links are determined3. Relevant target fragments retrieved and recombined

to derive final translation.


EBMT: An Example• Assumes an aligned bilingual corpus of examples against

which input text is matched• Best match is found using a similarity metric based on word

co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)


EBMT: An Example• Assumes an aligned bilingual corpus of examples against

which input text is matched• Best match is found using a similarity metric based on word

co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching)

Given the Corpus

The shop is open on Monday Le magasin est ouvert Lundi

John went to the swimming pool Jean est allé à la piscine

The butcher’s is next to the baker’s La boucherie est à côté de la boulangerie


EBMT: An Example

• Identify useful fragments

Given the CorpusThe shop is open on Monday Le magasin est ouvert Lundi




EBMT: An Example

Isolate useful fragments

We can now translate:

on Monday lundiJohn went to Jean est allé àthe baker’s la boulangerie

Given the CorpusThe shop is open on Monday Le magasin est ouvert lundi



• Identify useful fragments• Recombination depends on nature of examples used


Marker-Based EBMT at DCU

Marker-Based EBMT at DCU

• Gaijin: [Veale & Way], RANLP ‘97• [Gough et al.], AMTA ‘02• wEBMT: [Way & Gough], Comp. Linguistics ‘03• [Gough & Way], EAMT ‘04• [Way & Gough], TMI ‘04• [Gough], PhD Thesis ‘05• [Way & Gough], Natural Language Engineering ‘05• [Way & Gough], Machine Translation ‘05• [Groves & Way], ACL w/shop on Data-Driven MT ‘05• [Groves & Way], Machine Translation & EAMT ‘06• MaTrEx: [Armstrong et al.], TC-STAR OpenLab ‘06• [Stroppa et al.], NIST MT-Eval ‘06, AMTA ’06,

IWSLT-06


System Development

System Lang. Pairs #Sent. Pairs

Gaijin ‘97 ENDE 1836

wEBMT ‘03 FREN 219K Penn-II NPs, VPs

TMI-04 FREN 203,000

ACL-05 FREN 322,000

MaTrEx OpenLab ESEN 958,000

MaTrEx NIST-06 AREN 3,000,000


System Development




TMI-04 FREN 203,000

ACL-05 FREN 322,000



MaTrEx AMTA-06 BasqueEN 276,000


System Development




TMI-04 FREN 203,000

ACL-05 FREN 322,000



MaTrEx AMTA-06 BasqueEN 276,000

MaTrEx IWSLT-06 ITEN 40,000


Marker-Based EBMT

“The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes

which appear in a limited set of grammatical contexts and which signal that context.”

[Green, 1979]• Universal psycholinguistic constraint: languages are marked for

syntactic structure at surface level by closed set of lexemes or morphemes

The Dearborn Mich., energy company stopped paying a dividend in the third quarter of 1984 because of troubles at its Midland nuclear plant.


Marker-Based EBMT






•Three NPs start with determiners, one with a possessive pronoun•Nominal element will appear soon to the right•Sets of determiners and possessive pronouns small and finite


Marker-Based EBMT






•Four prepositional phrases, with prepositional heads•NP object will appear soon to the right•Set of prepositions small and finite


Marker-Based EBMT: Chunking

• Use a set of closed-class marker words to segment aligned source and target sentences during a pre-processing stage

• <PUNC> now used as end-of-chunk marker

Determiners <DET>

Quantifiers <QUANT>

Prepositions <PREP>

Conjunctions <CONJ>

Wh-Adverbs <WRB>

Possessive Pronouns <POSS>

Personal Pronouns <PRON>

Punctuation Marks <PUNC>

• English Marker words extracted from CELEX


Marker-Based EBMT: Chunking (2)• Enables the use of basic syntactic markup for extraction of

translation resources• Source-target sentence pairs are tagged with marker

categories in pre-processing stageEN: <PRON> you click apply <PREP> to view <DET> the effect <PREP> of <DET> the selectionFR: <PRON> vous cliquez <PRON> sur appliquer <PREP> pour visualiser <DET>l’ effet <PREP> de <DET> la sélection

• Aligned source-target chunks created by segmenting sentences based on these marker tags along with cognate and word co-occurrence information: <PRON> you click apply : <PRON> vous cliquez sur appliquer

<PREP> to view : <PREP> pour visualiser <DET> the effect : <DET> l’effet <PREP> of the selection : <PREP> de la sélection


Marker-Based EBMT: Chunking (2)• Enables the use of basic syntactic markup for extraction of

translation resources• Source-target sentence pairs are tagged with marker

categories in pre-processing stageEN: <PRON> you click apply <PREP> to view <DET> the effect <PREP> of <DET> the selectionFR: <PRON> vous cliquez <PRON> sur appliquer <PREP> pour visualiser <DET>l’ effet <PREP> de <DET> la sélection

• Aligned source-target chunks created by segmenting sentences based on these marker tags along with cognate and word co-occurrence information: <PRON> you click apply : <PRON> vous cliquez sur appliquer

<PREP> to view : <PREP> pour visualiser <DET> the effect : <DET> l’effet <PREP> of the selection : <PREP> de la sélection

• Chunks must contain at least one non-marker word—ensures chunks contain useful contextual information


Marker-Based EBMT: Lexicon & Template Extraction

• Chunks containing only one non-marker word in both source and target languages can then be used to extract a word-level lexicon:

<PREP> to: <PREP> pour<LEX> view: <LEX> visualiser<LEX> effect: <LEX> effet<DET> the: <DET> l<PREP> of: <PREP> de

• In a final pre-processing stage, we produce a set of generalized marker templates by replacing marker words with their tags:

<PRON> click apply : <PRON> cliquez sur appliquer <PREP> view : <PREP> visualiser<DET> effect : <DET> effet <PREP> the selection : <PREP> la sélection

• Any marker word pair can now be inserted at the appropriate tag location.

• More general examples add flexibility to the matching process and improve coverage (and quality)


Marker-Based EBMT

• During translation:– Resources are searched from maximal (specific source-

target sentence-pairs) to minimal context (word-for-word translation).

– Retrieved example translation candidates are recombined, along with their weights, based on source sentence order

– System outputs n-best list of translations


Phrase-Based SMT

• SMT translation and language models now make use of phrase-translations in TM, along with word correspondences, to improve translation output.– Better modelling of syntax and local word-reordering

• Phrase extraction heuristics based on word alignments shown to be better than more syntactically motivated approaches [Koehn et al., 2003]– Perform word alignment in both source-target and target-

source directions– Take intersection of unidirectional alignments– Extend the intersection iteratively into the union by adding

adjacent alignments within the alignment space [Och & Ney 2003, Koehn et al., 2003].

– Extract all possible phrases from sentence pairs which correspond to these alignments

– Phrase probabilities can be calculated from relative frequencies


Outline: Recap








Experiments

Publication Training Corpus

Language Pair

Rationale

Way & Gough NLE-05

203K-sent. Sun TM

ENFR How does EBMT fare compared to WB-SMT?

Groves & Way

ACL SMT-05

203K-sent. Sun TM

ENFR How does EBMT fare compared to PB-SMT? What about combining EBMT & SMT chunks?

Groves & Way

MT-06

322K-sent. Europarl

ENFR How does changing domain affect all this?

Armstrong et al.

OpenLab-06

958-K sent. Europarl

ESEN What about a different language pair & more training data?

Stroppa et al.AMTA-06

273-K sentEF TM

BasqueEN

What about a more different language pair?


EBMT vs. WB-SMT

• [Way & Gough, 05] (cf. talk here in May 05): on 203K-$ Sun TM (4.8M words), and a 4K-$ test set (ave. $-length 13.1 words EN, 15.2 words FR), EBMT>vanilla WB-SMT (Giza++, CMU-Cambridge statistical toolkit, ISI ReWrite Decoder) for FREN

• Best BLEU scores: – ENFR: .453 EBMT, .338 WB-SMT– FREN: .461 EBMT, .446 WB-SMT


EBMT & PB-SMT (on Sun TM)English-French

• The Phrase-Based system using GIZA-Data outperforms the same system seeded with EBMT-Data on all metrics, bar Precision (0.6598 vs. 0.6661)

• Marker-Based EBMT system beats both Phrase-Based SMT systems, particularly for BLEU (0.4409 vs. 0.3758) and Recall (0.6877 vs. 0.5759).

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bleu Prec Recall WER

PBSMT (GIZA)1.73M entries

PBSMT (EBMT)403,278 entries

EBMT


EBMT & PB-SMT (on Sun TM)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1




EBMT

French-English

• Scores for all systems are better for FREN than for ENFR

• Again, the Phrase-Based system using GIZA data outperforms the same system seeded with EBMT data.

• As for ENFR, the Marker-Based EBMT system significantly outperforms both Phrase-Based SMT systems for FREN.


Towards Hybridity

• Decided to merge data sources– Combine parts of EBMT sub-sentential alignments with

parts of the data induced using GIZA++• Performed a number of experiments using:

– EBMT Phrases + GIZA++ Words (SEMI-HYBRID)• Investigate if quality of EBMT phrases is better than GIZA++

phrases– All Data (HYBRID); GIZA++ Words & Phrases + EBMT

Words & Phrases • EBMT phrases will be used instead of SMT n-grams • EBMT phrases should add extra probability to ‘more useful’

SMT phrases; i.e. the probabilities of the phrases in the intersection of these two sets are boosted

EBMTPhrases

Giza++Phrases


Merging Data Sources: ENFR Results

• Using EBMT phrases + GIZA words improves significantly on using EBMT data alone

• Merging all the EBMT and GIZA data improves on all metrics, most significantly for BLEU score (0.4259 vs. 0.3643 SEMI-HYBRID).

• EBMT system still wins out for BLEU score, Recall and WER

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1




SEMI-HYBRID430,336 entries

HYBRID 2.05M entries

EBMT system


Merging Data Sources: FREN Results

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1




SEMI-HYBRID430,336 entries

HYBRID 2.05M entries

EBMT system

• Using EBMT phrases + GIZA words shows improvements on PBSMT system seeded with EBMT data, but improves only on the GIZA seeded system’s BLEU score (0.4888 vs. 0.4198).

• However, merging all data improves on both PBSMT systems on all metrics

• EBMT system beats Hybrid system only on Recall and WER


Results: Discussion• PBSMT

– Best PBSMT BLEU scores (with Giza++ data only): 0.375 (E-F), 0.420 (F-E);

– Seeding PBSMT with EBMT data gets good scores: for BLEU, 0.364 (E-F), 0.395 (F-E); note differences in data size (1.73M vs. 403K)

– PBSMT loses out to EBMT system

• Semi-Hybrid System– Seeding Pharaoh with SMT words and EBMT phrases

improves over baseline Giza++ seeded system;– Data size diminishes considerably (430K vs. 1.73M);– Worse results than for EBMT system.

• Fully-Hybrid System– Better results than for ‘semi-hybrid’ system: E-F 0.426

(0.396), F-E 0.489 (0.427);– Data size increases to 2.04M phrase table entries– For F-E, Hybrid system beats EBMT on BLEU (0.4888 vs.

0.4611) & Precision (0.6927 vs. 0.6782); EBMT ahead for Recall & WER.


EBMT & PB-SMT (on Europarl)

• [Groves & Way, 06a/b]1. Added SMT-chunks to EBMT system hybrid ‘statistical

EBMT’ system2. New domain: Europarl (FREN, 322K-$ ) [Koehn, 05]• Extracted training data from designated training sets,

filtering based on sentence length and relative sentence length (ratio of 1.5 used).

– Allowed us to extract high-quality training sets

# sentence pairs # words

78K 1.49M

156K 2.98M

322K 6.12M

• For testing, randomly extracted 5000 sentences from the Europarl common test set. Avg. sentence lengths: 20.5 words (French), 19.0 words (English)


EBMT vs. PBSMT

• Compared the performance of our Marker-Based EBMT system against that of a PB-SMT system built using:– Pharaoh Phrase-Based Decoder [Koehn, 04] – SRI LM toolkit [Stolcke, 02].– Refined alignment strategy [Och & Ney, 03]

• Trained on incremental data sets, tested on 5000 sentence test set– Effect of increasing training data on translation quality

• Performed translation for FREN• Evaluated translation quality automatically using

BLEU [Papineni et al., 02], Precision & Recall (GTM toolkit [Turian et al., 03]) and Word-error rate (WER)


EBMT vs. PBSMT: French-English

0

0.1

0.2

0.3

0.4

0.5

0.6

Bleu Prec Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

Bleu Prec Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

Bleu Prec Recall

78K

156K

322K

• Doubling the amount of data improves performance across the board for both EBMT and PBSMT

• PBSMT system clearly outperforms EBMT system, on average achieving 0.07 BLEU score higher

• PBSMT achieves a significantly lower WER (e.g. 68.55 vs. 82.43 for the 322K data set)

• Increasing amount of training data results in:– 3-5% increase in relative BLEU for

PBSMT– 6.2% to 10.3% relative BLEU score

improvement for EBMT

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

WER

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

WER

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

WER


EBMT vs. PBSMT: English-French

0

0.1

0.2

0.3

0.4

0.5

0.6

Bleu Prec Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

Bleu Prec Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

Bleu Prec Recall

78K

156K

322K

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

WER

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

WER

• PBSMT continues to outperform EBMT system by some distance– e.g. 0.1933 vs. 0.1488 BLEU score, 0.518

vs. 0.4578 Recall for 322K data set• Difference between systems is somewhat

less for ENFR than for FREN– EBMT system performance much more

consistent for both directions– PBSMT system performs 2% BLEU score

worse (10% relative) for ENFR than for FREN

• French-English is ‘easier’– Fewer agreement errors, problems with

boundary friction e.g. le the (FREN), the le, la, les, l’ (ENFR)

• EBMT scores higher for ENFR than for FREN in terms of BLEU score

– Cf. [Callison-Burch et al., 06], BLEU for evaluating non-n-gram-based systems

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

WER


Hybrid System Experiments

• Decided to merge elements of EBMT marker-based alignments with PBSMT phrases and words induced via GIZA++

• Number of Hybrid Systems– LEX-EBMT: Replaced EBMT lexicon with higher

quality PBSMT word-alignments, to lower WER– H-EBMT vs. H-PBSMT: Merged PBSMT words and

phrases with EBMT data (words and phrases) and passed resulting data to baseline EBMT and baseline PBSMT systems

– H-EBMT-LM: Reranked the output of H-EBMT systems using the PBSMT system’s equivalent language model


Hybrid Experiments: French-English

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

78K 156K 322K

EBMT

LEX-EBMT

H-EBMT

H-EBMT-LM

PBSMT

H-PBSMT


Hybrid Experiments: French-English

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

78K 156K 322K

EBMT

LEX-EBMT

H-EBMT

H-EBMT-LM

PBSMT

H-PBSMT

• Use of the improved lexicon (LEX-EBMT), leads to only slight improvements (average relative increase of 2.9% BLEU)

• Adding Hybrid data improves above baselines, for both EBMT (H-EBMT) and PBSMT (H-PBSMT)– H-PBSMT system achieves higher BLEU score trained on 78K &

156K compared with PBSMT system when trained on twice as much data.

• The addition of the language model to the H-EBMT system helps guide word order after lexical selection and thus improves results further


Hybrid Experiments: English-French

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

78K 156K 322K

EBMT

LEX-EBMT

H-EBMT

H-EBMT-LM

PBSMT

H-PBSMT

• We see similar results for ENFR as for FREN– The more SMT-like the EBMT system becomes, the more the

BLEU scores fall in line with other metrics, i.e. higher for FREN than for ENFR

• Using the hybrid data set we get a 15% average relative increase in BLEU score for the EBMT system, and 6.2% for the H-PBSMT system over its baseline

• The H-PBSMT system performs almost as well as the baseline system trained on over 4 times the amount of data


SMT ‘phrases’ vs. EBMT ‘chunks’

SMT EBMT BOTH SMT-ONLY EBMT-ONLY

78K 1.17M 242,907 47,311 1.12M 195,596

156K 2.45M 470,588 92,662 2.36M 378,026

322K 5.15M 928,717 181,669 4.97M 747,048

• Many more SMT phrases are derived than EBMT chunks– Not reflected in scores

• Doubling amount of data, doubles amount of sub-sentential alignments for both systems– Indicates the heterogeneous nature

of the Europarl corpus• Taking the 322K training set :

– 93.0% SMT chunks found only once, 99.4% occur < 10 times

– 96.6% EBMT chunks found only once, 99.8% occur < 10 times

• Of the top 10 most frequent chunks in SMT-only set, 7 are made up solely of marker words:du of thede la of theunion européenne unionétats membres member statesde l of thedans le in then est isparlement européen parliamentque nous that weque la that the


Remarks

• [Groves & Way, 05] showed how an EBMT system outperforms a PBSMT system when trained on the Sun Microsystems’ data set

• This time around, the baseline PBSMT system achieves higher quality than all variants of the EBMT system– Heterogeneous Europarl vs. Homogeneous Sun data– Chunk coverage is lower on Europarl data set: 6% translations

produced using chunks alone (Sun) vs. 1% on Europarl– EBMT system considered 13 words on average for direct

translation (vs. 7 for Sun data)• Significant improvements seen when using higher-quality lexicon• Improvements also seen when LM introduced

• H-PBSMT system able to outperform baseline PBSMT system• Further gains to be made from hybrid corpus-based

approaches– Small overlap on chunks extracted via EBMT and SMT methods


Hybrid ‘Example-Based SMT’: The MaTrEx system


Hybrid Example-Based SMT

• [Armstrong et al., 06]: OpenLab MT-EVAL (March 06)—adding EBMT chunks to ‘vanilla Pharaoh’ PB-SMT system adds about 4 BLEU points for ESEN

• [Stroppa et al., 06]: adding EBMT chunks to ‘vanilla Pharaoh’ PB-SMT system adds about 5 BLEU points for BasqueEN

• Good performance in IWSLT-06


Outline: Recap








‘Phrases’, ‘Chunks’ and Training-Test Corpora

• SMT phrases are contiguous sequences of n-grams• Typically, EBMT performance is comparable with PB-

SMT with fewer sub-sentential alignments• As EBMT chunks are different from SMT ‘phrases’, use

them if available in your PB-SMT systems (cf. OpenLab ESEN and AMTA BasqueEN results). They:– Provide longer sequences of context better translations– Reinforce probability of good but infrequent SMT ‘phrases’

• As SMT ‘phrases’ are different from EBMT chunks, use them if available in your EBMT systems

• SMT ‘phrases’ typically shorter than EBMT chunks, so more useful where training/test material is more heterogeneous—where EBMT chunks are ‘too long’ to cover the input data, SMT n-grams can fill in before we need to resort to W2W translation (always last resort)

• cf. CMU findings in recent NIST MT-Eval …


‘Phrases’, ‘Chunks’ and Training-Test Corpora

• Looks like EBMT better on homogeneous training data:– EBMT > PB-SMT on Sun TM (ENFR)– EBMT > PB-SMT on EF TM (BasqueEN)

• SMT better on (more) heterogeneous data– PB-SMT > EBMT on Europarl (ENFR)

• Predictors of Usefulness of Approach given Text Type:– Chunk coverage– Amount of W2W Translation


Conclusions

• Combining SMT ‘phrases’ and EBMT chunks in a hybrid ‘statistical EBMT’ or ‘example-based SMT’ system will improve your system output

• Blind adherence to one approach will guarantee that your performance is less than it could otherwise be

• John Hutchins: “EBMT is Hybrid MT”• Joe Olive: “Need combination of ‘rules’

and statistics”


Ongoing & Future Work

• Automatic detection of Marker Words– Most common SMT phrases consist mainly of marker

words

• Plan to increase levels of hybridity– Code a simple EBMT decoder, factoring in Marker-Based

recombination approach along with probabilities– Use exact sentence matching in PBSMT, as in EBMT– Integration of generalized templates into PBSMT system

(and reintegrate them into EBMT system)– Integrate marker tag information into SMT language and

translation models – Hybrid EBMT-EBMT System (with CMU)?!

• What’s the contribution of EBMT chunks if an SMT system is allowed as much training data as it likes?


Thank you for your attention.

Andy Way, IGK Summer School, Edinburgh, Sept. 2006 Hybrid Data-Driven Models of Machine Translation...

Documents

Transcript of Andy Way, IGK Summer School, Edinburgh, Sept. 2006 Hybrid Data-Driven Models of Machine Translation...