Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training...

25
Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

Transcript of Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training...

Page 1: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Thomas SchoenemannUniversity of Düsseldorf, Germany

ACL 2013, Sofia, Bulgaria

Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

Page 2: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Outline

1. Word Alignment

2. Fertility based models- IBM-3 and IBM-4 specifically

3. Removing Deficiency

4. Maximum Likelihood Training- expectation maximization (EM)

Page 3: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Word Alignment

Given: a bilingual sentence pair, e.g.

Task: find out which words correspond to each other

Page 4: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Considered Approach

Overall strategy:

1. Design a probabilistic model (or take an existing) for translation and word alignment, with a (manageable) set of base probabilities

2. Learn the base probabilities from a set of training data(sentence pairs without alignments)

3. To annotate a given sentence pair:compute most likely alignment

Approach for this talk:- probabilistic approach- data driven - unsupervised: no alignments given- based on (Brown et al. 1993)

Page 5: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Considered Models• Alignment and Translation Model:

for a target sentence and a source sentence :

• Considered Alignments:

- each target word corresponds to at most one source word (mainly for computational reasons)

conditional model

alignments are hidden variables

Page 6: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Generative Process for Fertility Based Models

Given a source sentence :

1. For - decide on the number of target words aligned to

2. For - For : decide on the kth target word aligned to

3. For - For : decide on the position of the target word

The remaining positions are filled with the words

- distortion model- IBM-3/4/5 differ- source for deficiency

Then decide on the number of unaligned words

target words

Page 7: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

IBM-3, Distortion and Deficiency? ? ? ? ? ?

IBM-3: deficient, we could choose j=1 (already taken)

This work: nondeficient variant of IBM-3:

Page 8: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

ReflectionIBM-3:

This work: nondeficient variant of IBM-3:

• Same base probabilities as the original IBM-3

• Nondeficiency achieved by renormalization

• Relation to parametric models

• Need to keep track of all taken positions (just like the IBM-5)

Page 9: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

IBM-4? ? ? ? ? ?

1. Word Classes:

2. Center position of „die“ (the closest previous aligned word ):

IBM-4 (deficient):

This work: nondeficient variant of IBM-4: renormalization

Page 10: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Nondeficient IBM-4

Deficient IBM-4:

Nondeficient IBM-4: ()

Leave out position 13 because we have to place „neigeuse“ afterwards.

Page 11: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Training: Maximum LikelihoodMaximize the likelihood of the training corpus:

Subject to simplex/probability constraints on the base probabilities

• nonconcave maximization problem

• many local maxima

• no global algorithms known

• method of choice: expectation maximization (Dempster et al. ´77, Neal & Hinton ´98)

• for convenience: take the negative logarithm of the objective

Page 12: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

EM: A Majorize-Minimize Method

… assuming there is just one variable to minimize- in practice there are several thousand variables, with simplex (a.k.a. probability ) constraints- and for us, the bounding functions will not be convex

negative log likelihood function

0

Page 13: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

EM: Problems for IBM-3 and IBM-4

cf. (Udupa & Maji 2006)

For IBM-3 and IBM-4:

• evaluating the negative log likelihood at a given point is intractable

(deficient + nondeficient)

• bounding function known up to weights (= expectations)

- computing the weights is intractable (deficient + nondeficient)

- there are exponentially many weights (nondeficient)

) approximations

• Approximated bounding functions: non-convex (nondeficient)

) local minimization with projected gradient descent (250 iter.) (e.g. Bertsekas 1999)

Page 14: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

E-Step: Computing the Weights of the Bounding Function

• Minimizing the bounding function decomposes into smaller problems

(one per probability distribution)

• Here: consider only distortion for the IBM-3. For each sentence:

- for the deficient variant: need expectation of i aligning to j

- for the nondeficient variant: need expectation of i aligning to j

when choosing from a set of open positions (exponentially many )

• In both cases: hillclimbing procedure as in (Brown et al. 1993)

- gives a likely alignment and neighbors ! approximate expectations

- incremental for the deficient variant (fast)

- not/only partially incremental for the nondeficient variant (slow)

• This method is also used to compute alignments

Page 15: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Briefly Mentioned

More contributions in the paper:• For the IBM-3: pooled distortion model

• Reduced deficiency for the IBM-4:

words can no longer be placed outside of the sentence

(but still on top of one another)

• In both cases: parametric models handled via EM

and projected gradient descent

based on

Page 16: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Experimental Setup• Europarl German $ English and Spanish $ English - gold alignments: my own and from Lambert et al.

• all sentences lower cased

• in deficient mode: deficient empty word model (Och & Ney 2003) • 100000 sentence pairs (leads to 1 day running time, 8GB memory)

• Evaluation metric: weighted F-Measure (Fraser & Marcu 2007) - accuracy measure (higher values = better alignments) - ® = 0.1 (recall more important than precision)

Page 17: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Results – Short StoryAlignment accuracy :

• the nondeficient IBM-3 is clearly better than the deficient one

• IBM-4: all level, no winners

• IBM-5 beats everything

Also tried:

• phrase based translation (Moses Experiment Management System)

• training run in both directions, diag-grow-final-and

• tuning (MERT) on 750 sentence pairs

• run for all models and variants

• the various BLEU scores offer no conclusions

Page 18: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Results – Long Story (1)

Page 19: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Results – Long Story (2)

de|en en|de73.00

74.00

75.00

76.00

77.00

78.00

79.00

80.00

81.00

IBM-3

deficientnondeficient

accu

racy

(wei

ghte

d f-m

easu

re)

Page 20: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Results – Long Story (3)

de|en en|de73

74

75

76

77

78

79

80

81

82

IBM-4 (50x50 word classes)

deficient (GIZA++)deficient (our)nondeficient

accu

racy

(wei

ghte

d f-m

easu

re)

Page 21: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Results – Long Story (4)

de|en en|de70

72

74

76

78

80

82

IBM 3/4/5 (GIZA++)

IBM-3IBM-4IBM-5

accu

racy

(wei

ghte

d f-m

easu

re)

Page 22: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Related work on Word AlignmentModels:

Brown et al. 1993Vogel, Ney, Tillmann 1996Wang & Waibel 1998Melamed 2000

Marcu & Wong 2002Deng & Byrne 2005Fraser & Marcu 2007Mauser et al. 2009

Algorithms:Al-Onaizan et al. 1999Matusov, Zens, Ney 2004Taskar et al. 2005

Udupa & Maji 2005Lacoste-Julien et al. 2006Cromières, Kurohashi 2009

Regularity terms:Liang, Taskar, Klein 2006Graca, Ganchev, Taskar 2010Bansal, Quirk, Moore 2011Vaswani et al. 2012

Page 23: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

ConclusionContributed:

• Nondeficient variants of IBM-3 and IBM-4

• Maximum Likelihood Training based on EM

- E-steps solved by hillclimbing

- M-steps solved by projected gradient ascent

Findings:

• important goal of probabilistic modeling (theoretical value)

• improvements for IBM-3 (f-measure on gold alignments)

• otherwise no improvements (IBM-4, BLEU scores)

• IBM-5 beats everything (f-measures on gold alignments)

Page 24: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Thank you!

Questions?

Source code and gold alignments online:

https://github.com/Thomas1205/RegAligner

http://user.phil-fak.uni-duesseldorf.de/~tosch/

Page 25: Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

The Bounding FunctionNeed to iteratively solve problems of the form (ignoring a constant)

exponentially many terms

previous parameters

product of factors

sum of logarithms

factor evaluated with the parameters to be currently optimized