Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training...

Thomas SchoenemannUniversity of Düsseldorf, Germany

ACL 2013, Sofia, Bulgaria

Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

Outline

1. Word Alignment

2. Fertility based models- IBM-3 and IBM-4 specifically

3. Removing Deficiency

4. Maximum Likelihood Training- expectation maximization (EM)

Word Alignment

Given: a bilingual sentence pair, e.g.

Task: find out which words correspond to each other

Considered Approach

Overall strategy:

1. Design a probabilistic model (or take an existing) for translation and word alignment, with a (manageable) set of base probabilities

2. Learn the base probabilities from a set of training data(sentence pairs without alignments)

3. To annotate a given sentence pair:compute most likely alignment

Approach for this talk:- probabilistic approach- data driven - unsupervised: no alignments given- based on (Brown et al. 1993)

Considered Models• Alignment and Translation Model:

for a target sentence and a source sentence :

• Considered Alignments:

- each target word corresponds to at most one source word (mainly for computational reasons)

conditional model

alignments are hidden variables

Generative Process for Fertility Based Models

Given a source sentence :

1. For - decide on the number of target words aligned to

2. For - For : decide on the kth target word aligned to

3. For - For : decide on the position of the target word

The remaining positions are filled with the words

- distortion model- IBM-3/4/5 differ- source for deficiency

Then decide on the number of unaligned words

target words

IBM-3, Distortion and Deficiency? ? ? ? ? ?

IBM-3: deficient, we could choose j=1 (already taken)

This work: nondeficient variant of IBM-3:

ReflectionIBM-3:

This work: nondeficient variant of IBM-3:

• Same base probabilities as the original IBM-3

• Nondeficiency achieved by renormalization

• Relation to parametric models

• Need to keep track of all taken positions (just like the IBM-5)

IBM-4? ? ? ? ? ?

1. Word Classes:

2. Center position of „die“ (the closest previous aligned word ):

IBM-4 (deficient):

This work: nondeficient variant of IBM-4: renormalization

Nondeficient IBM-4

Deficient IBM-4:

Nondeficient IBM-4: ()

Leave out position 13 because we have to place „neigeuse“ afterwards.

Training: Maximum LikelihoodMaximize the likelihood of the training corpus:

Subject to simplex/probability constraints on the base probabilities

• nonconcave maximization problem

• many local maxima

• no global algorithms known

• method of choice: expectation maximization (Dempster et al. ´77, Neal & Hinton ´98)

• for convenience: take the negative logarithm of the objective

EM: A Majorize-Minimize Method

… assuming there is just one variable to minimize- in practice there are several thousand variables, with simplex (a.k.a. probability ) constraints- and for us, the bounding functions will not be convex

negative log likelihood function

0

EM: Problems for IBM-3 and IBM-4

cf. (Udupa & Maji 2006)

For IBM-3 and IBM-4:

• evaluating the negative log likelihood at a given point is intractable

(deficient + nondeficient)

• bounding function known up to weights (= expectations)

- computing the weights is intractable (deficient + nondeficient)

- there are exponentially many weights (nondeficient)

) approximations

• Approximated bounding functions: non-convex (nondeficient)

) local minimization with projected gradient descent (250 iter.) (e.g. Bertsekas 1999)

E-Step: Computing the Weights of the Bounding Function

• Minimizing the bounding function decomposes into smaller problems

(one per probability distribution)

• Here: consider only distortion for the IBM-3. For each sentence:

- for the deficient variant: need expectation of i aligning to j

- for the nondeficient variant: need expectation of i aligning to j

when choosing from a set of open positions (exponentially many )

• In both cases: hillclimbing procedure as in (Brown et al. 1993)

- gives a likely alignment and neighbors ! approximate expectations

- incremental for the deficient variant (fast)

- not/only partially incremental for the nondeficient variant (slow)

• This method is also used to compute alignments

Briefly Mentioned

More contributions in the paper:• For the IBM-3: pooled distortion model

• Reduced deficiency for the IBM-4:

words can no longer be placed outside of the sentence

(but still on top of one another)

• In both cases: parametric models handled via EM

and projected gradient descent

based on

Experimental Setup• Europarl German $ English and Spanish $ English - gold alignments: my own and from Lambert et al.

• all sentences lower cased

• in deficient mode: deficient empty word model (Och & Ney 2003) • 100000 sentence pairs (leads to 1 day running time, 8GB memory)

• Evaluation metric: weighted F-Measure (Fraser & Marcu 2007) - accuracy measure (higher values = better alignments) - ® = 0.1 (recall more important than precision)

Results – Short StoryAlignment accuracy :

• the nondeficient IBM-3 is clearly better than the deficient one

• IBM-4: all level, no winners

• IBM-5 beats everything

Also tried:

• phrase based translation (Moses Experiment Management System)

• training run in both directions, diag-grow-final-and

• tuning (MERT) on 750 sentence pairs

• run for all models and variants

• the various BLEU scores offer no conclusions

Results – Long Story (1)


de|en en|de73.00

74.00

75.00

76.00

77.00

78.00

79.00

80.00

81.00

IBM-3

deficientnondeficient

accu

racy

(wei

ghte

d f-m

easu

re)


de|en en|de73

74

75

76

77

78

79

80

81

82

IBM-4 (50x50 word classes)

deficient (GIZA++)deficient (our)nondeficient

accu

racy

(wei

ghte

d f-m

easu

re)


de|en en|de70

72

74

76

78

80

82

IBM 3/4/5 (GIZA++)

IBM-3IBM-4IBM-5

accu

racy

(wei

ghte

d f-m

easu

re)

Related work on Word AlignmentModels:

Brown et al. 1993Vogel, Ney, Tillmann 1996Wang & Waibel 1998Melamed 2000

Marcu & Wong 2002Deng & Byrne 2005Fraser & Marcu 2007Mauser et al. 2009

Algorithms:Al-Onaizan et al. 1999Matusov, Zens, Ney 2004Taskar et al. 2005

Udupa & Maji 2005Lacoste-Julien et al. 2006Cromières, Kurohashi 2009

Regularity terms:Liang, Taskar, Klein 2006Graca, Ganchev, Taskar 2010Bansal, Quirk, Moore 2011Vaswani et al. 2012

ConclusionContributed:

• Nondeficient variants of IBM-3 and IBM-4

• Maximum Likelihood Training based on EM

- E-steps solved by hillclimbing

- M-steps solved by projected gradient ascent

Findings:

• important goal of probabilistic modeling (theoretical value)

• improvements for IBM-3 (f-measure on gold alignments)

• otherwise no improvements (IBM-4, BLEU scores)

• IBM-5 beats everything (f-measures on gold alignments)

Thank you!

Questions?

Source code and gold alignments online:

https://github.com/Thomas1205/RegAligner

http://user.phil-fak.uni-duesseldorf.de/~tosch/

The Bounding FunctionNeed to iteratively solve problems of the form (ignoring a constant)

exponentially many terms

previous parameters

product of factors

sum of logarithms

factor evaluated with the parameters to be currently optimized

Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training...

Documents

Transcript of Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training...