Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training...
-
Upload
tatum-rumbold -
Category
Documents
-
view
215 -
download
2
Transcript of Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training...
Thomas SchoenemannUniversity of Düsseldorf, Germany
ACL 2013, Sofia, Bulgaria
Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment
Outline
1. Word Alignment
2. Fertility based models- IBM-3 and IBM-4 specifically
3. Removing Deficiency
4. Maximum Likelihood Training- expectation maximization (EM)
Word Alignment
Given: a bilingual sentence pair, e.g.
Task: find out which words correspond to each other
Considered Approach
Overall strategy:
1. Design a probabilistic model (or take an existing) for translation and word alignment, with a (manageable) set of base probabilities
2. Learn the base probabilities from a set of training data(sentence pairs without alignments)
3. To annotate a given sentence pair:compute most likely alignment
Approach for this talk:- probabilistic approach- data driven - unsupervised: no alignments given- based on (Brown et al. 1993)
Considered Models• Alignment and Translation Model:
for a target sentence and a source sentence :
• Considered Alignments:
- each target word corresponds to at most one source word (mainly for computational reasons)
conditional model
alignments are hidden variables
Generative Process for Fertility Based Models
Given a source sentence :
1. For - decide on the number of target words aligned to
2. For - For : decide on the kth target word aligned to
3. For - For : decide on the position of the target word
The remaining positions are filled with the words
- distortion model- IBM-3/4/5 differ- source for deficiency
Then decide on the number of unaligned words
target words
IBM-3, Distortion and Deficiency? ? ? ? ? ?
IBM-3: deficient, we could choose j=1 (already taken)
This work: nondeficient variant of IBM-3:
ReflectionIBM-3:
This work: nondeficient variant of IBM-3:
• Same base probabilities as the original IBM-3
• Nondeficiency achieved by renormalization
• Relation to parametric models
• Need to keep track of all taken positions (just like the IBM-5)
IBM-4? ? ? ? ? ?
1. Word Classes:
2. Center position of „die“ (the closest previous aligned word ):
IBM-4 (deficient):
This work: nondeficient variant of IBM-4: renormalization
Nondeficient IBM-4
Deficient IBM-4:
Nondeficient IBM-4: ()
Leave out position 13 because we have to place „neigeuse“ afterwards.
Training: Maximum LikelihoodMaximize the likelihood of the training corpus:
Subject to simplex/probability constraints on the base probabilities
• nonconcave maximization problem
• many local maxima
• no global algorithms known
• method of choice: expectation maximization (Dempster et al. ´77, Neal & Hinton ´98)
• for convenience: take the negative logarithm of the objective
EM: A Majorize-Minimize Method
… assuming there is just one variable to minimize- in practice there are several thousand variables, with simplex (a.k.a. probability ) constraints- and for us, the bounding functions will not be convex
negative log likelihood function
0
EM: Problems for IBM-3 and IBM-4
cf. (Udupa & Maji 2006)
For IBM-3 and IBM-4:
• evaluating the negative log likelihood at a given point is intractable
(deficient + nondeficient)
• bounding function known up to weights (= expectations)
- computing the weights is intractable (deficient + nondeficient)
- there are exponentially many weights (nondeficient)
) approximations
• Approximated bounding functions: non-convex (nondeficient)
) local minimization with projected gradient descent (250 iter.) (e.g. Bertsekas 1999)
E-Step: Computing the Weights of the Bounding Function
• Minimizing the bounding function decomposes into smaller problems
(one per probability distribution)
• Here: consider only distortion for the IBM-3. For each sentence:
- for the deficient variant: need expectation of i aligning to j
- for the nondeficient variant: need expectation of i aligning to j
when choosing from a set of open positions (exponentially many )
• In both cases: hillclimbing procedure as in (Brown et al. 1993)
- gives a likely alignment and neighbors ! approximate expectations
- incremental for the deficient variant (fast)
- not/only partially incremental for the nondeficient variant (slow)
• This method is also used to compute alignments
Briefly Mentioned
More contributions in the paper:• For the IBM-3: pooled distortion model
• Reduced deficiency for the IBM-4:
words can no longer be placed outside of the sentence
(but still on top of one another)
• In both cases: parametric models handled via EM
and projected gradient descent
based on
Experimental Setup• Europarl German $ English and Spanish $ English - gold alignments: my own and from Lambert et al.
• all sentences lower cased
• in deficient mode: deficient empty word model (Och & Ney 2003) • 100000 sentence pairs (leads to 1 day running time, 8GB memory)
• Evaluation metric: weighted F-Measure (Fraser & Marcu 2007) - accuracy measure (higher values = better alignments) - ® = 0.1 (recall more important than precision)
Results – Short StoryAlignment accuracy :
• the nondeficient IBM-3 is clearly better than the deficient one
• IBM-4: all level, no winners
• IBM-5 beats everything
Also tried:
• phrase based translation (Moses Experiment Management System)
• training run in both directions, diag-grow-final-and
• tuning (MERT) on 750 sentence pairs
• run for all models and variants
• the various BLEU scores offer no conclusions
Results – Long Story (1)
Results – Long Story (2)
de|en en|de73.00
74.00
75.00
76.00
77.00
78.00
79.00
80.00
81.00
IBM-3
deficientnondeficient
accu
racy
(wei
ghte
d f-m
easu
re)
Results – Long Story (3)
de|en en|de73
74
75
76
77
78
79
80
81
82
IBM-4 (50x50 word classes)
deficient (GIZA++)deficient (our)nondeficient
accu
racy
(wei
ghte
d f-m
easu
re)
Results – Long Story (4)
de|en en|de70
72
74
76
78
80
82
IBM 3/4/5 (GIZA++)
IBM-3IBM-4IBM-5
accu
racy
(wei
ghte
d f-m
easu
re)
Related work on Word AlignmentModels:
Brown et al. 1993Vogel, Ney, Tillmann 1996Wang & Waibel 1998Melamed 2000
Marcu & Wong 2002Deng & Byrne 2005Fraser & Marcu 2007Mauser et al. 2009
Algorithms:Al-Onaizan et al. 1999Matusov, Zens, Ney 2004Taskar et al. 2005
Udupa & Maji 2005Lacoste-Julien et al. 2006Cromières, Kurohashi 2009
Regularity terms:Liang, Taskar, Klein 2006Graca, Ganchev, Taskar 2010Bansal, Quirk, Moore 2011Vaswani et al. 2012
ConclusionContributed:
• Nondeficient variants of IBM-3 and IBM-4
• Maximum Likelihood Training based on EM
- E-steps solved by hillclimbing
- M-steps solved by projected gradient ascent
Findings:
• important goal of probabilistic modeling (theoretical value)
• improvements for IBM-3 (f-measure on gold alignments)
• otherwise no improvements (IBM-4, BLEU scores)
• IBM-5 beats everything (f-measures on gold alignments)
Thank you!
Questions?
Source code and gold alignments online:
https://github.com/Thomas1205/RegAligner
http://user.phil-fak.uni-duesseldorf.de/~tosch/
The Bounding FunctionNeed to iteratively solve problems of the form (ignoring a constant)
exponentially many terms
previous parameters
product of factors
sum of logarithms
factor evaluated with the parameters to be currently optimized