MT Book Ch. 7: Optimization
-
Upload
nara-institute-of-science-and-technology -
Category
Engineering
-
view
112 -
download
5
Transcript of MT Book Ch. 7: Optimization
Overview
15/07/09 2
• MT decoding
• Need to find w that assigns higher scores to be@er translaBons (e, d) • Be@er translaBons = translaBons with lower error
f: source sentence, e: target sentence, d: derivaBon w: weight vector, h(・): feature funcBon
Loss MinimizaBon • Given parallel corpus (F, E), find w that minimizes loss funcBon l(・)
• e.g., l(F, E; w) = 1 – BLEU(E, decodew(F)) • λ is a regularizaBon constant to avoid overfiUng
15/07/09 3
regularizaBon term
Problems to Consider 1. Search space is vast • impossible to consider all candidates • correct translaBon is rarely possible
2. ApproximaBon of error funcBon • Error metrics (e.g. BLEU) are not differenBable • Split corpus-‐level metrics into sentence level
3. How to calculate argmin wTh
15/07/09 4
Batch Learning • Given parallel corpus (F, E), iniBalize w and iteraBvely
1. decode whole corpus F with current w, and get k-‐best lists C 2. opBmize w 3. loop unBl convergence
• vs. online learning • opBmize w per sentence
15/07/09 5
Minimum Error Rate Training (MERT) • Given error funcBon error(E, Ê), directly minimize it • E: reference translaBons, Ê: system translaBons • e.g. error(E, Ê) = 1 – BLEU(E, Ê)
• In other words,
• Since error(・) is not differenBable w.r.t. w, gradient-‐based method is not applicable • Instead, use Powell’s method • gradients not required
15/07/09 6
Powell’s Method • IteraBvely, fix a direcBon, and find opBmal w in that direcBon • Applicable when gradients are not available
15/07/09 7
w0
w1 w2
w3
x1
x2
OpBmizaBon in One DirecBon • 1-‐best translaBon parameterized by scalar γ
15/07/09 8
bm: one-‐hot vector with mth dim = 1
intercept slope
γ
wh + γh
c1
c2
c4 c3
Candidates with highest score are selected
envelope
γ
error
c1
c3
c4
e.g.) f = 黒い 猫 を 見た e = I saw a black cat c1 = I saw black cat c2 = saw a black cat …
Corpus-‐level Error • Sentence-‐level losses are summed to get corpus-‐level error
15/07/09 9
sentence 1 sentence 2
add
sentence-‐level error
sentence-‐level envelope
mulB-‐sentence error
γ* Find γ that minimizes overall error!
Problems of Powell’s Method • SensiBve to iniBalizaBon of w • Not suitable for high-‐dimensional feature vectors
15/07/09 10
Sojmax Loss • TranslaBon probability
• Loss is negaBve likelihood of oracle translaBons
where oracle translaBons are
• Gradient-‐based methods (e.g. L-‐BFGS) are applicable
15/07/09 11
Max Margin Loss
15/07/09 12
• Make sure distances between correct translaBons and incorrect translaBons are large
• For example:
• OpBmizaBon methods for SVM are applicable (e.g. SMO)
for all oracle and non-‐oracle pairs …
penalize when diff in error is greater than diff in score
f: 黒い猫を見た, e (correct): I saw a black cat
e* (oracle) I saw black cat 0.1 0.4 e (system) see red dog 0.9 0.3
error score (=wTh)
large small! bad!
Pairwise Ranking OpBmizaBon (PRO) • Parameter esBmaBon as ranking problem
• Classifier learns w to rank candidates by error • Generate training examples from pairs of candidates • posiBve example: h(cand1) – h(cand2) = (-‐4, 6) • negaBve example: h(cand3) – h(cand1) = (3, -‐7) • wT{h(cand1) – h(cand2)} > 0 ⇔ wTh(cand1) > wTh(cand2)
• Off-‐the-‐shelf linear binary classifiers can be used
15/07/09 13
f: 黒い猫を見た, e (correct): I saw a black cat
e (cand1) I see black cat 0.3 (-‐1, 2) ??? e (cand2) see black dog 0.7 (3, -‐4) ??? e (cand3) see red dog 0.9 (2, -‐5) ???
error score (=wTh) h
Minimum Bayes Risk
15/07/09 14
• Minimize expected loss
where • γ = 0: all candidates are equally likely • γ = 1: sojmax • γ→∞: highest scoring candidate with probability 1 (MERT)
• DifferenBable and considers many candidates <e,d>
Sentence-‐level BLEU
• Sentence-‐level error funcBons are needed for opBmizaBon • BLEU is corpus-‐level metric
• 4-‐gram precision is ojen 0 on sentence level • varies from human judgments
• Sentence-‐level error • Linear BLEU • (Expected BLEU)
15/07/09 15