MaxForce: Max-Violation Perceptron and Forced Decoding for...
Transcript of MaxForce: Max-Violation Perceptron and Forced Decoding for...
![Page 1: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/1.jpg)
MaxForce: Max-Violation Perceptron and
Forced Decoding for Scalable MT Training
Heng Yu
Chinese Acad. of Sciences
Liang Huang
CUNY
Haitao Mi
IBM T. J. Watson
0 1 2 3 4 5 6
Bush
held
held talks
talks with
with Sharon
Sharon
Kai Zhao
CUNY
![Page 2: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/2.jpg)
MaxForce: Max-Violation Perceptron and
Forced Decoding for Scalable MT Training
Heng Yu
Chinese Acad. of Sciences
Liang Huang
CUNY
Haitao Mi
IBM T. J. Watson
0 1 2 3 4 5 6
Bush
held
held talks
talks with
with Sharon
Sharon
Scalable Training for MT Finally Made Successful
Kai Zhao
CUNY
![Page 3: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/3.jpg)
Discriminative Training for SMT• discriminative training is dominant in parsing / tagging
• can use arbitrary, overlapping, lexicalized features
• but not very successful yet in machine translation
• most efforts on MT training tune feature weights on the small dev set (~1k sents) not the training set!
• as a result can only use ~10 dense features (MERT)
• or ~10k rather impoverished features (MIRA/PRO)
• Liang et al (2006) train on the training set but failed
2
training set (>100k sentences) dev set (~1k sents)
test set (~1k sents)
![Page 4: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/4.jpg)
Timeline for MT Training
3
training set (>100k sentences) dev set (~1k sents)
test set (~1k sents)
MERT (Och ’02)
(dense features)
![Page 5: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/5.jpg)
Timeline for MT Training
3
training set (>100k sentences) dev set (~1k sents)
test set (~1k sents)
Standard Perceptron (a noble failure) (Liang et al 2006)
MERT (Och ’02)
(dense features)
![Page 6: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/6.jpg)
Timeline for MT Training
3
training set (>100k sentences) dev set (~1k sents)
test set (~1k sents)
Standard Perceptron (a noble failure) (Liang et al 2006) MIRA
(Watanabe+ ’07)(Chiang+ ’08-’12)
MERT (Och ’02)
(dense features)
(pseudo sparsefeatures)
![Page 7: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/7.jpg)
Timeline for MT Training
3
training set (>100k sentences) dev set (~1k sents)
test set (~1k sents)
Standard Perceptron (a noble failure) (Liang et al 2006) MIRA
(Watanabe+ ’07)(Chiang+ ’08-’12)
PRO(Hopkins+May ’11)
Regression(Bazrafshan+ ’12)
MERT (Och ’02)
(dense features)
(pseudo sparsefeatures)
![Page 8: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/8.jpg)
Timeline for MT Training
3
training set (>100k sentences) dev set (~1k sents)
test set (~1k sents)
Standard Perceptron (a noble failure) (Liang et al 2006) MIRA
(Watanabe+ ’07)(Chiang+ ’08-’12)
PRO(Hopkins+May ’11)
Regression(Bazrafshan+ ’12)
HOLS(Flanigan+ ’13)
(sparse features as one dense feature)
MERT (Och ’02)
(dense features)
(pseudo sparsefeatures)
![Page 9: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/9.jpg)
Timeline for MT Training
3
training set (>100k sentences) dev set (~1k sents)
test set (~1k sents)
Standard Perceptron (a noble failure) (Liang et al 2006) MIRA
(Watanabe+ ’07)(Chiang+ ’08-’12)
PRO(Hopkins+May ’11)
Regression(Bazrafshan+ ’12)
our work (2013): violation-fixing perceptron with truly sparse features
HOLS(Flanigan+ ’13)
(sparse features as one dense feature)
MERT (Och ’02)
(dense features)
(pseudo sparsefeatures)
![Page 10: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/10.jpg)
Timeline for MT Training
3
training set (>100k sentences) dev set (~1k sents)
test set (~1k sents)
Standard Perceptron (a noble failure) (Liang et al 2006) MIRA
(Watanabe+ ’07)(Chiang+ ’08-’12)
PRO(Hopkins+May ’11)
Regression(Bazrafshan+ ’12)
our work (2013): violation-fixing perceptron with truly sparse features
HOLS(Flanigan+ ’13)
(sparse features as one dense feature)
MERT (Och ’02)
(dense features)
(pseudo sparsefeatures)?
![Page 11: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/11.jpg)
Why previous work fails
• their learning methods are based on exact search
• MT has huge search spaces => severe search errors
• learning algorithms should fix search errors
• full updates (perceptron/MIRA/PRO) can’t fix search errors
• MT involves latent variables (derivations not annotated)
• perceptron/MIRA was not designed for latent variables
• we need better variants for perceptron4
![Page 12: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/12.jpg)
Why our approach works
• use a variant of perceptron tailored for inexact search
• fix search errors in the middle of the search
• “partial updates” instead of “full updates”
• use forced decoding lattice as the target to update to
• use parallelized minibatch to speed up learning
• result: scaled to a large portion of the training data
• 20M sparse features => +2.0 BLEU over MERT/PRO 5
![Page 13: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/13.jpg)
MT as Structured Classification
• with latent variables (hidden derivations)
6
x
ythe man bit the dog
那 人 咬 了 狗
![Page 14: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/14.jpg)
MT as Structured Classification
• with latent variables (hidden derivations)
6
x
ythe man bit the dog
那 人 咬 了 狗
...
all gold derivations
![Page 15: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/15.jpg)
MT as Structured Classification
• with latent variables (hidden derivations)
6
x
ythe man bit the dog
那 人 咬 了 狗
...x那 人 咬 了 狗
all gold derivations
![Page 16: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/16.jpg)
MT as Structured Classification
• with latent variables (hidden derivations)
6
x
ythe man bit the dog
那 人 咬 了 狗
...x
ythe dog bit the man
那 人 咬 了 狗
best derivation
all gold derivations
![Page 17: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/17.jpg)
MT as Structured Classification
• with latent variables (hidden derivations)
6
x
ythe man bit the dog
那 人 咬 了 狗
...x
ythe dog bit the man
那 人 咬 了 狗
best derivation
all gold derivations wrong translation
![Page 18: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/18.jpg)
MT as Structured Classification
• with latent variables (hidden derivations)
6
x
ythe man bit the dog
那 人 咬 了 狗
...x
ythe dog bit the man
那 人 咬 了 狗
best derivation
best goldderivation
all gold derivations wrong translation
![Page 19: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/19.jpg)
MT as Structured Classification
• with latent variables (hidden derivations)
6
x
ythe man bit the dog
那 人 咬 了 狗
...x
ythe dog bit the man
那 人 咬 了 狗
best derivation
best goldderivation
update: penalize best derivationand reward best gold derivation
all gold derivations wrong translation
--++
![Page 20: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/20.jpg)
Outline
• Motivations
• Phrase-based Translation and Forced Decoding
• Violation-Fixing Perceptron for SMT
• Update Strategies: Early Update and Max-Violation
• Feature Design
• Experiments
7
![Page 21: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/21.jpg)
Phrase-based translation
yu Shalong juxing le huitan
与 沙⻰龙 举行 了 会谈
held talks with Sharon
布什Bushi
Bush
yu Shalong juxing le huitanwith Sharon held talks
meetingsSharon heldwithBush
Bushi
![Page 22: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/22.jpg)
Phrase-based translation
yu Shalong juxing le huitan
与 沙⻰龙 举行 了 会谈
held talks with Sharon
布什Bushi
Bush
yu Shalong juxing le huitanwith Sharon held talks
meetingsSharon heldwith
_ _ _ _ _ _
Bush
Bushi
![Page 23: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/23.jpg)
Phrase-based translation
yu Shalong juxing le huitan
与 沙⻰龙 举行 了 会谈
held talks with Sharon
布什Bushi
Bush
yu Shalong juxing le huitanwith Sharon held talks
meetingsSharon heldwith
_ _ _ _ _ _
Bush
Bushi
●_ _ _ _ _
![Page 24: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/24.jpg)
Phrase-based translation
yu Shalong juxing le huitan
与 沙⻰龙 举行 了 会谈
held talks with Sharon
布什Bushi
Bush
yu Shalong juxing le huitanwith Sharon held talks
meetingsSharon heldwith
_ _ _ _ _ _ ●_ _●●●
Bush
Bushi
●_ _ _ _ _
![Page 25: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/25.jpg)
Phrase-based translation
yu Shalong juxing le huitan
与 沙⻰龙 举行 了 会谈
held talks with Sharon
布什Bushi
Bush
yu Shalong juxing le huitanwith Sharon held talks
meetingsSharon heldwith
_ _ _ _ _ _ ●_ _●●● ●●●●●●
Bush
Bushi
●_ _ _ _ _
![Page 26: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/26.jpg)
Phrase-based translation
yu Shalong juxing le huitan
与 沙⻰龙 举行 了 会谈
held talks with Sharon
布什Bushi
Bush
yu Shalong juxing le huitanwith Sharon held talks
meetingsSharon heldwith
_ _ _ _ _ _ ●_ _●●● ●●●●●●
Bush
Bushi
●_ _ _ _ _
![Page 27: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/27.jpg)
Phrase-based translation
yu Shalong juxing le huitan
与 沙⻰龙 举行 了 会谈
held talks with Sharon
布什Bushi
Bush
yu Shalong juxing le huitanwith Sharon held talks
meetingsSharon heldwith
_ _ _ _ _ _ ●_ _●●● ●●●●●●
● ●_●●●
Bush
Bushi
●_ _ _ _ _
![Page 28: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/28.jpg)
Language Model and Beam Search• split each -LM state into many +LM states
9
![Page 29: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/29.jpg)
Language Model and Beam Search• split each -LM state into many +LM states
9
●_ _ _ _ _ Bush
![Page 30: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/30.jpg)
Language Model and Beam Search• split each -LM state into many +LM states
9
●_ _●●● ... talks
●_ _●●● ... talk
●_ _●●● ... meeting
●_ _ _ _ _ Bush
![Page 31: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/31.jpg)
Language Model and Beam Search• split each -LM state into many +LM states
9
●_ _●●● ... talks
●_ _●●● ... talk
●_ _●●● ... meeting
●●●●●● ... Sharon
●●●●●● ... Shalong
●_ _ _ _ _ Bush
![Page 32: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/32.jpg)
Language Model and Beam Search• split each -LM state into many +LM states
9
●_ _●●● ... talks
●_ _●●● ... talk
●_ _●●● ... meeting
●●●●●● ... Sharon
●●●●●● ... Shalong
●_ _ _ _ _ Bush
● ● ● ● ● ●
![Page 33: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/33.jpg)
Forced Decoding
Bushi yu Shalong juxing le huitan Bush held talks with Sharon
• both as data selection (more literal) and oracle derivations
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
![Page 34: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/34.jpg)
Forced Decoding
●_ _ _ _ _ Bush
Bushi yu Shalong juxing le huitan Bush held talks with Sharon
• both as data selection (more literal) and oracle derivations
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
![Page 35: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/35.jpg)
Forced Decoding
●_ _●●● ... talks
●_ _●●● ... talk
●_ _●●● ... meeting
●_ _ _ _ _ Bush
Bushi yu Shalong juxing le huitan Bush held talks with Sharon
• both as data selection (more literal) and oracle derivations
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
![Page 36: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/36.jpg)
Forced Decoding
●_ _●●● ... talks
●_ _●●● ... talk
●_ _●●● ... meeting
●●●●●● ... Sharon
●●●●●● ... Shalong
●_ _ _ _ _ Bush
Bushi yu Shalong juxing le huitan Bush held talks with Sharon
• both as data selection (more literal) and oracle derivations
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
![Page 37: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/37.jpg)
Forced Decoding
●_ _●●● ... talks
●_ _●●● ... talk
●_ _●●● ... meeting
●●●●●● ... Sharon
●●●●●● ... Shalong
●_ _ _ _ _ Bush
Bushi yu Shalong juxing le huitan Bush held talks with Sharon
• both as data selection (more literal) and oracle derivations
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
![Page 38: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/38.jpg)
Forced Decoding
●_ _●●● ... talks
●_ _●●● ... talk
●_ _●●● ... meeting
●●●●●● ... Sharon
●●●●●● ... Shalong
●_ _ _ _ _ Bush
Bushi yu Shalong juxing le huitan Bush held talks with Sharon
• both as data selection (more literal) and oracle derivations
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
![Page 39: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/39.jpg)
Forced Decoding
●_ _●●● ... talks
●_ _●●● ... talk
●_ _●●● ... meeting
●●●●●● ... Sharon
●●●●●● ... Shalong
●_ _ _ _ _ Bush
Bushi yu Shalong juxing le huitan Bush held talks with Sharon
one gold derivation
• both as data selection (more literal) and oracle derivations
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
![Page 40: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/40.jpg)
Unreachable Sentences and Prefix
11
Lianheguo
paiqian
50mıng
guanchaiyuan
jiandu
Bolıweiya
huıfumınzhu
zhengzhı
yılaishoucı
quanguo
daxuan
U.N.
sent
50
observers
to
monitor
the
1st
election
since
Bolivia
restored
democracy
5
33
4
1
玻利维亚
恢复 民主 政治 以来 首次 全国 大选联合
国派遣 50
名 观察员
监督
• distortion limit causes unreachability (hiero would be better)
• but we can still use reachable prefix-pairs of unreachable pairs
![Page 41: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/41.jpg)
Unreachable Sentences and Prefix
11
Lianheguo
paiqian
50mıng
guanchaiyuan
jiandu
Bolıweiya
huıfumınzhu
zhengzhı
yılaishoucı
quanguo
daxuan
U.N.
sent
50
observers
to
monitor
the
1st
election
since
Bolivia
restored
democracy
5
33
4
1
玻利维亚
恢复 民主 政治 以来 首次 全国 大选联合
国派遣 50
名 观察员
监督
• distortion limit causes unreachability (hiero would be better)
• but we can still use reachable prefix-pairs of unreachable pairs
![Page 42: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/42.jpg)
Sentence/Word Reachability Ratio• how many sentences pairs pass forced decoding?
• the ratio drops dramatically as sentences get longer
• prefixes boost coverage
12
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
10 20 30 40 50 60 70
Ratio
of co
mple
te c
ove
rage
Sentence length
Distortion-unlimitDistortion-limit 6Distortion-limit 4Distortion-limit 2Distortion-limit 0
![Page 43: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/43.jpg)
Sentence/Word Reachability Ratio• how many sentences pairs pass forced decoding?
• the ratio drops dramatically as sentences get longer
• prefixes boost coverage
12
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
10 20 30 40 50 60 70
Ratio
of co
mple
te c
ove
rage
Sentence length
Distortion-unlimitDistortion-limit 6Distortion-limit 4Distortion-limit 2Distortion-limit 0
![Page 44: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/44.jpg)
Sentence/Word Reachability Ratio• how many sentences pairs pass forced decoding?
• the ratio drops dramatically as sentences get longer
• prefixes boost coverage
12
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
10 20 30 40 50 60 70
Ratio
of co
mple
te c
ove
rage
Sentence length
Distortion-unlimitDistortion-limit 6Distortion-limit 4Distortion-limit 2Distortion-limit 0
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
10 20 30 40 50 60 70
Ra
tio o
f co
mp
lete
co
vera
ge
Sentence length
dist-6dist-4dist-2dist-0
![Page 45: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/45.jpg)
Number of Gold Derivations
• exponential in sentence length (on fully reachables)
• these are the “latent variables” in learning
13
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
5 10 15 20 25 30 35 40 45 50
Ave
rage n
um
ber
of deriva
tions
Sentence length
dist-6dist-4dist-2dist-0
![Page 46: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/46.jpg)
Outline
• Background: Phrase-based Translation (Koehn, 2004)
• Forced Decoding
• Violation-Fixing Perceptron for MT Training
• Update strategy
• Feature design
• Experiments
14
![Page 47: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/47.jpg)
Structured Perceptron (Collins 02)
15
x
y=-1y=+1
x
y
update weights
if y ≠ z
w
x zexactinference
binary classification
![Page 48: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/48.jpg)
Structured Perceptron (Collins 02)
15
x
y=-1y=+1
x
y
update weights
if y ≠ z
w
x zexactinference
binary classification
structured classification
![Page 49: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/49.jpg)
Structured Perceptron (Collins 02)
15
x
ythe man bit the dog
那 人 咬 了 狗
x
y=-1y=+1
x
y
update weights
if y ≠ z
w
x zexactinference
binary classification
structured classification
![Page 50: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/50.jpg)
Structured Perceptron (Collins 02)
• challenges in applying perceptron for MT
• the inference (decoding) is vastly inexact (beam search)
• we know standard perceptron doesn’t work for MT
• intuition: the learner should fix the search error first15
x
ythe man bit the dog
那 人 咬 了 狗
y
update weights
if y ≠ z
w
x zexactinference
x
y=-1y=+1
x
y
update weights
if y ≠ z
w
x zexactinference
constant# of classes
exponential # of classes
binary classification
structured classification
![Page 51: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/51.jpg)
Structured Perceptron (Collins 02)
• challenges in applying perceptron for MT
• the inference (decoding) is vastly inexact (beam search)
• we know standard perceptron doesn’t work for MT
• intuition: the learner should fix the search error first15
x
ythe man bit the dog
那 人 咬 了 狗
y
update weights
if y ≠ z
w
x zexactinference
x
y=-1y=+1
x
y
update weights
if y ≠ z
w
x zexactinference
constant# of classes
exponential # of classes
binary classification
structured classification
inexactinference
![Page 52: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/52.jpg)
Search Error: Gold Derivations Pruned
16
_ _ _ _ _ _
0 1 2 3 4 5 6
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
real decoding beam search
![Page 53: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/53.jpg)
Search Error: Gold Derivations Pruned
16
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
real decoding beam search
![Page 54: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/54.jpg)
Search Error: Gold Derivations Pruned
16
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
real decoding beam search
![Page 55: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/55.jpg)
Search Error: Gold Derivations Pruned
16
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
real decoding beam search
![Page 56: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/56.jpg)
Search Error: Gold Derivations Pruned
16
_ _ _ _ _ _ ● _ _ ● ● ●
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
● _ ● ● _ ●_ ● _ ● ● ●
● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
real decoding beam search
![Page 57: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/57.jpg)
Search Error: Gold Derivations Pruned
16
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
● _ ● ● _ ●_ ● _ ● ● ●
● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
real decoding beam search
![Page 58: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/58.jpg)
Search Error: Gold Derivations Pruned
16
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
● _ ● ● _ ●_ ● _ ● ● ●
● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
real decoding beam search
should fix search errors here!
![Page 59: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/59.jpg)
Fixing Search Error 1: Early Update
17
standard update(no guarantee!)
21
Model
![Page 60: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/60.jpg)
Fixing Search Error 1: Early Update
• early update (Collins/Roark’04) when the correct falls off beam
• up to this point the incorrect prefix should score higher
• that’s a “violation” which we want to fix
17
standard update(no guarantee!)
21
Model
![Page 61: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/61.jpg)
Fixing Search Error 1: Early Update
• early update (Collins/Roark’04) when the correct falls off beam
• up to this point the incorrect prefix should score higher
• that’s a “violation” which we want to fix
17
correct sequencefalls off beam
(pruned)
correct
standard update(no guarantee!)
21
Model
![Page 62: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/62.jpg)
Fixing Search Error 1: Early Update
• early update (Collins/Roark’04) when the correct falls off beam
• up to this point the incorrect prefix should score higher
• that’s a “violation” which we want to fix
17
correct sequencefalls off beam
(pruned)
correct
incorrect
standard update(no guarantee!)
21
Model
![Page 63: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/63.jpg)
Fixing Search Error 1: Early Update
• early update (Collins/Roark’04) when the correct falls off beam
• up to this point the incorrect prefix should score higher
• that’s a “violation” which we want to fix
17
earl
y up
date
correct sequencefalls off beam
(pruned)
correct
incorrect
violation guaranteed: incorrect prefix scores higher up to this point
standard update(no guarantee!)
21
Model
![Page 64: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/64.jpg)
Fixing Search Error 1: Early Update
• early update (Collins/Roark’04) when the correct falls off beam
• up to this point the incorrect prefix should score higher
• that’s a “violation” which we want to fix
• standard perceptron does not guarantee violation
• w/ pruning, the correct seq. might score higher at the end!
• called “invalid” update b/c it doesn’t fix the search error
17
earl
y up
date
correct sequencefalls off beam
(pruned)
correct
incorrect
violation guaranteed: incorrect prefix scores higher up to this point
standard update(no guarantee!)
21
Model
![Page 65: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/65.jpg)
Early Update w/ Latent Variable
18
21
Model
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
• the gold-standard derivations are not annotated
• we treat any reference-producing derivation as good
![Page 66: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/66.jpg)
Early Update w/ Latent Variable
18
21
Model
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
• the gold-standard derivations are not annotated
• we treat any reference-producing derivation as good
correct
![Page 67: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/67.jpg)
Early Update w/ Latent Variable
18
21
Model
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
• the gold-standard derivations are not annotated
• we treat any reference-producing derivation as good
correct
![Page 68: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/68.jpg)
Early Update w/ Latent Variable
18
21
Model
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
• the gold-standard derivations are not annotated
• we treat any reference-producing derivation as good
correct
all correct derivations fall off
![Page 69: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/69.jpg)
Early Update w/ Latent Variable
18
incorrect
21
Model
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
• the gold-standard derivations are not annotated
• we treat any reference-producing derivation as good
correct
all correct derivations fall off
![Page 70: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/70.jpg)
Early Update w/ Latent Variable
18
earl
y up
date
incorrect
violation guaranteed: incorrect prefix scores higher up to this point
21
Model
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
• the gold-standard derivations are not annotated
• we treat any reference-producing derivation as good
correct
all correct derivations fall off
![Page 71: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/71.jpg)
Early Update w/ Latent Variable
18
earl
y up
date
incorrect
violation guaranteed: incorrect prefix scores higher up to this point
21
Model
_ _ _ _ _ _ ● _ _ _ _ _ ● _ _ ● ● _ ● _ _ ● ● ● ● ● _ ● ● ● ● ● ● ● ● ●
0 1 2 3 4 5 6
Bushheld
talks with Sharon
held talks with Sharongold derivation lattice
• the gold-standard derivations are not annotated
• we treat any reference-producing derivation as good
correct
all correct derivations fall off
stop decoding
![Page 72: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/72.jpg)
Fixing Search Error 2: Max-Violation
19
• early update works but learns slowly due to partial updates
• max-violation: use the prefix where violation is maximum
• “worst-mistake” in the search space
• we call these methods “violation-fixing perceptrons” (Huang et al 2012)
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 73: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/73.jpg)
Early Update vs. Max-Violation
_ _ _ _ _ _
0 1 2 3 4 5 6
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 74: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/74.jpg)
Early Update vs. Max-Violation
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●ea
rly
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 75: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/75.jpg)
Early Update vs. Max-Violation
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 76: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/76.jpg)
Early Update vs. Max-Violation
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 77: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/77.jpg)
Early Update vs. Max-Violation
Early-update
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 78: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/78.jpg)
Early Update vs. Max-Violation
Early-update
_ _ _ _ _ _ ● _ _ ● ● ●
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
● _ ● ● _ ●_ ● _ ● ● ●
● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 79: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/79.jpg)
Early Update vs. Max-Violation
Early-update
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
● _ ● ● _ ●_ ● _ ● ● ●
● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 80: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/80.jpg)
Early Update vs. Max-Violation
Early-update
● ● _ ● ● ●
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
● _ ● ● _ ●_ ● _ ● ● ●
● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _● ● ● ● _ ●● _ ● ● ● ●● ● ● _ ● ●
● ● ● _ ● ●
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 81: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/81.jpg)
Early Update vs. Max-Violation
Early-update
● ● _ ● ● ●
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
● _ ● ● _ ●_ ● _ ● ● ●
● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _● ● ● ● _ ●● _ ● ● ● ●● ● ● _ ● ●
● ● ● _ ● ● ● ● ● ● ● ●
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 82: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/82.jpg)
Early Update vs. Max-Violation
Early-update
● ● _ ● ● ●
_ _ _ _ _ _
0 1 2 3 4 5 6
● _ _ _ _ _
_ _ ●_ _ __ _ _ _ ● _
_ _ _ _ _ ●● ● _ ● _ _
_ ● ● ● _ __ _ ● ● ●_
_ ● ● ● _ _
● _ _ ● ● _
● _ ● ● _ ●_ ● _ ● ● ●
● _ _ ● ● ●_ ● ● _ _ __ _ ● ●_ _
_ _ ● _ ● _
_ ● _ _ ● _● ● ● ● _ ●● _ ● ● ● ●● ● ● _ ● ●
● ● ● _ ● ●
Max-violation
● ● ● ● ● ●
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 83: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/83.jpg)
Latent-Variable Perceptron
early
max
-vi
olat
ion
late
st
full
(standard)
best in the beam
worst in the beamfalls off
the beam biggestviolation
last valid update
correct sequence
invalidupdate!
early
max
-vi
olat
ion
best in the beam
worst in the beam
d�i
d+id+i⇤
d�i⇤d+|x|
dy|x|
std
loca
l
standard update is invalid
mod
elw
d�|x|
![Page 84: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/84.jpg)
Roadmap of the techniques
22
structured perceptron(Collins, 2002)
![Page 85: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/85.jpg)
Roadmap of the techniques
22
structured perceptron(Collins, 2002)
latent-variable perceptron
(Zettlemoyer and Collins, 2005; Sun et al., 2009)
![Page 86: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/86.jpg)
Roadmap of the techniques
22
structured perceptron(Collins, 2002)
latent-variable perceptron
(Zettlemoyer and Collins, 2005; Sun et al., 2009)
perceptron w/ inexact search
(Collins & Roark, 2004;Huang et al 2012)
![Page 87: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/87.jpg)
Roadmap of the techniques
22
structured perceptron(Collins, 2002)
latent-variable perceptron
(Zettlemoyer and Collins, 2005; Sun et al., 2009)
perceptron w/ inexact search
(Collins & Roark, 2004;Huang et al 2012)
latent-variable perceptron w/ inexact search
(Yu et al 2013)
![Page 88: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/88.jpg)
Roadmap of the techniques
22
structured perceptron(Collins, 2002)
latent-variable perceptron
(Zettlemoyer and Collins, 2005; Sun et al., 2009)
perceptron w/ inexact search
(Collins & Roark, 2004;Huang et al 2012)
latent-variable perceptron w/ inexact search
(Yu et al 2013)
hiero syntactic parsing semantic parsing transliteration
![Page 89: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/89.jpg)
Feature Design
• Dense features:
• standard phrase-based features (Koehn, 2004)
• Sparse Features:
• rule-identification features (unique id for each rule)
• word-edges features
• lexicalized local translation context within a rule
• non-local features
• dependency between consecutive rules
23
![Page 90: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/90.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
![Page 91: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/91.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
![Page 92: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/92.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
![Page 93: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/93.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
![Page 94: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/94.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
![Page 95: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/95.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
Combo Features:
![Page 96: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/96.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
Combo Features:
![Page 97: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/97.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
Combo Features:100010=沙⻰龙|held
![Page 98: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/98.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
Combo Features:100010=沙⻰龙|held
![Page 99: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/99.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
Combo Features:100010=沙⻰龙|held
![Page 100: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/100.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
Combo Features:100010=沙⻰龙|held010001=举行|talks
![Page 101: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/101.jpg)
WordEdges Features (local)
24
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• the first and last Chinese words in the rule
• the first and last English words in the rule
• the two Chinese words surrounding the rule
Combo Features:100010=沙⻰龙|held010001=举行|talks
![Page 102: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/102.jpg)
Lexical backoffs and combos
• Lexical features are often too sparse
• 6 kinds of lexical backoffs with various budgets
• total budget can’t exceed 10 (bilexical)
25
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
![Page 103: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/103.jpg)
Lexical backoffs and combos
• Lexical features are often too sparse
• 6 kinds of lexical backoffs with various budgets
• total budget can’t exceed 10 (bilexical)
25
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
![Page 104: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/104.jpg)
Lexical backoffs and combos
• Lexical features are often too sparse
• 6 kinds of lexical backoffs with various budgets
• total budget can’t exceed 10 (bilexical)
25
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
100010=沙⻰龙|held
![Page 105: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/105.jpg)
Lexical backoffs and combos
• Lexical features are often too sparse
• 6 kinds of lexical backoffs with various budgets
• total budget can’t exceed 10 (bilexical)
25
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
P00010=NN|held
100010=沙⻰龙|held
![Page 106: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/106.jpg)
Lexical backoffs and combos
• Lexical features are often too sparse
• 6 kinds of lexical backoffs with various budgets
• total budget can’t exceed 10 (bilexical)
25
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
P00010=NN|held
100010=沙⻰龙|held
![Page 107: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/107.jpg)
Lexical backoffs and combos
• Lexical features are often too sparse
• 6 kinds of lexical backoffs with various budgets
• total budget can’t exceed 10 (bilexical)
25
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
P00010=NN|held
100010=沙⻰龙|held
010001=举行|talks
![Page 108: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/108.jpg)
Lexical backoffs and combos
• Lexical features are often too sparse
• 6 kinds of lexical backoffs with various budgets
• total budget can’t exceed 10 (bilexical)
25
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
P00010=NN|held
0c0001=举|talks
100010=沙⻰龙|held
010001=举行|talks
![Page 109: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/109.jpg)
Non-Local Features (trivial)
26
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• two consecutive rule ids (rule bigram model)
• the last two English words and the current rule
• should explore a lot more!
![Page 110: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/110.jpg)
Non-Local Features (trivial)
26
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• two consecutive rule ids (rule bigram model)
• the last two English words and the current rule
• should explore a lot more!
![Page 111: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/111.jpg)
Non-Local Features (trivial)
26
与 沙⻰龙 举行 了 会谈
held a few talks
</s>
r2
布什
Bush
r1
<s>
<s>
• two consecutive rule ids (rule bigram model)
• the last two English words and the current rule
• should explore a lot more!
![Page 112: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/112.jpg)
Experiments
27
• Date sets
![Page 113: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/113.jpg)
Experiments
27
Scale Language sent. dev tst
SmallCh-En
30knist06 news nist08 news
LargeCh-En
240knist06 news nist08 news
Large Sp-En 170k newstest2012 newtest2013
• Date sets
![Page 114: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/114.jpg)
Experiments
27
Scale Language sent. dev tst
SmallCh-En
30knist06 news nist08 news
LargeCh-En
240knist06 news nist08 news
Large Sp-En 170k newstest2012 newtest2013
• Date sets
![Page 115: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/115.jpg)
Experiments
27
Scale Language sent. dev tst
SmallCh-En
30knist06 news nist08 news
LargeCh-En
240knist06 news nist08 news
Large Sp-En 170k newstest2012 newtest2013
• Date sets
![Page 116: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/116.jpg)
Experiments
27
10x dev
Scale Language sent. dev tst
SmallCh-En
30knist06 news nist08 news
LargeCh-En
240knist06 news nist08 news
Large Sp-En 170k newstest2012 newtest2013
• Date sets
![Page 117: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/117.jpg)
Experiments
27
10x dev 120x dev
Scale Language sent. dev tst
SmallCh-En
30knist06 news nist08 news
LargeCh-En
240knist06 news nist08 news
Large Sp-En 170k newstest2012 newtest2013
• Date sets
![Page 118: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/118.jpg)
Experiments
27
10x dev 120x dev
Scale Language sent. dev tst
SmallCh-En
30knist06 news nist08 news
LargeCh-En
240knist06 news nist08 news
Large Sp-En 170k newstest2012 newtest2013
• Date sets
![Page 119: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/119.jpg)
Experiments
27
10x dev 120x dev
Scale Language sent. dev tst
SmallCh-En
30knist06 news nist08 news
LargeCh-En
240knist06 news nist08 news
Large Sp-En 170k newstest2012 newtest2013
• Date sets
Sp-En sent. word.ratio 55% 43.9%
![Page 120: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/120.jpg)
Experiments
27
10x dev 120x dev
Scale Language sent. dev tst
SmallCh-En
30knist06 news nist08 news
LargeCh-En
240knist06 news nist08 news
Large Sp-En 170k newstest2012 newtest2013
• Date sets
Sp-En sent. word.ratio 55% 43.9%
31x dev
![Page 121: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/121.jpg)
Perceptron: std, early, and max-violation• standard perceptron (Liang et al’s “bold”) works poorly
• b/c invalid update ratio is very high (search quality is low)
• max-violation converges faster than early update
17
18
19
20
21
22
23
24
25
26
2 4 6 8 10 12 14 16 18 20
BLE
U
Number of iteration
MaxForce
MERTearly
local
standard
28
![Page 122: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/122.jpg)
Perceptron: std, early, and max-violation• standard perceptron (Liang et al’s “bold”) works poorly
• b/c invalid update ratio is very high (search quality is low)
• max-violation converges faster than early update
this explains why Liang et al ’06 failedstd ~ “bold”; local ~ “local”
17
18
19
20
21
22
23
24
25
26
2 4 6 8 10 12 14 16 18 20
BLE
U
Number of iteration
MaxForce
MERTearly
local
standard
28
![Page 123: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/123.jpg)
Perceptron: std, early, and max-violation• standard perceptron (Liang et al’s “bold”) works poorly
• b/c invalid update ratio is very high (search quality is low)
• max-violation converges faster than early update
50%
60%
70%
80%
90%
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ratio
beam size
Ratio of invalid updates+non-local feature
(standard perceptron)
this explains why Liang et al ’06 failedstd ~ “bold”; local ~ “local”
17
18
19
20
21
22
23
24
25
26
2 4 6 8 10 12 14 16 18 20
BLE
U
Number of iteration
MaxForce
MERTearly
local
standard
28
![Page 124: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/124.jpg)
Perceptron: std, early, and max-violation• standard perceptron (Liang et al’s “bold”) works poorly
• b/c invalid update ratio is very high (search quality is low)
• max-violation converges faster than early update
50%
60%
70%
80%
90%
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ratio
beam size
Ratio of invalid updates+non-local feature
(standard perceptron)
this explains why Liang et al ’06 failedstd ~ “bold”; local ~ “local”
17
18
19
20
21
22
23
24
25
26
2 4 6 8 10 12 14 16 18 20
BLE
U
Number of iteration
MaxForce
MERTearly
local
standard
28
![Page 125: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/125.jpg)
Parallelized Perceptron
29
• mini-batch perceptron (Zhao and Huang, 2013) much faster than iterative parameter mixing (McDonald et al, 2010)
• 6 CPUs => ~4x speedup; 24 CPUs => ~7x speedup
22
23
24
0 0.5 1 1.5 2 2.5 3 3.5 4
BLE
U
Time
MERT PRO-dense
minibatch(24-core)minibatch(6-core)minibatch(1 core)single processor
Time
![Page 126: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/126.jpg)
Internal comparison with different features
• dense: 11 standard features for phrase-based MT
• ruleid: rule identification feature
• word-edges: word-edges features with back-offs
• non-local: non-local features with back-offs
30
18
19
20
21
22
23
24
25
26
2 4 6 8 10 12 14 16
BLE
U
Number of iteration
MERT
+non-local+word-edges
+ruleiddense
![Page 127: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/127.jpg)
Internal comparison with different features
• dense: 11 standard features for phrase-based MT
• ruleid: rule identification feature
• word-edges: word-edges features with back-offs
• non-local: non-local features with back-offs
30
dense: 11 features
18
19
20
21
22
23
24
25
26
2 4 6 8 10 12 14 16
BLE
U
Number of iteration
MERT
+non-local+word-edges
+ruleiddense
![Page 128: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/128.jpg)
Internal comparison with different features
• dense: 11 standard features for phrase-based MT
• ruleid: rule identification feature
• word-edges: word-edges features with back-offs
• non-local: non-local features with back-offs
30
ruleid: 0.1%
dense: 11 features
18
19
20
21
22
23
24
25
26
2 4 6 8 10 12 14 16
BLE
U
Number of iteration
MERT
+non-local+word-edges
+ruleiddense
+0.9 bleu
![Page 129: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/129.jpg)
Internal comparison with different features
• dense: 11 standard features for phrase-based MT
• ruleid: rule identification feature
• word-edges: word-edges features with back-offs
• non-local: non-local features with back-offs
30
ruleid: 0.1%
wordedges: 99.6%
dense: 11 features
18
19
20
21
22
23
24
25
26
2 4 6 8 10 12 14 16
BLE
U
Number of iteration
MERT
+non-local+word-edges
+ruleiddense
+0.9 bleu
+2.3
![Page 130: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/130.jpg)
Internal comparison with different features
• dense: 11 standard features for phrase-based MT
• ruleid: rule identification feature
• word-edges: word-edges features with back-offs
• non-local: non-local features with back-offs
30
ruleid: 0.1%
wordedges: 99.6%
non-local: 0.3%
dense: 11 features
18
19
20
21
22
23
24
25
26
2 4 6 8 10 12 14 16
BLE
U
Number of iteration
MERT
+non-local+word-edges
+ruleiddense
+0.9 bleu
+2.3+0.7
![Page 131: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/131.jpg)
External comparison with MERT & PRO
31
• MERT, PRO-dense/medium/sparse all tune on dev-set
• PRO-sparse use the same feature as ours
10
12
14
16
18
20
22
24
26
2 4 6 8 10 12 14 16
BLE
U
Number of iteration
MaxForceMERT
PRO-densePRO-medium
PRO-large
![Page 132: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/132.jpg)
Final Results on FBIS Data
32
• Moses: state-of-the-art phrase-based system in C++
• Cubit: phrase-based system (Huang and Chiang, 2007) in python
• almost identical baseline scores with MERT
• max-violation takes ~47 hours on 24 CPUs (23M features)
![Page 133: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/133.jpg)
Final Results on FBIS Data
32
System Alg. Tune on Features Dev TestMoses
CubitCubitCubitCubitCubit
MERT dev set 11 25.5 22.5
MERT dev set 11 25.4 22.5
PRO dev set11 25.6 22.6
PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3
MaxForce Train set 23M 27.8 24.5
• Moses: state-of-the-art phrase-based system in C++
• Cubit: phrase-based system (Huang and Chiang, 2007) in python
• almost identical baseline scores with MERT
• max-violation takes ~47 hours on 24 CPUs (23M features)
![Page 134: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/134.jpg)
Final Results on FBIS Data
32
System Alg. Tune on Features Dev TestMoses
CubitCubitCubitCubitCubit
MERT dev set 11 25.5 22.5
MERT dev set 11 25.4 22.5
PRO dev set11 25.6 22.6
PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3
MaxForce Train set 23M 27.8 24.5
• Moses: state-of-the-art phrase-based system in C++
• Cubit: phrase-based system (Huang and Chiang, 2007) in python
• almost identical baseline scores with MERT
• max-violation takes ~47 hours on 24 CPUs (23M features)
![Page 135: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/135.jpg)
Final Results on FBIS Data
32
System Alg. Tune on Features Dev TestMoses
CubitCubitCubitCubitCubit
MERT dev set 11 25.5 22.5
MERT dev set 11 25.4 22.5
PRO dev set11 25.6 22.6
PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3
MaxForce Train set 23M 27.8 24.5
• Moses: state-of-the-art phrase-based system in C++
• Cubit: phrase-based system (Huang and Chiang, 2007) in python
• almost identical baseline scores with MERT
• max-violation takes ~47 hours on 24 CPUs (23M features)
![Page 136: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/136.jpg)
Final Results on FBIS Data
32
System Alg. Tune on Features Dev TestMoses
CubitCubitCubitCubitCubit
MERT dev set 11 25.5 22.5
MERT dev set 11 25.4 22.5
PRO dev set11 25.6 22.6
PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3
MaxForce Train set 23M 27.8 24.5
• Moses: state-of-the-art phrase-based system in C++
• Cubit: phrase-based system (Huang and Chiang, 2007) in python
• almost identical baseline scores with MERT
• max-violation takes ~47 hours on 24 CPUs (23M features)
![Page 137: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/137.jpg)
Final Results on FBIS Data
32
System Alg. Tune on Features Dev TestMoses
CubitCubitCubitCubitCubit
MERT dev set 11 25.5 22.5
MERT dev set 11 25.4 22.5
PRO dev set11 25.6 22.6
PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3
MaxForce Train set 23M 27.8 24.5
• Moses: state-of-the-art phrase-based system in C++
• Cubit: phrase-based system (Huang and Chiang, 2007) in python
• almost identical baseline scores with MERT
• max-violation takes ~47 hours on 24 CPUs (23M features)
![Page 138: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/138.jpg)
Final Results on FBIS Data
32
System Alg. Tune on Features Dev TestMoses
CubitCubitCubitCubitCubit
MERT dev set 11 25.5 22.5
MERT dev set 11 25.4 22.5
PRO dev set11 25.6 22.6
PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3
MaxForce Train set 23M 27.8 24.5
• Moses: state-of-the-art phrase-based system in C++
• Cubit: phrase-based system (Huang and Chiang, 2007) in python
• almost identical baseline scores with MERT
• max-violation takes ~47 hours on 24 CPUs (23M features)
+2.3
![Page 139: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/139.jpg)
Final Results on FBIS Data
32
System Alg. Tune on Features Dev TestMoses
CubitCubitCubitCubitCubit
MERT dev set 11 25.5 22.5
MERT dev set 11 25.4 22.5
PRO dev set11 25.6 22.6
PRO dev set 3k 26.3 23.0PRO dev set36k 17.7 14.3
MaxForce Train set 23M 27.8 24.5
• Moses: state-of-the-art phrase-based system in C++
• Cubit: phrase-based system (Huang and Chiang, 2007) in python
• almost identical baseline scores with MERT
• max-violation takes ~47 hours on 24 CPUs (23M features)
+2.3 +2.0
![Page 140: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/140.jpg)
Results on Spanish-English set
• Data-set: Europarl corpus, 170k sentences
• dev/test set: newtest2012 / 2013 (one-reference only)
• +1 in 1-ref bleu ~ +2 in 4-ref bleu
• bleu improvement is comparable to Chinese w/ 4-refs
33
system algorithm #feat. dev test
Moses Mert 11 27.4 24.4
Cubit MaxForce 21M 28.7 25.5
Sp-En sent. word.Reachable ratio 55% 43.9%
![Page 141: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/141.jpg)
Results on Spanish-English set
• Data-set: Europarl corpus, 170k sentences
• dev/test set: newtest2012 / 2013 (one-reference only)
• +1 in 1-ref bleu ~ +2 in 4-ref bleu
• bleu improvement is comparable to Chinese w/ 4-refs
33
system algorithm #feat. dev test
Moses Mert 11 27.4 24.4
Cubit MaxForce 21M 28.7 25.5
+1.3 +1.1Sp-En sent. word.
Reachable ratio 55% 43.9%
![Page 142: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/142.jpg)
Conclusion• a simple yet effective online learning approach for MT
• scaled to (a large portion of) the training set for the first time
• able to incorporate 20M sparse lexicalized features
• no need to define BLEU+1, or hope/fear derivations
• no learning rate or hyperparameters
• +2.3/+2.0 BLEU points better than MERT/PRO
• the three ingredients that made it work
• violation-fixing perceptron: early-update and max-violation
• forced decoding lattice helps
• minibatch parallelization scales it up to big data34
![Page 143: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/143.jpg)
Roadmap of the techniques
35
structured perceptron(Collins, 2002)
![Page 144: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/144.jpg)
Roadmap of the techniques
35
structured perceptron(Collins, 2002)
latent-variable perceptron
(Zettlemoyer and Collins, 2005; Sun et al., 2009)
![Page 145: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/145.jpg)
Roadmap of the techniques
35
structured perceptron(Collins, 2002)
latent-variable perceptron
(Zettlemoyer and Collins, 2005; Sun et al., 2009)
perceptron w/ inexact search
(Collins & Roark, 2004;Huang et al 2012)
![Page 146: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/146.jpg)
Roadmap of the techniques
35
structured perceptron(Collins, 2002)
latent-variable perceptron
(Zettlemoyer and Collins, 2005; Sun et al., 2009)
perceptron w/ inexact search
(Collins & Roark, 2004;Huang et al 2012)
latent-variable perceptron w/ inexact search
(Yu et al 2013)
![Page 147: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/147.jpg)
Roadmap of the techniques
35
structured perceptron(Collins, 2002)
latent-variable perceptron
(Zettlemoyer and Collins, 2005; Sun et al., 2009)
perceptron w/ inexact search
(Collins & Roark, 2004;Huang et al 2012)
latent-variable perceptron w/ inexact search
(Yu et al 2013)
hiero syntactic parsing semantic parsing transliteration
![Page 148: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/148.jpg)
Roadmap of the techniques
35
structured perceptron(Collins, 2002)
latent-variable perceptron
(Zettlemoyer and Collins, 2005; Sun et al., 2009)
perceptron w/ inexact search
(Collins & Roark, 2004;Huang et al 2012)
latent-variable perceptron w/ inexact search
(Yu et al 2013)
hiero syntactic parsing semantic parsing transliteration
replacing EM for partially-
observed data
![Page 149: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/149.jpg)
20 years of Statistical MT• word alignment: IBM models (Brown et al 90, 93)
• translation model (choose one from below)
• SCFG (ITG: Wu 95, 97; Hiero: Chiang 05, 07) or STSG (GHKM 04, 06; Liu+ 06; Huang+ 06)
• PBMT (Och+Ney 02; Koehn et al 03)
• evaluation metric: BLEU (Papineni et al 02)
• decoding algorithm: cube pruning (Chiang 07; Huang+Chiang 07)
• training algorithm (choose one from below)
• MERT (Och 03): ~10 dense features on dev set
• MIRA (Chiang et al 08-12) or PRO (Hopkins+May 11): ~10k feats on dev set
• MaxForce: 20M+ feats on training set; +2/+1.5 BLEU over MERT/PRO
• Max-Violation Perceptron with Forced Decoding: fixes search errors
• first successful effort of online large-scale discriminative training for MT
![Page 150: MaxForce: Max-Violation Perceptron and Forced Decoding for ...web.engr.oregonstate.edu/~huanlian/slides/maxforce-anim.pdf · Discriminative Training for SMT • discriminative training](https://reader035.fdocuments.in/reader035/viewer/2022081402/5f0d5f877e708231d43a07fe/html5/thumbnails/150.jpg)
When learning with vastly inexact search, you should use a principled method such as max-violation.
Thank you!
Max-violation