NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track...
-
Upload
arline-alice-goodwin -
Category
Documents
-
view
213 -
download
0
Transcript of NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track...
![Page 1: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/1.jpg)
NRC Report Conclusion
Tu Zhaopeng2009-09-08
![Page 2: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/2.jpg)
NIST06
The Portage System
For Chinese large-track entry, used simple,
but carefully-tuned, phrase-based system:
Pre-process source text
Viterbi decoding using loglinear model
Nbest rescoring using fancier loglinear model
Post-process raw translation
![Page 3: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/3.jpg)
NIST06
Pre-processing:
Convert to GB2312, removing traditional
characters with no GB2312 representation
Segment using LDC segmenter
Translate numbers and dates using rules
Strip non-ASCII OOV’s
![Page 4: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/4.jpg)
NIST06
Post-processing
Truecase using 4-gram HMM (via SRILM disambig)
trained on parallel corpus
Detokenization heuristics
![Page 5: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/5.jpg)
NIST06 Rescoring
Rescoring based on 5k-best lists, using Powell’s
algorithm to find max-BLEU weights
Features (22)
All 12 decoder features
Character length
IBM2 scores in both directions
IBM1-based “missing word” feature (compare score of best
translation for each word to best known)
Posterior probabilities calculated from nbest list for:
sentence length, phrases, words, unigrams, and bigrams.
![Page 6: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/6.jpg)
NIST06 Search Parameters
![Page 7: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/7.jpg)
NIST08
Towards Tighter Integration of Rule-
based and Statistical MT in Serial System
Combination
Rule-based
Systran
Phrase-based
Portage
![Page 8: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/8.jpg)
NIST08
Annotation of Systran output, five
different chunk types:
named entities, numbers, dates
unknown words or unlikely sequences of short
words
‘strong’ rules : very reliable chunks, e.g., rules
based on a long distance syntactic relationship, or
a long multiword expression
![Page 9: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/9.jpg)
NIST09 Serial system combination
![Page 10: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/10.jpg)
NIST09
NRC system trained on SY/EN parallel corpus:
use SYSTRAN to translate ZH half of parallel ZH/EN
training corpus, discarding UN, HKH/L corpora for
eciency ! 3M sentence pairs
preprocess SY: strip markup, tokenize, lowercase
standard phrase-based training
![Page 11: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/11.jpg)
NIST09
Two strategies that didn't work:
Exploit SY/EN surface similarity: boost HMM ttable
scores of similar forms, prior to phrase extraction !
no improvement
Use SY case information: adopt SY case for aligned
EN words|no improvement compared to baseline
independent truecaser
![Page 12: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/12.jpg)
NIST09
Common features:
phrase table based on symmetrized HMM word
alignments (4 features: lex+rf, fwd+bkw)
5g mixture LM from parallel corpus (Foster &
Kuhn, WMT07)
6g LM from GW
word count and distortion
![Page 13: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/13.jpg)
NIST09
![Page 14: NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.](https://reader034.fdocuments.in/reader034/viewer/2022051820/56649f445503460f94c652cc/html5/thumbnails/14.jpg)
NIST09
Useful
rescoring with IBM- and nbest-based features
(Ueng and Ney, CL07; Chen et al, IWSLT05):
+0.3 BLEU
greedy feature pruning for rescoring +0.3
BLEU
truecasing with \title trick": +0.3 BLEU