Backward Machine Transliteration by Learning Phonetic Similarity
Introduction to Machine Transliteration
-
Upload
yoh-okuno -
Category
Technology
-
view
18.351 -
download
4
description
Transcript of Introduction to Machine Transliteration
![Page 1: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/1.jpg)
Introduction to Machine Transliteration
Yoh Okuno / @nokuno
#TokykoNLP
![Page 2: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/2.jpg)
About me
• Name: Yoh Okuno / @nokuno
• Software Engineer at Yahoo! Japan
• Interest: NLP, Machine Learning, Data Mining
• Skill: C/C++, Python, Hadoop, and English.
• Website: http://yoh.okuno.name/
![Page 3: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/3.jpg)
What is transliteration? [Zhang+ 12] Whitepaper of NEWS 2012
Shared Task on Machine Transliteration
![Page 4: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/4.jpg)
What is transliteration? • Transliteration is defined as phonetic translation of names across languages
• Similar to Letter-‐to-‐Phoneme (L2P) and
Pronunciation Inference
• Reverse operation of transliteration is called back-‐transliteration
![Page 5: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/5.jpg)
Examples of Transliteration • A shared task supports 14 language pairs
![Page 6: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/6.jpg)
All language pairs at NEW 2012
![Page 7: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/7.jpg)
Two types of transliteration 1. Transliteration mining
– Given source-‐target language pairs with noise,
find correct transliteration from them
2. Transliteration generation
– Given source language characters, generate
ranked list of target language characters
![Page 8: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/8.jpg)
1. Transliteration mining [Jiampojamarn+ 07] Applying Many-‐to-‐Many
Alignments and Hidden Markov Models to
Letter-‐to-‐Phoneme Conversion
![Page 9: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/9.jpg)
Character alignment • Align pairs of Kana and Kanji characters monotonically and detect failures of alignment
• Techniques from statistical machine translation
• Used m2m-‐aligner because of its functions
http://code.google.com/p/m2m-‐aligner/
四季多彩 しきたさい 西都原 さいとばる iPhone あいふぉん
四|季|多|彩| し|き|た|さい| 西|都|原| さい|と|ばる| i|Ph|o|n|e| あい|ふ|ぉ|ん|_|
![Page 10: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/10.jpg)
Training m2m-‐aligner • Trained 3 datasets
– Mozc’s dictionary (1.5 M words)
– unidic (230k words)
– alt-‐cannadic (400k words) → most suitable
• Just run 2 commands
![Page 11: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/11.jpg)
Trained results • Three files are generated
Alignment:
Error:
Model:
![Page 12: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/12.jpg)
Applying m2m-‐aligner • Apply to 6 datasets – Social IME shared dictionary (93k words)
– Mined from Wikipedia (169k words)
– Crawled MS-‐IME dictionary (18k words)
– Manually corrected MS-‐IME dictionary (92k words)
– Hatena keyword (315k words)
– Mined from Aozora Bunko(225k words)
![Page 13: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/13.jpg)
What is Social IME? • The most popular “Cloud-‐based” Japanese
input method (230k unique user per month)
http://www.social-‐ime.com/
![Page 14: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/14.jpg)
Shared Dictionary of Social IME
• Noisy & Crazy → Needs cleaning!
shared with all users
![Page 15: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/15.jpg)
Mining words from Wikipedia
grep like “[一-‐龠]+([ぁ-‐んヴー]+)”
![Page 16: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/16.jpg)
Crawling MS-‐IME user dictionary
![Page 17: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/17.jpg)
Hatena keyword http://developer.hatena.ne.jp/ja/documents/keyword/misc/catalog
![Page 18: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/18.jpg)
Mining Aozora Bunko http://satomacoto.blogspot.com/2012/01/blog-‐post.html
![Page 19: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/19.jpg)
Applied results
• Run:
• Results: Dataset Social IME Wikipedia MS-‐IME MS-‐IME2 hatena Aozora
Size 93k 169k 18k 97k 314k 255k Align 48k 137k 16k 86k 235k 114k Error 45k 32k 2k 10k 78k 110k
![Page 20: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/20.jpg)
Alignment examples • Not perfect but practical precision From Social IME:
From Wikipedia:
“ゃ,ゅ,ょ,っ” should be combined with the previous character
![Page 21: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/21.jpg)
Error examples (from Social IME)
• Error analysis is most interesting!
Abbreviations: Emoticons (顔文字):
Personal Information:
![Page 22: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/22.jpg)
Error examples (from Hatena) Length limit (16 chars):
Chinese / Korean / old Japanese words:
Semantic translation:
![Page 23: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/23.jpg)
Error examples (From Aozora) • Many old Japanese cannot be aligned
• Many semantic translations in old Japanese
![Page 24: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/24.jpg)
Aligning Mozc dictionary • Aligned Mozc dictionary with cannadic model
• Error examples
Data Input Alignment Error Size 1488k 1424k 64k
ぎんごう 銀行 かくちょうだかいだかい 格調高い あくせられーた 一方通行 {こうたろう/ひろたろう} 廣太郎
![Page 25: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/25.jpg)
Conclusion
• Described how to clean Social-‐IME/Wikipedia/
MS-‐IME dictionary using m2m-‐aligner
• Future work: automatically classify pairs with
alignment error to emoticons, abbreviations,
personal information and so on.
![Page 26: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/26.jpg)
2. Transliteration Generation [Jiampojamarn+ 08] Joint Processing and
Discriminative Training for Letter-‐to-‐
Phoneme Conversion
![Page 27: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/27.jpg)
DirecTL+: String Transduction Model
• Training and decoding tool developed by the same author of m2m-‐aligner (now Googler)
• Used structured perceptron and MIRA
• Require aligned corpus (m2m-‐alginer format)
• http://code.google.com/p/directl-‐p/
![Page 28: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/28.jpg)
Adopted joint model • Joint model is better than pipeline
![Page 29: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/29.jpg)
Structured Perceptron
![Page 30: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/30.jpg)
Features for transliteration • Target 1-‐gram, 2-‐gram and combinations
![Page 31: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/31.jpg)
Evaluation Metrics
• Word Accuracy: Top-‐1 accuracy
• Mean F-‐score: Character-‐based accuracy
• MRR: Top-‐k ranking using position of the first
correct candidate
• MAP: Top-‐k ranking using all information
![Page 32: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/32.jpg)
Experiments • MIRA outperformed perceptron and others
![Page 33: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/33.jpg)
Conclusion • Proposed joint model for transliteration /
letter-‐to-‐phoneme conversion
• MIRA outperformed structured perceptron
• Features including unigram and linear-‐chain
perform well
![Page 34: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/34.jpg)
References
• [Zhang+ 12] Whitepaper of NEWS 2012 Shared Task on
Machine Transliteration
• [Jiampojamarn+ 07] Applying Many-‐to-‐Many Alignments and
Hidden Markov Models to Letter-‐to-‐Phoneme Conversion
• [Jiampojamarn+ 08] Joint Processing and Discriminative
Training for Letter-‐to-‐Phoneme Conversion
![Page 35: Introduction to Machine Transliteration](https://reader031.fdocuments.in/reader031/viewer/2022013108/55794de5d8b42a31678b527c/html5/thumbnails/35.jpg)
Any Question?