A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute...
-
Upload
naomi-sutton -
Category
Documents
-
view
215 -
download
0
Transcript of A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute...
![Page 1: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/1.jpg)
A Joint Source-Channel Model for Machine Transliteration
Li Haizhou, Zhang Min, Su JianInstitute for Infocomm Research
21 Heng Mui Keng Terrace, Singapore 119613{hli,sujian,mzhang}@i2r.a-star.edu.sg
ACL 2004
![Page 2: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/2.jpg)
2
Abstract
This paper presents a new framework that allows direct orthographical mapping (DOM) between two different languages, through a joint source- channel model, also called n-gram TM model,
Automated the orthographic alignment process derives the aligned transliteration units from a bilingual dictionary.
State-of-the-art machine learning algorithm.
![Page 3: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/3.jpg)
3
Research Question
Out of vocabulary Transliterating English names into Chinese is not
straightforward. For example,/Lafontant/ is transliterated into 拉丰
唐 (La-Feng-Tang) while /Constant/ becomes 康斯坦特 (Kang-Si-Tan-Te), where syllable /-tant/ in the two names are transliterated differently depending on the names’ language of origin.
![Page 4: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/4.jpg)
4
Research Question
For a given English name E as the observed channel output, one seeks a posteriori the most likely Chinese transliteration C that maximizes P(C|E).
– P(E|C) : the probability of transliterating C to E through a noisy channel, which is also called transformation rules.
– P(C) : n-gram language models.
Likewise, in C2E transliteration
– P(E) : n-gram language models.
?
?
![Page 5: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/5.jpg)
5
Research Question
E intermediat phonemic C– P(C|E) = P(C|P)P(P|E) P(E|C) = P(E|P)P(P|C) , P: phonemic– For Example:
Chinese characters -> PinYin -> IPA English characters -> IPA
Orthographic alignment– For example:
one-to-many mappings
![Page 6: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/6.jpg)
6
Research Question
English name :
Chinese name :
?
![Page 7: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/7.jpg)
7
Machine Transliteration - Review
phoneme-based techniques– Transformation-based learning algorithm
(Meng et al. 2001; Jung et al, 2000;Virga & Khudanpur, 2003)– Using finite state transducer that implements transformation rules.
(Knight & Graehl, 1998),
Web- (Sigir 2004)– Using the Web for Automated Translation Extraction
Ying Zhang, Phil Vines– Learning Phonetic Similarity for Matching Named Entity
Translations and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung
– Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval
Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu+, and Lee-Feng Chien
![Page 8: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/8.jpg)
8
Joint Source-Channel Model
For K aligned transliteration units , we propose to estimate P(E,C) by a joint source-channel model.
– Suppose exists an alignment γ with :
:
:
:
γ ?
![Page 9: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/9.jpg)
9
Joint Source-Channel Model
A transliteration unit correspondence <e,c> is called a transliteration pair.
– E2C transliteration can be formulated as
– Similarly the C2E back-transliteration as
An n-gram transliteration model is defined as the conditional probability, or transliteration probability, of a transliteration pair <e,c>k depending on its immediate n predecessor pairs:
![Page 10: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/10.jpg)
10
Transliteration alignment
A bilingual dictionary contains entries mapping English names to their respective Chinese transliterations.
– To obtain the statistics, the bilingual dictionary needs to be aligned. The maximum likelihood approach, through EM Algorithm.
![Page 11: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/11.jpg)
11
Transliteration alignment
:
:
:
:
γ
![Page 12: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/12.jpg)
12
The experiments
Database from the bilingual dictionary “Chinese Transliteration of Foreign Personal Names”
– Xinhua News Agency 37,694 unique English ent
ries and their official Chinese transliteration.
13 subsets for 13-flod open test
13-fold open tests : error rates < 1% deviation.
– EM iteration converges at 8th round ?
![Page 13: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/13.jpg)
13
The experiments
For a test set W composed of V names– Each name has been aligned into a sequence of transliteration pai
r tokens.– The cross-entropy Hp(W) of a model on data W is defined as
,where and
WT is the total number of aligned transliteration pair tokens in the data W.
– The perplexity :
![Page 14: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/14.jpg)
14
The experiments
![Page 15: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/15.jpg)
15
The experiments
In word error report, a word is considered correct only if an exact match happens between transliteration and the reference.
The character error rate is the sum of deletion, insertion and substitution errors..
![Page 16: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/16.jpg)
16
The experiments
5,640 transliteration pairs are cross mappings between 3,683 English and 374 Chinese units.
– Each English unit have 1.53 = 5,640/3,683 Chinese correspondences.
– Each Chinese unit have 15.1 = 5,640/374 English back-transliteration units.
Monolingual language perplexity– Chinese and English independently using the n-gram language
models
![Page 17: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/17.jpg)
17
The experiments
![Page 18: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/18.jpg)
18
Discussions
![Page 19: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/19.jpg)
19
Conclusions
In this paper, we propose a new framework (DOM) for transliteration. n-gram TM is a successful realization of DOM paradigm.
It generates probabilistic orthographic transformation rules using a data driven approach. By skipping the intermediate phonemic interpretation, the transliteration error rate is reduced significantly.
Place and company names : – /Victoria-Fall/ becomes 维多利亚瀑布 (Pinyin:Wei Duo
Li Ya Pu Bu).– It can also be easily extended to handle such name tran
slation.?
![Page 20: A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697bff11a28abf838cbb673/html5/thumbnails/20.jpg)
20
Ca Va