Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji
description
Transcript of Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji
![Page 1: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/1.jpg)
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Automatic Extraction of Translational Japanese-KATAKANAand English Word Pairs from Bilingual Corpora
Advisor : Dr. Hsu
Graduate : Chun Kai Chen
Author : Keita Tsuji
ICCPOL (2001) 245-250
![Page 2: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/2.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Outline
Motivation Objective Introduction Extraction of Traslational Word Pairs Experimental Results Conclusions Personal Opinion
![Page 3: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/3.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
The bilingual lexicon is in demand in many fields ─ cross-language information retrieval─ machine translation
The bilingual corpora ─ become widely available reflecting the change in
publishing activities
![Page 4: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/4.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objective
Propose a method ─ automatically extract translational Japanese-KATAKA
NA and English word pairs from bilingual corpora based on transliteration rules
─ < グラフ , graph>
![Page 5: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/5.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction(1/2)
Many of the methods proposed ─ depend heavily on word frequency in the corpora─ cannot treat low-frequency words properly
The low-frequency words ─ include newly-coined words which are especially in de
mand in many fields
![Page 6: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/6.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Introduction(2/2)
Against these background─ we propose a method to extract translational KATAKA
NA-English word pairs from bilingual corpora The method
─ applies all the existing transliteration rules to each mora unit in a KATAKANA word
─ extract English word which matched or partially-matched to one of these transliteration candidates as translation
![Page 7: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/7.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Extraction of Traslational Word Pairs
Pick up J Decompose J
Generate candidates
Pick up E Identify LCS P(J, E)=maxi Dice()
Transliteration rules
グラフ ‘グ’ ,’ラ’ ,’フ’
Ti(J)=graf
Ti(J)=graph
graph P(グラフ , graph)=1P(グラフ , library)=0.36
S(‘guraf’, ‘graph’) = ‘gra’)
Construction of Transliteration Rule
Measure for Matching
![Page 8: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/8.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Construction of Transliteration Rule(1/2)
1. Decompose KATAKANA word ─ ‘ディスパッチャー’─ ‘ディ’ , ‘ ス’ , ‘ パッ’ and ‘ チャー’
2. Extract transliteration rule for each unit manually─ ‘ディスパッチャー’ and ‘dispatcher’─ ‘ディ’ = ‘di’─ ‘ス’ = ‘s’─ ‘パッ’ = ‘pat’─ ‘チャー’ = ‘cher’
![Page 9: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/9.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Construction of Transliteration Rule(2/2)
3. Repeat (1) and (2) for all the word pairs in the list
─ count the frequency of each rule─ rank them for each KATAKANA unit
4. Add Hepburn transliteration rules into the above rules
─ If the source list is large enough, this process might not be necessary
![Page 10: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/10.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Assume
When Japanese borrows an English word─ the word is transliterated on the basis of Japanese mora unit ─ their correspondences are stable
These transliterations are free from context ─ have no relation to the preceding─ following mora units
The number of transliteration rules is small They do not vary drastically from time to time nor fro
m domain to domain
![Page 11: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/11.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Extraction of Traslational Word Pairs
Pick up J and decompose it into units ─ according to the same framework we used at TR construction
Using all the transliteration rules in TR─ generate all the possible transliteration candidates for J─ Henceforth we represent the i-th transliteration candidate of J as Ti(J)
Pick up E which co-occurred with J ─ occurred in the same aligned segment in the corpus─ identify the longest common subsequence with each Ti(J)
If the following P(J, E) exceeds certain threshold─ extract pair J and E as translation
P(J, E)=maxi Dice(L(S(Ti(J), E)), L(Ti(J)), L(E))
![Page 12: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/12.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Extraction of Traslational Word Pairs
Pick up J Decompose J
Generate candidates
Pick up E Identify LCS P(J, E)=maxi Dice()
Transliteration rules
グラフ ‘グ’ ,’ラ’ ,’フ’
Ti(J)=graf
Ti(J)=graph
graph P(グラフ , graph)=1P(グラフ , library)=0.36
S(‘guraf’, ‘graph’) = ‘gra’)
Construction of Transliteration Rule
Measure for Matching
![Page 13: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/13.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Dice Measure J: KATAKANA word in the corpus E: English word in the corpus L(w): number of characters of word w T(J): transliteration candidate of word J S(w1, w2): longest common subsequence of w1 and w2
─ (e.g. S(‘guraf’, ‘graph’) = ‘gra’)
Dice(k,m,n)=k*2/(m+n)
Dice(L(S(Ti(J), E)), L(Ti(J)), L(E))
Dice(L(S(graf, graph)), L(graf), L(graph))
Dice(3,4,5)=3*2/(4+5)=0.67
Ti(J)=graf
Ti(J)=graph
![Page 14: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/14.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Example P( グラフ , graph)
= max( Dice(L(S(graf, graph)), L(graf), L(graph)), Dice(L(S(graph, graph)), L(graph), L(graph)), Dice(L(S(graff, graph)), L(graff), L(graph)), … Dice(L(S(gulerfe, graph)), L(gulerfe), L(graph))) = max(0.67, 1.00, 0.60, …, 0.33, 0.33) = 1.00
P( グラフ , library)=max(0.36, 0.33, ..., 0.29, 0.29)=0.36
![Page 15: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/15.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Device for Time-saving(1/2)
Procedure requires much computational time when applied to the actual data─ using all rules in TR often leads to the combinatory exp
losion of the number of transliteration candidates─ identifying the longest common subsequence often req
uires much time
![Page 16: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/16.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Device for Time-saving(2/2)
Apply less transliteration rules to longer KATAKANA word─ at TR construction, we have ranked transliteration rules according
to their frequencies─ applied top 12/(the number of units in J)+1 rules to each unit of J
3*6*4 candidates
12/3+1=5
3*5*4 candidates
3 6 4
![Page 17: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/17.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Other Alternatives(1/2)
Transliteration Rule (TR)─ comparing the effectiveness of TR and HR (Hepburn tr
ansliteration rule)─ HR is an well-known rule for transliterating Japanese t
o alphabet strings and is easily available─ HR gives each KATAKANA unit a unique alphabet str
ing TR usually generates many T(J) HR generates only one T(J). If HR alone can produce good result, we can save computation
al time
![Page 18: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/18.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Other Alternatives(2/2)
Measure for Matching─ Our method uses the combination of Dice and NPT_sc
ore─ Mdic was used on behalf of Dice:
Mdic(k, m, n)=(1+logk)*k*2/(m+n)─ Bgrm was used on behalf of the combination of Dice a
nd NPT_score:
Bgrm(T(J), E)=|NT(J) ∩ NE|/|NT(J) ∪ NE|
![Page 19: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/19.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiment Data(1/2)
Source Data for TR─ from 3,742 transliterational term pairs in dictionary of
artificial intelligence─ added Hepburn transliteration rules to them
Bilingual Corpora─ eight bilingual corpora, Japanese-English parallel abstr
acts/titles of academic papers,─ domains of these corpora are artificial intelligence (AI),
forestry (FR), information processing (IP) and architecture (AC)
![Page 20: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/20.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experiment Data(2/2)
The purpose of using abstract and title corpora─ examine the influence of size and the degree of parallelism of eac
h segment to the extraction results─ abstracts are larger and noisier than titles
The purpose of using four domains ─ examine the influence of domain difference to the results─ results of the other three domains do not significantly differ from
that of artificial intelligence─ need not be so nervous about the domain of source data for TR co
nstruction
![Page 21: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/21.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results(1/3)
TR is much more effective than simple HR(Hepburn transliteration rules)─ TR_Dice achieved 93% precision at 75% recall─ HR_Dice at 75% recall remained 4%
Dice is more effective than Mdic─ many word pairs which are long and morphologically-r
elated but not translational─ Mdic between these word pairs tend to become higher t
han those of short translational pairs, which leads to the decrease of precision
![Page 22: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/22.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results(2/3) The combination of Dice and NPT_score is more effe
ctive than Bgrm Our method is more effective than using combination
of NPT and Match─ in [1] tends to generate strings which contain incorrect translation
s. ‘グラフ’ is transliterated into ‘ghurlaoffphu’ contains not only correct translation ‘graph’, but also incorrect one ‘
hop’.─ Match depends only on the length of English words and their mat
ched parts─ Therefore, it tends to evaluate short English words
Match(‘ghurlaoffphu’, ‘hop’)=3/3=1)
![Page 23: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/23.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Results(3/3)
TR which was constructed based on artificial intelligence term pairs also performed well against the other three domains─ This indicates the domain-independence of TR ─ we do not have to construct TR for each domain
Our method achieved 96-100% precision at 75% recall against title corpora
Our method performed well against abstract corpora (83-93% precision at 75% recall)
![Page 24: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/24.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Error Analysis(1/2)
Most of the translational word pairs not extracted were those containing transliteration units which were not listed in TR─ For instance
‘アーキテクチャ’ and ‘architecture’ was not extracted because ‘ キ’ = ‘chi’ ,‘チャ’ = ‘ture’ were not listed in TR
─ We can solve this problem by enriching TR
![Page 25: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/25.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Error Analysis(2/2)
A few of the translational word pairs which were not extracted were acronyms and their original forms─ cannot be extracted based on transliterations by nature
Many of the non-translational word pairs wrongly extracted were morphologically related pairs. ─ For instance
‘プログラマ’ (programmer) and ‘program’ , ‘クラス’ (class) and ‘subclass’
─ Emphasizing the first and the last unit matching might be effective
![Page 26: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/26.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusion
We proposed a method─ for extracting translational KATAKANA-English word
pairs from bilingual corpora
The experiment─ shows our method is highly effective
By enriching TR and introducing some heuristics─ the performance will become higher
![Page 27: Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji](https://reader036.fdocuments.in/reader036/viewer/2022062519/56815193550346895dbfc99d/html5/thumbnails/27.jpg)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Personal Opinion
Advantage─ Extraction of Traslational Word Pairs─ Error Analysis
Disadvantage─ Construction of Transliteration Rule
─ Assume and limit