An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language...
-
Upload
marvin-dalton -
Category
Documents
-
view
214 -
download
0
Transcript of An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language...
![Page 1: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/1.jpg)
An Information Theoretic Approach to Bilingual Word Clustering
Manaal Faruqui & Chris DyerLanguage Technologies Institute
SCS, CMU
![Page 2: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/2.jpg)
Word Clustering
Grouping of words capturing syntactic, semantic and distributional regularities
Iran
USAIndia
Paris
11
13.422,000
play
London
laugheat
run
100
goodnice
better
awesome
cool
fight
![Page 3: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/3.jpg)
Bilingual Word Clustering
What ?
• Clustering words of two languages simultaneously
• Inducing a dependence between the two clusterings
Why ?
• To obtain better clusterings (hypothesis)
How ?
• By using cross-lingual information
![Page 4: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/4.jpg)
Bilingual Word Clustering
Assumption: Aligned words convey information about their respective clusters
![Page 5: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/5.jpg)
Bilingual Word Clustering
Existing: Monolingual Models Proposed: Monolingual + Bilingual Hints
![Page 6: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/6.jpg)
Related Work
• Bilingual Word Clustering (Och, 1999)• Language model based objective for monolingual
component• Word alignment count-based similarity function for
bilingual
• Linguistic structure transfer (Täckstrom et al. 2012)• Maximize the correspondence between clusters of
aligned words• Alternate optimization of mono & bi objective• Clustering of only top 1 million words
• POS tagging (Snyder & Barzilay, 2010)• Word sense disambiguation (Diab, 2003)• Bilingual graph based projections (Das and Petrov, 2011)
![Page 7: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/7.jpg)
Monolingual Objective
S
P(S;C) = P(c1) * P(w1|c1) * P(c2|c1) * P(w2|c2) * …
(Brown, 1992)
c1 c4c3c2
w1 w2 w3 w4
H(S;C) = E [ -log P(S;C) ]
C
Maximize the likelihood of the word sequence given the clustering
Minimize the entropy (surprisal) of the word sequence given the clustering
![Page 8: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/8.jpg)
Bilingual Objective
Maximize the information we know about one clustering given another
1 1
Language 1 Language 22
3
2
3 Word alignments
![Page 9: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/9.jpg)
Bilingual Objective
1 1
Language 1 Language 22
3
2
3
Minimize the entropy of one clustering given the other
Word alignments
![Page 10: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/10.jpg)
Bilingual Objective
For aligned words x in clustering C and y in clustering D,
The association between Cx and Dy can be written as:
p(Cx|Dy) + p (Dy|Cx)
Cx Dy
Dz
p(Dy|Cx) = a / (a + b)
a
b
Where,
Cw
c
![Page 11: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/11.jpg)
Bilingual Objective
• Thus for the two clusterings,
AVI (C, D) = E(i, j) [ -log p(Ci|Dj) – log p (Dj|Ci) ]
• Aligned Variation of Information
• Captures the mutual information content of the two clusterings
• Has distance metric properties• Non-negative: AVI (C, D) > 0• Symmetric: AVI (C, D) = AVI (D, C) • Triangle Inequality: AVI (C, E) ≤ AVI (C, D) + AVI (D, E) • Identity of Indiscernibles: AVI (C, D) = 0, iff C ≅ D
Aligned Variation of Information
![Page 12: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/12.jpg)
Joint Objective
α [ H (C) + H (D) ] + ß AVI (C, D)
BilingualMonolingual
α, ß are the weights of the mono and bi objectives resp.
Word sequence information
Cross lingual information
![Page 13: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/13.jpg)
Inference
Bilingual
MonolingualMonolingual & Bilingual Word Clustering
We want to do a MAP inference on the factor graph
![Page 14: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/14.jpg)
Inference
• Optimization• Optimal solution is a hard combinatorial problem (Och, 1995)
• Greedy hill climbing word exchange (Martin et al., 1995)
• Transfer word to the cluster with max improvement
• Initialization• Round-robin based on frequency
• Termination• No. of words exchanged < 0.1% (vocab1 + vocab2)• At least 5 complete iterations
![Page 15: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/15.jpg)
Evaluation
Named Entity Recognition (NER)
Evaluation
• Core information extraction task• Very sensitive to word representations
• Word clusters are useful for downstream tasks (Turian et al, 2010)
• Can be directly used as features for NER • English(Finkel & Manning, 2009), German(Faruqui & Padó, 2010)
![Page 16: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/16.jpg)
Data and Tools
German NER
• Training & Test data: CoNLL 2003• 220,000 and 55,000 tokens resp.
• Corpora for clustering: WIT-3 (Cettolo et al., 2012)
• Collection of TED talks• {Arabic, English, French, Korean, Turkish} – German• Around 1.5 million German tokens for each pair
• Stanford NER for training (Finkel and Manning, 2009)
• In-built functionality to use word clusters for generalization
• cdec for unsupervised word alignments (Dyer et al., 2013)
![Page 17: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/17.jpg)
Experiments
Baseline: No clusters
1. Bilingual Information Only• α = 0, ß = 1• Objective: AVI (C, D)
2. Monolingual Information Only• α = 1, ß = 0• Objective: H (C) + H (D)
3. Monolingual + Bilingual Information• α = 1, ß = 0.1• Objective: H (C) + H (D) + 0.1 AVI (C, D)
α [ H (C) + H (D) ] + ß AVI (C, D)
![Page 18: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/18.jpg)
Alignment Edge Filtering
• Word alignments are not perfect
• We filter out alignment edges between two words (x, y) if:
x y
a
b
cd
2 * b / ( (a + b + c) + (b + d) ) ≤ η
• Training η for different language pairs:
English 0.1
French 0.1
Arabic 0.3
Turkish 0.5
Korean 0.7
![Page 19: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/19.jpg)
Results
F1 scores of German NER trained using different word clusters on the Training set
![Page 20: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/20.jpg)
Results
F1 scores of German NER trained using different word clusters on the Test set
![Page 21: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/21.jpg)
Ongoing Work
Bilingual
Monolingual
Multilingual Word Clustering
![Page 22: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/22.jpg)
Ongoing Work
Current work: Parallel Data
Mono1 + Parallel Data
Mono1 + Parallel Data + Mono2
![Page 23: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/23.jpg)
Conclusion
• Novel information theoretic model for bilingual clustering• The bilingual objective has an intuitive meaning• Joint optimization of the mono + bi objective
• Improvement in clustering quality over monolingual clustering
• Extendable to any number of languages incorporating both monolingual and parallel data
![Page 24: An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649da05503460f94a8c228/html5/thumbnails/24.jpg)
Thank You!