Cross Language Concept Mining
Transcript of Cross Language Concept Mining
Cross Language Concept Mining{ Motaz.Saad and David.Langlois and Kamel.Smaili }@loria.fr
1. OVERVIEW
Journalist Review System JRSObjective: Build a Journalist Review System (JRS) that enables me-
dia trackers (journalists) to collect multilingual comparable articles con-cerning a given topic, and perform the following:• Explore & review opinions.• Automatically detect the split of public opinions (e.g.: with vs
against an issue or person ...).• Identify & review more detailed opinions (joy, sad, anger, ...).Requirements:• Comparable corpora for training/testing.• Comparability Measure (CM): to compare multilingual articles• Sentiment Based Comparability Measure (SCM): to compare opin-
ions of comparable articles.
2. COMPARABLE CORPORA
• Sources: Wikipedia encyclopedia and Euronews website.• Aligning Wikipedia articles⇒ Use interlanguage links⇒ [[ar:rW�]] [[de:Regen]] [[es:Lluvia]] [[fr:Pluie]] [[en:Rain]]
• Aligning Euronews articles⇒ parsing html links of each Englisharticle and fetching corresponding Arabic/French articles.
• Corpora Information: publicly available athttp://sf.net/projects/crlcl/
AFEWC eNewsEnglish French Arabic English French Arabic
Articles 40290 40290 40290 34442 34442 34442Sentences 4.8M 2.7M 1.2M 744K 746K 622KAvg #sentences/article 119 69 30 21 21 17Avg #words/article 2266 1435 548 198 200 161Words 91.3M 57.8M 22M 6.8M 6.9M 5.5MVocabulary 2.8M 1.9M 1.5M 232K 256K 373K
3. COMPARABILITY MEASURE (CM)• CM is based on cosine similarity between comparable articles.• Word’s weight are represented as binary and frequency of words.• Cosine similarity is better for CM
R1 R5 R10
0.4
0.6
0.8
1
0.36
0.81
1
0.49
0.86
1
Rec
all
binCM cosineCM
4. SENTIMENT BASED COMPARABILITY MEASURE (SCM)
scm(c) =
∣∣∣∣∣∣∣∑
C(Sx)=c
P (Sx|c)
Nx−
∑C(Sy)=c
P (Sy|c)
Ny
∣∣∣∣∣∣∣
5. SCM RESULTS
Corpora scm(o) scm(o) scm(p) scm(p)
parallel-p2
AFP 0.02 0.02 0.1 0.12ANN 0.05 0.06 0.1 0.1ASB 0.07 0.1 0.12 0.14TED 0.06 0.06 0.08 0.07UN 0.05 0.02 0.07 0.08
ComparableENews 0.07 0.15 0.11 0.15AFEWC 0.11 0.19 0.11 0.16
o = subjective, o = objective, p = negative, (p) = positive
AFP: Associated France Press, ANN, Annahar newspaper, ASB: Assabah newspaper, TED: talks fromted.com, UN: United nations resolutions.
- Comparing CM results for parallel/comparable corpora⇒ CM can capture comparability- Comparable articles do not have the same opinions⇒ they variate in their objectivityand positivity
6. MORPHOLOGICAL ANALYSIS
katabكتب to writeécrire
tairطير to flyvoler
maktab مكتبoffice
bureau
kitab كتابbooklivre
maktaba مكتبةlibrary
bibliothèque
ta-iar طيارpilotpilote
matar مطارairport
aéroport
ta-ira طائرةairplane
avion
ta-ir طائرbird
oiseau
• Stemming and lemmatization for English and French• Rooting and light stemming for Arabic⇒ Light stemming removes suffixes and prefixes⇒ Rooting removes suffixes and prefixes and reduce to the root
7. COVERAGE RATE OF THE BILINGUAL DICTIONARY
57%morphAr-lemma50%morphAr-stemEn
40%root-lemma39%root-stemEn41%lightStem-lemma41%LightStem-stemEn
0% 20% 40% 60% 80% 100%
8. FUTURE WORK
• Elaborate a multilingual document representation model based on Latent SemanticIndexing to enhance CM.
• Elaborate SCM by enhancing sentiment detecting and by reviewing more detailedsentiments, i.e emotion in words (joy, anger, pleasure, ...). This will be done byexploiting annotated lexicons and semantic network.
• Develop an interface for journalists to review comparable articles.
9. REFERENCES• Saad, M.; Langlois, D. & Smaili, K. (2013), Comparing Multilingual Comparable Articles Based On Opinions, in ’Proceedings of
the Sixth Workshop on Building and Using Comparable Corpora’ , Association for Computational Linguistics, Sofia, Bulgaria , pp.105-111.
• Saad, M.; Langlois, D. & Smaili, K. (2013), Extracting Comparable Articles from Wikipedia and Measuring Their Comparabilities, in’5th International Conference on Corpus Linguistics’ , University of Alicante, Spain.