Cross-Lingual Linking of News Stories using ESA
description
Transcript of Cross-Lingual Linking of News Stories using ESA
Copyright 2011 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Cross-Lingual Linking of NewsStories using ESA
Nitish Aggarwal, Kartik Asooja, Paul Biutelaar, Tamara Polajanar, Jorge Gracia
DERI, NUI Galway, IrelandOEG, UPM, Madrid, Spain
Tuesday, 18 Dec, 2012CL!NSS, FIRE-2012
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Overview
Problem Space Approach
Search Space Reduction Semantic Ranking
Cross-Lingual Explicit Semantic Analysis (CL-ESA) Evaluations Conclusion & Future Work
2
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Problem Space
Cross-lingual news story linking identify the same news articles in different languages Cross-Lingual Plagiarism detection
Data set 50 English News Stories 50K Hindi News Stories
Challenge Not directly Translated
– Similar keywords in different stories– Different keywords in similar stories
3
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Approach
Search Space Reduction News publication dates
– by taking K days window Vocabulary overlap
– Translating English news stories using Google Translate
Semantic Ranking Rank the news stories with their semantic relatedness CL-ESA semantic relatedness score
4
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Corpus-based Relatedness Semantic meaning as a distributional vector
– Words that occur in similar contexts tend to have similar/ related meanings i.e. meaning of a word can be defined in terms of its context. (Distributional Hypothesis (Harris, 1954))
Latent Semantic Analysis (LSA)– Latent or implicit semantics (unsupervised)
Explicit Semantic Analysis (ESA)– Explicit semantics from explicitly derived concepts
(supervised)
5
Semantic Ranking/Relatedness
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge6
Word1
Wordn
W1*URI1+w2*URI2…. wn*URIn
W1*URI1+w2*URI2…. wn*URIn
Word1
Wordn
W1*URI1+w2*URI2…. wn*URIn
W1*URI1+w2*URI2…. wn*URIn
Word1
Wordn
W1*URI1+w2*URI2…. wn*URIn
W1*URI1+w2*URI2…. wn*URIn
EN
HI
ES
Inverted Index
W11*URI1+w12*URI2…. w1n*URIn
W11*URI1+w12*URI2…. w1n*URIn Vector Cosine
Semantic Relatedness
Term@en
Term@hi
Cross lingual ESA (CL-ESA)
Multilingual Wikipedia Index EN, DE, ES, PT, FR, NL, HI
– Easily extendable for other languages Performed better than CL-latent models
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Run1 window of 4 days (2 days before and 2 days after) Rank all news stories using CL-ESA
Run2 window of 14 days (7 days before and 7 days after) Rank all news stories using Modified CL-ESA
Run3 English stories were translated into Hindi using Google
translator Took top 1000 Hindi news using vocabulary overlap Re-rank all news stories using CL-ESA
7
Experiments
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
CL!NSS challenge
8
Evaluation: Results
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Initial approach for cross lingual linking of news stories Bigger window with modified CL-ESA works best Translated vocabulary overlap did not work well
Use other ranking scores LSA, LDA
Evaluate separate effect of components Bigger window size Vs Ranking function
9
Conclusion
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Thank You Questions?
10