Post on 17-May-2015
description
Context Based Search
ByShatabdi Kundu (2010EET2553)
Computer Technology,M.TechIIT Delhi
Email ID:shatabdikundu@live.com
Project Guide:Prof.Santanu Chaudhury
Electrical Engineering DepartmentIIT Delhi
Email ID:santanuc@ee.iitd.ac.in
June 22, 2011Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 1of 16
Outline
Introduction to Topic Models- Probabilistic Modelling
Latent Dirichlet Allocation
Topic Discovery using Wordnet
Work Done
Results
Conclusion and Future Work
References
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 2of 16
Probabilistic Modelling
Treat data as observations that arise from a generativeprobabilistic process that includes hidden variables
For documents, the hidden variables reflect the thematicstructure of the collection
Infer the hidden structure using posterior inference
What are the topics that describe this collection?
Situate new data into the estimated model
How does this query or new document fit into the estimatedtopic structure?
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 3of 16
Intuition behind LDA
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 4of 16
Generative Process
Cast these intuitions into a generative probabilistic process
Each document is a random mixture of corpus-wide topics
Each word is drawn from one of those topics
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 5of 16
Graphical Models
Nodes are random variablesEdges denote possible dependenceObserved variables are shadedPlates denote replicated structure
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 6of 16
Graphical Models
Structure of the graph defines the pattern of conditionaldependence between the ensemble of random variables.
Eg. this graph corressponds to
p(y , x1...xN) = p(y)N∏
n=1
p(xn | y) (1)
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 7of 16
Latent Dirichlet Allocation
1 Draw each topic βk ∼ Dir(η), for k ε {1,.....,K}2 For each document:
1 Draw topic proportions θd ∼ Dir(α)2 For each word:
1 Draw Zd,n ∼ Mult(θd)2 Draw Wd,n ∼ Mult(βZd,n )
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 8of 16
Latent Dirichlet Allocation
From a collection of documents, infer
Per-word topic assignment Zd,n
Per-document topic proportions θdPer-corpus topic distributions βk
Use posterior expectations to perform the task at hand, e.ginformation retrieval,document similarity, etc.
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 9of 16
Topic Discovery using Wordnet
Lexical relations used for finding out the latent topics
synsets(synonym sets) as basic units
hyponymya semantic relation between word meaningsEg. {maple} is a hyponym of {tree}
hypernymyinverse of hyponymEg.{tree} is a hypernym of {maple}
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 10of 16
Work Done
I took a collection of 10 documents that had a total of around28K words
I removed the stop words and rare words along withpunctuation marks and numbers.
Then I modeled a 7-topic LDA model with this corpus
Now I had 7 topics with 5 most highly probable occuringwords from each topic.
I then used the lexical relations of Wordnet to identify thehidden topics using common parents of all the words in eachtopic.
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 11of 16
Results after training LDA model
This model only selects appropriate words within a topic butdoes not name the topic
Discovering the topic name is done using Wordnet
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 12of 16
Results after applying to Wordnet
The above result gives us the hidden topic names of the wordsthat comprised the documents.
This kind of model can be used for identifying topics whengiven only a word.
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 13of 16
Conclusion and Future Work
Now we will be working on searching based on topics(context)using this model.
Basically we will be dealing with geo-intent of the queries anddecide on the topic to which they belong for better retrieval ofinformation.
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 14of 16
References
Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan.Journal of Machine Learning Research, 3:993-1022, January2003.
Jun Fu Cai, Wee Sun Lee, Yee Whye Teh. NUS-ML:Improving Word Sense Disambiguation Using Topic Features.SEMEVAL (2007).
David M. Blei, Jon D. McAuliffe. Supervised Topic Models.NIPS (2007).
Wordnet. http://www.shiffman.net/teaching/a2z/wordnet
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 15of 16
Thank You
Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 16of 16