Example 16,000 documents 100 topic Picked those with large p(w|z)
-
Upload
marvin-turner -
Category
Documents
-
view
212 -
download
0
Transcript of Example 16,000 documents 100 topic Picked those with large p(w|z)
![Page 1: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/1.jpg)
Example
• 16,000 documents
• 100 topic
• Picked those with large p(w|z)
![Page 2: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/2.jpg)
![Page 3: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/3.jpg)
• Given a new document, compute and • words allocated to each topic
• approximates p(zn|w)
• See cases where these values are relatively large
• 4 topics found
n
New document?i n
ii
![Page 4: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/4.jpg)
Unseen document (contd.)
• Bag of words - William Randolph Hearst Foundation assigned to different topics
![Page 5: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/5.jpg)
Applications and empirical results
• Document modeling
• Document classification
• Collaborative filtering
![Page 6: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/6.jpg)
Document modeling
• Task: density estimation, high likelihood to unseen document
• Measure of goodness: perplexity
• Monotonically decreases in the likelihood
![Page 7: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/7.jpg)
The experiment
Articles Terms
Scientific abstracts
5,225 28,414
Newswire articles
16,333 23,075
![Page 8: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/8.jpg)
The experiment (contd.)
• Preprocessed– stop words– appearing once
• 10% held for training
• Trained with the same stopping criteria
![Page 9: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/9.jpg)
Results
![Page 10: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/10.jpg)
Overfitting in Mixture of unigrams
• Peaked posterior in the training set
• Unseen document with unseen word
• Word will have very small probability
• Remedy: smoothing
![Page 11: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/11.jpg)
Overfitting in pLSI
• Mixture of topics allowed
• Marginalize over d to find p(w)
• Restriction to having the same topic proportions as training documents
• “Folding in” ignore p(z|d) parameters and refit p(z|dnew)
![Page 12: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/12.jpg)
LDA
• Documents can have different proportions of topics
• No heuristics
![Page 13: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/13.jpg)
![Page 14: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/14.jpg)
Document classification
• Generative or discriminative
• Choice of features in document classification
• LDA as dimensionality reduction technique
• as LDA features)w(
![Page 15: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/15.jpg)
The experiment
• Binary classification
• 8000 documents, 15,818 words
• True label not known
• 50 topic
• Trained SVM on the LDA features
• Compared with SVM on all word features
• LDA reduced feature space by 99.6%
![Page 16: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/16.jpg)
GRAIN vs NOT GRAIN
![Page 17: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/17.jpg)
EARN vs NOT EARN
![Page 18: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/18.jpg)
LDA in document classification
• Feature space reduced, performance improved
• Results need further investigation
• Use for feature selection
![Page 19: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/19.jpg)
Collaborative filtering
• Collection of users and movies they prefer
• Trained on observed users
• Task: given unobserved user and all movies preferred but one, predict the held out movie
• Only users who positively rated 100 movies
• Trained on 89% of data
![Page 20: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/20.jpg)
Some quantities required…
• Probability of held out movie p(w|wobs)
– For mixture of unigrams and pLSI sum out topic variable
– For LDA sum out topic and Dirichlet variables (quantity efficient to compute)
![Page 21: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/21.jpg)
Results
![Page 22: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/22.jpg)
Further work
• Other approaches for inference and parameter estimation
• Embedded in another model
• Other types of data
• Partial exchangeability
![Page 23: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/23.jpg)
Example – Visual words
• Document = image
• Words = image features: bars, circles
• Topics = face, airplane
• Bag of words = no spatial relationship between objects
![Page 24: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/24.jpg)
Visual words
![Page 25: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/25.jpg)
Identifying the visual words and topics
![Page 26: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/26.jpg)
Conclusion
• Exchangeability, De Finetti Theorem
• Dirichlet distribution Generative Bag of words Independence assumption in Dirichlet
distribution - correlated topics
![Page 27: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/27.jpg)
Implementations
• In C (by one of the authors)– http://www.cs.princeton.edu/~blei/lda-c/
• In C and Matlab– http://chasen.org/~daiti-m/dist/lda/
![Page 28: Example 16,000 documents 100 topic Picked those with large p(w|z)](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649e575503460f94b4fee4/html5/thumbnails/28.jpg)
References
• Latent Dirichlet allocation, D. Blei, A. Ng, and M. Jordan. In Journal of Machine Learning Research, 3:993-1022, 2003
• Discovering object categories in image collections. J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, W. T. Freeman. MIT AI Lab Memo AIM-2005-005, February, 2005
• Correlated topic models, David Blei and John Lafferty, Advances in Neural Information Processing Systems 18, 2005.