cENTERTAIN.me - Content-Based Movie Recommendation with DBpedia
Movie topics- Efficient features for movie recommendation systems
-
Upload
suvir-bhargav -
Category
Technology
-
view
355 -
download
1
description
Transcript of Movie topics- Efficient features for movie recommendation systems
Efficient Features for Movie Recommendation
Systems
Project presentation
Suvir Bhargav
Outline
● Motivation and Why movie reviews● Problem statement● How? or the overall system ● Text preprocessing approaches● Postprocessing: movie topics from a reviews
corpus● Similarity● Experimental setup and results
Thanks to Sean Lind, source: http://www.silveroakcasino.com/blog/posts/netflix/what-to-watch-on-netflix.html
Motivation
Motivation
● movie genres are not enough.● classify movies
○ keywords○ moods○ imdb ratings○ micro genres
micro genres
source: http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/
Why movie reviews?
Source: a sample user written movie review from imdb
Problem statement
● Feature extraction from user reviews of movies
● Use extracted features to find similar movies.
The overall system
Movie reviews corpus● preprocessing
○ tokenization, stopwords, lemmatized.
● post processing○ topic modeling: Movie topics from a reviews corpus
● similarity measure○ return movies with similar topics distribution
tokenization, stopwords, lemmatized.
Simple information extraction
Text preprocessing
Figure credit to nltk book.
Post processing
Document representation: Vector Space Model (VSM)
Picture credit: pyevolve
Post processing: generative model
source: David blei’s slide
Post processing: LDA
For each document in the collection, the words can be generated in two stage process1) Randomly choose a distribution over topics.2) For each word in the document
a) Randomly choose a topic from the distribution over topics in step 1.
b) Randomly choose a word from the corresponding distribution over the vocabulary
Documents exhibit multiple topics
Movie topics from a reviews corpus
Similarity Measure
● Cosine Similarity● KL divergence● Hellinger distance
Cosine Similarity
Similarity Measure
Hellinger Distance
Similarity Measure
The overall system: implementation
Movie reviews corpus● preprocessing
○ nltk and gensim’s simple preprocessing.
● post processing○ gensim python wrapper to MALLET○ index topic distribution of query movies, q and 1k
movies corpus, C.
● similarity measure○ python numpy implementation○ apply distance metric on indexed q and C.○ sort and pick top 5 movies.
Experimental setup
Movie reviews corpus of 1k movies
reviews data source: imdb
Evaluation criteria
Experimental setup
Conclusion
● Movie topics as efficient features for RS○ represents movies by underlying semantic patterns
○ useful for capturing movie genre and mood.
○ but not so well with plot.
○ user written movie reviews are useful movie meta-data.
● The developed prototype○ easy to add more movie meta-data
○ python allows scalability.
○ Topics as an explanation needs further tuning.
Future directions
● Movie review preprocessing○ bigram, trigrams.○ create multi-word movie keywords or language
construction
● Building complex topic models○ Hierarchical LDA○ author-topic model
■ include authorship information.■ similarity between authors
Questions ?
Thank You
Image src: http://www.brinvy.biz/177215/batman-catching-a-ride-on-supermans-back-funny-hd-wallpaper-x.html
Extra slides
List of extra slides and notes● Original LDA paper● introduction to probabilistic topic modeling● and A. Huang’s Similarity measures for text document
clustering● Another good LDA description● Integrating out multinomial parameters in LDA● language construction in micro genres
LDA