Topic Retrieval and Articles Recommendationcs229.stanford.edu/proj2016spr/poster/050.pdf · 2017....
Transcript of Topic Retrieval and Articles Recommendationcs229.stanford.edu/proj2016spr/poster/050.pdf · 2017....
![Page 1: Topic Retrieval and Articles Recommendationcs229.stanford.edu/proj2016spr/poster/050.pdf · 2017. 9. 23. · Topic Retrieval and Articles Recommendation Yu Wang , Xinyu Shen, Jinzhi](https://reader036.fdocuments.in/reader036/viewer/2022063008/5fbec77b39c1e42e426eefe3/html5/thumbnails/1.jpg)
Topic Retrieval and Articles RecommendationYu Wang, Xinyu Shen, Jinzhi Wang
Motivation
Data & Preprocessing
For a paper of interest, looking for more papers with similar topic would be time consuming and effort-taking. We propose a new method of exploring papers through Machine Learning methods that can save people’s effort.
Existing Paper
Machine learning
Machine Extracted Topics
ComparePaper & Topics in Data
Machine select
Result Papers
Paper selection by human – Time and effort consuming
Existing Paper
Extractby
human
Multiple PossibleTopics
ResultPapers
Multiple papers for eachtopicSearch
for each
Paper selection by machine – Higher efficiency & accuracy
Data source: 2011 ~ 2015 CS229 course project reports
Number of documents
Total words Unique words Dictionarysize
1,298 2.4 million 6,522 39,588
For each document• Convert format from PDF to txt• Remove non-English words (i.e. numbers, symbols, signs etc.)• All words to lowercase• Word suffix removal (word stemming) [obtained dictionary]• Trivial words removal (i.e. the, and, was, we)Vectorize the selected feature into w columns and the documents into drows. And thus we have the text feature matrix X.
Xi,j is the time of appearance of a specific word j in document i.Remove non-common words like names or rare jargons, we filter out words with times for time of appearance < 3 [obtained unique words].An example of a paragraph before and after processing will be:
Welcome to CS 229! This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include: supervised learning ( generative / discriminative learning…
welcome course provid broad introduc machine learn statistic pattern recogni topic include supervis learn generat discriminat learn
MethodologyTrainingSetting cluster number to 20K-‐means:• Obtain topic assignment on each documentLatent Dirichlet Allocation(LDA):• Obtain topic assignment on each word
Analyzing • Matrix Y: Labeling Matrix for documents• Matrix Z: Labeling Matrix for words• Convert Matrix X / Z to matrix of topic distribution over documents in
k-means / LDA method. Testing• Add a row vector representing the word composition over the test
document and plug into k-means / LDA method.
Analyzing Testing
Matrix X.Every doc has unique
word combination
Doc 1Doc 2Doc 3Doc 4
……………
Doc i
Word 1 Word 2 …… Word j
For each doc , by cosine distance Refer to Mat. Y
.
.
.Distribution of topics on Doc i
Topic1 ….. Topic 10 ..… Topic 20
Doc 1
Doc 2
…
…
Doc i
Test doc
If testing, add a row vector of word composition over one doc
Dist. of topics on test doc
Matrix Z. Describing which
topic/cluster a word belongs to
Topic1 Topic2 Topic3 …… Topic20word 1word 2word 3word 4
…………
word j
The doc topic is the highest cluster among all
Topic1 Topic2 Topic3 …… Topic20Doc 1Doc 2Doc 3Doc 4
…………
Doc i
Test doc
Refer to word composition of
each doc
For each doc
Dist. of topics on test doc
If testing, compare to test vector of word composition over one doc
Results
Reading ListMachine Learning Applied to the Detection of Retinal Blood VesselsSupervised DeepLearning For MultiClass Image Classification
Top 3 Recommendation ListK-‐means LDA
Implementing Machine Learning Algorithms on GPUs for Real-‐Time Traffic
Sign ClassificationPedestrian Detection Using Structured SVM
Equation to LaTeX FarmX: Leaf based disease identification in farms
Object classification for autonomous vehiclenavigationof Stanford campus Identifying Gender From Images of Faces
Comparison
Distribution
Relation between topic and paper(K-means)
Word frequency for each topic(LDA)
• Documents recommended by k-means mothod have a very similar distribution with the readling list papers compound distribution• Distribution of documents recommended by LDA deviate more. This may indicate more variance error.
Clusters for k-means & unique words
Other Other
Matrix XSet the Number of Clusters/Labels
Doc 1Doc 2
……………
Doc i
K-‐means LDADoc 1Doc 2
……………
Doc i
Cluster/Topic
word 1word 2
………………………
word j
Topic1 Topic2 …
1231………2
30% 12% …5% 47% ...
18% 27% ...……………………
71% 9% ...
Training
Assign cluster# (k = 20)
Assign cluster# (k = 20)
Matrix Y Matrix Z
Word 1 …… Word j