Topic Retrieval and Articles Recommendationcs229.stanford.edu/proj2016spr/poster/050.pdf · 2017....

Topic Retrieval and Articles RecommendationYu Wang, Xinyu Shen, Jinzhi Wang

Motivation

Data & Preprocessing

For a paper of interest, looking for more papers with similar topic would be time consuming and effort-taking. We propose a new method of exploring papers through Machine Learning methods that can save people’s effort.

Existing Paper

Machine learning

Machine Extracted Topics

ComparePaper & Topics in Data

Machine select

Result Papers

Paper selection by human – Time and effort consuming

Existing Paper

Extractby

human

Multiple PossibleTopics

ResultPapers

Multiple papers for eachtopicSearch

for each

Paper selection by machine – Higher efficiency & accuracy

Data source: 2011 ~ 2015 CS229 course project reports

Number of documents

Total words Unique words Dictionarysize

1,298 2.4 million 6,522 39,588

For each document• Convert format from PDF to txt• Remove non-English words (i.e. numbers, symbols, signs etc.)• All words to lowercase• Word suffix removal (word stemming) [obtained dictionary]• Trivial words removal (i.e. the, and, was, we)Vectorize the selected feature into w columns and the documents into drows. And thus we have the text feature matrix X.

Xi,j is the time of appearance of a specific word j in document i.Remove non-common words like names or rare jargons, we filter out words with times for time of appearance < 3 [obtained unique words].An example of a paragraph before and after processing will be:

Welcome to CS 229! This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include: supervised learning ( generative / discriminative learning…

welcome course provid broad introduc machine learn statistic pattern recogni topic include supervis learn generat discriminat learn

MethodologyTrainingSetting cluster number to 20K-‐means:• Obtain topic assignment on each documentLatent Dirichlet Allocation(LDA):• Obtain topic assignment on each word

Analyzing • Matrix Y: Labeling Matrix for documents• Matrix Z: Labeling Matrix for words• Convert Matrix X / Z to matrix of topic distribution over documents in

k-means / LDA method. Testing• Add a row vector representing the word composition over the test

document and plug into k-means / LDA method.

Analyzing Testing

Matrix X．Every doc has unique

word combination

Doc 1Doc 2Doc 3Doc 4

……………

Doc i

Word 1 Word 2 …… Word j

For each doc , by cosine distance Refer to Mat. Y

.

.

.Distribution of topics on Doc i

Topic1 ….. Topic 10 ..… Topic 20

Doc 1

Doc 2

…

…

Doc i

Test doc

If testing, add a row vector of word composition over one doc

Dist. of topics on test doc

Matrix Z． Describing which

topic/cluster a word belongs to

Topic1 Topic2 Topic3 …… Topic20word 1word 2word 3word 4

…………

word j

The doc topic is the highest cluster among all

Topic1 Topic2 Topic3 …… Topic20Doc 1Doc 2Doc 3Doc 4

…………

Doc i

Test doc

Refer to word composition of

each doc

For each doc

Dist. of topics on test doc

If testing, compare to test vector of word composition over one doc

Results

Reading ListMachine Learning Applied to the Detection of Retinal Blood VesselsSupervised DeepLearning For MultiClass Image Classification

Top 3 Recommendation ListK-‐means LDA

Implementing Machine Learning Algorithms on GPUs for Real-‐Time Traffic

Sign ClassificationPedestrian Detection Using Structured SVM

Equation to LaTeX FarmX: Leaf based disease identification in farms

Object classification for autonomous vehiclenavigationof Stanford campus Identifying Gender From Images of Faces

Comparison

Distribution

Relation between topic and paper(K-means)

Word frequency for each topic(LDA)

• Documents recommended by k-means mothod have a very similar distribution with the readling list papers compound distribution• Distribution of documents recommended by LDA deviate more. This may indicate more variance error.

Clusters for k-means & unique words

Other Other

Matrix XSet the Number of Clusters/Labels

Doc 1Doc 2

……………

Doc i

K-‐means LDADoc 1Doc 2

……………

Doc i

Cluster/Topic

word 1word 2

………………………

word j

Topic1 Topic2 …

1231………2

30% 12% …5% 47% ...

18% 27% ...……………………

71% 9% ...

Training

Assign cluster# (k = 20)

Assign cluster# (k = 20)

Matrix Y Matrix Z

Word 1 …… Word j

Topic Retrieval and Articles Recommendationcs229.stanford.edu/proj2016spr/poster/050.pdf · 2017....

Documents

Transcript of Topic Retrieval and Articles Recommendationcs229.stanford.edu/proj2016spr/poster/050.pdf · 2017....