Topic Retrieval and Articles Recommendationcs229.stanford.edu/proj2016spr/poster/050.pdf · 2017....

1
Topic Retrieval and Articles Recommendation Yu Wang, Xinyu Shen, Jinzhi Wang Motivation Data & Preprocessing For a paper of interest, looking for more papers with similar topic would be time consuming and effort-taking. We propose a new method of exploring papers through Machine Learning methods that can save people’s effort. Existing Paper Machine learning Machine Extracted Topics Compare Paper & Topics in Data Machine select Result Papers Paper selection by human – Time and effort consuming Existing Paper Extract by human Multiple Possible Topics Result Papers Multiple papers for each topic Search for each Paper selection by machine – Higher efficiency & accuracy Data source: 2011 ~ 2015 CS229 course project reports Number of documents Total words Unique words Dictionary size 1,298 2.4 million 6,522 39,588 For each document Convert format from PDF to txt Remove non-English words (i.e. numbers, symbols, signs etc.) All words to lowercase Word suffix removal (word stemming) [obtained dictionary] Trivial words removal (i.e. the, and, was, we) Vectorize the selected feature into w columns and the documents into d rows. And thus we have the text feature matrix X. X i,j is the time of appearance of a specific word j in document i . Remove non-common words like names or rare jargons, we filter out words with times for time of appearance < 3 [obtained unique words]. An example of a paragraph before and after processing will be: Welcome to CS 229! This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include: supervised learning ( generative / discriminative learning… welcome course provid broad introduc machine learn statistic pattern recogni topic include supervis learn generat discriminat learn Methodology Training Setting cluster number to 20 Kmeans: Obtain topic assignment on each document Latent Dirichlet Allocation(LDA): Obtain topic assignment on each word Analyzing Matrix Y: Labeling Matrix for documents Matrix Z: Labeling Matrix for words Convert Matrix X / Z to matrix of topic distribution over documents in k-means / LDA method. Testing Add a row vector representing the word composition over the test document and plug into k-means / LDA method. Analyzing Testing Matrix X Every doc has unique word combination Doc 1 Doc 2 Doc 3 Doc 4 Doc i Word 1 Word 2 …… Word j For each doc , by cosine distance Refer to Mat. Y . . . Distribution of topics on Doc i Topic1 .. Topic 10 ..Topic 20 Doc 1 Doc 2 Doc i Test doc If testing, add a row vector of word composition over one doc Dist. of topics on test doc Matrix Z Describing which topic/cluster a word belongs to Topic1 Topic2 Topic3 …… Topic20 word 1 word 2 word 3 word 4 word j The doc topic is the highest cluster among all Topic1 Topic2 Topic3 …… Topic20 Doc 1 Doc 2 Doc 3 Doc 4 Doc i Test doc Refer to word composition of each doc For each doc Dist. of topics on test doc If testing, compare to test vector of word composition over one doc Results Reading List Machine Learning Applied to the Detection of Retinal Blood Vessels Supervised DeepLearning For MultiClass Image Classification Top 3 Recommendation List Kmeans LDA ImplementingMachineLearning Algorithms on GPUs for RealTime Traffic Sign Classification Pedestrian Detection Using Structured SVM Equationto LaTeX FarmX: Leaf based disease identification in farms Object classification for autonomous vehiclenavigationofStanford campus Identifying Gender From Images of Faces Comparison Distribution Relation between topic and paper (K-means) Word frequency for each topic (LDA) Documents recommended by k-means mothod have a very similar distribution with the readling list papers compound distribution Distribution of documents recommended by LDA deviate more. This may indicate more variance error. Clusters for k-means & unique words Other Other Matrix X Set the Number of Clusters/Labels Doc 1 Doc 2 Doc i Kmeans LDA Doc 1 Doc 2 Doc i Cluster/Topic word 1 word 2 word j Topic1 Topic2 1 2 3 1 2 30% 12% 5% 47% ... 18% 27% ... 71% 9% ... Training Assign cluster # (k = 20) Assign cluster # (k = 20) Matrix Y Matrix Z Word 1 …… Word j

Transcript of Topic Retrieval and Articles Recommendationcs229.stanford.edu/proj2016spr/poster/050.pdf · 2017....

Page 1: Topic Retrieval and Articles Recommendationcs229.stanford.edu/proj2016spr/poster/050.pdf · 2017. 9. 23. · Topic Retrieval and Articles Recommendation Yu Wang , Xinyu Shen, Jinzhi

Topic Retrieval and Articles RecommendationYu Wang, Xinyu Shen, Jinzhi Wang

Motivation

Data & Preprocessing

For a paper of interest, looking for more papers with similar topic would be time consuming and effort-taking. We propose a new method of exploring papers through Machine Learning methods that can save people’s effort.

Existing  Paper

Machine  learning

Machine  Extracted  Topics

ComparePaper  &  Topics  in  Data

Machine  select

Result  Papers

Paper  selection  by  human  – Time  and  effort  consuming

Existing  Paper

Extractby  

human

Multiple  PossibleTopics

ResultPapers

Multiple  papers  for  eachtopicSearch

for  each

Paper  selection  by  machine  – Higher  efficiency  &  accuracy

Data source: 2011 ~ 2015 CS229 course project reports

Number of  documents

Total words Unique words Dictionarysize

1,298 2.4 million 6,522 39,588

For each document• Convert format from PDF to txt• Remove non-English words (i.e. numbers, symbols, signs etc.)• All words to lowercase• Word suffix removal (word stemming) [obtained dictionary]• Trivial words removal (i.e. the, and, was, we)Vectorize the selected feature into w columns and the documents into drows. And thus we have the text feature matrix X.

Xi,j is the time of appearance of a specific word j in document i.Remove non-common words like names or rare jargons, we filter out words with times for time of appearance < 3 [obtained unique words].An example of a paragraph before and after processing will be:

Welcome to CS 229! This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include: supervised learning ( generative / discriminative learning…

welcome  course  provid broad  introduc machine  learn  statistic  pattern  recogni topic  include  supervis learn  generat discriminat learn

MethodologyTrainingSetting cluster number to 20K-­‐means:• Obtain topic assignment on each documentLatent  Dirichlet Allocation(LDA):• Obtain topic assignment on each word

Analyzing • Matrix Y: Labeling Matrix for documents• Matrix Z: Labeling Matrix for words• Convert Matrix X / Z to matrix of topic distribution over documents in

k-means / LDA method. Testing• Add a row vector representing the word composition over the test

document and plug into k-means / LDA method.

Analyzing Testing

Matrix  X.Every doc has unique

word combination

Doc 1Doc 2Doc 3Doc 4

……………

Doc i

Word 1 Word 2 …… Word j

For  each  doc  ,  by  cosine  distance Refer to Mat. Y

.

.

.Distribution of topics on Doc i

Topic1 ….. Topic 10 ..… Topic 20

Doc 1

Doc 2

Doc i

Test doc

If testing, add a row vector of word composition over one doc

Dist. of topics on test doc

Matrix  Z. Describing which

topic/cluster a word belongs to

Topic1 Topic2 Topic3 …… Topic20word 1word 2word 3word 4

…………

word j

The doc topic is the highest cluster among all

Topic1 Topic2 Topic3 …… Topic20Doc 1Doc 2Doc 3Doc 4

…………

Doc i

Test doc

Refer to word composition of

each doc

For  each  doc  

Dist. of topics on test doc

If testing, compare to test vector of word composition over one doc

Results

Reading    ListMachine  Learning  Applied  to  the  Detection  of  Retinal  Blood  VesselsSupervised  DeepLearning For  MultiClass Image  Classification

Top 3 Recommendation ListK-­‐means LDA

Implementing  Machine  Learning  Algorithms  on  GPUs  for  Real-­‐Time  Traffic  

Sign  ClassificationPedestrian  Detection  Using  Structured  SVM

Equation  to  LaTeX FarmX:  Leaf  based  disease  identification  in  farms

Object  classification  for  autonomous  vehiclenavigationof  Stanford  campus Identifying  Gender  From  Images  of  Faces

Comparison

Distribution

Relation between topic and paper(K-means)

Word frequency for each topic(LDA)

• Documents recommended by k-means mothod have a very similar distribution with the readling list papers compound distribution• Distribution of documents recommended by LDA deviate more. This may indicate more variance error.

Clusters for k-means & unique words

Other Other

Matrix XSet the Number of Clusters/Labels

Doc 1Doc 2

……………

Doc i

K-­‐means LDADoc 1Doc 2

……………

Doc i

Cluster/Topic

word 1word 2

………………………

word j

Topic1 Topic2 …

1231………2

30% 12% …5% 47% ...

18% 27% ...……………………

71% 9% ...

Training

Assign cluster# (k = 20)

Assign cluster# (k = 20)

Matrix Y Matrix Z

Word 1 …… Word j