Predictive Text Embedding using LINE
-
Upload
nishant-prateek -
Category
Data & Analytics
-
view
304 -
download
3
Transcript of Predictive Text Embedding using LINE
Predictive Text Embedding using LINE
Shashank Gupta1 Nishant Prateek2 Karan Chandnani3
1201507574
2201225113
3201505507
IRE Major Project, Spring 2016
Outline
1 IntroductionPredictive Text EmbeddingText Networks
2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding
3 Training AlgorithmPre-training + Fine-tuning
4 Methods and ExperimentsDatasetExperiments
5 ResultsAccuracyDiscussion
6 Bibliography
Outline
1 IntroductionPredictive Text EmbeddingText Networks
2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding
3 Training AlgorithmPre-training + Fine-tuning
4 Methods and ExperimentsDatasetExperiments
5 ResultsAccuracyDiscussion
6 Bibliography
Predictive Text EmbeedingAn Introuction
Adapts the advantages of unsupervised text embeddings but naturallyutilizes labeled information in representation learning
An effective low dimensional representation is learned jointly fromlimited labeled examples and a large amount of unlabeled examples.
Optimized for particular tasks. eg. text classification, sentimentanalysis etc.
Outline
1 IntroductionPredictive Text EmbeddingText Networks
2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding
3 Training AlgorithmPre-training + Fine-tuning
4 Methods and ExperimentsDatasetExperiments
5 ResultsAccuracyDiscussion
6 Bibliography
Text Networks
Word-Word NetworkWord-word cooccurrence network, denoted as Gww = (V ,Eww ),captures the word co-occurrence information in local contexts of theunlabeled data. V is a vocabulary of words and Eww is the set ofedges between words.
Text Networks
Word-Document NetworkWord-document network, denoted as Gwd = (V ∪ D,Ewd), is abipartite network where D is a set of documents and V is a set ofwords. Ewd is the set of edges between words and documents.
Text Networks
Word-Label NetworkWord-label network, denoted as Gwl = (V ∪ L,Ewl), is a bipartitenetwork that captures category-level word co-occurrences. L is a setof class labels and V a set of words. Ewl is a set of edges betweenwords and classes.
Text Networks
Heterogeneous Text NetworkThe heterogeneous text network is the combination of word-word,word-document, and word-label networks constructed from bothunlabeled and labeled text data. It captures different levels of wordco-occurrences and contains both labeled and unlabeled information.
Outline
1 IntroductionPredictive Text EmbeddingText Networks
2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding
3 Training AlgorithmPre-training + Fine-tuning
4 Methods and ExperimentsDatasetExperiments
5 ResultsAccuracyDiscussion
6 Bibliography
Bipartite Network Embedding
PTE is composed of three indidual Bipartite graphs. We need amethod to embed each of these Bipartite graphs into alow-dimensional space.
Given a bipartite network G = (VA ∪ VB ,E ), where VA and VB aretwo disjoint sets of vertices of different types, and E is the set ofedges between them. We first define the conditional probability ofvertex vi in set VA generated by vertex vj in set VB as:
p(vi |vj) =exp(~uTi · ~uj)∑i ′∈A exp(~uTi ′ · ~uj)
Bipartite Network Embedding
PTE is composed of three indidual Bipartite graphs. We need amethod to embed each of these Bipartite graphs into alow-dimensional space.
Given a bipartite network G = (VA ∪ VB ,E ), where VA and VB aretwo disjoint sets of vertices of different types, and E is the set ofedges between them. We first define the conditional probability ofvertex vi in set VA generated by vertex vj in set VB as:
p(vi |vj) =exp(~uTi · ~uj)∑i ′∈A exp(~uTi ′ · ~uj)
Outline
1 IntroductionPredictive Text EmbeddingText Networks
2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding
3 Training AlgorithmPre-training + Fine-tuning
4 Methods and ExperimentsDatasetExperiments
5 ResultsAccuracyDiscussion
6 Bibliography
Heterogeneous Text Network Embedding
Opte = Oww + Owd + Owl
Oww = −∑
(i ,j)∈Eww
wij log p(vi |vj),
Owd = −∑
(i ,j)∈Ewd
wij log p(vi |dj),
Owl = −∑
(i ,j)∈Ewl
wij log p(vi |lj)
Heterogeneous Text Network Embedding
Opte = Oww + Owd + Owl
Oww = −∑
(i ,j)∈Eww
wij log p(vi |vj),
Owd = −∑
(i ,j)∈Ewd
wij log p(vi |dj),
Owl = −∑
(i ,j)∈Ewl
wij log p(vi |lj)
Heterogeneous Text Network Embedding
Opte = Oww + Owd + Owl
Oww = −∑
(i ,j)∈Eww
wij log p(vi |vj),
Owd = −∑
(i ,j)∈Ewd
wij log p(vi |dj),
Owl = −∑
(i ,j)∈Ewl
wij log p(vi |lj)
Heterogeneous Text Network Embedding
Opte = Oww + Owd + Owl
Oww = −∑
(i ,j)∈Eww
wij log p(vi |vj),
Owd = −∑
(i ,j)∈Ewd
wij log p(vi |dj),
Owl = −∑
(i ,j)∈Ewl
wij log p(vi |lj)
Outline
1 IntroductionPredictive Text EmbeddingText Networks
2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding
3 Training AlgorithmPre-training + Fine-tuning
4 Methods and ExperimentsDatasetExperiments
5 ResultsAccuracyDiscussion
6 Bibliography
Pre-training + Fine-Tuning
We used the pre-training and fine-tuning approach to optimize theobjective function Opte . We learn the embeddings with unlabeled datafirst, and then fine-tune the embeddings with the word-label network.
Algorithm
Algorithm: Pre-training + Fine-tuningData: Gww ,Gwd ,Gwl , number of samples T , number of negative samplesKResult: word embeddings ~wwhile iter ≤ T do
sample an edge from Eww and draw K negative edges, and updatethe word embeddings;
sample an edge from Ewd and draw K negative edges, and update theword and document embeddings;
end whilewhile iter ≤ T do
sample an edge from Ewl and draw K negative edges, and update theword embeddings;
end while
Outline
1 IntroductionPredictive Text EmbeddingText Networks
2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding
3 Training AlgorithmPre-training + Fine-tuning
4 Methods and ExperimentsDatasetExperiments
5 ResultsAccuracyDiscussion
6 Bibliography
Dataset
For this project, we use the Large Movie Review Dataset . Thisdataset consists of 25,000 movie reviews from IMDB.
The test set additionally contains another 25,000 movie reviews.
Apart from this there are another 50,000 unlabeled movie reviewsthat we used for fine-tuning.
Outline
1 IntroductionPredictive Text EmbeddingText Networks
2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding
3 Training AlgorithmPre-training + Fine-tuning
4 Methods and ExperimentsDatasetExperiments
5 ResultsAccuracyDiscussion
6 Bibliography
Experiments
For the first phase of the project, we did unsupervised training on theMovie Review Dataset using word2vec (skipgram) model. This servedas the baseline for further experiments.
For the next part, we tried the unsupervised training on theword-word network as mentioned in section 2. We first tried trainingwith random initialization. We then decided to leverage the word2vecembeddings and use those as initialization fo Gww . This gave usslightly better results than random initialization.
For the third part, we used the unsupervised embeddings obtained inthe previous step (with word2vec initialization) and fine-tuned it withthe word-label network from section 2, with random edge sampling.The probability of each edge being sampled is proportional to itsweight in the heterogenous text network embedding. This gave us afurther increase in performance.
Experiments
For the first phase of the project, we did unsupervised training on theMovie Review Dataset using word2vec (skipgram) model. This servedas the baseline for further experiments.
For the next part, we tried the unsupervised training on theword-word network as mentioned in section 2. We first tried trainingwith random initialization. We then decided to leverage the word2vecembeddings and use those as initialization fo Gww . This gave usslightly better results than random initialization.
For the third part, we used the unsupervised embeddings obtained inthe previous step (with word2vec initialization) and fine-tuned it withthe word-label network from section 2, with random edge sampling.The probability of each edge being sampled is proportional to itsweight in the heterogenous text network embedding. This gave us afurther increase in performance.
Experiments
For the first phase of the project, we did unsupervised training on theMovie Review Dataset using word2vec (skipgram) model. This servedas the baseline for further experiments.
For the next part, we tried the unsupervised training on theword-word network as mentioned in section 2. We first tried trainingwith random initialization. We then decided to leverage the word2vecembeddings and use those as initialization fo Gww . This gave usslightly better results than random initialization.
For the third part, we used the unsupervised embeddings obtained inthe previous step (with word2vec initialization) and fine-tuned it withthe word-label network from section 2, with random edge sampling.The probability of each edge being sampled is proportional to itsweight in the heterogenous text network embedding. This gave us afurther increase in performance.
Outline
1 IntroductionPredictive Text EmbeddingText Networks
2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding
3 Training AlgorithmPre-training + Fine-tuning
4 Methods and ExperimentsDatasetExperiments
5 ResultsAccuracyDiscussion
6 Bibliography
Accuracy
Algorithms Accuracy
Skipgram (word2vec) 84.86%Unsupervised (Gww ) 75.37%Unsupervised + Fine-tuning (Gww +Gwl)
77.83%
Outline
1 IntroductionPredictive Text EmbeddingText Networks
2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding
3 Training AlgorithmPre-training + Fine-tuning
4 Methods and ExperimentsDatasetExperiments
5 ResultsAccuracyDiscussion
6 Bibliography
Discussion
Though the unsupervised pre-training + fine-tuning approach gave us thebest results, it still lacks in performance as compared to the skipgrammodel. Our results fail to align with those mentioned in the paper. Thiscould be a result of replacing the alias table method for edge-sampling instep 3, with random sampling.
References
J. Weston, S. Chopra, and K. Adams. tagspace: Semantic embeddingsfrom hashtags. In EMNLP, pages 1822 - 1827, 2014
J. Tang, M. Qu, Q. Mei(2015, August). PTE: Predictive textembedding through large-scale heterogeneous text networks. InProceedings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (pp. 1165 - 1174). ACM.
J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. LINE:Large-scale information network embedding. In WWW, pages 1067 -1077, 2015.
http://ai.stanford.edu/ amaas/data/sentiment/