Predictive Text Embedding using LINE

Predictive Text Embedding using LINE

Shashank Gupta1 Nishant Prateek2 Karan Chandnani3

1201507574

2201225113

3201505507

IRE Major Project, Spring 2016

Outline

1 IntroductionPredictive Text EmbeddingText Networks

2 EmbeddingsBipartite Network EmbeddingHeterogeneous Text Network Embedding

3 Training AlgorithmPre-training + Fine-tuning

4 Methods and ExperimentsDatasetExperiments

5 ResultsAccuracyDiscussion

6 Bibliography

Predictive Text EmbeedingAn Introuction

Adapts the advantages of unsupervised text embeddings but naturallyutilizes labeled information in representation learning

An effective low dimensional representation is learned jointly fromlimited labeled examples and a large amount of unlabeled examples.

Optimized for particular tasks. eg. text classification, sentimentanalysis etc.

Outline






6 Bibliography

Text Networks

Word-Word NetworkWord-word cooccurrence network, denoted as Gww = (V ,Eww ),captures the word co-occurrence information in local contexts of theunlabeled data. V is a vocabulary of words and Eww is the set ofedges between words.

Text Networks

Word-Document NetworkWord-document network, denoted as Gwd = (V ∪ D,Ewd), is abipartite network where D is a set of documents and V is a set ofwords. Ewd is the set of edges between words and documents.

Text Networks

Word-Label NetworkWord-label network, denoted as Gwl = (V ∪ L,Ewl), is a bipartitenetwork that captures category-level word co-occurrences. L is a setof class labels and V a set of words. Ewl is a set of edges betweenwords and classes.

Text Networks

Heterogeneous Text NetworkThe heterogeneous text network is the combination of word-word,word-document, and word-label networks constructed from bothunlabeled and labeled text data. It captures different levels of wordco-occurrences and contains both labeled and unlabeled information.

Outline






6 Bibliography

Bipartite Network Embedding

PTE is composed of three indidual Bipartite graphs. We need amethod to embed each of these Bipartite graphs into alow-dimensional space.

Given a bipartite network G = (VA ∪ VB ,E ), where VA and VB aretwo disjoint sets of vertices of different types, and E is the set ofedges between them. We first define the conditional probability ofvertex vi in set VA generated by vertex vj in set VB as:

p(vi |vj) =exp(~uTi · ~uj)∑i ′∈A exp(~uTi ′ · ~uj)

Outline






6 Bibliography

Heterogeneous Text Network Embedding

Opte = Oww + Owd + Owl

Oww = −∑

(i ,j)∈Eww

wij log p(vi |vj),

Owd = −∑

(i ,j)∈Ewd

wij log p(vi |dj),

Owl = −∑

(i ,j)∈Ewl

wij log p(vi |lj)

Outline






6 Bibliography

Pre-training + Fine-Tuning

We used the pre-training and fine-tuning approach to optimize theobjective function Opte . We learn the embeddings with unlabeled datafirst, and then fine-tune the embeddings with the word-label network.

Algorithm

Algorithm: Pre-training + Fine-tuningData: Gww ,Gwd ,Gwl , number of samples T , number of negative samplesKResult: word embeddings ~wwhile iter ≤ T do

sample an edge from Eww and draw K negative edges, and updatethe word embeddings;

sample an edge from Ewd and draw K negative edges, and update theword and document embeddings;

end whilewhile iter ≤ T do

sample an edge from Ewl and draw K negative edges, and update theword embeddings;

end while

Outline






6 Bibliography

Dataset

For this project, we use the Large Movie Review Dataset . Thisdataset consists of 25,000 movie reviews from IMDB.

The test set additionally contains another 25,000 movie reviews.

Apart from this there are another 50,000 unlabeled movie reviewsthat we used for fine-tuning.

Outline






6 Bibliography

Experiments

For the first phase of the project, we did unsupervised training on theMovie Review Dataset using word2vec (skipgram) model. This servedas the baseline for further experiments.

For the next part, we tried the unsupervised training on theword-word network as mentioned in section 2. We first tried trainingwith random initialization. We then decided to leverage the word2vecembeddings and use those as initialization fo Gww . This gave usslightly better results than random initialization.

For the third part, we used the unsupervised embeddings obtained inthe previous step (with word2vec initialization) and fine-tuned it withthe word-label network from section 2, with random edge sampling.The probability of each edge being sampled is proportional to itsweight in the heterogenous text network embedding. This gave us afurther increase in performance.

Outline






6 Bibliography

Accuracy

Algorithms Accuracy

Skipgram (word2vec) 84.86%Unsupervised (Gww ) 75.37%Unsupervised + Fine-tuning (Gww +Gwl)

77.83%

Outline






6 Bibliography

Discussion

Though the unsupervised pre-training + fine-tuning approach gave us thebest results, it still lacks in performance as compared to the skipgrammodel. Our results fail to align with those mentioned in the paper. Thiscould be a result of replacing the alias table method for edge-sampling instep 3, with random sampling.

References

J. Weston, S. Chopra, and K. Adams. tagspace: Semantic embeddingsfrom hashtags. In EMNLP, pages 1822 - 1827, 2014

J. Tang, M. Qu, Q. Mei(2015, August). PTE: Predictive textembedding through large-scale heterogeneous text networks. InProceedings of the 21th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (pp. 1165 - 1174). ACM.

J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. LINE:Large-scale information network embedding. In WWW, pages 1067 -1077, 2015.

http://ai.stanford.edu/ amaas/data/sentiment/

Predictive Text Embedding using LINE

Data & Analytics

Transcript of Predictive Text Embedding using LINE