Introduction to word embeddings with Python

Post on 24-Jan-2017

1.612 views 1 download

Transcript of Introduction to word embeddings with Python

Introduction to word embeddings

Pavel Kalaidin@facultyofwonder

Moscow Data Fest, September, 12th, 2015

distributional hypothesis

лойс

годно, лойслойс за песню

из принципа не поставлю лойсвзаимные лойсы

лойс, если согласен

What is the meaning of лойс?

годно, лойслойс за песню

из принципа не поставлю лойсвзаимные лойсы

лойс, если согласен

What is the meaning of лойс?

кек

кек, что ли?кек)))))))ну ты кек

What is the meaning of кек?

кек, что ли?кек)))))))ну ты кек

What is the meaning of кек?

vectorial representations of words

simple and flexible platform for

understanding text and probably not messing up

one-hot encoding?

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

co-occurrence matrix

recall: word-document co-occurrence matrix for LSA

from entire document to window (length 5-10)

still seems suboptimal -> big, sparse, etc.

lower dimensions, we want dense vectors

(say, 25-1000)

How?

matrix factorization?

SVD of co-occurrence matrix

lots of memory?

idea: directly learn low-dimensional vectors

here comes word2vec

Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al: [paper]

idea: instead of capturing co-occurrence counts

predict surrounding words

Two models:C-BOW

predicting the word given its context

skip-grampredicting the context given a word

Explained in great detail here, so we’ll skip it for now Also see: word2vec Parameter Learning Explained, Rong, paper

CBOW: several times faster than skip-gram, slightly better accuracy for the frequent wordsSkip-Gram: works well with small amount of

data, represents well rare words or phrases

Examples?

Wwoman- Wman= Wqueen- Wking

classic example

<censored example>

word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method, Goldberg et al, 2014 [arxiv]

all done with gensim:github.com/piskvorky/gensim/

...failing to take advantage of the vast amount of repetition

in the data

so back to co-occurrences

GloVe for Global VectorsPennington et al, 2014: nlp.stanford.

edu/pubs/glove.pdf

Ratios seem to cancel noise

The gist: model ratios with vectors

The model

Preserving linearity

Preventing mixing dimensions

Restoring symmetry, part 1

recall:

Restoring symmetry, part 2

Least squares problem it is now

SGD->AdaGrad

ok, Python code

glove-python:github.com/maciejkula/glove-python

two sets of vectorsinput and context + bias

average/sum/drop

complexity |V|2

complexity |C|0.8

Evaluation: it works

#spb#gatchina#msk#kyiv#minsk#helsinki

Compared to word2vec

#spb#gatchina#msk#kyiv#minsk#helsinki

Abusing models

music playlists:github.com/mattdennewitz/playlist-to-vec

deep walk:DeepWalk: Online Learning of Social

Representations [link]

predicting hashtagsinteresting read: #TAGSPACE: Semantic

Embeddings from Hashtags [link]

RusVectōrēs: distributional semantic models for Russian: ling.go.mail.ru/dsm/en/

corpus matters

building block forbigger models╰(*´︶`*)╯

</slides>