Introduction to word embeddings with Python

74
Introduction to word embeddings Pavel Kalaidin @facultyofwonder Moscow Data Fest, September, 12th, 2015

Transcript of Introduction to word embeddings with Python

Page 1: Introduction to word embeddings with Python

Introduction to word embeddings

Pavel Kalaidin@facultyofwonder

Moscow Data Fest, September, 12th, 2015

Page 2: Introduction to word embeddings with Python
Page 3: Introduction to word embeddings with Python
Page 4: Introduction to word embeddings with Python

distributional hypothesis

Page 5: Introduction to word embeddings with Python

лойс

Page 6: Introduction to word embeddings with Python

годно, лойслойс за песню

из принципа не поставлю лойсвзаимные лойсы

лойс, если согласен

What is the meaning of лойс?

Page 7: Introduction to word embeddings with Python

годно, лойслойс за песню

из принципа не поставлю лойсвзаимные лойсы

лойс, если согласен

What is the meaning of лойс?

Page 8: Introduction to word embeddings with Python

кек

Page 9: Introduction to word embeddings with Python

кек, что ли?кек)))))))ну ты кек

What is the meaning of кек?

Page 10: Introduction to word embeddings with Python

кек, что ли?кек)))))))ну ты кек

What is the meaning of кек?

Page 11: Introduction to word embeddings with Python

vectorial representations of words

Page 12: Introduction to word embeddings with Python

simple and flexible platform for

understanding text and probably not messing up

Page 13: Introduction to word embeddings with Python

one-hot encoding?

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 14: Introduction to word embeddings with Python

co-occurrence matrix

recall: word-document co-occurrence matrix for LSA

Page 16: Introduction to word embeddings with Python

from entire document to window (length 5-10)

Page 17: Introduction to word embeddings with Python

still seems suboptimal -> big, sparse, etc.

Page 18: Introduction to word embeddings with Python

lower dimensions, we want dense vectors

(say, 25-1000)

Page 19: Introduction to word embeddings with Python

How?

Page 20: Introduction to word embeddings with Python

matrix factorization?

Page 21: Introduction to word embeddings with Python

SVD of co-occurrence matrix

Page 22: Introduction to word embeddings with Python

lots of memory?

Page 23: Introduction to word embeddings with Python

idea: directly learn low-dimensional vectors

Page 24: Introduction to word embeddings with Python

here comes word2vec

Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al: [paper]

Page 25: Introduction to word embeddings with Python

idea: instead of capturing co-occurrence counts

predict surrounding words

Page 26: Introduction to word embeddings with Python

Two models:C-BOW

predicting the word given its context

skip-grampredicting the context given a word

Explained in great detail here, so we’ll skip it for now Also see: word2vec Parameter Learning Explained, Rong, paper

Page 27: Introduction to word embeddings with Python
Page 28: Introduction to word embeddings with Python

CBOW: several times faster than skip-gram, slightly better accuracy for the frequent wordsSkip-Gram: works well with small amount of

data, represents well rare words or phrases

Page 29: Introduction to word embeddings with Python

Examples?

Page 30: Introduction to word embeddings with Python
Page 31: Introduction to word embeddings with Python
Page 32: Introduction to word embeddings with Python
Page 33: Introduction to word embeddings with Python
Page 34: Introduction to word embeddings with Python
Page 35: Introduction to word embeddings with Python
Page 36: Introduction to word embeddings with Python

Wwoman- Wman= Wqueen- Wking

classic example

Page 37: Introduction to word embeddings with Python

<censored example>

Page 38: Introduction to word embeddings with Python

word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method, Goldberg et al, 2014 [arxiv]

Page 39: Introduction to word embeddings with Python

all done with gensim:github.com/piskvorky/gensim/

Page 40: Introduction to word embeddings with Python

...failing to take advantage of the vast amount of repetition

in the data

Page 41: Introduction to word embeddings with Python

so back to co-occurrences

Page 42: Introduction to word embeddings with Python

GloVe for Global VectorsPennington et al, 2014: nlp.stanford.

edu/pubs/glove.pdf

Page 43: Introduction to word embeddings with Python

Ratios seem to cancel noise

Page 44: Introduction to word embeddings with Python

The gist: model ratios with vectors

Page 45: Introduction to word embeddings with Python

The model

Page 46: Introduction to word embeddings with Python

Preserving linearity

Page 47: Introduction to word embeddings with Python

Preventing mixing dimensions

Page 48: Introduction to word embeddings with Python

Restoring symmetry, part 1

Page 49: Introduction to word embeddings with Python

recall:

Page 50: Introduction to word embeddings with Python
Page 51: Introduction to word embeddings with Python

Restoring symmetry, part 2

Page 52: Introduction to word embeddings with Python

Least squares problem it is now

Page 53: Introduction to word embeddings with Python

SGD->AdaGrad

Page 54: Introduction to word embeddings with Python

ok, Python code

Page 55: Introduction to word embeddings with Python

glove-python:github.com/maciejkula/glove-python

Page 56: Introduction to word embeddings with Python

two sets of vectorsinput and context + bias

average/sum/drop

Page 57: Introduction to word embeddings with Python

complexity |V|2

Page 58: Introduction to word embeddings with Python

complexity |C|0.8

Page 59: Introduction to word embeddings with Python

Evaluation: it works

Page 60: Introduction to word embeddings with Python

#spb#gatchina#msk#kyiv#minsk#helsinki

Page 61: Introduction to word embeddings with Python

Compared to word2vec

Page 62: Introduction to word embeddings with Python

#spb#gatchina#msk#kyiv#minsk#helsinki

Page 63: Introduction to word embeddings with Python
Page 65: Introduction to word embeddings with Python

Abusing models

Page 66: Introduction to word embeddings with Python

music playlists:github.com/mattdennewitz/playlist-to-vec

Page 67: Introduction to word embeddings with Python

deep walk:DeepWalk: Online Learning of Social

Representations [link]

Page 69: Introduction to word embeddings with Python

predicting hashtagsinteresting read: #TAGSPACE: Semantic

Embeddings from Hashtags [link]

Page 70: Introduction to word embeddings with Python

RusVectōrēs: distributional semantic models for Russian: ling.go.mail.ru/dsm/en/

Page 71: Introduction to word embeddings with Python
Page 72: Introduction to word embeddings with Python

corpus matters

Page 73: Introduction to word embeddings with Python

building block forbigger models╰(*´︶`*)╯

Page 74: Introduction to word embeddings with Python

</slides>