Word2vec: From intuition to practice using gensim

46
WORD2VEC FROM INTUITION TO PRACTICE USING GENSIM Edgar Marca [email protected] Python Peru Meetup September 1st, 2016 Lima - Perú

Transcript of Word2vec: From intuition to practice using gensim

Page 1: Word2vec: From intuition to practice using gensim

WORD2VECFROM INTUITION TO PRACTICE USING GENSIM

Edgar [email protected]

Python Peru MeetupSeptember 1st, 2016Lima - Perú

Page 2: Word2vec: From intuition to practice using gensim

About Edgar Marca

# Software Engineer at Love Mondays.# One of the organizers of Data Science Lima Meetup.# Machine Learning and Data Science enthusiasm.# Eu falo um pouco de Português.

1

Page 3: Word2vec: From intuition to practice using gensim

DATA SCIENCE LIMA MEETUP

Page 4: Word2vec: From intuition to practice using gensim

Data Science Lima Meetup

Datos

# 5 Meetups y el 6to a la vuelta de la esquina# 410 Datanautas en el Grupo de Meetup.# 329 Personas en el Grupo de Facebook.

Organizadores

# Manuel Solorzano.# Dennis Barreda.# Freddy Cahuas.# Edgar Marca

3

Page 5: Word2vec: From intuition to practice using gensim

Data Science Lima Meetup

Figure: Foto del quinto Data Science Lima Meetup.4

Page 6: Word2vec: From intuition to practice using gensim

DATA

Page 7: Word2vec: From intuition to practice using gensim

Data Never Sleeps

Figure: How much data is generated every minute? 1

1Data Never Sleeps 3.0https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/

6

Page 8: Word2vec: From intuition to practice using gensim

NATURAL LANGUAGE PROCESSING

Page 9: Word2vec: From intuition to practice using gensim

Introduction

# Text is the core business of internet companies today.# Machine Learning and natural language processing

techniques are applied to big datasets to improve search,ranking and many other tasks (spam detection, adsrecomendations, email categorization, machine translation,speech recognition, etc)

8

Page 10: Word2vec: From intuition to practice using gensim

Natural Language Processing

Problems with text

# Messy.# Irregularities of the language.# Hierarchically.# Sparse Nature.

9

Page 11: Word2vec: From intuition to practice using gensim

REPRESENTATIONS FOR TEXTS

Page 12: Word2vec: From intuition to practice using gensim

Contextual Representation

11

Page 13: Word2vec: From intuition to practice using gensim

How to Learn good representations?

12

Page 14: Word2vec: From intuition to practice using gensim

One-hot Representation

One-hot encoding

Represent every word as an R|V | vector with all 0s and 1 at theindex of that word.

13

Page 15: Word2vec: From intuition to practice using gensim

One-hot Representation

EXAMPLE

Example:

Let V = {the, hotel, nice,motel}

wthe =

1

0

0

0

,whotel =

0

1

0

0

,wnice =

0

0

1

0

,wmotel =

0

0

0

1

We represent each word as a completely independent entity.This word representation does not give us directly any notion ofsimilarity.

14

Page 16: Word2vec: From intuition to practice using gensim

One-hot Representation

For instance

⟨whotel ,wmotel⟩R4 = 0 (1)

⟨whotel ,wcat⟩R4 = 0 (2)

we can try to reduce the size of this space from R4 to somethingsmaller and find a subspace that encodes the relationshipsbetween words.

15

Page 17: Word2vec: From intuition to practice using gensim

One-hot Representation

Problems

# The dimension depends on the vocabulary size.# Leads to data sparsity. So we need more data.# Provide not useful information to the system.# Encondings are arbitrary.

16

Page 18: Word2vec: From intuition to practice using gensim

Bag-of-words representation

# Sum of one-hot codes.# Ignores orders or words.

Examples:

# vocabulary = (monday, tuesday, is, a, today)# Monday Monday = [2, 0, 0, 0, 0]# today is monday = [1 0 1 1 1]# today is tuesday = [0 1 1 1 1]# is a monday today = [1 0 1 1 1]

17

Page 19: Word2vec: From intuition to practice using gensim

Distributional hypotesis

You shall know a word by the company it keeps!Firth (1957)

18

Page 20: Word2vec: From intuition to practice using gensim

Language Modeling (Unigrams, Bigrams, etc)

A language model is a probabilistic model that assignsprobability to any sequence of n words P(w1 ,w2 , . . . , wn)

Unigrams

Assuming that the word ocurrences are completely independent

P(w1 ,w2 , . . . , wn) = Πni=1P(wi) (3)

19

Page 21: Word2vec: From intuition to practice using gensim

Language Modeling (Unigrams, Bigrams, etc)

Bigrams

The probability of the sequence depend on the pairwise prob-ability of a word in the sequence and the word next to it.

P(w1 ,w2 , . . . , wn) = Πni=2P(wi | wi−1) (4)

20

Page 22: Word2vec: From intuition to practice using gensim

Word Embeddings

Word Embeddings

A set of language modeling and feature learning techniques inNLP where words or phrases from the vocabulary are mappedto vectors of real numbers in a low-dimensional space relativeto the vocabulary size (”continuous space”).

# Vector space models (VSMs) represent (embed) words in acontinous vector space.

# Semantically similar words are mapped to nearby points.# Basic idea is Distributional Hypothesis: words that appear

in the same context share semantic meaning.

21

Page 23: Word2vec: From intuition to practice using gensim

WORD2VEC

Page 24: Word2vec: From intuition to practice using gensim

Distributional hypotesis

You shall know a word by the company it keeps!Firth (1957)

23

Page 25: Word2vec: From intuition to practice using gensim

Word2Vec

Figure: Two original papers published in association with word2vecby Mikolov et al. (2013)

# Efficient Estimation of Word Representations in VectorSpace https://arxiv.org/abs/1301.3781.

# Distributed Representations of Words and Phrases andtheir Compositionality https://arxiv.org/abs/1310.4546. 24

Page 26: Word2vec: From intuition to practice using gensim

Continuous Bag of Words and Skip-gram

25

Page 27: Word2vec: From intuition to practice using gensim

Contextual Representation

Word is represented by context in use

26

Page 28: Word2vec: From intuition to practice using gensim

Contextual Representation

27

Page 29: Word2vec: From intuition to practice using gensim

Word Vectors

28

Page 30: Word2vec: From intuition to practice using gensim

Word Vectors

29

Page 31: Word2vec: From intuition to practice using gensim

Word Vectors

30

Page 32: Word2vec: From intuition to practice using gensim

Word Vectors

31

Page 33: Word2vec: From intuition to practice using gensim

Word2Vec

# vking − vman + vwoman ≈ vqueen

# vparis − vfrance + vitaly ≈ vrome

# Learns from raw text# Huge splash in NLP world.# Comes pretrained. (If you don’t have any specialize

vocabulary)# Word2vec is computationally efficient model for learning

word embeddings.# Word2Vec is a successful example of ”shallow” learning.# Very simple Feedforward neural network with single hidden

layer, backpropagation, and no non-linearities.32

Page 34: Word2vec: From intuition to practice using gensim

Word2vec

33

Page 35: Word2vec: From intuition to practice using gensim

Gensim

34

Page 36: Word2vec: From intuition to practice using gensim

APPLICATIONS

Page 37: Word2vec: From intuition to practice using gensim

What the Fuck Are Trump Supporters Thinking?

36

Page 38: Word2vec: From intuition to practice using gensim

What the Fuck Are Trump Supporters Thinking?

37

Page 39: Word2vec: From intuition to practice using gensim

What the Fuck Are Trump Supporters Thinking?

# They gathered four million tweets belonging to more thantwo thousand hard-core Trump supporters.

# Distances between those vectors encoded the semanticdistance between their associated words (e.g. the vectorrepresentation of the word morons was near idiots but faraway from funny)

Link: https://medium.com/adventurous-social-science/what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d

38

Page 40: Word2vec: From intuition to practice using gensim

Restaurant Recomendation.

http://www.slideshare.net/SudeepDasPhD/recsys-2015-making-meaningful-restaurant-recommendations-at-opentable

39

Page 41: Word2vec: From intuition to practice using gensim

Restaurant Recomendation.

http://www.slideshare.net/SudeepDasPhD/recsys-2015-making-meaningful-restaurant-recommendations-at-opentable

40

Page 42: Word2vec: From intuition to practice using gensim

Song Recomendations

Link: https://social.shorthand.com/mawsonguy/3CfQA8mj2S/playlist-harvesting 41

Page 43: Word2vec: From intuition to practice using gensim

TAKEAWAYS

Page 44: Word2vec: From intuition to practice using gensim

Takeaways

# If you don’t have enough data you can use pre-trainedmodels.

# Remember: Garbage in, garbage out.# Every data set will come out with diferent results.# Use Word2vec as feature extractor.

43

Page 45: Word2vec: From intuition to practice using gensim

44

Page 46: Word2vec: From intuition to practice using gensim

Obrigado

45