Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240...

39
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1 CSE 6240: Web Search and Text Mining. Spring 2020 Prof. Srijan Kumar with Arindum Roy and Roshan Pati Word Embeddings

Transcript of Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240...

Page 1: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1

CSE 6240: Web Search and Text Mining. Spring 2020

Prof. Srijan Kumarwith Arindum Roy and Roshan Pati

Word Embeddings

Page 2: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2

Administrivia

• Homework: Will be released today after class

• Project Reminder: Teams due Monday Jan 20.

• A fun exercise at the end of the class!

Page 3: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3

Homework Policy• Late day policy: 3 late days (3 x 24 hour chunks)

– Use as needed • Collaboration:

– OK to talk, discuss the questions, and potential directions for solving them. However, you need to write your own solutions and code separately, and NOT as a group activity.

– Please list the students you collaborated with. • Zero tolerance on plagiarism

– Follow the GT academic honesty rules

Page 4: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 4

Recap So Far1. IR and text processing2. Evaluation of IR system

Page 5: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 5

Today’s Lecture• Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model

Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC

Page 6: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 6

Representing a Word: One Hot Encoding• Given a vocabulary

dogcatpersonholdingtreecomputerusing

Page 7: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 7

Representing a Word: One Hot Encoding• Given a vocabulary

dog 1cat 2person 3holding 4tree 5computer 6using 7

Page 8: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 8

Representing a Word: One Hot Encoding• Given a vocabulary, convert to One Hot Encoding

dog 1 [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]cat 2 [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ]person 3 [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ]holding 4 [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ]tree 5 [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ]computer 6 [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ]using 7 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ]

Page 9: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 9

Recap: Bag of Words Model• Represent a document as a collection of words (after

cleaning the document)– The order of words is irrelevant– The document “John is quicker than Mary” is indistinguishable

from the doc “Mary is quicker than John”• Rank documents according to the overlap between

query words and document words

Page 10: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 10

Representing Phrases: Bag of Words

bag of words representation

Dog

Cat

Person

Holding

Tree

Compu

ter

Using

Page 11: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 11

Representing Phrases: Bag of Words

bag of words representation

person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]

Dog

Cat

Person

Holding

Tree

Compu

ter

Using

Page 12: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 12

Representing Phrases: Bag of Words

bag of words representation

person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]

Dog

Cat

Person

Holding

Tree

Compu

ter

Using

Page 13: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 13

Representing Phrases: Bag of Words

bag of words representation

person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ]

Dog

Cat

Person

Holding

Tree

Compu

ter

Using

Page 14: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 14

Representing Phrases: Bag of Words

bag of words representation

person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ]

person using computerperson holding cat {3, 3, 7, 6, 2} [ 0, 1, 2, 1, 0, 0, 1, 1, 0, 0 ]

Dog

Cat

Person

Holding

Tree

Compu

ter

Using

Page 15: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 15

Distributional Hypothesis [Lenci, 2008]

• The degree of semantic similarity between two linguistic expressions is a function of the similarity of the their linguistic contexts

• Similarity in meaning ∝ Similarity of context

• Simple definition: context = surrounding words

Page 16: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 16

What Is The Meaning Of “Barwadic”?• he handed her glass of bardiwac. • Beef dishes are made to complement the bardiwac. • Nigel staggered to his feet, face flushed from too much

bardiwac. • Malbec, one of the lesser-known bardiwac grapes,

responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent bardiwac. • The drinks were delicious: blood-red bardiwac as well as

light, sweet Rhenish.

Page 17: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 17

What Is The Meaning Of “Barwadic”?• he handed her glass of barwadic. • Beef dishes are made to complement the barwadic. • Nigel staggered to his feet, face flushed from too much

barwadic. • Malbec, one of the lesser-known barwadic grapes,

responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent barwadic. • The drinks were delicious: blood-red barwadic as well as

light, sweet Rhenish.Inference: barwadic is an alcoholic beverage made from grapes

Page 18: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 18

Geometric Interpretation: Co-occurrence As Feature

• Recall the term-document matrix– Rows are terms, columns are documents, cells represent the

number of time a term appears in a document• Here we create a word-word co-occurrence matrix

– Rows and columns are words – Cell (R,C) means “how many times does word C appear in the

neighborhood of word R”• Neighborhood = a window of fixed size around the word

Page 19: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 19

Row Vectors in Co-occurrence Matrix

• Row vector describes the usage of the word in the corpus/document

• Row vectors can be seen as coordinates of the point in an n-dimensional Euclidean space• Example: n = 2• Dimensions = ‘get’ and ‘use’

Co-occurrence matrix

Page 20: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 20

Distance And Similarity• Selected two dimensions ‘get’ and ‘use’• Similarity between words = spatial proximity in the

dimension space• Measured by the Euclidean distance

Page 21: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 21

Distance And Similarity• Exact position in the space depends on the frequency of

the word • More frequent words will appear farther from the origin• E.g., say ‘dog’ is more frequent than ‘cat’

• Does not mean it is more important• Solution: Ignore the length and look only at the direction

Page 22: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 22

Angle And Similarity• Angle ignores the exact location of the

point• Method: Normalize by the length of

vectors or use only the angle as a distance measure

• Standard metric: Cosine similarity between vectors

Page 23: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 23

Issues with Co-occurrence Matrix• Problem with using the co-occurrence directly:

– The resulting vectors are very high dimensional– Dimension size = Number of words in the corpus

• Billions! – Down-sampling dimensions is not straight-forward

• How many columns to select? • Which columns to select?

• Solution: Compression or Dimensionality Reduction Techniques

Page 24: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 24

SVD for Dimensionality Reduction

• SVD = Singular Value Decomposition• For an input matrix X

– U = left-singular vector of X, and V = right-singular vector of X– S is a diagonal matrix

• Diagonal values of S are called Singular Values• Matrix U is a get a r-dimension vector for every row of X

Page 25: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 25

Word Visualization via Dimensionality Reduction

Page 26: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 26

Issues with SVD• Computational cost for SVD on an N x M matrix is O(NM2),

where N < M• Impossible for large number of word vocabularies or documents • Impractical for real corpus

• It is hard to incorporate out-of-sample or new words/documents• Entire row in the matrix will be 0

Page 27: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 27

Word2Vec: Representing Word MeaningsKey idea: Predict the surrounding words of every word

Benefits:• Faster• Easier to incorporate new words and documents

Main paper: Distributed Representations of Words and Phrases and their Compositionality. Mikolov et al., NuerIPS, 2013.

Page 28: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 28

Two Styles of Learning Word2Vec• Continuous Bag of

Words (CBOW): uses the context words in a window to predict the middle word

• Skip-gram: uses the middle word to predict the context words in a window.

Page 29: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 29

Neural Network Basics: Neuron• Basic building blocks of neural networks• Input is a vector: x = [x1, … xm]• Weights and bias:

– Neuron has weights w = [w1, w2, …, wm]– Bias term = b (or w0)

• Activation function:– Transforms the aggregate– e.g., sigmoid, ReLU

• Output computation:

Page 30: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 30

Neural Network Basics: Fully Connected Layer• A layer whose neurons are

connected to all the neurons in the previous layer – Each neuron takes as input all the

output from the previous layer• Multiple layers can be stacked

together • Example: 3 fully connected layers

Page 31: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 31

Neural Network Basics: More About Layers• Input layer: input vectors are

given as inputs here • Hidden layer: Intermediate

representation of inputs – Multiple hidden layers can be

stacked together • Output layer: final output

– Can have one or more neurons in the output layer

• Note that information flows in one direction

Page 32: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 32

CBOW: Continuous Bag of WordsExample: “The cat sat on floor” (window size 2)Input: context wordsOutput: middle word

Page 33: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 33

The Architecture

Architecture: input layer, hidden layer, and output layer• Fully connected

layersInput: one-hot vector of context wordsDesired output: one-hot vector of the middle word

Page 34: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 34

The Architecture

Input size: R|V|

Hidden layer size: R|N|

Output size: R|V|

Input-to-hidden layer weight matrix: W |V| x |N|• All inputs share the W matrixHidden-to-output layer weight matrix: W’|N| x |V|• All weight matrices are

shared across all examples

Page 35: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 35

Parameters To Be Learned

• Size of the input and output word vector = |V|

• All weights are to be learned during the training process

Page 36: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 36

Input to Hidden Layer

• Matrix multiplication generates the hidden vector– Multiplication of input

one-hot vector with the input-to-hidden layer matrix

• One multiplication per input

Page 37: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 37

Input to Hidden Layer

Multiplication for ‘cat’

Page 38: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 38

Input to Hidden Layer

Multiplication for ‘on’

Page 39: Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2 Administrivia •Homework: Will be released

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 39

Hidden Layer

• Aggregation is done at the hidden layer– Example: simple

averaging