Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240...

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1

CSE 6240: Web Search and Text Mining. Spring 2020

Prof. Srijan Kumarwith Arindum Roy and Roshan Pati

Word Embeddings


Administrivia

• Homework: Will be released today after class

• Project Reminder: Teams due Monday Jan 20.

• A fun exercise at the end of the class!


Homework Policy• Late day policy: 3 late days (3 x 24 hour chunks)

– Use as needed • Collaboration:

– OK to talk, discuss the questions, and potential directions for solving them. However, you need to write your own solutions and code separately, and NOT as a group activity.

– Please list the students you collaborated with. • Zero tolerance on plagiarism

– Follow the GT academic honesty rules


Recap So Far1. IR and text processing2. Evaluation of IR system


Today’s Lecture• Representing words and phrases

– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model

Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC


Representing a Word: One Hot Encoding• Given a vocabulary

dogcatpersonholdingtreecomputerusing


Representing a Word: One Hot Encoding• Given a vocabulary

dog 1cat 2person 3holding 4tree 5computer 6using 7


Representing a Word: One Hot Encoding• Given a vocabulary, convert to One Hot Encoding

dog 1 [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]cat 2 [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ]person 3 [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ]holding 4 [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ]tree 5 [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ]computer 6 [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ]using 7 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ]


Recap: Bag of Words Model• Represent a document as a collection of words (after

cleaning the document)– The order of words is irrelevant– The document “John is quicker than Mary” is indistinguishable

from the doc “Mary is quicker than John”• Rank documents according to the overlap between

query words and document words


Representing Phrases: Bag of Words

bag of words representation

Dog

Cat

Person

Holding

Tree

Compu

ter

Using




person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]

Dog

Cat

Person

Holding

Tree

Compu

ter

Using




person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]

Dog

Cat

Person

Holding

Tree

Compu

ter

Using




person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ]

Dog

Cat

Person

Holding

Tree

Compu

ter

Using




person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ]

person using computerperson holding cat {3, 3, 7, 6, 2} [ 0, 1, 2, 1, 0, 0, 1, 1, 0, 0 ]

Dog

Cat

Person

Holding

Tree

Compu

ter

Using


Distributional Hypothesis [Lenci, 2008]

• The degree of semantic similarity between two linguistic expressions is a function of the similarity of the their linguistic contexts

• Similarity in meaning ∝ Similarity of context

• Simple definition: context = surrounding words


What Is The Meaning Of “Barwadic”?• he handed her glass of bardiwac. • Beef dishes are made to complement the bardiwac. • Nigel staggered to his feet, face flushed from too much

bardiwac. • Malbec, one of the lesser-known bardiwac grapes,

responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent bardiwac. • The drinks were delicious: blood-red bardiwac as well as

light, sweet Rhenish.


What Is The Meaning Of “Barwadic”?• he handed her glass of barwadic. • Beef dishes are made to complement the barwadic. • Nigel staggered to his feet, face flushed from too much

barwadic. • Malbec, one of the lesser-known barwadic grapes,

responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent barwadic. • The drinks were delicious: blood-red barwadic as well as

light, sweet Rhenish.Inference: barwadic is an alcoholic beverage made from grapes


Geometric Interpretation: Co-occurrence As Feature

• Recall the term-document matrix– Rows are terms, columns are documents, cells represent the

number of time a term appears in a document• Here we create a word-word co-occurrence matrix

– Rows and columns are words – Cell (R,C) means “how many times does word C appear in the

neighborhood of word R”• Neighborhood = a window of fixed size around the word


Row Vectors in Co-occurrence Matrix

• Row vector describes the usage of the word in the corpus/document

• Row vectors can be seen as coordinates of the point in an n-dimensional Euclidean space• Example: n = 2• Dimensions = ‘get’ and ‘use’

Co-occurrence matrix


Distance And Similarity• Selected two dimensions ‘get’ and ‘use’• Similarity between words = spatial proximity in the

dimension space• Measured by the Euclidean distance


Distance And Similarity• Exact position in the space depends on the frequency of

the word • More frequent words will appear farther from the origin• E.g., say ‘dog’ is more frequent than ‘cat’

• Does not mean it is more important• Solution: Ignore the length and look only at the direction


Angle And Similarity• Angle ignores the exact location of the

point• Method: Normalize by the length of

vectors or use only the angle as a distance measure

• Standard metric: Cosine similarity between vectors


Issues with Co-occurrence Matrix• Problem with using the co-occurrence directly:

– The resulting vectors are very high dimensional– Dimension size = Number of words in the corpus

• Billions! – Down-sampling dimensions is not straight-forward

• How many columns to select? • Which columns to select?

• Solution: Compression or Dimensionality Reduction Techniques


SVD for Dimensionality Reduction

• SVD = Singular Value Decomposition• For an input matrix X

– U = left-singular vector of X, and V = right-singular vector of X– S is a diagonal matrix

• Diagonal values of S are called Singular Values• Matrix U is a get a r-dimension vector for every row of X


Word Visualization via Dimensionality Reduction


Issues with SVD• Computational cost for SVD on an N x M matrix is O(NM2),

where N < M• Impossible for large number of word vocabularies or documents • Impractical for real corpus

• It is hard to incorporate out-of-sample or new words/documents• Entire row in the matrix will be 0


Word2Vec: Representing Word MeaningsKey idea: Predict the surrounding words of every word

Benefits:• Faster• Easier to incorporate new words and documents

Main paper: Distributed Representations of Words and Phrases and their Compositionality. Mikolov et al., NuerIPS, 2013.


Two Styles of Learning Word2Vec• Continuous Bag of

Words (CBOW): uses the context words in a window to predict the middle word

• Skip-gram: uses the middle word to predict the context words in a window.


Neural Network Basics: Neuron• Basic building blocks of neural networks• Input is a vector: x = [x1, … xm]• Weights and bias:

– Neuron has weights w = [w1, w2, …, wm]– Bias term = b (or w0)

• Activation function:– Transforms the aggregate– e.g., sigmoid, ReLU

• Output computation:


Neural Network Basics: Fully Connected Layer• A layer whose neurons are

connected to all the neurons in the previous layer – Each neuron takes as input all the

output from the previous layer• Multiple layers can be stacked

together • Example: 3 fully connected layers


Neural Network Basics: More About Layers• Input layer: input vectors are

given as inputs here • Hidden layer: Intermediate

representation of inputs – Multiple hidden layers can be

stacked together • Output layer: final output

– Can have one or more neurons in the output layer

• Note that information flows in one direction


CBOW: Continuous Bag of WordsExample: “The cat sat on floor” (window size 2)Input: context wordsOutput: middle word


The Architecture

Architecture: input layer, hidden layer, and output layer• Fully connected

layersInput: one-hot vector of context wordsDesired output: one-hot vector of the middle word


The Architecture

Input size: R|V|

Hidden layer size: R|N|

Output size: R|V|

Input-to-hidden layer weight matrix: W |V| x |N|• All inputs share the W matrixHidden-to-output layer weight matrix: W’|N| x |V|• All weight matrices are

shared across all examples


Parameters To Be Learned

• Size of the input and output word vector = |V|

• All weights are to be learned during the training process


Input to Hidden Layer

• Matrix multiplication generates the hidden vector– Multiplication of input

one-hot vector with the input-to-hidden layer matrix

• One multiplication per input



Multiplication for ‘cat’



Multiplication for ‘on’


Hidden Layer

• Aggregation is done at the hidden layer– Example: simple

averaging

Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240...

Documents

Transcript of Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240...