Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240...
Transcript of Word Embeddings - Stanford Computer ScienceWord Embeddings Srijan Kumar, Georgia Tech, CSE6240...
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1
CSE 6240: Web Search and Text Mining. Spring 2020
Prof. Srijan Kumarwith Arindum Roy and Roshan Pati
Word Embeddings
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2
Administrivia
• Homework: Will be released today after class
• Project Reminder: Teams due Monday Jan 20.
• A fun exercise at the end of the class!
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3
Homework Policy• Late day policy: 3 late days (3 x 24 hour chunks)
– Use as needed • Collaboration:
– OK to talk, discuss the questions, and potential directions for solving them. However, you need to write your own solutions and code separately, and NOT as a group activity.
– Please list the students you collaborated with. • Zero tolerance on plagiarism
– Follow the GT academic honesty rules
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 4
Recap So Far1. IR and text processing2. Evaluation of IR system
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 5
Today’s Lecture• Representing words and phrases
– Neural network basics – Word2vec – Continuous bag of words – Skip-gram model
Some slides in this lecture are inspired from the slides by Prof. Leonid Sigal, UBC
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 6
Representing a Word: One Hot Encoding• Given a vocabulary
dogcatpersonholdingtreecomputerusing
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 7
Representing a Word: One Hot Encoding• Given a vocabulary
dog 1cat 2person 3holding 4tree 5computer 6using 7
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 8
Representing a Word: One Hot Encoding• Given a vocabulary, convert to One Hot Encoding
dog 1 [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]cat 2 [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ]person 3 [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ]holding 4 [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ]tree 5 [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ]computer 6 [ 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 ]using 7 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ]
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 9
Recap: Bag of Words Model• Represent a document as a collection of words (after
cleaning the document)– The order of words is irrelevant– The document “John is quicker than Mary” is indistinguishable
from the doc “Mary is quicker than John”• Rank documents according to the overlap between
query words and document words
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 10
Representing Phrases: Bag of Words
bag of words representation
Dog
Cat
Person
Holding
Tree
Compu
ter
Using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 11
Representing Phrases: Bag of Words
bag of words representation
person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]
Dog
Cat
Person
Holding
Tree
Compu
ter
Using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 12
Representing Phrases: Bag of Words
bag of words representation
person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]
Dog
Cat
Person
Holding
Tree
Compu
ter
Using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 13
Representing Phrases: Bag of Words
bag of words representation
person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ]
Dog
Cat
Person
Holding
Tree
Compu
ter
Using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 14
Representing Phrases: Bag of Words
bag of words representation
person holding dog {3, 4, 1} [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0 ]person holding cat {3, 4, 2} [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0 ]person using computer {3, 7, 6} [ 0, 0, 1, 0, 0, 1, 1, 0, 0, 0 ]
person using computerperson holding cat {3, 3, 7, 6, 2} [ 0, 1, 2, 1, 0, 0, 1, 1, 0, 0 ]
Dog
Cat
Person
Holding
Tree
Compu
ter
Using
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 15
Distributional Hypothesis [Lenci, 2008]
• The degree of semantic similarity between two linguistic expressions is a function of the similarity of the their linguistic contexts
• Similarity in meaning ∝ Similarity of context
• Simple definition: context = surrounding words
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 16
What Is The Meaning Of “Barwadic”?• he handed her glass of bardiwac. • Beef dishes are made to complement the bardiwac. • Nigel staggered to his feet, face flushed from too much
bardiwac. • Malbec, one of the lesser-known bardiwac grapes,
responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent bardiwac. • The drinks were delicious: blood-red bardiwac as well as
light, sweet Rhenish.
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 17
What Is The Meaning Of “Barwadic”?• he handed her glass of barwadic. • Beef dishes are made to complement the barwadic. • Nigel staggered to his feet, face flushed from too much
barwadic. • Malbec, one of the lesser-known barwadic grapes,
responds well to Australia’s sunshine. • I dined off bread and cheese and this excellent barwadic. • The drinks were delicious: blood-red barwadic as well as
light, sweet Rhenish.Inference: barwadic is an alcoholic beverage made from grapes
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 18
Geometric Interpretation: Co-occurrence As Feature
• Recall the term-document matrix– Rows are terms, columns are documents, cells represent the
number of time a term appears in a document• Here we create a word-word co-occurrence matrix
– Rows and columns are words – Cell (R,C) means “how many times does word C appear in the
neighborhood of word R”• Neighborhood = a window of fixed size around the word
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 19
Row Vectors in Co-occurrence Matrix
• Row vector describes the usage of the word in the corpus/document
• Row vectors can be seen as coordinates of the point in an n-dimensional Euclidean space• Example: n = 2• Dimensions = ‘get’ and ‘use’
Co-occurrence matrix
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 20
Distance And Similarity• Selected two dimensions ‘get’ and ‘use’• Similarity between words = spatial proximity in the
dimension space• Measured by the Euclidean distance
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 21
Distance And Similarity• Exact position in the space depends on the frequency of
the word • More frequent words will appear farther from the origin• E.g., say ‘dog’ is more frequent than ‘cat’
• Does not mean it is more important• Solution: Ignore the length and look only at the direction
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 22
Angle And Similarity• Angle ignores the exact location of the
point• Method: Normalize by the length of
vectors or use only the angle as a distance measure
• Standard metric: Cosine similarity between vectors
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 23
Issues with Co-occurrence Matrix• Problem with using the co-occurrence directly:
– The resulting vectors are very high dimensional– Dimension size = Number of words in the corpus
• Billions! – Down-sampling dimensions is not straight-forward
• How many columns to select? • Which columns to select?
• Solution: Compression or Dimensionality Reduction Techniques
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 24
SVD for Dimensionality Reduction
• SVD = Singular Value Decomposition• For an input matrix X
– U = left-singular vector of X, and V = right-singular vector of X– S is a diagonal matrix
• Diagonal values of S are called Singular Values• Matrix U is a get a r-dimension vector for every row of X
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 25
Word Visualization via Dimensionality Reduction
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 26
Issues with SVD• Computational cost for SVD on an N x M matrix is O(NM2),
where N < M• Impossible for large number of word vocabularies or documents • Impractical for real corpus
• It is hard to incorporate out-of-sample or new words/documents• Entire row in the matrix will be 0
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 27
Word2Vec: Representing Word MeaningsKey idea: Predict the surrounding words of every word
Benefits:• Faster• Easier to incorporate new words and documents
Main paper: Distributed Representations of Words and Phrases and their Compositionality. Mikolov et al., NuerIPS, 2013.
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 28
Two Styles of Learning Word2Vec• Continuous Bag of
Words (CBOW): uses the context words in a window to predict the middle word
• Skip-gram: uses the middle word to predict the context words in a window.
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 29
Neural Network Basics: Neuron• Basic building blocks of neural networks• Input is a vector: x = [x1, … xm]• Weights and bias:
– Neuron has weights w = [w1, w2, …, wm]– Bias term = b (or w0)
• Activation function:– Transforms the aggregate– e.g., sigmoid, ReLU
• Output computation:
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 30
Neural Network Basics: Fully Connected Layer• A layer whose neurons are
connected to all the neurons in the previous layer – Each neuron takes as input all the
output from the previous layer• Multiple layers can be stacked
together • Example: 3 fully connected layers
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 31
Neural Network Basics: More About Layers• Input layer: input vectors are
given as inputs here • Hidden layer: Intermediate
representation of inputs – Multiple hidden layers can be
stacked together • Output layer: final output
– Can have one or more neurons in the output layer
• Note that information flows in one direction
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 32
CBOW: Continuous Bag of WordsExample: “The cat sat on floor” (window size 2)Input: context wordsOutput: middle word
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 33
The Architecture
Architecture: input layer, hidden layer, and output layer• Fully connected
layersInput: one-hot vector of context wordsDesired output: one-hot vector of the middle word
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 34
The Architecture
Input size: R|V|
Hidden layer size: R|N|
Output size: R|V|
Input-to-hidden layer weight matrix: W |V| x |N|• All inputs share the W matrixHidden-to-output layer weight matrix: W’|N| x |V|• All weight matrices are
shared across all examples
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 35
Parameters To Be Learned
• Size of the input and output word vector = |V|
• All weights are to be learned during the training process
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 36
Input to Hidden Layer
• Matrix multiplication generates the hidden vector– Multiplication of input
one-hot vector with the input-to-hidden layer matrix
• One multiplication per input
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 37
Input to Hidden Layer
Multiplication for ‘cat’
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 38
Input to Hidden Layer
Multiplication for ‘on’
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 39
Hidden Layer
• Aggregation is done at the hidden layer– Example: simple
averaging