Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with...

66
Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, Ido Dagan Presentation: Collin Gress

Transcript of Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with...

Page 1: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Authors: Omar Levy, Yoav Goldberg, Ido Dagan

Presentation: Collin Gress

Page 2: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Motivation

u We want to do NLP tasks. How do we represent words?

Page 3: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Motivation

u We want to do NLP tasks. How do we represent words?

u We generally want vectors. Think neural networks

Page 4: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Motivation

u We want to do NLP tasks. How do we represent words?

u We generally want vectors. Think neural networks

u What are some ways to get vector representations of words?

Page 5: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Motivation

u We want to do NLP tasks. How do we represent words?

u We generally want vectors. Think neural networks

u What are some ways to get vector representations of words?

u Distributional hypothesis: "words that are used and occur in the same contexts tend to purport similar meanings” - Wikipedia

Page 6: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Vector representations of words and their surrounding contexts

u Word2vec [1]

Page 7: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Vector representations of words and their surrounding contexts

u Word2vec [1]

u Glove [2]

Page 8: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Vector representations of words and their surrounding contexts

u Word2vec [1]

u Glove [2]

u PMI – Pointwise mutual Information

Page 9: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Vector representations of words and their surrounding contexts

u Word2vec [1]

u Glove [2]

u PMI – Pointwise mutual information

u SVD of PMI – Singular value decomposition of PMI matrix

Page 10: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Very briefly: Word2vec

u This paper focuses on Skip-Gram negative sampling (SGNS) which predicts context words based off of a target word

Page 11: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Very briefly: Word2vec

u This paper focuses on Skip-Gram negative sampling (SGNS) which predicts context words based off of a target word

u Optimization problem solvable by gradient descent. Want to maximize ! ⋅ #for word context pairs that exist in the dataset, and minimize it for “hallucinated” word context pairs [0].

Page 12: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Very briefly: Word2vec

u This paper focuses on Skip-Gram negative sampling (SGNS) which predicts context words based off of a target word

u Optimization problem solvable by gradient descent. Want to maximize ! ⋅ #for word context pairs that exist in the dataset, and minimize it for “hallucinated” word context pairs [0].

u For every real word-context pair in dataset, hallucinate % word-context pairs. That is, given some word target, draw % contexts from &'(#) = +,-./(+)

∑ +,-./(+1)21

Page 13: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Very briefly: Word2vec

u This paper focuses on Skip-Gram negative sampling (SGNS) which predicts context words based off of a target word

u Optimization problem solvable by gradient descent. Want to maximize ! ⋅ #for word context pairs that exist in the dataset, and minimize it for “hallucinated” word context pairs [0].

u For every real word-context pair in dataset, hallucinate % word-context pairs. That is, given some word target, draw % contexts from &'(#) = +,-./(+)

∑ +,-./(+1)21

u End up with a set of vectors, !3 ∈ ℝ6 for every word in dataset. Similarly, set of vectors, #3 ∈ ℝ6 for each context in the dataset.

See Mikolov paper [1] for details

Page 14: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Very briefly: Glove

u Learn !-dimensional vectors " and #⃗ as well as word and context specific scalars, %& and %' such that " ⋅ #⃗ + %& + %' = log(count ", # ) for all word context pairs in data set [0]

Page 15: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Very briefly: Glove

u Learn !-dimensional vectors " and #⃗ as well as word and context specific scalars, %& and %' such that " ⋅ #⃗ + %& + %' = log(count ", # ) for all word context pairs in data set [0]

u Objective “solved” by factorization of the log count matrix, 5678(':;<= &,' ), > ⋅ ?@ + %& + %'⃗

Page 16: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Very briefly: Pointwise mutual information (PMI)

u !"# $, & = log( - .,/- . - / )

Page 17: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

PMI: example

Source: https://en.wikipedia.org/wiki/Pointwise_mutual_information

Page 18: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

PMI matrices for word, context pairs in practice

u Very sparse

Page 19: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Interesting relationships between PMI and SGNS; PMI and Glove

u SGNS is implicitly factorizing PMI shifted by some constant [0]. Specifically, SGNS finds optimal vectors, ! and "⃗, such that ! ⋅ "⃗ = &'( !, " − log(0). 2 ⋅ 34 = ' − log 0

Page 20: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Interesting relationships between PMI and SGNS; PMI and Glove

u SGNS is implicitly factorizing PMI shifted by some constant [0]. Specifically, SGNS finds optimal vectors, ! and "⃗, such that ! ⋅ "⃗ = &'( !, " − log(0). 2 ⋅ 34 = ' − log 0

u Recall that, in Glove we learn vectors d-dimensional vectors ! and "⃗ as well as word and context specific scalars, 56 and 57 such that ! ⋅ "⃗ + 56 + 57 = log(count !, " ) for all word context pairs in data set

Page 21: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Interesting relationships between PMI and SGNS; PMI and Glove

u SGNS is implicitly factorizing PMI shifted by some constant [0]. Specifically, SGNS finds optimal vectors, ! and "⃗, such that ! ⋅ "⃗ = &'( !, " − log(0). 2 ⋅ 34 = ' − log 0

u Recall that, in Glove we learn vectors d-dimensional vectors ! and "⃗ as well as word and context specific scalars, 56 and 57 such that ! ⋅ "⃗ + 56 + 57 = log(count !, " ) for all word context pairs in data set

u If we fix 56 and 57 such that 56 = log("=>?@ ! ) and 57 = log("=>?@ " ), we get a problem nearly equivalent to factorizing PMI matrix shifted by log( A ). I.e. 2 ⋅ 34 +56 +57⃗ = ' − log( A )

Page 22: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Interesting relationships between PMI and SGNS; PMI and Glove

u SGNS is implicitly factorizing PMI shifted by some constant [0]. Specifically, SGNS finds optimal vectors, ! and "⃗, such that ! ⋅ "⃗ = &'( !, " − log(0). 2 ⋅ 34 = ' − log 0

u Recall that, in Glove we learn vectors d-dimensional vectors ! and "⃗ as well as word and context specific scalars, 56 and 57 such that ! ⋅ "⃗ + 56 + 57 = log(count !, " ) for all word context pairs in data set

u If we fix 56 and 57 such that 56 = log("=>?@ ! ) and 57 = log("=>?@ " ), we get a problem nearly equivalent to factorizing PMI matrix shifted by log( A ). I.e. 2 ⋅ 34 +56 +57⃗ = ' − log( A )

u Or in simple terms, SGNS (Word2vec) and Glove aren’t too different from PMI

Page 23: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Very briefly: SVD of PMI matrix

u Singular value decomposition of PMI gives us dense vectors

Page 24: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Very briefly: SVD of PMI matrix

u Singular value decomposition of PMI gives us dense vectors

u Factorize PMI matrix, ! into product of three matrices i.e. " ⋅ $ ⋅ %&

Page 25: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Very briefly: SVD of PMI matrix

u Singular value decomposition of PMI gives us dense vectors

u Factorize PMI matrix, ! into product of three matrices i.e. " ⋅ $ ⋅ %&u Why does that help? ' = ") ⋅ $) and * = %)

Page 26: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Thesis

u The performance gains of word embeddings are mainly attributed to the optimization of algorithm hyperparameters by algorithm designers rather than the algorithms themselves

Page 27: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Thesis

u The performance gains of word embeddings are mainly attributed to the optimization of algorithm hyperparameters by algorithm designers rather than the algorithms themselves

u PMI and SVD baselines used for comparison in embedding papers were the most “vanilla” versions, hence the apparent superiority of embedding algorithms

Page 28: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Thesis

u The performance gains of word embeddings are mainly attributed to the optimization of algorithm hyperparameters by algorithm designers rather than the algorithms themselves

u PMI and SVD baselines used for comparison in embedding papers were the most “vanilla” versions, hence the apparent superiority of embedding algorithms

u Hyperparameters of Glove and Word2vec can be applied to PMI and SVD, drastically improving their performance

Page 29: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Pre-processing Hyperparameters

u Dynamic Context Window: context word counts are weighted by their distance to the target word. Word2vec does this by setting each context word’s weight to !"#$%!'"()*$"'+,#-)./!"#$%!'"() . Glove uses /

$"'+,#-)

Page 30: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Pre-processing Hyperparameters

u Dynamic Context Window: context word counts are weighted by their distance to the target word. Word2vec does this by setting each context word’s weight to !"#$%!'"()*$"'+,#-)./!"#$%!'"() . Glove uses /

$"'+,#-)u Subsampling: remove very frequent words from corpus. Word2vec does this by

removing words which are more frequent than some threshold 0 with

probability 1 − +3 where 4 is the corpus wide frequency of a word.

Page 31: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Pre-processing Hyperparameters

u Dynamic Context Window: context word counts are weighted by their distance to the target word. Word2vec does this by setting each context word’s weight to !"#$%!'"()*$"'+,#-)./!"#$%!'"() . Glove uses /

$"'+,#-)u Subsampling: remove very frequent words from corpus. Word2vec does this by

removing words which are more frequent than some threshold 0 with

probability 1 − +3 where 4 is the corpus wide frequency of a word.

u Subsampling can be “dirty” or “clean”. In dirty subsampling, words are removed before word context pairs are formed. In clean subsampling, it is done after.

Page 32: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Pre-processing Hyperparameters

u Dynamic Context Window: context word counts are weighted by their distance to the target word. Word2vec does this by setting each context word’s weight to !"#$%!'"()*$"'+,#-)./!"#$%!'"() . Glove uses /

$"'+,#-)u Subsampling: remove very frequent words from corpus. Word2vec does this by

removing words which are more frequent than some threshold 0 with

probability 1 − +3 where 4 is the corpus wide frequency of a word.

u Subsampling can be “dirty” or “clean”. In dirty subsampling, words are removed before word context pairs are formed. In clean subsampling, it is done after.

u Deleting Rare Words: exactly what you would expect. Negligible effect on performance

Page 33: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Association Metric Hyperparameters

u Shifted PMI: As previously discussed, SGNS implicitly factorizes the PMI matrix shifted by log(&). When working with PMI matrices, we can simply apply this transformation by picking some constant &, meaning each cell of the PMI matrix is ()* +, - − log(&)

Page 34: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Association Metric Hyperparameters

u Shifted PMI: As previously discussed, SGNS implicitly factorizes the PMI matrix shifted by log(&). When working with PMI matrices, we can simply apply this transformation by picking some constant &, meaning each cell of the PMI matrix is ()* +, - − log(&)

u Context Distribution Smoothing: used in Word2vec to smooth the context

distribution for negative sampling. /0(-) = (23456 2 )7∑ (23456 29 )7:;

where < is some

constant. Can be used in PMI in the same sort of way.

()*= +, - = >?@ A(B,2)A B A7(2)

where /= c = (23456 2 )7∑ (23456 29 )7:;

Page 35: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Association Metric Hyperparameters

u Shifted PMI: As previously discussed, SGNS implicitly factorizes the PMI matrix shifted by log(&). When working with PMI matrices, we can simply apply this transformation by picking some constant &, meaning each cell of the PMI matrix is ()* +, - − log(&)

u Context Distribution Smoothing: used in Word2vec to smooth the context

distribution for negative sampling. /0(-) = (23456 2 )7∑ (23456 29 )7:;

where < is some

constant. Can be used in PMI in the same sort of way.

()*= +, - = >?@ A(B,2)A B A7(2)

where /= c = (23456 2 )7∑ (23456 29 )7:;

u Context distribution smoothing helps to correct PMI’s bias towards word context pairs where the context is rare

Page 36: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return ! + #⃗ for word representations rather than just !

Page 37: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return ! + #⃗ for word representations rather than just !

u New ! holds more information. Previously, two vectors would have high cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

Page 38: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return " + $⃗ for word representations rather than just "

u New " holds more information. Previously, two vectors would have high cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

u Eigenvalue Weighting: In SVD, weight the eigenvalue matrix by raising value to some power, &. Then define ' = )* + Σ*-

Page 39: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return ! + #⃗ for word representations rather than just !

u New ! holds more information. Previously, two vectors would have high cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

u Eigenvalue Weighting: In SVD, weight the eigenvalue matrix by raising value to some power, %. Then define & = () * Σ),

u Authors observed that SGNS results in ”symmetric” weight and context matrices. I.e. neither is orthonormal and no bias given to either in training objective [0]. Symmetry obtainable in SVD by letting % = -

.

Page 40: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return ! + #⃗ for word representations rather than just !

u New ! holds more information. Previously, two vectors would have high cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

u Eigenvalue Weighting: In SVD, weight the eigenvalue matrix by raising value to some power, %. Then define & = () * Σ),

u Authors observed that SGNS results in ”symmetric” weight and context matrices. I.e. neither is orthonormal and no bias given to either in training objective [0]. Symmetry obtainable in SVD by letting % = -

.

Page 41: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return ! + #⃗for word representations rather than just !

u New ! holds more information. Previously, two vectors would have high cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

u Eigenvalue Weighting: In SVD, weight the eigenvalue matrix by raising value to some power, %. Then define & = () * Σ),

u Authors observed that SGNS results in ”symmetric” weight and context matrices. I.e. neither is orthonormal and no bias given to either in training objective [0]. Symmetry obtainable in SVD by letting % = -

.u Vector Normalization: general assumption is to normalize word vectors with /.

normalization, may be worthwhile to experiment with.

Page 42: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiments Setup: HyperparameterSpace

Page 43: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Setup: Training

u Train on English Wikipedia dump. 77.5 million sentences, 1.5 billion tokens

Page 44: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Setup: Training

u Train on English Wikipedia dump. 77.5 million sentences, 1.5 billion tokens

u Use ! = 500 for SVD, SGNS, Glove

Page 45: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Setup: Training

u Train on English Wikipedia dump. 77.5 million sentences, 1.5 billion tokens

u Use ! = 500 for SVD, SGNS, Glove

u Glove trained for 50 iterations

Page 46: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each dataset human labeled with word pair similarity scores

Page 47: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity

Page 48: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity

u Analogy task: two analogy datasets used. Given analogies of the form ! is to !∗ as # is to #∗ where #∗ is not given. Example: “Paris is to France as Tokyo is to _”.

Page 49: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity

u Analogy task: two analogy datasets used. Given analogies of the form ! is to !∗ as # is to #∗ where #∗ is not given. Example: “Paris is to France as Tokyo is to _”.

u Analogies calculated using 3CosAdd and 3CosMul

Page 50: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity

u Analogy task: two analogy datasets used. Given analogies of the form ! is to !∗ as # is to #∗ where #∗ is not given. Example: “Paris is to France as Tokyo is to _”.

u Analogies calculated using 3CosAdd and 3CosMul

u 3CosAdd: !$%'!()∗∈+,\{/,/∗,)}(cos #∗, !∗ − cos #∗, ! + cos #∗, # )

Page 51: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity

u Analogy task: two analogy datasets used. Given analogies of the form ! is to !∗ as # is to #∗ where #∗ is not given. Example: “Paris is to France as Tokyo is to _”.

u Analogies calculated using 3CosAdd and 3CosMul

u 3CosAdd:!%&'!()∗∈+,\{/,/∗,)} cos #∗, !∗ − cos #∗, ! + cos #∗, #

u 3CosMul: !%&'!()∗∈+,\{/,/∗,)}789()∗,/∗)<789()∗,))

789 )∗,/ => where ? = .001

Page 52: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Results

Page 53: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Experiment Results Cont.

Page 54: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5. SGNS never outperforms SVD by more than 1.7%

Page 55: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5. SGNS never outperforms SVD by more than 1.7%

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD

Page 56: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5. SGNS never outperforms SVD by more than 1.7%

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD

u Glove only superior to SGNS on analogy tasks using 3CosAdd (generally considered inferior to 3CosMult)

Page 57: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5. SGNS never outperforms SVD by more than 1.7%

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD

u Glove only superior to SGNS on analogy tasks using 3CosAdd (generally considered inferior to 3CosMult)

u CBOW seems to only perform well on MSR analogy data set

Page 58: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5. SGNS never outperforms SVD by more than 1.7%

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD

u Glove only superior to SGNS on analogy tasks using 3CosAdd (generally considered inferior to 3CosMult)

u CBOW seems to only perform well on MSR analogy data set

u SVD does not benefit from shifting PMI matrix

Page 59: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5. SGNS never outperforms SVD by more than 1.7%

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD

u Glove only superior to SGNS on analogy tasks using 3CosAdd (generally considered inferior to 3CosMult)

u CBOW seems to only perform well on MSR analogy data set

u SVD does not benefit from shifting PMI matrix

u Using SVD with an eigenvalue weighting of 1 results in poor performance compared to .5 or 0

Page 60: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Recommendations

u Tune hyperparameters based on task at hand (duh)

Page 61: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Recommendations

u Tune hyperparameters based on task at hand (duh)

u If you’re using PMI, always use context distribution smoothing

Page 62: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Recommendations

u Tune hyperparameters based on task at hand (duh)

u If you’re using PMI, always use context distribution smoothing

u If you’re using SVD, always use eigenvalue weighting

Page 63: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Recommendations

u Tune hyperparameters based on task at hand (duh)

u If you’re using PMI, always use context distribution smoothing

u If you’re using SVD, always use eigenvalue weighting

u SGNS always performs well and is computationally efficient to train and utilize

Page 64: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Recommendations

u Tune hyperparameters based on task at hand (duh)

u If you’re using PMI, always use context distribution smoothing

u If you’re using SVD, always use eigenvalue weighting

u SGNS always performs well and is computationally efficient to train and utilize

u With SGNS, use many negative examples, i.e. prefer larger !

Page 65: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

Recommendations

u Tune hyperparameters based on task at hand (duh)

u If you’re using PMI, always use context distribution smoothing

u If you’re using SVD, always use eigenvalue weighting

u SGNS always performs well and is computationally efficient to train and utilize

u With SGNS, use many negative examples, i.e. prefer larger !u Experiment with " =" + &⃗ variation in SGNS and Glove

Page 66: Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with Lessons Learned from Word Embeddings Authors: Omar Levy, Yoav Goldberg, IdoDagan Presentation:

References

u [0] Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.

u [1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

u [2] Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.