Linguistic regularities in sparse and explicit word representations conll-2014

58
Linguistic Regularities in Sparse and Explicit Word Representations Omer Levy Yoav Goldberg Bar-Ilan University Israel

Transcript of Linguistic regularities in sparse and explicit word representations conll-2014

Linguistic Regularities in Sparse and Explicit

Word Representations

Omer Levy Yoav Goldberg

Bar-Ilan University

Israel

Papers in ACL 2014*

Neural Networks & Word Embeddings

Other Topics

* Sampling error: +/- 100%

Neural Embeddings

β€’ Dense vectors

β€’ Each dimension is a latent feature

β€’ Common software package: word2vec

πΌπ‘‘π‘Žπ‘™π‘¦: (βˆ’7.35, 9.42, 0.88,… ) ∈ ℝ100

β€’ β€œMagic”

king βˆ’ man + woman = queen

(analogies)

Representing words as vectors is not new!

Explicit Representations (Distributional)

β€’ Sparse vectors

β€’ Each dimension is an explicit context

β€’ Common association metric: PMI, PPMI

πΌπ‘‘π‘Žπ‘™π‘¦: π‘…π‘œπ‘šπ‘’: 17, π‘π‘Žπ‘ π‘‘π‘Ž: 5, πΉπ‘–π‘Žπ‘‘: 2, … ∈ ℝ π‘‰π‘œπ‘π‘Žπ‘ β‰ˆ100,000

β€’ Does the same β€œmagic” work for explicit representations too?

β€’ Baroni et al. (2014) showed that embeddings outperform explicit, but…

Questions

β€’ Are analogies unique to neural embeddings?

Compare neural embeddings with explicit representations

β€’ Why does vector arithmetic reveal analogies?

Unravel the mystery behind neural embeddings and their β€œmagic”

Background

Mikolov et al. (2013a,b,c)

β€’ Neural embeddings have interesting geometries

Mikolov et al. (2013a,b,c)

β€’ Neural embeddings have interesting geometries

β€’ These patterns capture β€œrelational similarities”

β€’ Can be used to solve analogies:

man is to woman as king is to queen

Mikolov et al. (2013a,b,c)

β€’ Neural embeddings have interesting geometries

β€’ These patterns capture β€œrelational similarities”

β€’ Can be used to solve analogies:

π‘Ž is to π‘Žβˆ— as 𝑏 is to π‘βˆ—

β€’ Can be recovered by β€œsimple” vector arithmetic:

π‘Ž βˆ’ π‘Žβˆ— = 𝑏 βˆ’ π‘βˆ—

Mikolov et al. (2013a,b,c)

β€’ Neural embeddings have interesting geometries

β€’ These patterns capture β€œrelational similarities”

β€’ Can be used to solve analogies:

π‘Ž is to π‘Žβˆ— as 𝑏 is to π‘βˆ—

β€’ With simple vector arithmetic:

π‘Ž βˆ’ π‘Žβˆ— = 𝑏 βˆ’ π‘βˆ—

π‘Ž βˆ’ π‘Žβˆ— = 𝑏 βˆ’ π‘βˆ—

Mikolov et al. (2013a,b,c)

𝑏 βˆ’ π‘Ž + π‘Žβˆ— = π‘βˆ—

Mikolov et al. (2013a,b,c)

king βˆ’ man + woman = queen

Mikolov et al. (2013a,b,c)

𝑏 π‘Ž π‘Žβˆ— π‘βˆ—

Tokyo βˆ’ Japan + France = Paris

Mikolov et al. (2013a,b,c)

𝑏 π‘Ž π‘Žβˆ— π‘βˆ—

best βˆ’ good + strong = strongest

Mikolov et al. (2013a,b,c)

𝑏 π‘Ž π‘Žβˆ— π‘βˆ—

best βˆ’ good + strong = strongest

Mikolov et al. (2013a,b,c)

vectors in ℝ𝑛

𝑏 π‘Ž π‘Žβˆ— π‘βˆ—

Are analogies unique to neural embeddings?

β€’ Experiment: compare embeddings to explicit representations

Are analogies unique to neural embeddings?

Are analogies unique to neural embeddings?

β€’ Experiment: compare embeddings to explicit representations

Are analogies unique to neural embeddings?

β€’ Experiment: compare embeddings to explicit representations

β€’ Learn different representations from the same corpus:

Are analogies unique to neural embeddings?

β€’ Experiment: compare embeddings to explicit representations

β€’ Learn different representations from the same corpus:

β€’ Evaluate with the same recovery method:

argmaxπ‘βˆ—

cos π‘βˆ—, 𝑏 βˆ’ π‘Ž + π‘Žβˆ—

Analogy Datasets

β€’ 4 words per analogy: π‘Ž is to π‘Žβˆ— as 𝑏 is to π‘βˆ—

β€’ Given 3 words: π‘Ž is to π‘Žβˆ— as 𝑏 is to ?

β€’ Guess the best suiting π‘βˆ— from the entire vocabulary 𝑉‒ Excluding the question words π‘Ž, π‘Žβˆ—, 𝑏

β€’ MSR: ~8000 syntactic analogies

β€’ Google: ~19,000 syntactic and semantic analogies

Embedding vs Explicit (Round 1)

Embedding vs Explicit (Round 1)

Embedding54%

Embedding63%

Explicit29%

Explicit45%

0%

10%

20%

30%

40%

50%

60%

70%

MSR Google

Acc

ura

cy

Many analogies recovered by explicit, but many more by embedding.

Why does vector arithmetic reveal analogies?

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest π‘βˆ— to 𝑏 βˆ’ π‘Ž + π‘Žβˆ—

β€’ This is done with cosine similarity:

argmaxπ‘βˆ—βˆˆπ‘‰

cos π‘βˆ—, 𝑏 βˆ’ π‘Ž + π‘Žβˆ— =

argmaxπ‘βˆ—βˆˆπ‘‰

cos π‘βˆ—, 𝑏 βˆ’ cos π‘βˆ—, π‘Ž + cos π‘βˆ—, π‘Žβˆ—

Problem: one similarity might dominate the rest.

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest π‘βˆ— to 𝑏 βˆ’ π‘Ž + π‘Žβˆ—

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest π‘βˆ— to 𝑏 βˆ’ π‘Ž + π‘Žβˆ—

β€’ This is done with cosine similarity:

argmaxπ‘βˆ—

cos π‘βˆ—, 𝑏 βˆ’ π‘Ž + π‘Žβˆ— =

argmaxπ‘βˆ—βˆˆπ‘‰

cos π‘βˆ—, 𝑏 βˆ’ cos π‘βˆ—, π‘Ž + cos π‘βˆ—, π‘Žβˆ—

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest π‘βˆ— to 𝑏 βˆ’ π‘Ž + π‘Žβˆ—

β€’ This is done with cosine similarity:

argmaxπ‘βˆ—

cos π‘βˆ—, 𝑏 βˆ’ π‘Ž + π‘Žβˆ— =

argmaxπ‘βˆ—

cos π‘βˆ—, 𝑏 βˆ’ cos π‘βˆ—, π‘Ž + cos π‘βˆ—, π‘Žβˆ—

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest π‘βˆ— to 𝑏 βˆ’ π‘Ž + π‘Žβˆ—

β€’ This is done with cosine similarity:

argmaxπ‘βˆ—

cos π‘βˆ—, 𝑏 βˆ’ π‘Ž + π‘Žβˆ— =

argmaxπ‘βˆ—

cos π‘βˆ—, 𝑏 βˆ’ cos π‘βˆ—, π‘Ž + cos π‘βˆ—, π‘Žβˆ—

vector arithmetic = similarity arithmetic

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest π‘βˆ— to 𝑏 βˆ’ π‘Ž + π‘Žβˆ—

β€’ This is done with cosine similarity:

argmaxπ‘βˆ—

cos π‘βˆ—, 𝑏 βˆ’ π‘Ž + π‘Žβˆ— =

argmaxπ‘βˆ—

cos π‘βˆ—, 𝑏 βˆ’ cos π‘βˆ—, π‘Ž + cos π‘βˆ—, π‘Žβˆ—

vector arithmetic = similarity arithmetic

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest π‘₯ to π‘˜π‘–π‘›π‘” βˆ’π‘šπ‘Žπ‘› + π‘€π‘œπ‘šπ‘Žπ‘›

β€’ This is done with cosine similarity:

argmaxπ‘₯

cos π‘₯, π‘˜π‘–π‘›π‘” βˆ’ π‘šπ‘Žπ‘› + π‘€π‘œπ‘šπ‘Žπ‘› =

argmaxπ‘₯

cos π‘₯, π‘˜π‘–π‘›π‘” βˆ’ cos π‘₯,π‘šπ‘Žπ‘› + cos π‘₯, π‘€π‘œπ‘šπ‘Žπ‘›

vector arithmetic = similarity arithmetic

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest π‘₯ to π‘˜π‘–π‘›π‘” βˆ’π‘šπ‘Žπ‘› + π‘€π‘œπ‘šπ‘Žπ‘›

β€’ This is done with cosine similarity:

argmaxπ‘₯

cos π‘₯, π‘˜π‘–π‘›π‘” βˆ’ π‘šπ‘Žπ‘› + π‘€π‘œπ‘šπ‘Žπ‘› =

argmaxπ‘₯

cos π‘₯, π‘˜π‘–π‘›π‘” βˆ’ cos π‘₯,π‘šπ‘Žπ‘› + cos π‘₯, π‘€π‘œπ‘šπ‘Žπ‘›

vector arithmetic = similarity arithmetic

royal? female?

What does each similarity term mean?

β€’ Observe the joint features with explicit representations!

𝒒𝒖𝒆𝒆𝒏 ∩ π’Œπ’Šπ’π’ˆ 𝒒𝒖𝒆𝒆𝒏 ∩ π’˜π’π’Žπ’‚π’

uncrowned Elizabeth

majesty Katherine

second impregnate

… …

Can we do better?

Let’s look at some mistakes…

Let’s look at some mistakes…

England βˆ’ London + Baghdad = ?

Let’s look at some mistakes…

England βˆ’ London + Baghdad = Iraq

Let’s look at some mistakes…

England βˆ’ London + Baghdad = Mosul?

The Additive Objective

cos πΌπ‘Ÿπ‘Žπ‘ž, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos πΌπ‘Ÿπ‘Žπ‘ž, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos π‘€π‘œπ‘ π‘’π‘™, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos π‘€π‘œπ‘ π‘’π‘™, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos π‘€π‘œπ‘ π‘’π‘™, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

The Additive Objective

cos πΌπ‘Ÿπ‘Žπ‘ž, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos πΌπ‘Ÿπ‘Žπ‘ž, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos π‘€π‘œπ‘ π‘’π‘™, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos π‘€π‘œπ‘ π‘’π‘™, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos π‘€π‘œπ‘ π‘’π‘™, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

The Additive Objective

cos πΌπ‘Ÿπ‘Žπ‘ž, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos πΌπ‘Ÿπ‘Žπ‘ž, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos π‘€π‘œπ‘ π‘’π‘™, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos π‘€π‘œπ‘ π‘’π‘™, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos π‘€π‘œπ‘ π‘’π‘™, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

The Additive Objective

cos πΌπ‘Ÿπ‘Žπ‘ž, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos πΌπ‘Ÿπ‘Žπ‘ž, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos π‘€π‘œπ‘ π‘’π‘™, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos π‘€π‘œπ‘ π‘’π‘™, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos π‘€π‘œπ‘ π‘’π‘™, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

The Additive Objective

cos πΌπ‘Ÿπ‘Žπ‘ž, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos πΌπ‘Ÿπ‘Žπ‘ž, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos π‘€π‘œπ‘ π‘’π‘™, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos π‘€π‘œπ‘ π‘’π‘™, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos π‘€π‘œπ‘ π‘’π‘™, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

The Additive Objective

cos πΌπ‘Ÿπ‘Žπ‘ž, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos πΌπ‘Ÿπ‘Žπ‘ž, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos π‘€π‘œπ‘ π‘’π‘™, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ βˆ’ cos π‘€π‘œπ‘ π‘’π‘™, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos π‘€π‘œπ‘ π‘’π‘™, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

β€’ Problem: one similarity might dominate the rest

β€’ Much more prevalent in explicit representation

β€’ Might explain why explicit underperformed

How can we do better?

How can we do better?

β€’ Instead of adding similarities, multiply them!

How can we do better?

β€’ Instead of adding similarities, multiply them!

argmaxπ‘βˆ—

cos π‘βˆ—, 𝑏 cos π‘βˆ—, π‘Žβˆ—

cos π‘βˆ—, π‘Ž

How can we do better?

β€’ Instead of adding similarities, multiply them!

argmaxπ‘βˆ—

cos π‘βˆ—, 𝑏 cos π‘βˆ—, π‘Žβˆ—

cos π‘βˆ—, π‘Ž

Embedding vs Explicit (Round 2)

Multiplication > Addition

Add54%

Add63%

Add29%

Add45%

Mul59%

Mul67% Mul

57%

Mul68%

0%

10%

20%

30%

40%

50%

60%

70%

80%

MSR Google MSR Google

Embedding Explicit

Acc

ura

cy

Explicit is on-par with Embedding

Embedding59%

Embedding67%Explicit

57%

Explicit68%

0%

10%

20%

30%

40%

50%

60%

70%

80%

MSR Google

Acc

ura

cy

Explicit is on-par with Embedding

β€’ Embeddings are not β€œmagical”

β€’ Embedding-based similarities have a more uniform distribution

β€’ The additive objective performs better on smoother distributions

β€’ The multiplicative objective overcomes this issue

Conclusion

β€’ Are analogies unique to neural embeddings?

No! They occur in sparse and explicit representations as well.

β€’ Why does vector arithmetic reveal analogies?

Because vector arithmetic is equivalent to similarity arithmetic.

β€’ Can we do better?

Yes! The multiplicative objective is significantly better.

More Results and Analyses (in the paper)

β€’ Evaluation on closed-vocabulary analogy questions (SemEval 2012)

β€’ Experiments with a third objective function (PairDirection)

β€’ Do different representations reveal the same analogies?

β€’ Error analysis

β€’ A feature-level interpretation of how word similarity reveals analogies

Thanks βˆ’ for + listening = )