19 word embeddings - Colorado State University

27
11/19/19 1 Word Embeddings Chapter 6 in Martin/Jurafsky Slides based on slides by Dan Jurafsky The problem with bag-of-words representation Does not capture any notion of meaning of relatedness of words: The vectors that represent different words are all orthogonal!

Transcript of 19 word embeddings - Colorado State University

Page 1: 19 word embeddings - Colorado State University

11/19/19

1

Word Embeddings

Chapter 6 in Martin/Jurafsky Slides based on slides by Dan Jurafsky

The problem with bag-of-words representation

•  Does not capture any notion of meaning of relatedness of words:

The vectors that represent different words are all orthogonal!

Page 2: 19 word embeddings - Colorado State University

11/19/19

2

How do we get at the meaning of words?

•  Let's take a closer look at words and the different types of relationships between them.

Words, Lemmas, Senses, Definitions

Pronunciation:

pepper, n. Brit. /ˈpɛpə/ , U.S. /ˈpɛpər/

Forms: OE peopor (rare), OE pipcer (transmission error), OE pipor, OE pipur (rare ...

Frequency (in current use): Etymology: A borrowing from Latin. Etymon: Latin piper.< classical Latin piper, a loanword < Indo-Aryan (as is ancient Greek πέπερι ); compare Sanskrit ...

I. The spice or the plant. 1. a. A hot pungent spice derived from the prepared fruits (peppercorns) ofthe pepper plant, Piper nigrum (see sense 2a), used from early times toseason food, either whole or ground to powder (often in association withsalt). Also (locally, chiefly with distinguishing word): a similar spicederived from the fruits of certain other species of the genus Piper; thefruits themselves.

The ground spice from Piper nigrum comes in two forms, the more pungent black pepper, producedfrom black peppercorns, and the milder white pepper, produced from white peppercorns: see BLACK

adj. and n. Special uses 5a, PEPPERCORN n. 1a, and WHITE adj. and n. Special uses 7b(a). cubeb, mignonette pepper, etc.: see the first element.

b. With distinguishing word: any of certain other pungent spices derivedfrom plants of other families, esp. ones used as seasonings.

Cayenne, Jamaica pepper, etc.: see the first element.

2. a. The plant Piper nigrum (family Piperaceae), a climbing shrubindigenous to South Asia and also cultivated elsewhere in the tropics,which has alternate stalked entire leaves, with pendulous spikes of smallgreen flowers opposite the leaves, succeeded by small berries turning redwhen ripe. Also more widely: any plant of the genus Piper or the familyPiperaceae.

b. Usu. with distinguishing word: any of numerous plants of otherfamilies having hot pungent fruits or leaves which resemble pepper ( 1a)in taste and in some cases are used as a substitute for it.

1

Oxford English Dictionary | The definitive record of the Englishlanguage

Pronunciation:

pepper, n. Brit. /ˈpɛpə/ , U.S. /ˈpɛpər/

Forms: OE peopor (rare), OE pipcer (transmission error), OE pipor, OE pipur (rare ...

Frequency (in current use): Etymology: A borrowing from Latin. Etymon: Latin piper.< classical Latin piper, a loanword < Indo-Aryan (as is ancient Greek πέπερι ); compare Sanskrit ...

I. The spice or the plant. 1. a. A hot pungent spice derived from the prepared fruits (peppercorns) ofthe pepper plant, Piper nigrum (see sense 2a), used from early times toseason food, either whole or ground to powder (often in association withsalt). Also (locally, chiefly with distinguishing word): a similar spicederived from the fruits of certain other species of the genus Piper; thefruits themselves.

The ground spice from Piper nigrum comes in two forms, the more pungent black pepper, producedfrom black peppercorns, and the milder white pepper, produced from white peppercorns: see BLACK

adj. and n. Special uses 5a, PEPPERCORN n. 1a, and WHITE adj. and n. Special uses 7b(a). cubeb, mignonette pepper, etc.: see the first element.

b. With distinguishing word: any of certain other pungent spices derivedfrom plants of other families, esp. ones used as seasonings.

Cayenne, Jamaica pepper, etc.: see the first element.

2. a. The plant Piper nigrum (family Piperaceae), a climbing shrubindigenous to South Asia and also cultivated elsewhere in the tropics,which has alternate stalked entire leaves, with pendulous spikes of smallgreen flowers opposite the leaves, succeeded by small berries turning redwhen ripe. Also more widely: any plant of the genus Piper or the familyPiperaceae.

b. Usu. with distinguishing word: any of numerous plants of otherfamilies having hot pungent fruits or leaves which resemble pepper ( 1a)in taste and in some cases are used as a substitute for it.

1

Oxford English Dictionary | The definitive record of the Englishlanguage

betel-, malagueta, wall pepper, etc.: see the first element. See also WATER PEPPER n. 1.

c. U.S. The California pepper tree, Schinus molle. Cf. PEPPER TREE n. 3.

3. Any of various forms of capsicum, esp. Capsicum annuum var.annuum. Originally (chiefly with distinguishing word): any variety of theC. annuum Longum group, with elongated fruits having a hot, pungenttaste, the source of cayenne, chilli powder, paprika, etc., or of theperennial C. frutescens, the source of Tabasco sauce. Now frequently(more fully sweet pepper): any variety of the C. annuum Grossumgroup, with large, bell-shaped or apple-shaped, mild-flavoured fruits,usually ripening to red, orange, or yellow and eaten raw in salads orcooked as a vegetable. Also: the fruit of any of these capsicums.

Sweet peppers are often used in their green immature state (more fully green pepper), but somenew varieties remain green when ripe. bell-, bird-, cherry-, pod-, red pepper, etc.: see the first element. See also CHILLI n. 1, PIMENTO n. 2, etc.

II. Extended uses. 4. a. Phrases. to have pepper in the nose: to behave superciliously orcontemptuously. to take pepper in the nose, to snuff pepper: totake offence, become angry. Now arch.

b. In other allusive and proverbial contexts, chiefly with reference to thebiting, pungent, inflaming, or stimulating qualities of pepper.

†c. slang. Rough treatment; a severe beating, esp. one inflicted during aboxing match. Cf. Pepper Alley n. at Compounds 2, PEPPER v. 3. Obs.

5. Short for PEPPERPOT n. 1a.

6. colloq. A rapid rate of turning the rope in a game of skipping. Also:skipping at such a rate.

betel-, malagueta, wall pepper, etc.: see the first element. See also WATER PEPPER n. 1.

c. U.S. The California pepper tree, Schinus molle. Cf. PEPPER TREE n. 3.

3. Any of various forms of capsicum, esp. Capsicum annuum var.annuum. Originally (chiefly with distinguishing word): any variety of theC. annuum Longum group, with elongated fruits having a hot, pungenttaste, the source of cayenne, chilli powder, paprika, etc., or of theperennial C. frutescens, the source of Tabasco sauce. Now frequently(more fully sweet pepper): any variety of the C. annuum Grossumgroup, with large, bell-shaped or apple-shaped, mild-flavoured fruits,usually ripening to red, orange, or yellow and eaten raw in salads orcooked as a vegetable. Also: the fruit of any of these capsicums.

Sweet peppers are often used in their green immature state (more fully green pepper), but somenew varieties remain green when ripe. bell-, bird-, cherry-, pod-, red pepper, etc.: see the first element. See also CHILLI n. 1, PIMENTO n. 2, etc.

II. Extended uses. 4. a. Phrases. to have pepper in the nose: to behave superciliously orcontemptuously. to take pepper in the nose, to snuff pepper: totake offence, become angry. Now arch.

b. In other allusive and proverbial contexts, chiefly with reference to thebiting, pungent, inflaming, or stimulating qualities of pepper.

†c. slang. Rough treatment; a severe beating, esp. one inflicted during aboxing match. Cf. Pepper Alley n. at Compounds 2, PEPPER v. 3. Obs.

5. Short for PEPPERPOT n. 1a.

6. colloq. A rapid rate of turning the rope in a game of skipping. Also:skipping at such a rate.

sense lemma definition

Page 3: 19 word embeddings - Colorado State University

11/19/19

3

A sense or “concept” is the meaning component of a word

There are relations between senses

Relation: Synonymy

•  Synonyms have the same meaning in some or all contexts. –  filbert / hazelnut –  couch / sofa –  big / large –  automobile / car –  vomit / throw up – water / H20

Page 4: 19 word embeddings - Colorado State University

11/19/19

4

Relation: Synonymy

•  Note that there are probably no examples of perfect synonymy. –  Even if many aspects of meaning are identical

–  Still may not preserve the acceptability based on notions of politeness, slang, etc.

Relation: Synonymy?

water/H20

big/large

brave/courageous

Page 5: 19 word embeddings - Colorado State University

11/19/19

5

Relation: Antonymy •  Senses that are opposites with respect to one aspect of meaning

•  Otherwise, they are very similar! dark/light short/long fast/slow rise/fall

hot/cold up/down in/out

Relation: Similarity

 Words with similar meanings. Not synonyms, but sharing some element of meaning

 car, bicycle

 cow, horse

Page 6: 19 word embeddings - Colorado State University

11/19/19

6

Ask humans to rate the similarity of words

word1 word2 similarity

vanish disappear 9.8 behave obey 7.3 belief impression 5.95 muscle bone 3.65 modest flexible 0.98 hole agreement 0.3

SimLex-999 dataset (Hill et al., 2015)

Relation: Word relatedness

•  Also called "word association" •  Words be related in any way, perhaps via a

semantic frame or field

•  car, bicycle: similar •  car, gasoline: related, not similar

Page 7: 19 word embeddings - Colorado State University

11/19/19

7

Semantic field

•  Words that –  cover a particular semantic domain –  bear structured relations with each other.

hospitals surgeon, scalpel, nurse, anaesthetic, hospital

restaurants waiter, menu, plate, food, menu, chef

houses door, roof, kitchen, family, bed

Relation: superordinate/ subordinate One sense is a subordinate of another if the first sense is more specific, denoting a subclass of the other

–  car is a subordinate of vehicle

–  mango is a subordinate of fruit

Conversely superordinate –  vehicle is a superordinate of car

–  fruit is a subodinate of mango

Page 8: 19 word embeddings - Colorado State University

11/19/19

8

So far •  Concepts or word senses

–  Have a complex many-to-many association with words

•  Have relations with each other –  Synonymy –  Antonymy –  Similarity –  Relatedness –  Superordinate/subordinate –  Connotation

Let's define words by their usage •  Words can be defined by their environments (the words

around them)

•  Zellig Harris (1954): If A and B have almost identical environments we say that they are synonyms.

Page 9: 19 word embeddings - Colorado State University

11/19/19

9

What does ongchoi mean? Suppose you see these sentences: • Ong choi is delicious sautéed with garlic.

• Ong choi is superb over rice

• Ong choi leaves with salty sauces

And you've also seen these: •  …spinach sautéed with garlic over rice

•  Chard stems and leaves are delicious

•  Collard greens and other salty leafy greens

Conclusion: –  Ongchoi is a leafy green like spinach, chard, or collard greens

Ong choi: Ipomoea aquatica "Water Spinach"

Yamaguchi, Wikimedia Commons, public domain

Page 10: 19 word embeddings - Colorado State University

11/19/19

10

We'll build a model of meaning focusing on similarity •  Each word = a vector

–  Not just "word" or word45.

•  Similar words are "nearby in space"

good

nice

badworst

not good

wonderfulamazing

terrific

dislike

worse

very good incredibly goodfantastic

incredibly badnow

youithat

with

byto‘s

are

is

athan

Words as vectors

•  Called an "embedding" because it's embedded into a space

•  The standard way to represent meaning in NLP

•  Fine-grained model of meaning for similarity –  NLP tasks like sentiment analysis

•  With words, requires same word to be in training and test

•  With embeddings: ok if similar words occurred!!!

Page 11: 19 word embeddings - Colorado State University

11/19/19

11

The word-word matrix

•  Two words are similar in meaning if their context vectors are similar

21

aardvark computer data pinch result sugar …apricot 0 0 0 1 0 1pineapple 0 0 0 1 0 1digital 0 2 1 0 1 0information 0 1 6 0 4 0

Cosine for computing word similarity

vi is the count for word v in context i wi is the count for word w in context i. cos(v, w) is the cosine similarity of v and w

Sec. 6.3

12 CHAPTER 6 • VECTOR SEMANTICS

~a ·~b = |~a||~b|cosq~a ·~b|~a||~b|

= cosq (6.9)

The cosine similarity metric between two vectors~v and ~w thus can be computedcosine

as:

cosine(~v,~w) =~v ·~w|~v||~w| =

NX

i=1

viwi

vuutNX

i=1

v2i

vuutNX

i=1

w2i

(6.10)

For some applications we pre-normalize each vector, by dividing it by its length,creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vectordividing it by |~a|. For unit vectors, the dot product is the same as the cosine.

The cosine value ranges from 1 for vectors pointing in the same direction, through0 for vectors that are orthogonal, to -1 for vectors pointing in opposite directions.But raw frequency values are non-negative, so the cosine for these vectors rangesfrom 0–1.

Let’s see how the cosine computes which of the words apricot or digital is closerin meaning to information, just using raw counts from the following simplified table:

large data computerapricot 2 0 0digital 0 1 2

information 1 6 1

cos(apricot, information) =2+0+0p

4+0+0p

1+36+1=

22p

38= .16

cos(digital, information) =0+6+2p

0+1+4p

1+36+1=

8p38p

5= .58 (6.11)

The model decides that information is closer to digital than it is to apricot, aresult that seems sensible. Fig. 6.7 shows a visualization.

6.5 TF-IDF: Weighing terms in the vector

The co-occurrence matrix in Fig. 6.5 represented each cell by the raw frequency ofthe co-occurrence of two words.

It turns out, however, that simple frequency isn’t the best measure of associationbetween words. One problem is that raw frequency is very skewed and not verydiscriminative. If we want to know what kinds of contexts are shared by apricot andpineapple but not by digital and information, we’re not going to get good discrimi-nation from words like the, it, or they, which occur frequently with all sorts of wordsand aren’t informative about any particular word.

It’s a bit of a paradox. Word that occur nearby frequently (maybe sugar appearsoften in our corpus near apricot) are more important than words that only appear

12 CHAPTER 6 • VECTOR SEMANTICS

~a ·~b = |~a||~b|cosq~a ·~b|~a||~b|

= cosq (6.9)

The cosine similarity metric between two vectors~v and ~w thus can be computedcosine

as:

cosine(~v,~w) =~v ·~w|~v||~w| =

NX

i=1

viwi

vuutNX

i=1

v2i

vuutNX

i=1

w2i

(6.10)

For some applications we pre-normalize each vector, by dividing it by its length,creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vectordividing it by |~a|. For unit vectors, the dot product is the same as the cosine.

The cosine value ranges from 1 for vectors pointing in the same direction, through0 for vectors that are orthogonal, to -1 for vectors pointing in opposite directions.But raw frequency values are non-negative, so the cosine for these vectors rangesfrom 0–1.

Let’s see how the cosine computes which of the words apricot or digital is closerin meaning to information, just using raw counts from the following simplified table:

large data computerapricot 2 0 0digital 0 1 2

information 1 6 1

cos(apricot, information) =2+0+0p

4+0+0p

1+36+1=

22p

38= .16

cos(digital, information) =0+6+2p

0+1+4p

1+36+1=

8p38p

5= .58 (6.11)

The model decides that information is closer to digital than it is to apricot, aresult that seems sensible. Fig. 6.7 shows a visualization.

6.5 TF-IDF: Weighing terms in the vector

The co-occurrence matrix in Fig. 6.5 represented each cell by the raw frequency ofthe co-occurrence of two words.

It turns out, however, that simple frequency isn’t the best measure of associationbetween words. One problem is that raw frequency is very skewed and not verydiscriminative. If we want to know what kinds of contexts are shared by apricot andpineapple but not by digital and information, we’re not going to get good discrimi-nation from words like the, it, or they, which occur frequently with all sorts of wordsand aren’t informative about any particular word.

It’s a bit of a paradox. Word that occur nearby frequently (maybe sugar appearsoften in our corpus near apricot) are more important than words that only appear

Page 12: 19 word embeddings - Colorado State University

11/19/19

12

Cosine as a similarity metric

•  +1: vectors point in the same direction

•  -1: vectors point in opposite directions

•  0: vectors are orthogonal

` large data computer

apricot 1 0 0

digital 0 1 2

information 1 6 1

Which pair of words is more similar? cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =

1+ 0+ 0

1+ 0+ 0

1+36+1

1+36+1

0+1+ 4

0+1+ 4

1+ 0+ 0

0+ 6+ 2

0+ 0+ 0

=138

= .16

=838 5

= .58

= 0cos(v, w) =

v • wv w

=vv•ww=

viwii=1

N∑vi2

i=1

N∑ wi

2i=1

N∑

Page 13: 19 word embeddings - Colorado State University

11/19/19

13

large data computer

apricot 1 0 0

digital 0 1 2

information 1 6 1

Which pair of words is more similar? cosine(apricot,information) = cosine(digital,information) = cosine(apricot,digital) =

1+ 0+ 0

1+ 0+ 0

1+36+1

1+36+1

0+1+ 4

0+1+ 4

1+ 0+ 0

0+ 6+ 2

0+ 0+ 0

=138

= .16

=838 5

= .58

= 0

1 2 3 4 5 6 7

1

2

3

digital

apricotinformation

Dim

ensi

on 1

: ‘la

rge’

Dimension 2: ‘data’

But raw frequency is a bad representation

•  Frequency is clearly useful; if sugar appears a lot near apricot, that's useful information.

•  But overly frequent words like the, it, or they are not very informative about the context

•  Can use tf-idf over co-occurrence matrices

Page 14: 19 word embeddings - Colorado State University

11/19/19

14

Sparse versus dense vectors

•  tf-idf vectors are – high dimensional (length |V|= 20,000 to 50,000)

–  sparse (most elements are zero)

•  Alternative: learn vectors which are –  lower dimensional (length 50-1000)

– dense (most elements are non-zero)

Sparse versus dense vectors

•  Why dense? –  may be easier to use as features in machine learning (less weights to tune)

–  Dense vectors may generalize better than explicit counts

–  They may do better at capturing synonymy: •  car and automobile are synonyms; but are distinct dimensions

–  a word with car as a neighbor and a word with automobile as a neighbor should be similar, but aren't

–  In practice, they work better

Page 15: 19 word embeddings - Colorado State University

11/19/19

15

Embedding methods

•  Word2vec (Mikolov et al) https://code.google.com/archive/p/word2vec/

•  Glove (Pennington, Socher, Manning) http://nlp.stanford.edu/projects/glove/

Word2vec

•  Popular embedding method

•  Very fast to train

•  Several libraries provide code

•  Idea: predict rather than count

Page 16: 19 word embeddings - Colorado State University

11/19/19

16

Word2vec –  Instead of counting how often each word w occurs near "apricot"

–  Train a classifier on a binary prediction task:

•  Is w likely to show up near "apricot"?

–  We don’t actually care about this task

•  But we'll take the learned classifier weights as the word embeddings

Approach: predict if candidate word c is a "neighbor"

•  Treat the target word t and a true neighboring context word c as positive examples.

•  Randomly sample other words in the lexicon to get negative examples

•  Use logistic regression to train a classifier to distinguish those two cases

•  Use the weights as the embeddings

Page 17: 19 word embeddings - Colorado State University

11/19/19

17

Skip-Gram Training Data

•  Assume a 2 word window around a target word e.g.: –  I would like to go someplace nearby for lunch

•  The training data:

–  input/output pairs (centering on go)

Skip-Gram Training data

I would like to go someplace nearby for lunch Positive

{target, context} {go, like}

{go, to}

{go, someplace}

{go, nearby}

11/18/19

34

Page 18: 19 word embeddings - Colorado State University

11/19/19

18

Skip-Gram Training data

I would like to go someplace nearby for lunch

Positive Negative

{+ target, context} {- target, context}

{go, like} {go, aardvark}

{go, to} {go, incubate}

{go, someplace} {go, twelve}

{go, nearby} {go, therefore}

11/18/19

35

The classifier's goal

•  Compute the probability that c is a real context word and not a fake context word

•  P(+|t, c) – real context word

•  P(-|t, c) – not a context word

Page 19: 19 word embeddings - Colorado State University

11/19/19

19

Computing word similarity

•  Remember: two vectors are similar if they have a high dot product

–  Cosine is just a normalized dot product

•  So:

–  similarity(t, c) ∝ t · c

•  We’ll normalize to get a probability using the sigmoid function:

P (+|t, c) = 1

1 + e�sim(t,c)

Cost function

•  We want to maximize

•  Notation: –  the + set for the pairs from the positive training data –  the – set for the pairs sampled from the negative data.

•  Train using stochastic gradient descent

38

X

(t,c)2+

logP (+|t, c) +X

(t,c)2�

logP (�|t, c)

Page 20: 19 word embeddings - Colorado State University

11/19/19

20

Negative sampling

•  Problem: Too many negative examples to use them all

•  Solution: sample k negative examples n1,..,nk for each word t. The overall contribution of word t to the cost function is now equal to:

X

(t,c)2+

logP (+|t, c) +X

(t,c)2�

logP (�|t, c)

122 CHAPTER 6 • VECTOR SEMANTICS

• Minimize the similarity of the (t,c) pairs drawn from the negative examples.We can express this formally over the whole training set as:

L(q) =X

(t,c)2+

logP(+|t,c)+X

(t,c)2�logP(�|t,c) (6.33)

Or, focusing in on one word/context pair (t,c) with its k noise words n1...nk, thelearning objective L is:

L(q) = logP(+|t,c)+kX

i=1

logP(�|t,ni)

= logs(c · t)+kX

i=1

logs(�ni · t)

= log1

1+ e�c·t +kX

i=1

log1

1+ eni·t(6.34)

That is, we want to maximize the dot product of the word with the actual contextwords, and minimize the dot products of the word with the k negative sampled non-neighbor words.

We can then use stochastic gradient descent to train to this objective, iterativelymodifying the parameters (the embeddings for each target word t and each contextword or noise word c in the vocabulary) to maximize the objective.

Note that the skip-gram model thus actually learns two separate embeddingsfor each word w: the target embedding t and the context embedding c. Thesetarget

embeddingcontext

embedding embeddings are stored in two matrices, the target matrix T and the context matrixC. So each row i of the target matrix T is the 1 ⇥ d vector embedding ti for wordi in the vocabulary V , and each column i of the context matrix C is a d ⇥ 1 vectorembedding ci for word i in V . Fig. 6.13 shows an intuition of the learning task forthe embeddings encoded in these two matrices.

1.k.n.V

1.2…….j………V

1...d

WC

1. .. … d

increasesimilarity( apricot , jam)

wj . ck

jam

apricot

aardvark

decreasesimilarity( apricot , aardvark)

wj . cn

“…apricot jam…”neighbor word

random noiseword

Figure 6.13 The skip-gram model tries to shift embeddings so the target embedding (herefor apricot) are closer to (have a higher dot product with) context embeddings for nearbywords (here jam) and further from (have a lower dot product with) context embeddings forwords that don’t occur nearby (here aardvark).

Just as in logistic regression, then, the learning algorithm starts with randomlyinitialized W and C matrices, and then walks through the training corpus using gra-dient descent to move W and C so as to maximize the objective in Eq. 6.34. Thusthe matrices W and C function as the parameters q that logistic regression is tuning.

Learning the classifier

•  We’ll adjust the word weights to

–  make the positive pairs more likely

–  and the negative pairs less likely,

–  over the entire training set.

Page 21: 19 word embeddings - Colorado State University

11/19/19

21

Summary: How to learn word2vec (skip-gram) embeddings

•  Start with V random 300-dimensional vectors as initial embeddings

•  Using a logistic regression-like method:

–  Take a corpus and take pairs of words that co-occur as positive examples

–  Take pairs of words that don't co-occur as negative examples

–  Train the classifier to distinguish these by adjusting the embeddings to improve classifier performance

–  Throw away the classifier and keep the embeddings.

Or in other words •  Start with some initial embeddings (e.g., random)

•  iteratively make the embeddings for a word

–  more like the embeddings of its neighbors

–  less like the embeddings of other words.

Page 22: 19 word embeddings - Colorado State University

11/19/19

22

Properties of embeddings: Word similarity/relatedness

Nearest words to some embeddings (Mikolov et al. 2013)

18 CHAPTER 19 • VECTOR SEMANTICS

matrix is repeated between each one-hot input and the projection layer h. For thecase of C = 1, these two embeddings must be combined into the projection layer,which is done by multiplying each one-hot context vector x by W to give us twoinput vectors (let’s say vi and v j). We then average these vectors

h = W · 12C

X

�c jc, j 6=0

v( j) (19.31)

As with skip-grams, the the projection vector h is multiplied by the output matrixW 0. The result o = W 0h is a 1⇥ |V | dimensional output vector giving a score foreach of the |V | words. In doing so, the element ok was computed by multiplyingh by the output embedding for word wk: ok = v0kh. Finally we normalize this scorevector, turning the score for each element ok into a probability by using the soft-maxfunction.

19.5 Properties of embeddings

We’ll discuss in Section 17.8 how to evaluate the quality of different embeddings.But it is also sometimes helpful to visualize them. Fig. 17.14 shows the words/phrasesthat are most similar to some sample words using the phrase-based version of theskip-gram algorithm (Mikolov et al., 2013a).

target: Redmond Havel ninjutsu graffiti capitulateRedmond Wash. Vaclav Havel ninja spray paint capitulationRedmond Washington president Vaclav Havel martial arts grafitti capitulatedMicrosoft Velvet Revolution swordsmanship taggers capitulating

Figure 19.14 Examples of the closest tokens to some target words using a phrase-basedextension of the skip-gram algorithm (Mikolov et al., 2013a).

One semantic property of various kinds of embeddings that may play in theirusefulness is their ability to capture relational meanings

Mikolov et al. (2013b) demonstrates that the offsets between vector embeddingscan capture some relations between words, for example that the result of the ex-pression vector(‘king’) - vector(‘man’) + vector(‘woman’) is a vector close to vec-tor(‘queen’); the left panel in Fig. 17.15 visualizes this by projecting a representationdown into 2 dimensions. Similarly, they found that the expression vector(‘Paris’)- vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vec-tor(‘Rome’). Levy and Goldberg (2014a) shows that various other kinds of em-beddings also seem to have this property. We return in the next section to theserelational properties of embeddings and how they relate to meaning compositional-ity: the way the meaning of a phrase is built up out of the meaning of the individualvectors.

19.6 Compositionality in Vector Models of Meaning

To be written.

Embeddings capture relational meaning!

vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’)

vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)

Page 23: 19 word embeddings - Colorado State University

11/19/19

23

Page 24: 19 word embeddings - Colorado State University

11/19/19

24

Embeddings reflect cultural bias!

•  Ask “Paris : France :: Tokyo : x”

–  x = Japan

•  Ask “father : doctor :: mother : x”

–  x = nurse

•  Ask “man : computer programmer :: woman : x” –  x = homemaker

Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings." In Advances in Neural Information Processing Systems, pp. 4349-4357. 2016.

Train embeddings on old books to study changes in word meaning

!!

1900! 1950! 2000!

vs.!

Word vectors for 1920! Word vectors 1990!

“dog” 1920 word vector!

“dog” 1990 word vector!

Page 25: 19 word embeddings - Colorado State University

11/19/19

25

Visualizing changes in word meaning

Project 300 dimensions down to 2

~30 million books, 1850-1990, Google Books data William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. Proceedings of ACL 2016.

Now use our historical embeddings as a tool to investigate cultural biases

•  From the historical embeddings

•  Compute historical biases of words:

Gender bias: how much closer a word is to "woman" synonyms than "man" synonyms.

Ethnic bias: how much closer a word is to last names of a given ethnicity than to names of Anglo ethnicity

•  Correlate with occupational data from historical census

•  Look at how all these change over time

Page 26: 19 word embeddings - Colorado State University

11/19/19

26

Embedding bias correlates with occupation data Is "nurse" closer to "man" than "woman"?

Nikhil Garg, Londa Schiebinger, Dan Jurafsky, James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences 2018.

Embeddings reflects gender bias in occupations across time

Page 27: 19 word embeddings - Colorado State University

11/19/19

27

Conclusion •  Concepts or word senses are hard to define formally

•  Embeddings = vector models of meaning

–  More fine-grained than just a string or index

–  Especially good at modeling similarity/analogy

•  Just download and use!

–  Useful in practice but also encode cultural stereotypes