# Lecture 10: Capturing semantics: Word Embeddings … 10: Capturing semantics: Word Embeddings...

date post

22-Jun-2018Category

## Documents

view

217download

1

Embed Size (px)

### Transcript of Lecture 10: Capturing semantics: Word Embeddings … 10: Capturing semantics: Word Embeddings...

Lecture10:Capturingsemantics:WordEmbeddings

SanjeevArora EladHazan

COS402 MachineLearningand

ArtificialIntelligenceFall2016

(BorrowsfromslidesofD.Jurafsky StanfordU.)

Lasttime

N-gramlanguagemodels.Pr[w2|w1]

(Assignaprobabilitytoeachsentence;trainedusinganunlabeledcorpus)

Unsupervisedlearning

Today:Unsupervisedlearningforsemantics(meaning)

Semantics

Studyofmeaningoflinguisticexpressions.

MasteringandfluentlyoperatingwithitseemsaprecursortoAI(Turingtest,etc.)

Buteverybitofpartialprogresscanimmediatelybeusedtoimprovetechnology: Improvingwebsearch. Siri,etc. Machinetranslation Informationretrieval,..

Letstalkaboutmeaning

WhatdoesthesentenceYoulikegreencreammean?Comeupandseemesometime.Howdidwordsarriveattheirmeanings?

Youranswersseemtoinvolveamixof

Grammar/syntax Howwordsmaptophysicalexperience Physicalsensationssweet,breathing,painetc.andmentalstateshappy,regretfulappear

tobeexperiencedroughlythesamewaybyeverybody,whichhelpanchorsome meanings) (atsomelevel,becomesphilosophy).

Whataresimpletestsforunderstandingmeaning?

Whichofthefollowingmeanssameaspulchritude:(a)Truth(b)Disgust(c)Anger(d)Beauty?

Analogyquestions:Man:Woman::King:??

Howcanweteachacomputerthenotionofwordsimilarity?

Test:Thinkofawordthatco-occurs with:Cow,drink,babies,calcium

Distributionalhypothesisofmeaning,[Harris54],[Firth57]

Meaningofawordisdeterminedbywordsitco-occurswith.

TigerandLionaresimilarbecausetheycooccur withsimilarwords(jungle,hunt,predator,claws,etc.)

Acomputercouldlearnsimilaritybysimplyreadingatextcorpus;noneedtowaitforfullAI!

Howcanwequantifydistributionofnearbywords?

A bottle of tesgino is on the tableEverybody likes tesginoTesgino makes you drunkWe make tesgino out of corn.

Recallbigramfrequency P(beer,drunk)=!"#$%('((),+)#$,)

.")/#0023(

Redefine

count(beer,drunk)=#oftimesbeeranddrunkappearwithin5wordsofeachother.

Matrixofallsuchbigramcounts(#columns=V=106)

study alcohol drink bottle lion

Beer 35 552 872 677 2

Milk 45 20 935 250 0

Jungle 12 35 110 28 931

Tesgino 2 67 75 57 0

Whichwordslookmostsimilartoeachother?

Vectorrepresentationofbeer.

DistributionalhypothesisWordsimilarityboilsdowntovectorsimilarity.

SemanticEmbedding[Deerwester et

al1990]Ignorebigrams withfrequentwordslike

and,for

CosineSimilarity=

Cosinesimilarity

Rescalevectorstounitlength,computedotproduct.

19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7

computer data pinch result sugarapricot 0 0 0.56 0 0.56

pineapple 0 0 0.56 0 0.56digital 0.62 0 0 0 0

information 0 0.58 0 0.37 0Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing countsin Fig. 17.5.

The cosinelike most measures for vector similarity used in NLPis based onthe dot product operator from linear algebra, also called the inner product:dot product

inner product

dot-product(~v,~w) =~v ~w =NX

i=1

viwi = v1w1 + v2w2 + ...+ vNwN (19.10)

Intuitively, the dot product acts as a similarity metric because it will tend to behigh just when the two vectors have large values in the same dimensions. Alterna-tively, vectors that have zeros in different dimensionsorthogonal vectors will bevery dissimilar, with a dot product of 0.

This raw dot-product, however, has a problem as a similarity metric: it favorslong vectors. The vector length is defined asvector length

|~v| =

vuutNX

i=1

v2i (19.11)

The dot product is higher if a vector is longer, with higher values in each dimension.More frequent words have longer vectors, since they tend to co-occur with morewords and have higher co-occurrence values with each of them. Raw dot productthus will be higher for frequent words. But this is a problem; wed like a similaritymetric that tells us how similar two words are irregardless of their frequency.

The simplest way to modify the dot product to normalize for the vector length isto divide the dot product by the lengths of each of the two vectors. This normalizeddot product turns out to be the same as the cosine of the angle between the twovectors, following from the definition of the dot product between two vectors ~a and~b:

~a ~b = |~a||~b|cosq~a ~b|~a||~b|

= cosq (19.12)

The cosine similarity metric between two vectors~v and ~w thus can be computedcosineas:

cosine(~v,~w) =~v ~w|~v||~w| =

NX

i=1

viwivuut

NX

i=1

v2i

vuutNX

i=1

w2i

(19.13)

For some applications we pre-normalize each vector, by dividing it by its length,creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector

Highwhentwovectorshavelargevaluesinsamedimensions.

Low(infact0)fororthogonalvectorswith zerosincomplementarydistribution

19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7

computer data pinch result sugarapricot 0 0 0.56 0 0.56

pineapple 0 0 0.56 0 0.56digital 0.62 0 0 0 0

information 0 0.58 0 0.37 0Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing countsin Fig. 17.5.

The cosinelike most measures for vector similarity used in NLPis based onthe dot product operator from linear algebra, also called the inner product:dot product

inner product

dot-product(~v,~w) =~v ~w =NX

i=1

viwi = v1w1 + v2w2 + ...+ vNwN (19.10)

Intuitively, the dot product acts as a similarity metric because it will tend to behigh just when the two vectors have large values in the same dimensions. Alterna-tively, vectors that have zeros in different dimensionsorthogonal vectors will bevery dissimilar, with a dot product of 0.

This raw dot-product, however, has a problem as a similarity metric: it favorslong vectors. The vector length is defined asvector length

|~v| =

vuutNX

i=1

v2i (19.11)

The dot product is higher if a vector is longer, with higher values in each dimension.More frequent words have longer vectors, since they tend to co-occur with morewords and have higher co-occurrence values with each of them. Raw dot productthus will be higher for frequent words. But this is a problem; wed like a similaritymetric that tells us how similar two words are irregardless of their frequency.

The simplest way to modify the dot product to normalize for the vector length isto divide the dot product by the lengths of each of the two vectors. This normalizeddot product turns out to be the same as the cosine of the angle between the twovectors, following from the definition of the dot product between two vectors ~a and~b:

~a ~b = |~a||~b|cosq~a ~b|~a||~b|

= cosq (19.12)

The cosine similarity metric between two vectors~v and ~w thus can be computedcosineas:

cosine(~v,~w) =~v ~w|~v||~w| =

NX

i=1

viwivuut

NX

i=1

v2i

vuutNX

i=1

w2i

(19.13)

For some applications we pre-normalize each vector, by dividing it by its length,creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector

19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7

computer data pinch result sugarapricot 0 0 0.56 0 0.56

pineapple 0 0 0.56 0 0.56digital 0.62 0 0 0 0

information 0 0.58 0 0.37 0Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing countsin Fig. 17.5.

The cosinelike most measures for vector similarity used in NLPis based onthe dot product operator from linear algebra, also called the inner product:dot product

inner product

dot-product(~v,~w) =~v ~w =NX

i=1

viwi = v1w1 + v2w2 + ...+ vNwN (19.10)

Intuitively, the dot product acts as a similarity metric because it will tend to behigh just when the two vectors have large values in the same dimensions. Alterna-tively, vectors that have zeros in different dimensionsorthogonal vectors will bevery dissimilar, with a dot product of 0.

This raw dot-product, however, has a problem as a similarity metric: it favorslong vectors. The vector length is defined asvector length

|~v| =

vuutNX

i=1

v2i (19.11)

The dot product is higher if a vector is longer, with higher values in each dimension.More frequent words have longer vectors, since they tend to co-occur with morewords and have higher co-occurrence values with each of them. Raw dot productthus will be higher for frequent words. But this is a problem; wed like a similaritymetric that tells us how similar two words are irregardless of their frequency.

The simplest way to modify the dot product to normalize for the vector length isto divide the dot product by the lengths of each of the two vectors. This normalizeddot product turns out to be the same as the cosine of the angle between the twovectors, following from the definition of the dot product between two vectors ~a and~b:

~a ~b = |~a||~b|cosq~a ~b|~a||~b|

= cosq (19.12)

The cosine similarity metric between two vectors~v and ~w thus can be computedcosineas:

cosine(~v,~w) =~v ~w|~v||~w| =

NX

i=1

viwivuut

NX

i=1

v2i

vuutNX

i=1

w2i

(19.13)

For some applications we pre-normalize each vector, by dividing it by its length,creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector

19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7

computer data pinch result sugarapricot 0 0 0.56 0 0.56

pineapple 0 0 0.56 0 0.56digital 0.62 0 0 0 0

inner product

dot-product(~v,~w) =~v ~w =NX

i=1

viwi = v1w1 + v2w2 + ...+ vNwN (19.10)

|~v| =

vuutNX

i=1

v2i (19.11)

~a ~b = |~a||~b|cosq~a ~b|~a||~b|

= cosq (19.12)

The cosine similarity metric between two vectors~v and ~w thus can be computedcosineas:

cosine(~v,~w) =~v ~w|~v||~w| =

NX

i=1

viwivuut

NX

i=1

v2i

vuutNX

i=1

w2i

(19.13)

Othermeasuresofsimilarityhavebeendefined.

Reminderaboutpropertiesofcosine

-1:vectorspointinoppositedirections +1:vectorspointinsamedirections 0:vectorsareorthogonal

Nearestneighbors(cosinesimilarity)

Importantuseofembeddings:allowlanguageprocessing systemstomakeaguesswhenlabeleddataisinsufficient.(borrowdata/trainedfeaturesofnearbywords)

(Semisupervised learning:Learnembeddings fromlargeunlabeledcorpus;useashelpforsupervisedlearning)

Apple:'macintosh''microsoft''iphone''ibm''software''ios''ipad''ipod''apples''fruit'

Ball:'balls''game''play''thrown''back''hit''throwing''player''throw''catch'

Anthropologyfieldtrip:visualizationofwordvectorsobtainedfromdatingprofiles

[LorenShure,matworks.com]

EvolutionofnewwordmeaningsKulkarni,Al-Rfou,Perozzi,Skiena 2015

Problemswithabovenaiveembeddingmethod

Rawwordfrequencyisnotagreatmeasureofassociationbetweenwords Itsveryskewed

theandofareveryfrequent,butmaybenotthemostdiscriminative

Wedratherhaveameasurethataskswhetheracontextwordisparticularlyinformativeaboutthetargetword. PositivePointwise MutualInformation(PPMI)

PointwiseMutualInformation(PMI)

Pointwisemutualinformation:Doeventsxandyco-occurmorethaniftheywereindependent?

PMIbetweentwowords:(Church&Hanks1989)Dowordsxandyco-occurmorethaniftheywereindependent?

PMI ;, < = log