Lecture 10: Capturing semantics: Word Embeddings … 10: Capturing semantics: Word Embeddings...

of 29/29
Lecture 10: Capturing semantics: Word Embeddings Sanjeev Arora Elad Hazan COS 402 – Machine Learning and Artificial Intelligence Fall 2016 (Borrows from slides of D. Jurafsky Stanford U.)
  • date post

    22-Jun-2018
  • Category

    Documents

  • view

    217
  • download

    1

Embed Size (px)

Transcript of Lecture 10: Capturing semantics: Word Embeddings … 10: Capturing semantics: Word Embeddings...

  • Lecture10:Capturingsemantics:WordEmbeddings

    SanjeevArora EladHazan

    COS402 MachineLearningand

    ArtificialIntelligenceFall2016

    (BorrowsfromslidesofD.Jurafsky StanfordU.)

  • Lasttime

    N-gramlanguagemodels.Pr[w2|w1]

    (Assignaprobabilitytoeachsentence;trainedusinganunlabeledcorpus)

    Unsupervisedlearning

    Today:Unsupervisedlearningforsemantics(meaning)

  • Semantics

    Studyofmeaningoflinguisticexpressions.

    MasteringandfluentlyoperatingwithitseemsaprecursortoAI(Turingtest,etc.)

    Buteverybitofpartialprogresscanimmediatelybeusedtoimprovetechnology: Improvingwebsearch. Siri,etc. Machinetranslation Informationretrieval,..

  • Letstalkaboutmeaning

    WhatdoesthesentenceYoulikegreencreammean?Comeupandseemesometime.Howdidwordsarriveattheirmeanings?

    Youranswersseemtoinvolveamixof

    Grammar/syntax Howwordsmaptophysicalexperience Physicalsensationssweet,breathing,painetc.andmentalstateshappy,regretfulappear

    tobeexperiencedroughlythesamewaybyeverybody,whichhelpanchorsome meanings) (atsomelevel,becomesphilosophy).

  • Whataresimpletestsforunderstandingmeaning?

    Whichofthefollowingmeanssameaspulchritude:(a)Truth(b)Disgust(c)Anger(d)Beauty?

    Analogyquestions:Man:Woman::King:??

    Howcanweteachacomputerthenotionofwordsimilarity?

  • Test:Thinkofawordthatco-occurs with:Cow,drink,babies,calcium

    Distributionalhypothesisofmeaning,[Harris54],[Firth57]

    Meaningofawordisdeterminedbywordsitco-occurswith.

    TigerandLionaresimilarbecausetheycooccur withsimilarwords(jungle,hunt,predator,claws,etc.)

    Acomputercouldlearnsimilaritybysimplyreadingatextcorpus;noneedtowaitforfullAI!

  • Howcanwequantifydistributionofnearbywords?

    A bottle of tesgino is on the tableEverybody likes tesginoTesgino makes you drunkWe make tesgino out of corn.

    Recallbigramfrequency P(beer,drunk)=!"#$%('((),+)#$,)

    .")/#0023(

    Redefine

    count(beer,drunk)=#oftimesbeeranddrunkappearwithin5wordsofeachother.

  • Matrixofallsuchbigramcounts(#columns=V=106)

    study alcohol drink bottle lion

    Beer 35 552 872 677 2

    Milk 45 20 935 250 0

    Jungle 12 35 110 28 931

    Tesgino 2 67 75 57 0

    Whichwordslookmostsimilartoeachother?

    Vectorrepresentationofbeer.

    DistributionalhypothesisWordsimilarityboilsdowntovectorsimilarity.

    SemanticEmbedding[Deerwester et

    al1990]Ignorebigrams withfrequentwordslike

    and,for

  • CosineSimilarity=

    Cosinesimilarity

    Rescalevectorstounitlength,computedotproduct.

    19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7

    computer data pinch result sugarapricot 0 0 0.56 0 0.56

    pineapple 0 0 0.56 0 0.56digital 0.62 0 0 0 0

    information 0 0.58 0 0.37 0Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing countsin Fig. 17.5.

    The cosinelike most measures for vector similarity used in NLPis based onthe dot product operator from linear algebra, also called the inner product:dot product

    inner product

    dot-product(~v,~w) =~v ~w =NX

    i=1

    viwi = v1w1 + v2w2 + ...+ vNwN (19.10)

    Intuitively, the dot product acts as a similarity metric because it will tend to behigh just when the two vectors have large values in the same dimensions. Alterna-tively, vectors that have zeros in different dimensionsorthogonal vectors will bevery dissimilar, with a dot product of 0.

    This raw dot-product, however, has a problem as a similarity metric: it favorslong vectors. The vector length is defined asvector length

    |~v| =

    vuutNX

    i=1

    v2i (19.11)

    The dot product is higher if a vector is longer, with higher values in each dimension.More frequent words have longer vectors, since they tend to co-occur with morewords and have higher co-occurrence values with each of them. Raw dot productthus will be higher for frequent words. But this is a problem; wed like a similaritymetric that tells us how similar two words are irregardless of their frequency.

    The simplest way to modify the dot product to normalize for the vector length isto divide the dot product by the lengths of each of the two vectors. This normalizeddot product turns out to be the same as the cosine of the angle between the twovectors, following from the definition of the dot product between two vectors ~a and~b:

    ~a ~b = |~a||~b|cosq~a ~b|~a||~b|

    = cosq (19.12)

    The cosine similarity metric between two vectors~v and ~w thus can be computedcosineas:

    cosine(~v,~w) =~v ~w|~v||~w| =

    NX

    i=1

    viwivuut

    NX

    i=1

    v2i

    vuutNX

    i=1

    w2i

    (19.13)

    For some applications we pre-normalize each vector, by dividing it by its length,creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector

    Highwhentwovectorshavelargevaluesinsamedimensions.

    Low(infact0)fororthogonalvectorswith zerosincomplementarydistribution

    19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7

    computer data pinch result sugarapricot 0 0 0.56 0 0.56

    pineapple 0 0 0.56 0 0.56digital 0.62 0 0 0 0

    information 0 0.58 0 0.37 0Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing countsin Fig. 17.5.

    The cosinelike most measures for vector similarity used in NLPis based onthe dot product operator from linear algebra, also called the inner product:dot product

    inner product

    dot-product(~v,~w) =~v ~w =NX

    i=1

    viwi = v1w1 + v2w2 + ...+ vNwN (19.10)

    Intuitively, the dot product acts as a similarity metric because it will tend to behigh just when the two vectors have large values in the same dimensions. Alterna-tively, vectors that have zeros in different dimensionsorthogonal vectors will bevery dissimilar, with a dot product of 0.

    This raw dot-product, however, has a problem as a similarity metric: it favorslong vectors. The vector length is defined asvector length

    |~v| =

    vuutNX

    i=1

    v2i (19.11)

    The dot product is higher if a vector is longer, with higher values in each dimension.More frequent words have longer vectors, since they tend to co-occur with morewords and have higher co-occurrence values with each of them. Raw dot productthus will be higher for frequent words. But this is a problem; wed like a similaritymetric that tells us how similar two words are irregardless of their frequency.

    The simplest way to modify the dot product to normalize for the vector length isto divide the dot product by the lengths of each of the two vectors. This normalizeddot product turns out to be the same as the cosine of the angle between the twovectors, following from the definition of the dot product between two vectors ~a and~b:

    ~a ~b = |~a||~b|cosq~a ~b|~a||~b|

    = cosq (19.12)

    The cosine similarity metric between two vectors~v and ~w thus can be computedcosineas:

    cosine(~v,~w) =~v ~w|~v||~w| =

    NX

    i=1

    viwivuut

    NX

    i=1

    v2i

    vuutNX

    i=1

    w2i

    (19.13)

    For some applications we pre-normalize each vector, by dividing it by its length,creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector

    19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7

    computer data pinch result sugarapricot 0 0 0.56 0 0.56

    pineapple 0 0 0.56 0 0.56digital 0.62 0 0 0 0

    information 0 0.58 0 0.37 0Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing countsin Fig. 17.5.

    The cosinelike most measures for vector similarity used in NLPis based onthe dot product operator from linear algebra, also called the inner product:dot product

    inner product

    dot-product(~v,~w) =~v ~w =NX

    i=1

    viwi = v1w1 + v2w2 + ...+ vNwN (19.10)

    Intuitively, the dot product acts as a similarity metric because it will tend to behigh just when the two vectors have large values in the same dimensions. Alterna-tively, vectors that have zeros in different dimensionsorthogonal vectors will bevery dissimilar, with a dot product of 0.

    This raw dot-product, however, has a problem as a similarity metric: it favorslong vectors. The vector length is defined asvector length

    |~v| =

    vuutNX

    i=1

    v2i (19.11)

    The dot product is higher if a vector is longer, with higher values in each dimension.More frequent words have longer vectors, since they tend to co-occur with morewords and have higher co-occurrence values with each of them. Raw dot productthus will be higher for frequent words. But this is a problem; wed like a similaritymetric that tells us how similar two words are irregardless of their frequency.

    The simplest way to modify the dot product to normalize for the vector length isto divide the dot product by the lengths of each of the two vectors. This normalizeddot product turns out to be the same as the cosine of the angle between the twovectors, following from the definition of the dot product between two vectors ~a and~b:

    ~a ~b = |~a||~b|cosq~a ~b|~a||~b|

    = cosq (19.12)

    The cosine similarity metric between two vectors~v and ~w thus can be computedcosineas:

    cosine(~v,~w) =~v ~w|~v||~w| =

    NX

    i=1

    viwivuut

    NX

    i=1

    v2i

    vuutNX

    i=1

    w2i

    (19.13)

    For some applications we pre-normalize each vector, by dividing it by its length,creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector

    19.2 SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7

    computer data pinch result sugarapricot 0 0 0.56 0 0.56

    pineapple 0 0 0.56 0 0.56digital 0.62 0 0 0 0

    information 0 0.58 0 0.37 0Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing countsin Fig. 17.5.

    The cosinelike most measures for vector similarity used in NLPis based onthe dot product operator from linear algebra, also called the inner product:dot product

    inner product

    dot-product(~v,~w) =~v ~w =NX

    i=1

    viwi = v1w1 + v2w2 + ...+ vNwN (19.10)

    Intuitively, the dot product acts as a similarity metric because it will tend to behigh just when the two vectors have large values in the same dimensions. Alterna-tively, vectors that have zeros in different dimensionsorthogonal vectors will bevery dissimilar, with a dot product of 0.

    This raw dot-product, however, has a problem as a similarity metric: it favorslong vectors. The vector length is defined asvector length

    |~v| =

    vuutNX

    i=1

    v2i (19.11)

    The dot product is higher if a vector is longer, with higher values in each dimension.More frequent words have longer vectors, since they tend to co-occur with morewords and have higher co-occurrence values with each of them. Raw dot productthus will be higher for frequent words. But this is a problem; wed like a similaritymetric that tells us how similar two words are irregardless of their frequency.

    The simplest way to modify the dot product to normalize for the vector length isto divide the dot product by the lengths of each of the two vectors. This normalizeddot product turns out to be the same as the cosine of the angle between the twovectors, following from the definition of the dot product between two vectors ~a and~b:

    ~a ~b = |~a||~b|cosq~a ~b|~a||~b|

    = cosq (19.12)

    The cosine similarity metric between two vectors~v and ~w thus can be computedcosineas:

    cosine(~v,~w) =~v ~w|~v||~w| =

    NX

    i=1

    viwivuut

    NX

    i=1

    v2i

    vuutNX

    i=1

    w2i

    (19.13)

    For some applications we pre-normalize each vector, by dividing it by its length,creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector

    Othermeasuresofsimilarityhavebeendefined.

  • Reminderaboutpropertiesofcosine

    -1:vectorspointinoppositedirections +1:vectorspointinsamedirections 0:vectorsareorthogonal

  • Nearestneighbors(cosinesimilarity)

    Importantuseofembeddings:allowlanguageprocessing systemstomakeaguesswhenlabeleddataisinsufficient.(borrowdata/trainedfeaturesofnearbywords)

    (Semisupervised learning:Learnembeddings fromlargeunlabeledcorpus;useashelpforsupervisedlearning)

    Apple:'macintosh''microsoft''iphone''ibm''software''ios''ipad''ipod''apples''fruit'

    Ball:'balls''game''play''thrown''back''hit''throwing''player''throw''catch'

  • Anthropologyfieldtrip:visualizationofwordvectorsobtainedfromdatingprofiles

    [LorenShure,matworks.com]

  • EvolutionofnewwordmeaningsKulkarni,Al-Rfou,Perozzi,Skiena 2015

  • Problemswithabovenaiveembeddingmethod

    Rawwordfrequencyisnotagreatmeasureofassociationbetweenwords Itsveryskewed

    theandofareveryfrequent,butmaybenotthemostdiscriminative

    Wedratherhaveameasurethataskswhetheracontextwordisparticularlyinformativeaboutthetargetword. PositivePointwise MutualInformation(PPMI)

  • PointwiseMutualInformation(PMI)

    Pointwisemutualinformation:Doeventsxandyco-occurmorethaniftheywereindependent?

    PMIbetweentwowords:(Church&Hanks1989)Dowordsxandyco-occurmorethaniftheywereindependent?

    PMI ;, < = log