Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word...

38
João Sedoc IntroHLT class November 4, 2019 Distributional Semantics

Transcript of Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word...

Page 1: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

JoãoSedocIntroHLTclass

November4,2019

DistributionalSemantics

Page 2: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

 Nidaexample:A bottle of tesgüino is on the tableEverybody likes tesgüinoTesgüino makes you drunkWe make tesgüino out of corn.

 From context words humans can guess tesgüino means ◦  analcoholicbeveragelikebeer

 Intuitionforalgorithm:◦  Twowordsaresimilariftheyhavesimilarwordcontexts.

Intuition of distributional word similarity

Page 3: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

If we consider optometrist and eye-doctor we find that, as our corpus of utterances grows, these two occur in almost the same environments. In contrast, there are many sentence environments in which optometrist occurs but lawyer does not... It is a question of the relative frequency of such environments, and of what we will obtain if we ask an informant to substitute any word he wishes for optometrist (not asking what words have the same meaning). These and similar tests all measure the probability of particular environments occurring with particular elements... If A and B have almost identical environments we say that they are synonyms.

–Zellig Harris (1954)

“You shall know a word by the company it keeps!” –John Firth (1957)

Distributional Hypothesis

Page 4: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Distributional models of meaning = vector-space models of meaning = vector semantics

Intuitions:ZelligHarris(1954):◦ “oculistandeye-doctor…occurinalmostthesameenvironments”

◦ “IfAandBhavealmostidenticalenvironmentswesaythattheyaresynonyms.”

Firth(1957):◦ “Youshallknowawordbythecompanyitkeeps!”

4

Page 5: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Intuition

 Modelthemeaningofawordby“embedding”inavectorspace.

 Themeaningofawordisavectorofnumbers◦ Vectormodelsarealsocalled“embeddings”.

 Contrast:wordmeaningisrepresentedinmanycomputationallinguisticapplicationsbyavocabularyindex(“wordnumber545”)

vec(“dog”)=(0.2,-0.3,1.5,…)

 vec(“bites”)=(0.5,1.0,-0.4,…) vec(“man”)=(-0.1,2.3,-1.5,…)

5

Page 6: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Term-document matrix

6

Page 7: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Term-document matrix  Eachcell:countoftermtinadocumentd:tft,d:◦ Eachdocumentisacountvectorinℕv:acolumnbelow

7

Page 8: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

 Twodocumentsaresimilariftheirvectorsaresimilar

8

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Term-document matrix

Page 9: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

The words in a term-document matrix

 EachwordisacountvectorinℕD:arowbelow

9

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Page 10: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

 Twowordsaresimilariftheirvectorsaresimilar

10

As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

The words in a term-document matrix

Page 11: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Brilliant insight: Use running text as implicitly supervised training data!

•  Awordsnearfine•  Actsasgold‘correctanswer’tothequestion•  “Iswordwlikelytoshowupnearfine?”

•  Noneedforhand-labeledsupervision•  Theideacomesfromneurallanguagemodeling

•  Bengioetal.(2003)•  Collobertetal.(2011)

Page 12: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Word2vec

 Popularembeddingmethod Veryfasttotrain Codeavailableontheweb Idea:predictratherthancount

Page 13: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

◦ Insteadofcountinghowofteneachwordwoccursnear“fine”◦ Trainaclassifieronabinarypredictiontask:◦ Iswlikelytoshowupnear“fine”?

◦ Wedon’tactuallycareaboutthistask◦ Butwe'lltakethelearnedclassifierweightsasthewordembeddings

Word2vec

Page 14: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Word2vec

Page 15: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity
Page 16: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity
Page 17: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Dense embeddings you can download!

 Word2vec(Mikolovetal.) https://code.google.com/archive/p/word2vec/

 Fasttexthttp://www.fasttext.cc/

 Glove(Pennington,Socher,Manning) http://nlp.stanford.edu/projects/glove/

Page 18: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Why vector models of meaning? Computing the similarity between words

“fast”issimilarto“rapid”“tall”issimilarto“height”Questionanswering:Q:“HowtallisMt.Everest?”CandidateA:“TheofficialheightofMountEverestis29029feet”

18

Page 19: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Analogy: Embeddings capture relational meaning!

vector(‘king’)-vector(‘man’)+vector(‘woman’)≈ vector(‘queen’)

vector(‘Paris’)-vector(‘France’)+vector(‘Italy’)≈vector(‘Rome’)

19

Page 20: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Evaluating similarity  Extrinsic(task-based,end-to-end)Evaluation:◦ QuestionAnswering◦ SpellChecking◦ Essaygrading

 IntrinsicEvaluation:◦ Correlationbetweenalgorithmandhumanwordsimilarityratings◦ Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77

◦ TakingTOEFLmultiple-choicevocabularytests◦ Levied is closest in meaning to:

imposed, believed, requested, correlated

Page 21: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Evaluating embeddings  Comparetohumanscoresonwordsimilarity-typetasks:• WordSim-353(Finkelsteinetal.,2002)

•  SimLex-999(Hilletal.,2015)

•  StanfordContextualWordSimilarity(SCWS)dataset(Huangetal.,2012)

•  TOEFLdataset:Leviedisclosestinmeaningto:imposed,believed,requested,correlated

Page 22: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Intrinsic evaluation

Page 23: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Intrinsic evaluation

Page 24: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Measuring similarity

 Given2targetwordsvandw We’llneedawaytomeasuretheirsimilarity.

 Mostmeasureofvectorssimilarityarebasedonthe:

 Cosinebetweenembeddings!

◦ Highwhentwovectorshavelargevaluesinsamedimensions.◦ Low(infact0)fororthogonalvectorswithzerosincomplementarydistribution

24

Page 25: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Visualizing vectors and angles

1 2 3 4 5 6 7

1

2

3

digital

apricotinformation

Dim

ensi

on 1

: ‘la

rge’

Dimension 2: ‘data’25

large dataapricot 2 0digital 0 1information 1 6

Page 26: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Bias in Word Embeddings

Page 27: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

More to come on bias later …

Page 28: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Embeddings are the workhorse of NLP ◦ Usedaspre-initializationforlanguagemodels,neuralMT,classification,NERsystems…

◦ Downloadedandeasilytrainable◦ Availablein~100soflanguages

◦ And…contextualizedwordembeddingsarebuiltontopofthem

Page 29: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Contextualized Word Embeddings

Page 30: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

BERT - Deep Bidirectional Transformers

Page 31: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Putting it together

• Multiple (N) layers

• For encoder-decoder attention, Q: previous decoder layer, K and V: output of encoder

• For encoder self-attention, Q/K/V all come from previous encoder layer

• For decoder self-attention, allow each position to attend to all positions up to that position

• Positional encoding for word order

Fromlasttime…

Page 32: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Transformer

Page 33: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Training BERT ◦ BERThastwotrainingobjectives:1. Predictmissing(masked)words2. Predictifasentenceisthenextsentence

Page 34: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

BERT- predict missing words

Page 35: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

BERT- predict is next sentence?

Page 36: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity
Page 37: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

BERT on tasks

Page 38: Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word embeddings (Word2Vec, GloVe, FastText) predict rather than use co-occurrence Similarity

Take-Aways ◦ Distributionalsemantics–learnaword’s“meaning”byitscontext◦ Simplestrepresentationisfrequencyindocuments◦ Wordembeddings(Word2Vec,GloVe,FastText)predictratherthanuseco-occurrence

◦ Similaritymeasuredoftenusingcosine◦ Intrinsicevaluationusescorrelationwithhumanjudgementsofwordsimilarity

◦ ContextualizedembeddingsarethenewstarsofNLP