Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word...
Transcript of Distributional Semantics - GitHub Pages › slides › distributional... · 2019-12-05 · Word...
JoãoSedocIntroHLTclass
November4,2019
DistributionalSemantics
Nidaexample:A bottle of tesgüino is on the tableEverybody likes tesgüinoTesgüino makes you drunkWe make tesgüino out of corn.
From context words humans can guess tesgüino means ◦ analcoholicbeveragelikebeer
Intuitionforalgorithm:◦ Twowordsaresimilariftheyhavesimilarwordcontexts.
Intuition of distributional word similarity
If we consider optometrist and eye-doctor we find that, as our corpus of utterances grows, these two occur in almost the same environments. In contrast, there are many sentence environments in which optometrist occurs but lawyer does not... It is a question of the relative frequency of such environments, and of what we will obtain if we ask an informant to substitute any word he wishes for optometrist (not asking what words have the same meaning). These and similar tests all measure the probability of particular environments occurring with particular elements... If A and B have almost identical environments we say that they are synonyms.
–Zellig Harris (1954)
“You shall know a word by the company it keeps!” –John Firth (1957)
Distributional Hypothesis
Distributional models of meaning = vector-space models of meaning = vector semantics
Intuitions:ZelligHarris(1954):◦ “oculistandeye-doctor…occurinalmostthesameenvironments”
◦ “IfAandBhavealmostidenticalenvironmentswesaythattheyaresynonyms.”
Firth(1957):◦ “Youshallknowawordbythecompanyitkeeps!”
4
Intuition
Modelthemeaningofawordby“embedding”inavectorspace.
Themeaningofawordisavectorofnumbers◦ Vectormodelsarealsocalled“embeddings”.
Contrast:wordmeaningisrepresentedinmanycomputationallinguisticapplicationsbyavocabularyindex(“wordnumber545”)
vec(“dog”)=(0.2,-0.3,1.5,…)
vec(“bites”)=(0.5,1.0,-0.4,…) vec(“man”)=(-0.1,2.3,-1.5,…)
5
Term-document matrix
6
As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
Term-document matrix Eachcell:countoftermtinadocumentd:tft,d:◦ Eachdocumentisacountvectorinℕv:acolumnbelow
7
Twodocumentsaresimilariftheirvectorsaresimilar
8
As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
Term-document matrix
The words in a term-document matrix
EachwordisacountvectorinℕD:arowbelow
9
As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
Twowordsaresimilariftheirvectorsaresimilar
10
As#You#Like#It Twelfth#Night Julius#Caesar Henry#Vbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
The words in a term-document matrix
Brilliant insight: Use running text as implicitly supervised training data!
• Awordsnearfine• Actsasgold‘correctanswer’tothequestion• “Iswordwlikelytoshowupnearfine?”
• Noneedforhand-labeledsupervision• Theideacomesfromneurallanguagemodeling
• Bengioetal.(2003)• Collobertetal.(2011)
Word2vec
Popularembeddingmethod Veryfasttotrain Codeavailableontheweb Idea:predictratherthancount
◦ Insteadofcountinghowofteneachwordwoccursnear“fine”◦ Trainaclassifieronabinarypredictiontask:◦ Iswlikelytoshowupnear“fine”?
◦ Wedon’tactuallycareaboutthistask◦ Butwe'lltakethelearnedclassifierweightsasthewordembeddings
Word2vec
Word2vec
Dense embeddings you can download!
Word2vec(Mikolovetal.) https://code.google.com/archive/p/word2vec/
Fasttexthttp://www.fasttext.cc/
Glove(Pennington,Socher,Manning) http://nlp.stanford.edu/projects/glove/
Why vector models of meaning? Computing the similarity between words
“fast”issimilarto“rapid”“tall”issimilarto“height”Questionanswering:Q:“HowtallisMt.Everest?”CandidateA:“TheofficialheightofMountEverestis29029feet”
18
Analogy: Embeddings capture relational meaning!
vector(‘king’)-vector(‘man’)+vector(‘woman’)≈ vector(‘queen’)
vector(‘Paris’)-vector(‘France’)+vector(‘Italy’)≈vector(‘Rome’)
19
Evaluating similarity Extrinsic(task-based,end-to-end)Evaluation:◦ QuestionAnswering◦ SpellChecking◦ Essaygrading
IntrinsicEvaluation:◦ Correlationbetweenalgorithmandhumanwordsimilarityratings◦ Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77
◦ TakingTOEFLmultiple-choicevocabularytests◦ Levied is closest in meaning to:
imposed, believed, requested, correlated
Evaluating embeddings Comparetohumanscoresonwordsimilarity-typetasks:• WordSim-353(Finkelsteinetal.,2002)
• SimLex-999(Hilletal.,2015)
• StanfordContextualWordSimilarity(SCWS)dataset(Huangetal.,2012)
• TOEFLdataset:Leviedisclosestinmeaningto:imposed,believed,requested,correlated
Intrinsic evaluation
Intrinsic evaluation
Measuring similarity
Given2targetwordsvandw We’llneedawaytomeasuretheirsimilarity.
Mostmeasureofvectorssimilarityarebasedonthe:
Cosinebetweenembeddings!
◦ Highwhentwovectorshavelargevaluesinsamedimensions.◦ Low(infact0)fororthogonalvectorswithzerosincomplementarydistribution
24
Visualizing vectors and angles
1 2 3 4 5 6 7
1
2
3
digital
apricotinformation
Dim
ensi
on 1
: ‘la
rge’
Dimension 2: ‘data’25
large dataapricot 2 0digital 0 1information 1 6
Bias in Word Embeddings
More to come on bias later …
Embeddings are the workhorse of NLP ◦ Usedaspre-initializationforlanguagemodels,neuralMT,classification,NERsystems…
◦ Downloadedandeasilytrainable◦ Availablein~100soflanguages
◦ And…contextualizedwordembeddingsarebuiltontopofthem
Contextualized Word Embeddings
BERT - Deep Bidirectional Transformers
Putting it together
• Multiple (N) layers
• For encoder-decoder attention, Q: previous decoder layer, K and V: output of encoder
• For encoder self-attention, Q/K/V all come from previous encoder layer
• For decoder self-attention, allow each position to attend to all positions up to that position
• Positional encoding for word order
Fromlasttime…
Transformer
Training BERT ◦ BERThastwotrainingobjectives:1. Predictmissing(masked)words2. Predictifasentenceisthenextsentence
BERT- predict missing words
BERT- predict is next sentence?
BERT on tasks
Take-Aways ◦ Distributionalsemantics–learnaword’s“meaning”byitscontext◦ Simplestrepresentationisfrequencyindocuments◦ Wordembeddings(Word2Vec,GloVe,FastText)predictratherthanuseco-occurrence
◦ Similaritymeasuredoftenusingcosine◦ Intrinsicevaluationusescorrelationwithhumanjudgementsofwordsimilarity
◦ ContextualizedembeddingsarethenewstarsofNLP