Word Embeddings - cocoxu.github.io

WordEmbeddings

WeiXu(many slides from Greg Durrett)

Recall:FeedforwardNNs

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

so<maxWf(x)

nonlinearity(tanh,relu,…)

P (y|x) = softmax(Wg(V f(x)))num_classesprobs

Recall:BackpropagaEon

dhiddenunits

so<maxWf(x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)@z

@Verr(z)

ThisLecture

‣ Training

‣WordrepresentaEons

‣ word2vec/GloVe

‣ EvaluaEngwordembeddings

TrainingTips

Batching

‣ BatchingdatagivesspeedupsduetomoreefficientmatrixoperaEons

‣ NeedtomakethecomputaEongraphprocessabatchatthesameEme

probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))

‣ Batchsizesfrom1-100o<enworkwell

def make_update(input, gold_label)

# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopEmizaEonmethod(SGD,Adagrad,etc.)

‣ HowtoiniEalize?Howtoregularize?WhatopEmizertouse?

‣ Thislecture:somepracEcaltricks.TakedeeplearningoropEmizaEoncoursestounderstandthisfurther

HowdoesiniEalizaEonaffectlearning?

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

so<maxWf(x)

nonlinearity(tanh,relu,…)

P (y|x) = softmax(Wg(V f(x)))

‣ HowdoweiniEalizeVandW?Whatconsequencesdoesthishave?

‣ Nonconvexproblem,soiniEalizaEonma[ers!

‣ Nonlinearmodel…howdoesthisaffectthings?

‣ Tanh:IfcellacEvaEonsaretoolargeinabsolutevalue,gradientsaresmall

‣ ReLU:largerdynamicrange(allposiEvenumbers),butcanproducebigvalues,andcanbreakdownifeverythingistoonegaEve(“dead”ReLU)

HowdoesiniEalizaEonaffectlearning?

Krizhevskyetal.(2012)

IniEalizaEon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniEalizaEonwithappropriatescale

fan-in + fan-out,+

fan-in + fan-out

#‣ XavieriniEalizer:

‣Wantvarianceofinputsandgradientsforeachlayertobethesame

2)IniEalizetoolargeandcellsaresaturated

https://mmuratarat.github.io/2019-02-25/xavier-glorot-he-weight-inithttps://arxiv.org/pdf/1502.01852v1.pdf

BatchNormalizaEon‣ BatchnormalizaEon(IoffeandSzegedy,2015):periodicallyshi<+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)

https://medium.com/@shiyan/xavier-initialization-and-batch-normalization-my-understanding-b5b91268c25c

Dropout‣ ProbabilisEcallyzerooutpartsofthenetworkduringtrainingtopreventoverfigng,usewholenetworkattestEme

Srivastavaetal.(2014)

‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy

‣ FormofstochasEcregularizaEon

‣ OnelineinPytorch/Tensorflow

OpEmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused

‣ AdapEvestepsizelikeAdagrad,incorporatesmomentum

OpEmizer‣Wilsonetal.NIPS2017:adapEvemethodscanactuallyperformbadlyattestEme(Adamisinpink,SGDinblack)

‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ ObjecEve:manylossfuncEonslooksimilar,justchangesthelastlayeroftheneuralnetwork

‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)

‣ Training:lotsofchoicesforopEmizaEon/hyperparameters

FourElementsofNNs

WordRepresentaEons

‣ ConEnuousmodel<->expectsconEnuoussemanEcsfrominput

‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)

‣ NeuralnetworksworkverywellatconEnuousdata,butwordsarediscrete

slidecredit:DanKlein,DanJurafsky

DiscreteWordRepresentaEons

enjoyablegreat

fishcat

‣ Brownclusters:hierarchicalagglomeraEvehardclustering(eachwordhasonecluster,notsomeposteriordistribuEonlikeinmixturemodels)

‣Maximize

‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs

dog…

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)

‣ Brownclusters:hierarchicalagglomeraEvehardclustering‣Wegiveaverybriefsketchofthealgorithmhere:

k: a hyper-parameter, sort words by frequency

Take the top k most frequent words, put each of them in its own cluster

Create a new cluster (we have clusters)

Choose two clusters from clusters based on quality(C) and merge (back to k clusters)

Carry out final merges (full hierarchy)

Running time , n=#words in corpus

𝑐1, 𝑐2, 𝑐3, …𝑐k𝑖 = (k + 1)… |𝑉 |

𝑐k+1 k + 1k + 1

k − 1𝑂( 𝑉 k2 + 𝑛)

Learn more: Percy Liang’s phd thesis - Semi-Supervised Learning for Natural Language

Quality(C) = loge(wi |C(wi ))i

∑ q(C(wi ) |C(wi−1)) = p(c,c ')log p(c,c ')p(c)p(c ')c '=1

∑c=1

∑ +G

mutual information between adjacent clusters

entropy of the word distribution

‣ Brownclusters:hierarchicalagglomeraEvehardclustering

‣ ExampleClustersfromMilleretal.2004

word cluster features (bit string prefix)

NERinTwi[er

2m 2ma 2mar 2mara 2maro 2marrow 2mor 2mora 2moro 2morow 2morr 2morro 2morrow 2moz 2mr

2mro 2mrrw 2mrw 2mw tmmrw tmo tmoro tmorrow tmoz tmr tmro tmrow tmrrow tmrrw tmrw tmrww tmw tomaro tomarow tomarro tomarrow tomm tommarow

tommarrow tommoro tommorow tommorrow tommorw tommrow tomo tomolo tomoro tomorow

tomorro tomorrw tomoz tomrw tomz

Ri[eretal.(2011) Cherry&Guo(2015)

Word2vecBoth

WordEmbeddings

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

‣WhatproperEesshouldthesevectorshave?

SenEmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput

Iyyeretal.(2015)

goodenjoyable

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

themoviewasgreat

themoviewasgood

WordEmbeddings

‣ Goal:comeupwithawaytoproducetheseembeddings

‣ Foreachword,want“medium”dimensionalvector(50-300dims)represenEngit.

WordRepresentaEons

‣ Count-based:v*idf,PPMI,…

‣ Class-based:BrownClusters,…

‣ DistributedpredicEon-basedembeddings:Word2vec,GloVe,FastText,…

‣ Distributedcontextualembeddings:ELMo,BERT,GPT,…

‣ +manymorevariants:mulE-senseembeddings,syntacEcembeddings,…

word2vec/GloVe

NeuralProbabilisEcLanguageModel

�27 Bengioetal.(2003)

word2vec:ConEnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

‣ Parameters:dx|V|(oned-lengthcontextvectorpervocword),|V|xdoutputparameters(W)

so<max

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)

d-dimensional wordembeddings

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size|V|xd

size|V|

word2vec:Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

so<max

goldlabel=dog

‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)

‣ Anothertrainingexample:bit->the

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

d-dimensional wordembeddings

size|V|xd

HierarchicalSo<max

‣Matmul+so<maxover|V|isveryslowtocomputeforCBOWandSG

‣ Hierarchicalso<max:

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ Standardso<max: O(|V|)dotproductsofsized-pertraininginstancepercontextword

O(log(|V|))dotproductsofsized,

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

|V|xdparametersMikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

http://building-babylon.net/2017/08/01/hierarchical-softmax/

Skip-GramwithNegaEveSampling

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaEveexamplesbysamplingfromunigramdistribuEon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecEve= logP (y = 1|w, c)� 1

logP (y = 0|wi, c)sampled

thedogbittheman

logP (y = 1|w, c)� 1

logP (y = 0|wi, c)

ConnecEonswithMatrixFactorizaEon

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

knife dog sword love like

knife 0 1 6 5 5

dog 1 0 5 5 5

sword 6 5 0 5 5

love 5 5 5 0 5

like 5 5 5 5 2

Two words are “similar” in meaning if their context vectors are similar. Similarity == relatedness

ConnecEonswithMatrixFactorizaEon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V| |V|

contextvecsword vecs

‣ LooksalmostlikeamatrixfactorizaEon…canweinterpretitthisway?

Skip-GramasMatrixFactorizaEon

Levyetal.(2014)

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaEveexamplesfromtheuniformdistribuEonoverwords

numnegaEvesamples

‣ …andit’saweightedfactorizaEonproblem(weightedbywordfreq)

Skip-gramobjecEveexactlycorrespondstofactoringthismatrix:

Co-occurenceMatrix

‣ Typicalproblemsinword-wordco-occurrences:‣ RawfrequencyisnotthebestmeasureofassociaEonbetweenwords.‣ Frequentwordsareo<enmoreimportantthanrarewordsthatonlyappearonceortwice;‣ But,frequentwords(e.g.,the)thatappearinalldocumentsarealsonotveryusefulsignal.

‣ SoluEons—weighingtermsinword-word/word-docco-occurrencematrix‣ Tf*idf‣ PPMI(PosiEvePMI)

Co-occurenceMatrix

‣ Tf*idf‣ Tf:termfrequency

‣ Idf:inversedocumentfrequency

As You Like It Twelfth Night Julius Caesar Henry V

battle 1 0 7 17

solider 2 80 62 89

fool 36 58 1 4

clown 20 15 2 3� 𝑡𝑓 = log10(count(𝑡, 𝑑) + 1)

𝑖𝑑𝑓𝑖 = log10(𝑁𝑑𝑓𝑖

TotalnumberofdocsincollecEon

numberofdocsthathavewordi

word-docco-occurrences

GloVe(GlobalVectors)

Penningtonetal.(2014)

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraEcinvocsize

‣ Byfarthemostcommonnon-contextualwordvectorsusedtoday(10000+citaEons)

wordpaircounts

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaEonsaswordvectors

‣ Context-sensiCvewordembeddings:dependonrestofthesentence

‣ HugeimprovementsacrossnearlyallNLPtasksoverword2vec&GloVe

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

EvaluaEon

VisualizaEon

Source: tensorflow.org

VisualizaEon

Kulkarnietal.(2015)

EvaluaEngWordEmbeddings

‣WhatproperEesoflanguageshouldwordembeddingscapture?

goodenjoyable

‣ Similarity:similarwordsareclosetoeachother

‣ Analogy:

ParisistoFranceasTokyoisto???

goodistobestassmartisto???

WordSimilarity

Levyetal.(2015)

�cosine(→𝑣 , →𝑤) = →𝑣 ⋅ →𝑤→𝑣 |→𝑤 |

= ∑𝑁

𝑖=1 𝑣𝑖𝑤𝑖

∑𝑁𝑖=1 𝑣2

𝑖 ∑𝑁𝑖=1 𝑤2

‣ CosineSimilarity:

WordSimilarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiEononPMImatrix

‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisEncEonsdon’tma[erinpracEce

Word2vec

HypernymyDetecEon

‣ Hypernyms:detecEveisaperson,dogisaanimal

‣ word2vec(SGNS)worksbarelybe[erthanrandomguessinghere

‣ DowordvectorsencodetheserelaEonships?

Changetal.(2017)

Analogies

(king-man)+woman=queen

‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin

king+(woman-man)=queen

‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

cos(b, a2 � a1 + b1)Maximizingforb:Add= Mul=cos(b2, a2) cos(b2, b1)

cos(b2, a1) + ✏

UsingSemanEcKnowledge

Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet

Originalvectorforfalse

Adaptedvectorforfalse

‣ Doesn’thelpmostproblems

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ Approach2:iniEalizeusingGloVe/word2vec/ELMo,keepfixed

‣ Approach3:iniEalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters

‣Worksbestforsometasks,notusedforELMo,o<enusedforBERT

‣ O<enworkspre[ywell

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaEonapproaches(constantindatasetsize)

‣ Training:opEmizer,iniEalizer,regularizaEon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

‣ NextEme:RNNsandCNNs

‣ LotsofpretrainedembeddingsworkwellinpracEce,theycapturesomedesirableproperEes

‣ Evenbe[er:context-sensiEvewordembeddings(ELMo/BERT)

Word Embeddings - cocoxu.github.io

Documents

Transcript of Word Embeddings - cocoxu.github.io

Dynamic Contextualized Word Embeddings

Seeping Semantics: Linking Datasets using Word Embeddings ...

Dependency-Based Word Embeddings

From Word Embeddings To Document Distancesmkusner.github.io/presentations/From_Word_Embeddings_To... · 2020-05-16 · From Word Embeddings To Document Distances ... word embeddings

LearningToAdapt with Word Embeddings: Domain …

word embeddings what, how and whither

Text mining, word embeddings, & wikipedia

Mixed Membership Word Embeddings for Computational Social Scienceproceedings.mlr.press/v84/foulds18a/foulds18a.pdf · 2018-08-21 · Mixed Membership Word Embeddings for Computational

Semantic Structure and Interpretability of Word Embeddings

Word embeddings - College of Information & Computer Sciences

Introduction to word embeddings with Python

Dissecting Contextual Word Embeddings: Architecture and … · 2018. 10. 1. · Dissecting Contextual Word Embeddings: Architecture and Representation Matthew E. Peters 1, Mark Neumann

Word embeddings (II)

FREDDY: Fast Word Embeddings in Database Systems

Word Embeddings through Hellinger PCA - Lebret

Word Embeddings: History & Behind the Scene

Mixed Membership Word Embeddings for Computational Social ...

Semantic Word Cloud Generation Based on Word Embeddings · (“queen”) in the word analogy task. Thus, this paper uses the word embeddings learned from large related corpora to

Unsupervised morphology induction using word embeddings

Word Embeddings, why the hype ?