Word Embeddings - cocoxu.github.io

Post on 30-Apr-2022

15 views 0 download

Transcript of Word Embeddings - cocoxu.github.io

WordEmbeddings

WeiXu(many slides from Greg Durrett)

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

so<maxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classesprobs

Recall:BackpropagaEon

V

dhiddenunits

so<maxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)@z

@Verr(z)

zf(x)

ThisLecture

‣ Training

‣WordrepresentaEons

‣ word2vec/GloVe

‣ EvaluaEngwordembeddings

TrainingTips

Batching

‣ BatchingdatagivesspeedupsduetomoreefficientmatrixoperaEons

‣ NeedtomakethecomputaEongraphprocessabatchatthesameEme

probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))

...

‣ Batchsizesfrom1-100o<enworkwell

def make_update(input, gold_label)

# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]

...

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopEmizaEonmethod(SGD,Adagrad,etc.)

‣ HowtoiniEalize?Howtoregularize?WhatopEmizertouse?

‣ Thislecture:somepracEcaltricks.TakedeeplearningoropEmizaEoncoursestounderstandthisfurther

HowdoesiniEalizaEonaffectlearning?

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

so<maxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))

‣ HowdoweiniEalizeVandW?Whatconsequencesdoesthishave?

‣ Nonconvexproblem,soiniEalizaEonma[ers!

‣ Nonlinearmodel…howdoesthisaffectthings?

‣ Tanh:IfcellacEvaEonsaretoolargeinabsolutevalue,gradientsaresmall

‣ ReLU:largerdynamicrange(allposiEvenumbers),butcanproducebigvalues,andcanbreakdownifeverythingistoonegaEve(“dead”ReLU)

HowdoesiniEalizaEonaffectlearning?

Krizhevskyetal.(2012)

IniEalizaEon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniEalizaEonwithappropriatescale

U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out

#‣ XavieriniEalizer:

‣Wantvarianceofinputsandgradientsforeachlayertobethesame

2)IniEalizetoolargeandcellsaresaturated

https://mmuratarat.github.io/2019-02-25/xavier-glorot-he-weight-inithttps://arxiv.org/pdf/1502.01852v1.pdf

BatchNormalizaEon‣ BatchnormalizaEon(IoffeandSzegedy,2015):periodicallyshi<+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)

https://medium.com/@shiyan/xavier-initialization-and-batch-normalization-my-understanding-b5b91268c25c

Dropout‣ ProbabilisEcallyzerooutpartsofthenetworkduringtrainingtopreventoverfigng,usewholenetworkattestEme

Srivastavaetal.(2014)

‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy

‣ FormofstochasEcregularizaEon

‣ OnelineinPytorch/Tensorflow

OpEmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused

‣ AdapEvestepsizelikeAdagrad,incorporatesmomentum

OpEmizer‣Wilsonetal.NIPS2017:adapEvemethodscanactuallyperformbadlyattestEme(Adamisinpink,SGDinblack)

‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ ObjecEve:manylossfuncEonslooksimilar,justchangesthelastlayeroftheneuralnetwork

‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)

‣ Training:lotsofchoicesforopEmizaEon/hyperparameters

FourElementsofNNs

WordRepresentaEons

WordRepresentaEons

‣ ConEnuousmodel<->expectsconEnuoussemanEcsfrominput

‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)

‣ NeuralnetworksworkverywellatconEnuousdata,butwordsarediscrete

slidecredit:DanKlein,DanJurafsky

DiscreteWordRepresentaEons

good

enjoyablegreat

0

fishcat

‣ Brownclusters:hierarchicalagglomeraEvehardclustering(eachwordhasonecluster,notsomeposteriordistribuEonlikeinmixturemodels)

‣Maximize

‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs

dog…

isgo

0

0 1 1

11

1

1

0

0

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)

�19

‣ Brownclusters:hierarchicalagglomeraEvehardclustering‣Wegiveaverybriefsketchofthealgorithmhere:

k: a hyper-parameter, sort words by frequency

Take the top k most frequent words, put each of them in its own cluster

For

Create a new cluster (we have clusters)

Choose two clusters from clusters based on quality(C) and merge (back to k clusters)

Carry out final merges (full hierarchy)

Running time , n=#words in corpus

𝑐1, 𝑐2,  𝑐3,  …𝑐k𝑖 = (k + 1)… |𝑉 |

𝑐k+1  k + 1k + 1

k − 1𝑂( 𝑉 k2 + 𝑛)  

Learn more: Percy Liang’s phd thesis - Semi-Supervised Learning for Natural Language

Quality(C) = loge(wi |C(wi ))i

n

∑ q(C(wi ) |C(wi−1)) = p(c,c ')log p(c,c ')p(c)p(c ')c '=1

k

∑c=1

k

∑ +G

mutual information between adjacent clusters

entropy of the word distribution

DiscreteWordRepresentaEons

�20

‣ Brownclusters:hierarchicalagglomeraEvehardclustering

DiscreteWordRepresentaEons

‣ ExampleClustersfromMilleretal.2004

word cluster features (bit string prefix)

NERinTwi[er

�21

2m 2ma 2mar 2mara 2maro 2marrow 2mor 2mora 2moro 2morow 2morr 2morro 2morrow 2moz 2mr

2mro 2mrrw 2mrw 2mw tmmrw tmo tmoro tmorrow tmoz tmr tmro tmrow tmrrow tmrrw tmrw tmrww tmw tomaro tomarow tomarro tomarrow tomm tommarow

tommarrow tommoro tommorow tommorrow tommorw tommrow tomo tomolo tomoro tomorow

tomorro tomorrw tomoz tomrw tomz

Ri[eretal.(2011) Cherry&Guo(2015)

Word2vecBoth

WordEmbeddings

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

‣WhatproperEesshouldthesevectorshave?

SenEmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput

Iyyeretal.(2015)

goodenjoyable

bad

dog

great

is

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

themoviewasgreat

themoviewasgood

~~

WordEmbeddings

‣ Goal:comeupwithawaytoproducetheseembeddings

‣ Foreachword,want“medium”dimensionalvector(50-300dims)represenEngit.

WordRepresentaEons

�25

‣ Count-based:v*idf,PPMI,…

‣ Class-based:BrownClusters,…

‣ DistributedpredicEon-basedembeddings:Word2vec,GloVe,FastText,…

‣ Distributedcontextualembeddings:ELMo,BERT,GPT,…

‣ +manymorevariants:mulE-senseembeddings,syntacEcembeddings,…

word2vec/GloVe

NeuralProbabilisEcLanguageModel

�27 Bengioetal.(2003)

word2vec:ConEnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

‣ Parameters:dx|V|(oned-lengthcontextvectorpervocword),|V|xdoutputparameters(W)

dog

the

+

sized

so<max

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)

d-dimensional wordembeddings

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size|V|xd

W

size|V|

word2vec:Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

bit

so<max

goldlabel=dog

‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)

‣ Anothertrainingexample:bit->the

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

W

d-dimensional wordembeddings

size|V|xd

HierarchicalSo<max

‣Matmul+so<maxover|V|isveryslowtocomputeforCBOWandSG

‣ Hierarchicalso<max:

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ Standardso<max: O(|V|)dotproductsofsized-pertraininginstancepercontextword

O(log(|V|))dotproductsofsized,

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

|V|xdparametersMikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

http://building-babylon.net/2017/08/01/hierarchical-softmax/

Skip-GramwithNegaEveSampling

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaEveexamplesbysamplingfromunigramdistribuEon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecEve= logP (y = 1|w, c)� 1

k

nX

i=1

logP (y = 0|wi, c)sampled

thedogbittheman

logP (y = 1|w, c)� 1

k

nX

i=1

logP (y = 0|wi, c)

ConnecEonswithMatrixFactorizaEon

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V|

knife dog sword love like

knife 0 1 6 5 5

dog 1 0 5 5 5

sword 6 5 0 5 5

love 5 5 5 0 5

like 5 5 5 5 2

Two words are “similar” in meaning if their context vectors are similar. Similarity == relatedness

ConnecEonswithMatrixFactorizaEon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V| |V|

d

d

|V|

contextvecsword vecs

‣ LooksalmostlikeamatrixfactorizaEon…canweinterpretitthisway?

Skip-GramasMatrixFactorizaEon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaEveexamplesfromtheuniformdistribuEonoverwords

numnegaEvesamples

‣ …andit’saweightedfactorizaEonproblem(weightedbywordfreq)

Skip-gramobjecEveexactlycorrespondstofactoringthismatrix:

Co-occurenceMatrix

�35

‣ Typicalproblemsinword-wordco-occurrences:‣ RawfrequencyisnotthebestmeasureofassociaEonbetweenwords.‣ Frequentwordsareo<enmoreimportantthanrarewordsthatonlyappearonceortwice;‣ But,frequentwords(e.g.,the)thatappearinalldocumentsarealsonotveryusefulsignal.

‣ SoluEons—weighingtermsinword-word/word-docco-occurrencematrix‣ Tf*idf‣ PPMI(PosiEvePMI)

Co-occurenceMatrix

�36

‣ Tf*idf‣ Tf:termfrequency

‣ Idf:inversedocumentfrequency

As You Like It Twelfth Night Julius Caesar Henry V

battle 1 0 7 17

solider 2 80 62 89

fool 36 58 1 4

clown 20 15 2 3� 𝑡𝑓 = log10(count(𝑡,  𝑑) + 1)

𝑖𝑑𝑓𝑖 = log10(𝑁𝑑𝑓𝑖

)

TotalnumberofdocsincollecEon

numberofdocsthathavewordi

word-docco-occurrences

GloVe(GlobalVectors)

Penningtonetal.(2014)

X

i,j

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraEcinvocsize

‣ Byfarthemostcommonnon-contextualwordvectorsusedtoday(10000+citaEons)

wordpaircounts

|V|

|V|

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaEonsaswordvectors

‣ Context-sensiCvewordembeddings:dependonrestofthesentence

‣ HugeimprovementsacrossnearlyallNLPtasksoverword2vec&GloVe

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

EvaluaEon

VisualizaEon

Source: tensorflow.org

VisualizaEon

Kulkarnietal.(2015)

EvaluaEngWordEmbeddings

‣WhatproperEesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother

‣ Analogy:

ParisistoFranceasTokyoisto???

goodistobestassmartisto???

WordSimilarity

Levyetal.(2015)

�cosine(→𝑣 , →𝑤) =  →𝑣 ⋅ →𝑤→𝑣 |→𝑤 |

=  ∑𝑁

𝑖=1 𝑣𝑖𝑤𝑖

∑𝑁𝑖=1 𝑣2

𝑖 ∑𝑁𝑖=1 𝑤2

𝑖

‣ CosineSimilarity:

WordSimilarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiEononPMImatrix

‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisEncEonsdon’tma[erinpracEce

Word2vec

HypernymyDetecEon

‣ Hypernyms:detecEveisaperson,dogisaanimal

‣ word2vec(SGNS)worksbarelybe[erthanrandomguessinghere

‣ DowordvectorsencodetheserelaEonships?

Changetal.(2017)

Analogies

queen

king

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin

king+(woman-man)=queen

‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

cos(b, a2 � a1 + b1)Maximizingforb:Add= Mul=cos(b2, a2) cos(b2, b1)

cos(b2, a1) + ✏

UsingSemanEcKnowledge

Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet

Originalvectorforfalse

Adaptedvectorforfalse

‣ Doesn’thelpmostproblems

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ Approach2:iniEalizeusingGloVe/word2vec/ELMo,keepfixed

‣ Approach3:iniEalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters

‣Worksbestforsometasks,notusedforELMo,o<enusedforBERT

‣ O<enworkspre[ywell

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaEonapproaches(constantindatasetsize)

‣ Training:opEmizer,iniEalizer,regularizaEon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

‣ NextEme:RNNsandCNNs

‣ LotsofpretrainedembeddingsworkwellinpracEce,theycapturesomedesirableproperEes

‣ Evenbe[er:context-sensiEvewordembeddings(ELMo/BERT)