Lecture 6: Neural Networksaritter.github.io/courses/5525_slides_v2/lec6-nn.pdf‣ Compu?ng gradients...

Lecture6:NeuralNetworks

AlanRi5er(many slides from Greg Durrett)

ThisLecture

‣ Feedforwardneuralnetworks+backpropaga?on

‣ Neuralnetworkbasics

‣ Applica?ons

‣ Neuralnetworkhistory

‣ Implemen?ngneuralnetworks(if?me)

History:NN“darkages”

‣ Convnets:appliedtoMNISTbyLeCunin1998

‣ LSTMs:HochreiterandSchmidhuber(1997)

‣ Henderson(2003):neuralshiS-reduceparser,notSOTA

2008-2013:Aglimmeroflight…

‣ CollobertandWeston2011:“NLP(almost)fromscratch”

‣ Feedforwardneuralnetsinducefeaturesforsequen?alCRFs(“neuralCRF”)

‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011version?edSOTA

‣ Socher2011-2014:tree-structuredRNNsworkingokay

‣ Krizhevskeyetal.(2012):AlexNetforvision

2014:Stuffstartsworking

‣ Sutskeveretal.+Bahdanauetal.:seq2seqforneuralMT(LSTMsworkforNLP?)

‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassifica?on/sen?ment(convnetsworkforNLP?)

‣ 2015:explosionofneuralnetsforeverythingunderthesun

‣ ChenandManningtransi?on-baseddependencyparser(evenfeedforwardnetworksworkwellforNLP?)

Whydidn’ttheyworkbefore?

‣ Datasetstoosmall:forMT,notreallybe5erun?lyouhave1M+parallelsentences(andreallyneedalotmore)

‣Op,miza,onnotwellunderstood:goodini?aliza?on,per-featurescaling+momentum(Adagrad/Adadelta/Adam)workbestout-of-the-box

‣ Regulariza,on:dropoutispre5yhelpful

‣ Inputs:needwordrepresenta?onstohavetherightcon?nuousseman?cs

‣ Computersnotbigenough:can’trunforenoughitera?ons

NeuralNetBasics

NeuralNetworks

‣ Howcanwedononlinearclassifica?on?Kernelsaretooslow…

‣Wanttolearnintermediateconjunc?vefeaturesoftheinput

argmaxyw>f(x, y)‣ Linearclassifica?on:

themoviewasnotallthatgood

I[containsnot&containsgood]

NeuralNetworks:XOR

x1

x2

x1 x2

1 1111

100 0

00

0

0

1 0

1

x1, x2

(generally x = (x1, . . . , xm))

y

(generally y = (y1, . . . , yn)) y = x1 XOR x2

‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfunc?on

‣ Inputs

‣ Output

NeuralNetworks:XOR

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1“or”

y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

(looks like action potential in neuron)

NeuralNetworks:XORy = a1x1 + a2x2

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1

Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

x2

x1

“or”y = �x1 � x2 + 2 tanh(x1 + x2)

NeuralNetworks:XOR

x1

x2

0

1 -1

0

x2

x1

[not]

[good] y = �2x1 � x2 + 2 tanh(x1 + x2)

I

I

themoviewasnotallthatgood

NeuralNetworks

Takenfromh5p://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp space

ShiftNonlinear transformation

Linear model: y = w · x+ b

y = g(w · x+ b)y = g(Wx+ b)

NeuralNetworks


Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!

DeepNeuralNetworks

Adopted from Chris Dyer

}outputoffirstlayer

z = g(Vg(Wx+ b) + c)

z = g(Vy + c)

Input Second Layer

FirstLayer

“Feedforward”computa?on(notrecurrent)

z = V(Wx+ b) + c

Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?

DeepNeuralNetworks


FeedforwardNetworks,Backpropaga?on

Logis?cRegressionwithNNs

P (y|x) = exp(w>f(x, y))Py0 exp(w>f(x, y0))

‣ Singlescalarprobability

P (y|x) = softmax�[w>f(x, y)]y2Y

� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)

softmax(p)i =exp(pi)Pi0 exp(pi0)

‣ soSmax:expsandnormalizesagivenvector

P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]

P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer

NeuralNetworksforClassifica?on

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

soSmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classes

probs

TrainingNeuralNetworks

‣Maximizeloglikelihoodoftrainingdata

‣ i*:indexofthegoldlabel

‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex

z = g(V f(x))P (y|x) = softmax(Wz)

L(x, i⇤) = Wz · ei⇤ � logX

j

exp(Wz) · ej

L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)

Compu?ngGradients

‣ GradientwithrespecttoW

ifi=i*zj � P (y = i|x)zj

�P (y = i|x)zj

@

@WijL(x, i⇤) =

zj � P (y = i|x)zj

�P (y = i|x)zj otherwise

‣ Lookslikelogis?cregressionwithzasthefeatures!

i

j

{


j

exp(Wz) · ej

W

NeuralNetworksforClassifica?on

V soSmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@Wz

Compu?ngGradients:Backpropaga?onz = g(V f(x))

Ac?va?onsathiddenlayer

‣ GradientwithrespecttoV:applythechainrule

err(root) = ei⇤ � P (y|x)dim=m dim=d

@L(x, i⇤)@z

= err(z) = W>err(root)


j

exp(Wz) · ej

[somemath…]

@L(x, i⇤)@Vij

=@L(x, i⇤)

@z

@z

@Vij

Backpropaga?on:Picture

V soSmaxWf(x)

zg P

(y|x)


@L@W err(root)err(z)

z

‣ CanforgeteverythingaSerz,treatitastheoutputandkeepbackpropping

Backpropaga?on:Takeaways

‣ GradientsofoutputweightsWareeasytocompute—lookslikelogis?cregressionwithhiddenlayerzasfeaturevector

‣ Cancomputederiva?veoflosswithrespecttoztoforman“errorsignal”forbackpropaga?on

‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropaga?on

‣ Needtorememberthevaluesfromtheforwardcomputa?on

Applica?ons

NLPwithFeedforwardNetworks

Bothaetal.(2017)

…

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample

emb(interest)

emb(rates)‣Weightmatrixlearnsposi?on-dependent

processingofthewords

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

NLPwithFeedforwardNetworks

‣ Hiddenlayermixesthesedifferentsignalsandlearnsfeatureconjunc?ons

Bothaetal.(2017)

NLPwithFeedforwardNetworks‣Mul?lingualtaggingresults:

Bothaetal.(2017)

‣ GillickusedLSTMs;thisissmaller,faster,andbe5er

Sen?mentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput

Iyyeretal.(2015)

Sen?mentAnalysis

{

{Bag-of-words

TreeRNNs/CNNS/LSTMS

WangandManning(2012)

Kim(2014)

Iyyeretal.(2015)

CoreferenceResolu?on‣ Feedforwardnetworksiden?fycoreferencearcs

ClarkandManning(2015),Wisemanetal.(2015)

PresidentObamasigned…

Helatergaveaspeech…

?

Implementa?onDetails

Computa?onGraphs

‣ Compu?nggradientsishard!

‣ Automa?cdifferen?a?on:instrumentcodetokeeptrackofderiva?ves

y = x * x (y,dy) = (x * x, 2 * x * dx)codegen

‣ Computa?onisnowsomethingweneedtoreasonaboutsymbolically

‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch

Computa?onGraphsinPytorch


class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)

def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))

‣ Defineforwardpassfor

Computa?onGraphsinPytorch


ffnn = FFNN()

loss.backward()

probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)

optimizer.step()

def make_update(input, gold_label):

ffnn.zero_grad() # clear gradient variables

ei*: one-hot vector of the label (e.g., [0, 1, 0])

TrainingaModel

Defineacomputa?ongraph

Foreachepoch:

Computelossonbatch

Foreachbatchofdata:

Decodetestset

Autogradtocomputegradientsandtakestep

Batching

‣ Batchingdatagivesspeedupsduetomoreefficientmatrixopera?ons

‣ Needtomakethecomputa?ongraphprocessabatchatthesame?me

probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))

...

‣ Batchsizesfrom1-100oSenworkwell

def make_update(input, gold_label)

# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]

...

NextTime

‣Moreimplementa?ondetails:prac?caltrainingtechniques

‣Wordrepresenta?ons/wordvectors

‣ word2vec,GloVe

Lecture 6: Neural Networksaritter.github.io/courses/5525_slides_v2/lec6-nn.pdf‣ Compu?ng gradients...

Documents

Transcript of Lecture 6: Neural Networksaritter.github.io/courses/5525_slides_v2/lec6-nn.pdf‣ Compu?ng gradients...