CIS419/519Spring’18
CIS519/419AppliedMachineLearning
www.seas.upenn.edu/~cis519
[email protected]://www.cis.upenn.edu/~danroth/461C,3401Walnut
LecturegivenbyDanielKhashabi
SlideswerecreatedbyDanRoth(forCIS519/419atPennorCS446atUIUC),EricEatonforCIS519/419atPenn,orfromotherauthorswhohavemadetheirMLslidesavailable.
CIS419/519Spring’18
FunctionsCanbeMadeLinear§ Dataarenotlinearlyseparableinonedimension§ Notseparableifyouinsistonusingaspecificclassof
functions
2
x
CIS419/519Spring’18
BlownUpFeatureSpace§ Dataareseparablein<x,x2>space
3
x
x2
CIS419/519Spring’18
Multi-LayerNeuralNetwork§ Multi-layernetworkweredesignedtoovercomethe
computational(expressivity)limitationofasinglethresholdelement.
§ Theideaistostackseverallayersofthresholdelements,eachlayerusingtheoutputofthepreviouslayerasinput.
§ Multi-layernetworkscanrepresentarbitraryfunctions,butbuildingeffectivelearningmethodsforsuchnetworkwas[thoughttobe]difficult.
4
activation
Input
Hidden
Output
CIS419/519Spring’18
BasicUnits§ LinearUnit: Multiplelayersoflinearfunctions
oj =w¢ xproducelinearfunctions.Wewanttorepresentnonlinearfunctions.
§ Needtodoitinawaythatfacilitateslearning
§ Thresholdunits:oj =sgn(w¢ x)arenotdifferentiable, henceunsuitableforgradientdescent.
§ Thekeyideawastonoticethatthediscontinuityofthethresholdelementcanberepresentsbyasmoothnon-linearapproximation:oj =[1+exp{-w¢ x}]-1
§ (Rumelhart,Hinton,Williiam,1986),(Linnainmaa,1970),see:http://people.idsia.ch/~juergen/who-invented-backpropagation.html )
5
activation
Input
Hidden
Output
w2ij
w1ij
CIS419/519Spring’18
ModelNeuron(Logistic)§ Usanon-linear,differentiableoutputfunctionsuch
asthesigmoidorlogisticfunction
§ Netinputtoaunitisdefinedas:§ Outputofaunitisdefinedas:
6
iijj xwnet ∑ •=
)T(netj jje11O
−−+=
jT
12
6
345
7
67w
17w
∑T
jO
1x
7x
CIS419/519Spring’18
NeuralNetworks§ NeuralNetworksarefunctions:𝑁𝑁:𝑋 → 𝑌
§ where𝑋 = 0,1 *,or{0,1}* and𝑌 = 0,1 ,{0,1}§ Robustapproachtoapproximatingreal-valued,discrete-valued
andvectorvaluedtargetfunctions.
§ Amongthemosteffectivegeneralpurpose supervisedlearningmethodcurrentlyknown.
§ Effectiveespeciallyforcomplexandhardtointerpretinputdatasuchasreal-worldsensorydata,wherealotofsupervisionisavailable.
§ TheBackpropagationalgorithmforneuralnetworkshasbeenshownsuccessfulinmanypracticalproblems§ handwrittencharacterrecognition,speechrecognition,object
recognition,someNLPproblems
7
CIS419/519Spring’18
NeuralNetworks§ NeuralNetworksarefunctions:NN:𝑋 → 𝑌
§ where𝑋 = 0,1 *,or{0,1}* and𝑌 = 0,1 , {0,1}
§ NNcanbeusedasanapproximationofatargetclassifier§ Intheirgeneralform,evenwithasinglehiddenlayer,NNcan
approximateanyfunction§ AlgorithmsexistthatcanlearnaNNrepresentationfromlabeled
trainingdata(e.g.,Backpropagation).
8
CIS419/519Spring’18
Multi-LayerNeuralNetworks§ Multi-layernetworkweredesignedtoovercomethe
computational(expressivity)limitationofasinglethresholdelement.
§ Theideaistostackseverallayersofthresholdelements,eachlayerusingtheoutputofthepreviouslayerasinput.
9
activation
Input
Hidden
Output
CIS419/519Spring’18
MotivationforNeuralNetworks§ Inspiredbybiologicalsystems
§ Butdon’ttakethis(aswellasanyotherwordsinthenewon“emergence”ofintelligentbehavior)seriously;
§ WearecurrentlyonrisingpartofawaveofinterestinNNarchitectures,afteralongdowntimefromthemid-90-ies.§ Bettercomputerarchitecture(GPUs,parallelism)§ Alotmoredatathanbefore;inmanydomains,supervisionis
available.
§ Currentsurgeofinteresthasseenveryminimalalgorithmicchanges
10
CIS419/519Spring’18
MotivationforNeuralNetworks§ Minimaltonoalgorithmicchanges§ Onepotentiallyinterestingperspective:
§ BeforewelookedatNNonlyasfunctionapproximators.§ Now,welookattheintermediaterepresentationsgenerated
whilelearningasmeaningful§ Ideasarebeingdevelopedonthevalueoftheseintermediate
representationsfortransferlearningetc.
§ Wewillpresentinthenexttwolecturesafewofthebasicarchitecturesandlearningalgorithms,andprovidesomeexamplesforapplications
11
CIS419/519Spring’18
NeuralSpeedConstraints§ Neuron“switchingtime”isO(milliseconds),comparedto
nanosecondfortransistors.§ However,biologicalsystemscanperformsignificantcognitive
tasks(vision,languageunderstanding)infractionsofasecond.
§ Evenforlimitedabilities,currentAIsystemsrequireordersofmagnitudemoresteps.
§ Humanbrainhasapproximately10^10neurons,eachconnectedto10^4;mustexploremassiveparallelism(butthere’smore…)
12
CIS419/519Spring’18
BasicUnitinMulti-LayerNeuralNetwork
§ LinearUnit:𝑜/ = 𝑤.𝑥 multiplelayersoflinearfunctionsproducelinearfunctions.Wewanttorepresentnonlinearfunctions.
§ Thresholdunits:𝑜/ = 𝑠𝑔𝑛(𝑤.𝑥 − 𝑇) arenotdifferentiable,henceunsuitableforgradientdescent
13
activation
Input
Hidden
Output
CIS419/519Spring’18
ModelNeuron(Logistic)§ Neuronismodeledbyaunit𝑗 connectedbyweighted
links𝑤</ tootherunits𝑖.
§ Useanon-linear,differentiableoutputfunctionsuchasthesigmoidorlogisticfunction
§ Netinputtoaunitisdefinedas:
§ Outputofaunitisdefinedas:
14
net/ = ∑𝑤</. 𝑥<
𝑜/ =1
1 + exp −(net/ − 𝑇/)
∑ 𝑜/
𝑥E𝑥F𝑥G𝑥H𝑥I𝑥J
𝑥/𝑤E/
𝑤J/
CIS419/519Spring’18
History:NeuralComputation
15
§ McCullochandPitts(1943) showedhowlinearthresholdunitscanbeusedtocomputelogicalfunctions
§ Canbuildbasiclogicgates§ AND:𝑤</ = 𝑇//𝑛§ OR:𝑤</ = 𝑇/§ NOT: usenegativeweight
§ Canbuildarbitrarylogiccircuits,finite-statemachinesandcomputersgiventhesebasisgates.
§ CanspecifyanyBooleanfunctionusingtwolayernetwork(w/negation)§ DNFandCNFareuniversalrepresentations
net/ = ∑𝑤</ . 𝑥<
𝑜/ =1
1 + exp −(net/ − 𝑇/)
CIS419/519Spring’18
RepresentationalPower§ AnyBooleanfunctioncanberepresentedbyatwolayer
network(simulateatwolayerAND-ORnetwork)
§ Anybounded continuousfunctioncanbeapproximatedwitharbitrarysmallerrorbyatwolayernetwork.
§ Sigmoidfunctionsprovideasetofbasisfunctionfromwhicharbitraryfunctioncanbecomposed.
§ Anyfunctioncanbeapproximatedtoarbitraryaccuracybyathreelayernetwork.
16
CIS419/519Spring’18
QuizTime!§ Givenaneuralnetwork,howcanwe
makepredictions?§ Giveninput,calculatetheoutputofeach
layer(startingfromthefirstlayer),untilyougettotheoutput.
§ Whatisrequiredtofullyspecifyaneuralnetwork?§ Theweights.
17
§ WhyNNpredictionscanbequick?§ Becausemanyofthecomputationscouldbeparallelized.
§ Whatmakesaneuralnetworksnon-linearapproximator?§ Thenon-linearunits.
CIS419/519Spring’18
TrainingaNeuralNet
CIS419/519Spring’18
Widrow-HoffRule§ Thisincrementalupdateruleprovidesanapproximation
tothegoal:§ Findthebestlinearapproximationofthedata
𝐸𝑟𝑟 𝑤 / =12P 𝑡R − 𝑜R F
R∈T§ where:
𝑜R = P𝑤</. 𝑥< =<
𝑤 / . �⃗�
outputoflinearunitonexampled§ 𝑡R =Targetoutputforexampled
19
CIS419/519Spring’18
History:LearningRules§ Hebb(1949) suggestedthatiftwounitsarebothactive
(firing)thentheweightsbetweenthemshouldincrease:𝑤</ = 𝑤</ + 𝑅𝑜<𝑜/
§ 𝑅 andisaconstantcalledthelearningrate§ Supportedbyphysiologicalevidence
§ Rosenblatt(1959) suggestedthatwhenatargetoutputvalueisprovidedforasingleneuronwithfixedinput,itcanincrementallychangeweightsandlearntoproducetheoutputusingthePerceptronlearningrule.§ assumesbinaryoutputunits;singlelinearthresholdunit§ LedtothePerceptronAlgorithm
§ See:http://people.idsia.ch/~juergen/who-invented-backpropagation.html
20
CIS419/519Spring’18
PerceptronLearningRule§ Given:
§ thetarget outputfortheoutputunitis𝑡/§ theinput theneuronseesis𝑥<§ theoutput itproducesis𝑜/
§ Updateweightsaccordingto𝑤</ ← 𝑤</ + 𝑅 𝑡/ − 𝑜/ 𝑥<§ Ifoutputiscorrect,don’tchangetheweights§ Ifoutputiswrong,changeweightsforallinputswhichare1
§ Ifoutput islow(0,needstobe1)incrementweights§ Ifoutput ishigh(1,needstobe0)decrementweights
21
∑𝑇/
𝑜/
𝑥E𝑥F𝑥G𝑥H𝑥I𝑥J
𝑥W𝑤EW
𝑤JW
CIS419/519Spring’18
GradientDescent§ Weusegradientdescentdeterminetheweightvectorthat minimizes
𝐸𝑟𝑟 𝑤 / ;
§ Fixingtheset𝐷 ofexamples,𝐸 isafunctionof𝑤 /
§ Ateachstep,theweightvectorismodifiedinthedirectionthatproducesthesteepestdescentalongtheerrorsurface.
22
𝐸𝑟𝑟(𝑤)
𝑤𝑤G 𝑤F 𝑤E 𝑤Y
CIS419/519Spring’18
Summary:SingleLayerNetwork§ Varietyofupdaterules
§ Multiplicative§ Additive
§ Batch andincrementalalgorithms§ Variousconvergenceandefficiencyconditions§ Thereareotherwaystolearnlinearfunctions
§ LinearProgramming (generalpurpose)§ ProbabilisticClassifiers(someassumption)
§ Keyalgorithmsaredrivenbygradientdescent
23
CIS419/519Spring’18
GeneralStochasticGradientAlgorithms
wt+1 =wt – rt gw Q(xt, yt,wt)=wt – rt gt
LMS: Q((x,y),w) =1/2 (y– wT x)2leadstotheupdaterule(AlsocalledWidrow’s Adaline):
wt+1 =wt +r(yt –𝑤Z[ xt)xtHere,eventhoughwemakebinarypredictionsbasedonsgn (wT x)wedonottakethesign ofthedot-productintoaccountintheloss.
Anothercommonlossfunctionis:Hingeloss:Q((x,y),w)=max(0,1- ywT x)Thisleadstotheperceptron updaterule:
Ifyi𝑤<[·xi >1(Nomistake,byamargin):NoupdateOtherwise (Mistake,relativetomargin): wt+1 =wt +ryt xt
24
wT x
ThelossQ:afunction ofx,w andyLearningrate gradient
Hereg=-yxGood tothinkaboutthecaseofBooleanexamples
CIS419/519Spring’18
Summary:SingleLayerNetwork§ Varietyofupdaterules
§ Multiplicative§ Additive
§ Batch andincrementalalgorithms§ Variousconvergenceandefficiencyconditions§ Thereareotherwaystolearnlinearfunctions
§ LinearProgramming (generalpurpose)§ ProbabilisticClassifiers(someassumption)
§ Keyalgorithmsaredrivenbygradientdescent§ However,therepresentationalrestrictionislimitingin
manyapplications
25
CIS419/519Spring’18
BackpropagationLearningRule§ Sincetherecouldbemultipleoutputunits,wedefinethe
error asthesumoverallthenetworkoutputunits.
𝐸𝑟𝑟 𝑤 = EF∑ ∑\∈] 𝑡\R − 𝑜\R FR∈T
§ where𝐷 isthesetoftrainingexamples,§ 𝐾 isthesetofoutputunits
§ Thisisusedtoderivethe(global)learningrulewhichperformsgradientdescentintheweightspaceinanattempttominimizetheerrorfunction.
Δ𝑤</ = −𝑅𝜕𝐸𝜕𝑤</
26
𝑜E…𝑜\
(1, 0, 1, 0, 0)
Function 1
CIS419/519Spring’18
LearningwithaMulti-LayerPerceptron
§ It’seasytolearnthetoplayer– it’sjustalinearunit.§ Givenfeedback(truth)atthetoplayer,andtheactivationatthe
layerbelowit,youcanusethePerceptronupdaterule(moregenerally,gradientdescent)toupdatedtheseweights.
§ Theproblemiswhattodowiththeothersetofweights– wedonotgetfeedbackintheintermediatelayer(s).
27
activation
Input
Hidden
Output
w2ij
w1ij
CIS419/519Spring’18
LearningwithaMulti-LayerPerceptron
§ Theproblem iswhattodowiththeother setofweights– wedonotgetfeedbackintheintermediatelayer(s).
§ Solution: Ifalltheactivationfunctionsaredifferentiable, thentheoutput ofthenetworkisalsoadifferentiablefunction oftheinputandweightsinthenetwork.
§ Defineanerror function (e.g.,sumofsquares)thatisadifferentiable functionoftheoutput, i.e.thiserrorfunction isalsoadifferentiable functionof theweights.
§ Wecanthenevaluatethederivativesoftheerrorwithrespecttotheweights,andusethesederivativestofindweightvaluesthatminimizethiserrorfunction, usinggradientdescent(orotheroptimizationmethods).
§ Thisresultsinanalgorithmcalledback-propagation.
28
activation
Input
Hidden
Output
w2ij
w1ij
CIS419/519Spring’18
SomefactsfromrealanalysisFirstlet’sgetthenotationright:
Thearrowshowsfunctionaldependnence ofz onyi.e.giveny,wecancalculatez.e.g.,forexample:z(y)=2y^2
Thederivativeofz,withrespecttoy.
29
CIS419/519Spring’18
Somefactsfromrealanalysis§ Simplechainrule
§ If𝑧 isafunctionof𝑦,and𝑦 isafunctionof𝑥§ Then𝑧 isafunctionof𝑥,aswell.
§ Question:howtofindcdce
30
WewillusethesefactstoderivethedetailsoftheBackpropagationalgorithm.
zwillbetheerror(loss) function.- Weneedtoknowhowtodifferentiatez
Intermediatenodesusealogisticsfunction (oranotherdifferentiablestepfunction).- Weneedtoknowhowtodifferentiateit.
CIS419/519Spring’18
Somefactsfromrealanalysis§ Multiplepathchainrule
31Slide Credit: Richard Socher
CIS419/519Spring’18
Somefactsfromrealanalysis§ Multiplepathchainrule:general
32Slide Credit: Richard Socher
CIS419/519Spring’18
KeyIntuitionsRequiredforBP§ GradientDescent
§ Changetheweightsinthedirectionofgradienttominimizetheerrorfunction.
§ ChainRule§ Usethechainruletocalculatethe
weightsoftheintermediateweights
§ DynamicProgramming(Memoization)§ Memoize theweightupdatestomake
theupdatesfaster.§ The“back”partof“backpropagation”
33
𝜕𝐸𝜕𝑤</
output
input
CIS419/519Spring’18
Backpropagation:thebigpicture§ Loopoverinstances:
1. Theforwardstep§ Giventheinput,makepredictions layer-by-layer,startingfromthefirstlayer)
2. Thebackwardstep§ Calculatetheerrorintheoutput§ Updatetheweightslayer-by-layer,startingfromthefinallayer
34
𝜕𝐸𝜕𝑤</
output
input
CIS419/519Spring’18
Quiztime!§ Whatisthepurposeofforwardstep?
§ Tomakepredictions,givenaninput.
§ Whatisthepurposeofbackwardstep?§ Toupdatetheweights,givenanoutputerror.
§ Whydoweusethechainrule?§ Tocalculategradientintheintermediatelayers.
§ Whybackpropagationcouldbeefficient?§ Becauseitcanbeparallelized.
35
CIS419/519Spring’18
Derivingtheupdaterules
CIS419/519Spring’18
Reminder:ModelNeuron(Logistic)
§ Neuronismodeledbyaunit𝑗 connectedbyweightedlinks𝑤</ tootherunits𝑖.
§ Useanon-linear,differentiableoutputfunctionsuchasthesigmoidorlogisticfunction
§ Netinputtoaunitisdefinedas:
§ Outputofaunitisdefinedas:
37
net/ = ∑𝑤</. 𝑥<
𝑜/ =1
1 + exp −(net/ − 𝑇/)
∑ 𝑜/
𝑥E𝑥F𝑥G𝑥H𝑥I𝑥J
𝑥W𝑤EW
𝑤JW
Theparameterssofar?Thesetofconnectiveweights:𝑤</Thethresholdvalue:𝑇/
CIS419/519Spring’18
Derivatives§ Function1(error):
§ 𝐸 = EF∑ 𝑡\ − 𝑜\ F\∈]
§chcij
= − 𝑡𝑖 − 𝑜<§ Function2(lineargate):
§ net/ = ∑𝑤</ . 𝑥<
§cklmncojn
= 𝑥𝑖
§ Function3(differentiable stepfunction):
§ 𝑜< =E
Eplqr{s(klmns[)}
§cijcklmn
= lqr{s(klmns[)}(Eplqr{s(klmns[)})F
=𝑜<(1 − 𝑜<)
38
𝑜E…𝑜\
𝑗
𝑖
𝑤</
CIS419/519Spring’18
DerivationofLearningRule§ Theweightsareupdatedincrementally;theerroris
computedforeachexampleandtheweightupdateisthenderived.
𝐸R 𝑤 =12P 𝑡\ − 𝑜\ F
\∈]
§ 𝑤</ influencestheoutputonlythroughnet/
§ Therefore:𝜕𝐸R𝜕𝑤</
=𝜕𝐸R𝜕o/
𝜕𝑜/𝜕net/
𝜕net/𝜕𝑤</
39
𝑜E…𝑜\
𝑗
𝑖
𝑤</
𝑜< =E
Eplqr{s(klmns[)}and net/ = ∑𝑤</ . 𝑥<
CIS419/519Spring’18
= − 𝑡/ − 𝑜/ 𝑜/ 1 − 𝑜/ 𝑥<
𝜕𝐸R𝜕𝑤</
=𝜕𝐸R𝜕o/
𝜕𝑜/𝜕net/
𝜕net/𝜕𝑤</
DerivationofLearningRule(2)§ Weightupdatesofoutputunits:
§ 𝑤</ influencestheoutputonlythroughnet/§ Therefore:
40
𝑗
𝑖
𝑤</
𝐸R 𝑤 =12P 𝑡\ − 𝑜\ F
\∈]net/ = ∑𝑤</ . 𝑥<
𝜕𝑜/𝜕net/
= 𝑜/(1 − 𝑜/)
𝑜/ =1
1 + exp{−(net/ − 𝑇/)}
𝑜E…𝑜\
CIS419/519Spring’18
DerivationofLearningRule(3)§ Weightsofoutputunits:
§ 𝑤</ ischangedby:
Wherewedefined:
𝛿/ =chvcklmn
= 𝑡/ − 𝑜/ 𝑜/ 1 − 𝑜/
41
Δ𝑤</ = 𝑅 𝑡/ − 𝑜/ 𝑜/ 1 − 𝑜/ 𝑥<= 𝑅𝛿/𝑥<
𝑗
𝑖
𝑤</𝑜/
𝑥<
CIS419/519Spring’18
= P −𝛿\ 𝜕net\𝜕net/
𝑥<\∈wxyz*Z(/)
= P𝜕𝐸R𝜕net\
𝜕net\𝜕net/
𝑥<\∈wxyz*Z(/)
𝜕𝐸R𝜕𝑤</
=𝜕𝐸R𝜕net/
𝜕net/𝜕𝑤</
=
DerivationofLearningRule(4)§ Weightsofhiddenunits:
§ 𝑤</ Influencestheoutputonlythroughalltheunitswhosedirectinputinclude𝑗
42
𝑘
𝑗
𝑖
𝑤</
𝑜\net/ = ∑𝑤</ . 𝑥<
=𝜕𝐸R𝜕net/
𝑥< =
CIS419/519Spring’18
= P −𝛿\ 𝜕net\𝜕𝑜/
𝜕𝑜/𝜕net/
𝑥<\∈wxyz*Z(/)
= P −𝛿\𝑤/\𝑜/(1 − 𝑜/)𝑥<\∈wxyz*Z(/)
DerivationofLearningRule(5)§ Weightsofhiddenunits:
§ 𝑤</ influencestheoutputonlythroughalltheunitswhosedirectinputinclude𝑗
43
𝑘
𝑗
𝑖
𝑤</
𝑜\
𝜕𝐸R𝜕𝑤</
= P −𝛿\𝜕net\𝜕net/
𝑥<\∈wxyz*Z(/)
=
CIS419/519Spring’18
DerivationofLearningRule(6)§ Weightsofhiddenunits:
§ 𝑤</ ischangedby:
§ Where
𝛿/ = 𝑜/ 1 − 𝑜/ . ∑ −𝛿\𝑤/\\∈wxyz*Z /
§ Firstdeterminetheerrorfortheoutputunits.§ Then,backpropagate thiserrorlayerbylayerthroughthenetwork,
changingweightsappropriatelyineachlayer.
44
𝑘
𝑗
𝑖
𝑤</
𝑜\
Δ𝑤</ = 𝑅𝑜/ 1 − 𝑜/ . P −𝛿\𝑤/\\∈wxyz*Z /
𝑥<
= 𝑅𝛿/𝑥</
CIS419/519Spring’18
TheBackpropagationAlgorithm§ Createafullyconnectedthreelayernetwork.Initializeweights.§ Untilallexamplesproducethecorrectoutputwithin𝜖 (orother
criteria)Foreachexampleinthetrainingsetdo:
1. Computethenetworkoutput forthisexample2. Computetheerrorbetweentheoutputandtargetvalue
𝛿\ = 𝑡\ − 𝑜\ 𝑜\ 1 − 𝑜\1. Foreachoutputunitk,computeerrorterm
𝛿/ = 𝑜/ 1 − 𝑜/ . P −𝛿\𝑤/\ \∈Rio*}Zyzx~ /
1. Foreachhiddenunit, computeerrorterm:Δ𝑤</ = 𝑅𝛿/𝑥<
1. UpdatenetworkweightswithΔ𝑤</Endepoch
45
CIS419/519Spring’18
MoreHiddenLayers§ Thesamealgorithmholdsformorehiddenlayers.
46
output
input
CIS419/519Spring’18
Demotime!§ Link:https://playground.tensorflow.org/
47
CIS419/519Spring’18
CommentsonTraining§ Noguaranteeofconvergence;mayoscillateorreachalocal
minima.§ Inpractice,manylargenetworkscanbetrainedonlarge
amountsofdataforrealisticproblems.§ Manyepochs(tensofthousands)maybeneededforadequate
training.LargedatasetsmayrequiremanyhoursofCPU§ Terminationcriteria:Numberofepochs;Thresholdontraining
seterror;Nodecreaseinerror;Increasederroronavalidationset.
§ Toavoidlocalminima:severaltrialswithdifferentrandominitialweightswithmajorityorvotingtechniques
48
CIS419/519Spring’18
Over-trainingPrevention§ Runningtoomanyepochsmayover-trainthenetworkand
resultinover-fitting.(improvedresultontraining,decreaseinperformanceontestset)
§ Keepanhold-outvalidationsetandtestaccuracyaftereveryepoch
§ Maintainweightsforbestperformingnetworkonthevalidationsetandreturnitwhenperformancedecreasessignificantlybeyondthat.
§ Toavoidlosingtrainingdatatovalidation:§ Use10-foldcross-validationtodetermine theaveragenumberofepochs
thatoptimizesvalidationperformance§ Trainonthefulldatasetusingthismanyepochs toproduce thefinal
results
49
CIS419/519Spring’18
Over-fittingprevention§ Toofewhiddenunitspreventthesystemfromadequately
fittingthedataandlearningtheconcept.§ Usingtoomanyhiddenunitsleadstoover-fitting.§ Similarcross-validationmethodcanbeusedtodetermine
anappropriatenumberofhiddenunits.(general)§ Anotherapproachtopreventover-fittingisweight-decay:
allweightsaremultipliedbysomefractionin(0,1)aftereveryepoch.§ Encouragessmallerweightsandlesscomplexhypothesis§ Equivalently:changeErrorfunctiontoincludeatermforthesum
ofthesquaresoftheweightsinthenetwork.(general)
50
CIS419/519Spring’18
Dropouttraining§ Proposedby(Hintonetal,2012)
§ Eachtimedecidewhethertodeleteonehiddenunitwithsomeprobabilityp
51
CIS419/519Spring’18
Dropouttraining
§ Dropoutof50%ofthehiddenunitsand20%oftheinputunits(Hintonetal,2012)
52
CIS419/519Spring’18
Dropouttraining
§ Modelaveragingeffect§ Among2H models,withsharedparameters
§ H:numberofunitsinthenetwork§ Onlyafewgettrained§ Muchstrongerthantheknownregularizer
§ Whatabouttheinputspace?§ Dothesamething!
53
CIS419/519Spring’18
Input-OutputCoding§ Appropriatecodingofinputsandoutputscanmake
learningproblemeasierandimprovegeneralization.§ Encodeeachbinaryfeatureasaseparateinputunit;
§ Formulti-valuedfeaturesincludeonebinaryunitpervalueratherthantryingtoencodeinputinformationinfewerunits.§ Verycommontodaytousedistributedrepresentationoftheinput
– realvalued,denserepresentation.
§ Fordisjointcategorizationproblem,besttohaveoneoutputunitforeachcategoryratherthanencodingNcategoriesintologNbits.
54
Onewaytodoit,ifyoustartwithacollectionofsparselyrepresentationexamples,istousedimensionalityreductionmethods:- Yourm examplesarerepresentedasamx106 matrix- Multipleitbyarandommatrixofsize106 x300,say.- Randommatrix:Normal(0,1)- Newrepresentation:mx300denserows
CIS419/519Spring’18
HiddenLayerRepresentation§ Weighttuningproceduresetsweightsthatdefine
whateverhiddenunitsrepresentationismosteffectiveatminimizingtheerror.
§ SometimesBackpropagationwilldefinenewhiddenlayerfeaturesthatarenotexplicitintheinputrepresentation,butwhichcapturepropertiesoftheinputinstancesthataremostrelevanttolearningthetargetfunction.
§ Trainedhiddenunitscanbeseenasnewlyconstructedfeaturesthatre-representtheexamplessothattheyarelinearlyseparable
55
CIS419/519Spring’18
GradientChecksareuseful!§ Allowyoutoknowthattherearenobugsinyourneural
networkimplementation!
§ Implementyourgradient§ Implementafinitedifferencecomputationbyloopingthroughthe
parametersofyournetwork,addingandsubtractingasmallepsilon(∼10^-4)andestimatederivatives
𝑓� 𝜃 ≈ � �� s�(��)F� 𝜃± = 𝜃 ± 𝜖
§ Comparethetwoandmakesuretheyarealmostthesame
56
CIS419/519Spring’18
Auto-associativeNetwork§ Anauto-associativenetworktrainedwith8inputs,3hiddenunitsand
8outputnodes,wheretheoutputmustreproducetheinput.§ Whentrainedwithvectorswithonlyonebiton
INPUTHIDDEN10000000.89.400.801000000.97.99.71….00000001.01.11.88
§ Learnedthestandard3-bitencodingforthe8bitvectors.§ Illustratesalsodatacompressionaspectsoflearning
57
10001000
10001000
CIS419/519Spring’18
SparseAuto-encoder
58
§ Encoding:§ Decoding:
§ Goal:perfectreconstructionofinputvector𝒙,bytheoutput 𝒙�
§ Where𝜽 = {𝑾,𝑾′}§ Minimizeanerrorfunction𝒍(𝒙�, 𝒙)
§ Forexample:
§ Andregularizeit
§ Afteroptimizationdropthereconstructionlayerandaddanewlayer
𝒚 = 𝑓(𝑊𝒙 + 𝒃)
𝒙� = 𝑔(𝑊′𝒚 + 𝒃′)
𝑙 𝒙�, 𝒙 = 𝒙� − 𝒙 F
min� P𝑙 𝒙�, 𝒙𝒙
+P|𝑤<|<
CIS419/519Spring’18
StackingAuto-encoder§ Addanewlayer,andareconstructionlayerforit.§ Andtrytotuneitsparameterssuchthat§ Andcontinuethisforeachlayer
59
CIS419/519Spring’18
Beyondsupervisedlearning
60
§ Sofarwhatwehadwaspurelysupervised.§ Initializeparametersrandomly§ Traininsupervisedmodetypically,usingbackprop§ Usedinmostpracticalsystems (e.g.speechandimagerecognition)
§ Unsupervised,layer-wisepre-training+supervisedclassifierontop§ Traineachlayerunsupervised,oneaftertheother§ Trainasupervisedclassifierontop,keepingtheother layersfixed§ Goodwhenveryfewlabeledsamplesareavailable
§ Unsupervised,layer-wisepre-training+supervisedfine-tuning§ Traineachlayerunsupervised, oneaftertheother§ Addaclassifierlayer,andretrainthewholethingsupervised§ Goodwhenlabelsetispoor (e.g.pedestriandetection)
Wewon’ttalkaboutunsupervised pre-traininghere.Butit’s goodtohavethisinmind, sinceitisanactivetopicofresearch.
CIS419/519Spring’18
NN-2
61
CIS419/519Spring’18
Recap:Multi-LayerPerceptrons§ Multi-layernetwork
§ Aglobalapproximator§ Differentrulesfortrainingit
§ TheBack-propagation§ Forwardstep§ Backpropagationoferrors
§ Congrats!Nowyouknowtheoneoftheimportantalgorithmsinneuralnetworks!
§ Today:§ ConvolutionalNeuralNetworks§ RecurrentNeuralNetworks
62
activation
Input
Hidden
Output
CIS419/519Spring’18
ReceptiveFields§ The receptivefield ofanindividual sensoryneuron istheparticular
regionofthesensoryspace(e.g.,thebodysurface,ortheretina)inwhicha stimuluswilltriggerthefiringofthatneuron.§ Intheauditorysystem,receptivefieldscancorrespond tovolumesin
auditory space§ Designing“proper”receptivefieldsfortheinputNeuronsisa
significantchallenge.§ Considerataskwithimageinputs
§ Receptivefieldsshouldgiveexpressivefeaturesfromtherawinput tothesystem
§ Howwouldyoudesign thereceptivefieldsforthisproblem?
63
CIS419/519Spring’18
§ Afullyconnectedlayer:§ Example:
§ 100x100images§ 1000unitsintheinput
§ Problems:§ 10^7edges!§ Spatialcorrelationslost!§ Variablessizedinputs.
64Slide Credit: Marc'Aurelio Ranzato
Input layer
CIS419/519Spring’18
§ Considerataskwithimageinputs:§ Alocallyconnectedlayer:
§ Example:§ 100x100images§ 1000unitsintheinput§ Filtersize:10x10
§ Localcorrelationspreserved!§ Problems:
§ 10^5edges§ Thisparameterizationisgoodwheninput imageisregistered(e.g.,facerecognition). § Variablesizedinputs,again.
65Slide Credit: Marc'Aurelio Ranzato
Input layer
CIS419/519Spring’18
ConvolutionalLayer§ Asolution:
§ Filterstocapturedifferentpatternsintheinputspace.§ Shareparametersacrossdifferentlocations(assuming inputisstationary)
§ Convolutionswithlearnedfilters§ Filterswillbelearnedduring training.§ Theissueofvariable-sizedinputswillberesolvedwithapoolinglayer.
66SlideCredit:Marc'Aurelio Ranzato
Sowhatisaconvolution?
Input layer
CIS419/519Spring’18
ConvolutionOperator§ Convolutionoperator:∗
§ takestwofunctionsandgivesanotherfunction
§ Onedimension:
67
𝑥 ∗ ℎ 𝑡 = � 𝑥 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏
𝑥 ∗ ℎ [𝑛] = ∑ 𝑥 𝑚 ℎ[𝑛− 𝑚]~
“Convolution” isvery similarto
“cross-correlation”,exceptthatinconvolutiononeofthefunctions
isflipped.
CIS419/519Spring’18
ConvolutionOperator(2)§ Convolutionintwodimension:
§ Thesameidea:fliponematrixandslideitontheothermatrix§ Example:Sharpenkernel:
68Tryotherkernels:http://setosa.io/ev/image-kernels/
CIS419/519Spring’18
ConvolutionOperator(3)
Convolutionintwodimension:q Thesameidea:fliponematrixandslideitontheother
matrix
69
CIS419/519Spring’18
ComplexityofConvolution§ Complexityofconvolutionoperatoris𝑛log 𝑛 ,for𝑛
inputs.§ UsesFast-Fourier-Transform(FFT)
§ Fortwo-dimension,eachconvolutiontakes𝑀𝑁log 𝑀𝑁time,wherethesizeofinputis𝑀𝑁.
70Slide Credit: Marc'Aurelio Ranzato
CIS419/519Spring’18
ConvolutionalLayer§ Theconvolutionoftheinput(vector/matrix)withweights
(vector/matrix)resultsinaresponsevector/matrix.§ Wecanhavemultiplefiltersineachconvolutionallayer,each
producinganoutput.§ Ifitisanintermediatelayer,itcanhavemultipleinputs!
71
ConvolutionalLayer
FilterFilterFilterFilterOnecanaddnonlinearityattheoutputof
convolutional layer
CIS419/519Spring’18
PoolingLayer§ Howtohandlevariablesizedinputs?
§ Alayerwhichreducesinputsofdifferentsize,toafixedsize.§ Pooling
72SlideCredit:Marc'Aurelio Ranzato
CIS419/519Spring’18
PoolingLayer§ Howtohandlevariablesizedinputs?
§ Alayerwhichreducesinputsofdifferentsize,toafixedsize.§ Pooling§ Differentvariations
§ Maxpooling
ℎ< 𝑛 = max<∈¡(*)
ℎ¢ [𝑖]
§ Averagepooling
ℎ< 𝑛 = E*
∑<∈¡(*)
ℎ¢[𝑖]
§ L2-pooling
ℎ< 𝑛 = E*
∑<∈¡(*)
ℎ¢F[𝑖]
§ etc
73
CIS419/519Spring’18
ConvolutionalNets§ Onestagestructure:
§ Wholesystem:
74
Conv. Pooling
Stage1 Stage2 Stage3Fully
ConnectedLayer
InputImage
ClassLabel
CIS419/519Spring’18
TrainingaConvNet§ ThesameprocedurefromBack-propagationapplieshere.
§ Rememberinbackprop westartedfromtheerrortermsinthelaststage,andpassedthembacktothepreviouslayers,onebyone.
§ Back-propforthepoolinglayer:§ Consider,forexample,thecaseof“max”pooling.§ Thislayeronlyroutesthegradienttotheinputthathasthehighestvalueinthe
forwardpass.§ Hence,duringtheforwardpassofapoolinglayeritiscommontokeeptrackofthe
indexofthemaxactivation(sometimesalsocalled theswitches)sothatgradientroutingisefficientduringbackpropagation.
§ Thereforewehave: 𝛿 = chvc£j
75
Convol. Pooling
Stage3 FullyConnectedLayerInputImage
Class Label
𝛿¤¥¦ms¤¥§l̈ =𝜕𝐸R
𝜕𝑦¤¥¦ms¤¥§l¨
𝐸R
Stage1 Stage2
𝛿©ª¨¦ms¤¥§l¨ =𝜕𝐸R
𝜕𝑦©ª¨¦ms¤¥§l¨
𝑥< 𝑦<
CIS419/519Spring’18
TrainingaConvNet§ Back-propfortheconvolutionallayer:
76
Convol. Pooling
Stage3 FullyConnectedLayerInputImage
ClassLabel
𝛿¤¥¦ms¤¥§l̈ =𝜕𝐸R
𝜕𝑦¤¥¦ms¤¥§l¨
𝐸R
Stage1 Stage2
𝛿©ª¨¦ms¤¥§l¨ =𝜕𝐸R
𝜕𝑦©ª¨¦ms¤¥§l¨
𝑥< 𝑦<
Wederivetheupdaterulesfora1Dconvolution,buttheideaisthesameforbiggerdimensions.
𝑦« = 𝑤 ∗ 𝑥 ⟺𝑦«< = P 𝑤x
~sE
xY
𝑥<sx = P 𝑤<sx
~sE
xY
𝑥x∀𝑖
𝑦 = 𝑓 𝑦« ⟺𝑦< = 𝑓(𝑦«<)∀𝑖
𝜕𝐸R𝜕𝑤x
=
𝜕𝐸R𝜕𝑦«<
=
𝛿 =𝜕𝐸R𝜕𝑥x
=
Theconvolution
Adifferentiablenonlinearity
P𝜕𝐸R𝜕𝑦«<
𝜕𝑦« <𝜕𝑤x
~sE
<Y
= P𝜕𝐸R𝜕𝑦«<
𝑥<sx
~sE
<Y
𝜕𝐸R𝜕𝑦<
𝜕𝑦<𝜕𝑦«<
=𝜕𝐸R𝜕𝑦<
𝑓′(𝑦«)
P𝜕𝐸R𝜕𝑦«<
𝜕𝑦«<𝜕𝑥x
~sE
<Y
= P𝜕𝐸R𝜕𝑦«<
𝑤<sx
~sE
<Y
Nowwehaveeverythinginthislayertoupdatethefilter
Weneedtopassthegradienttotheprevious layer
NowwecanrepeatthisforeachstageofConvNet.
CIS419/519Spring’18
ConvolutionalNets
77
Stage1
Stage2
Stage3
FullyConnected
LayerInputImage
ClassLabel
FeaturevisualizationofconvolutionalnettrainedonImageNetfrom[Zeiler &Fergus2013]
CIS419/519Spring’18
Demo(TeachableMachines)https://teachablemachine.withgoogle.com/
78
CIS419/519Spring’18
ConvNet roots§ Fukushima,1980s designednetworkwithsamebasicstructurebut
didnottrainbybackpropagation.§ ThefirstsuccessfulapplicationsofConvolutionalNetworksbyYann
LeCun in1990's (LeNet)§ Wasusedtoreadzipcodes,digits,etc.
§ Manyvariantsnowadays,butthecoreideaisthesame§ Example:asystemdeveloped inGoogle (GoogLeNet)
§ Computedifferentfilters§ Composeonebigvectorfromallofthem§ Layerthisiteratively
79Seemore:http://arxiv.org/pdf/1409.4842v1.pdf
CIS419/519Spring’18
Depthmatters
80
Slidefrom[Kaiming He2015]
CIS419/519Spring’18
Vanishing/explodinggradients
§ Vanishinggradientsarequiteprevalentandaseriousissue.§ Arealexample
§ Trainingafeed-forwardnetwork§ y-axis:sumofthegradientnorms§ Earlierlayershaveexponentiallysmallersumofgradientnorms§ Thiswillmaketrainingearlierlayersmuchslower.
81
Gradientcanbecomeverysmallorverylargequickly,andthelocalityassumptionofgradientdescentbreaksdown (Vanishinggradient)[Bengio etal1994]
CIS419/519Spring’18
Vanishing/explodinggradients§ Inarchitectureswithmanylayers(e.g.>10)thegradientscaneasily
explodeorvanish.
§ Manymethodsproposedforreducetheeffectofvanishinggradients;althoughitisstillaproblem§ Introduceshorterpathbetweenlongconnections§ Abandonstochasticgradientdescentinfavorofamuchmore
sophisticatedHessian-Free(HF)optimization§ Clipgradientswithbiggersizes:
82
Defnne𝑔 = chc¯
If 𝑔 ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then𝑔 ← Z²yz}²i³R
´ 𝑔
CIS419/519Spring’18
PracticalTips§ Beforelargescaleexperiments,testonasmallsubsetofthedataand
checktheerrorshouldgotozero.§ Overfittingonsmalltraining
§ Visualizefeatures(featuremapsneedtobeuncorrelated)andhavehighvariance
§ Badtraining:manyhiddenunitsignoretheinputand/orexhibitstrongcorrelations.
83FigureCredit:Marc'Aurelio Ranzato
CIS419/519Spring’18
Debugging§ Trainingdiverges:
§ Learningratemaybetoolarge→decreaselearningrate§ BackProp isbuggy→numericalgradientchecking
§ Lossisminimizedbutaccuracyislow§ Checklossfunction: Isitappropriateforthetaskyouwanttosolve?Does
ithavedegeneratesolutions?
§ NNisunderperforming/under-fitting§ Computenumber ofparameters→iftoosmall,makenetworklarger
§ NNistooslow§ Computenumber ofparameters→Usedistributed framework,useGPU,
makenetworksmaller
84
Manyofthesepointsapplytomanymachinelearningmodels,nojustneuralnetworks.
CIS419/519Spring’18
CNNforvectorinputs
85
§ Let’sstudyanothervariantofCNNforlanguage§ Example:sentenceclassification(sayspamornotspam)
§ Firststep:representeachwordwithavectorinℝRThis is not a spam
Concatenatethevectors
§ NowwecanassumethattheinputtothesystemisavectorℝR³§ Wheretheinput sentencehaslength𝑙 (𝑙 = 5 inourexample)§ Eachwordvector’slength𝑑 (𝑑 = 7 inourexample)
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
O OO O OO O O OO O OO O O OO O OO O O OO O OO O O OO O OO O
CIS419/519Spring’18
ConvolutionalLayeronvectors§ Thinkaboutasingleconvolutionallayer
§ Abunchofvectorfilters§ Eachdefined inℝR²
• Whereℎ isthenumberofthewordsthefiltercovers• Sizeofthewordvector𝑑
§ Findits(modified)convolutionwiththeinputvector
§ Resultoftheconvolutionwiththefilter
§ Convolutionwithafilterthatspans2words, isoperatingonallofthebi-grams(vectorsoftwoconsecutiveword,concatenated):“thisis”,“isnot”,“nota”,“aspam”.
§ Regardlessofwhether itisgrammatical(notappealinglinguistically)
86
OO O O O OO O OO OO OO O OO OO OO O OO OO OO O OO OO OO
OOO O OO O OO O OO O O
OOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO OOOO O OO O OO O OO O OOOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO O
OOO O OO O OO O OO O OOOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO O
OOO O OO O OO O OO O OOOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO O
OOO O OO O OO O OO O O
𝑐E = 𝑓(𝑤. 𝑥E:²)𝑐F = 𝑓(𝑤.𝑥²pE:F²)𝑐G = 𝑓(𝑤.𝑥F²pE:G²)𝑐H = 𝑓(𝑤.𝑥G²pE:H²)
𝑐 = [𝑐E,… . , 𝑐*s²pE]OOO O
CIS419/519Spring’18
ConvolutionalLayeronvectors
87
OOO O O
OOO O OO O
OOO O OO O OO OO O OO O OO O OO O OO OO O OO O OO O OO O
This is not a spam
O OO O OO O O OO O OO O O OO O OO O O OO O OO O O OO O OO O
OOO O OO O OO O OO O OOOO O OO O OO O OO O O
OOO O OO O OO O OO O O OO OO O OOOOO O OO O OO O OO O O OO OO O OO
OOO OOOO O
OOOOOO
Getwordvectorsforeachwords
Concatenatevectors
Performconvolution
witheachfilter Filterbank
Setofresponsevectors
*
Howarewegoing tohandlethevariable
sizedresponsevectors?Pooling!
#offilters
#words- #lengthoffilter+1
CIS419/519Spring’18
ConvolutionalLayeronvectors
§ Nowwecanpass thefixed-sizedvectortoalogisticunit(softmax), orgiveittomulti-layernetwork(lastsession)
88
OOO O O
OOO O OO O
OOO O OO O OO OO O OO O OO O OO O OO OO O OO O OO O OO O
This is not a spam
O OO O OO O O OO O OO O O OO O OO O O OO O OO O O OO O OO O
OOO O OO O OO O OO O OOOO O OO O OO O OO O O
OOO O OO O OO O OO O O OO OO O OOOOO O OO O OO O OO O O OO OO O OO
OOO OOOO O
OOOOOO
Getwordvectorsforeachwords
Concatenatevectors
Performconvolutionwitheachfilter
Filterbank
*
#offilters
#words- #lengthoffilter+1
Poolingonfilter
responses
OOOOOO
OOOOOOOOO
Somechoicesforpooling:
k-max,mean,etc
CIS419/519Spring’18
RecurrentNeuralNetworks§ Predictiononchain-likeinput:
§ Example:POStaggingwordsofasentence
§ Issues:§ Structureintheoutput: Thereisconnections betweenlabels§ Interdependencebetweenelementsoftheinputs: Thefinaldecisionisbased
onanintricateinterdependenceofthewordsoneachother.§ Variablesizeinputs:e.g.sentencesdifferinsize
§ Howwouldyougoaboutsolvingthistask?
89
This is a sample sentence
DT VBZ DT NN NN
CIS419/519Spring’18
RecurrentNeuralNetworks§ Infiniteusesoffinitestructure
90
Y0
H0
X0
Y1
H1
X1
Y2
H2
X2
Y3
H3
X3
Hiddenstaterepresentation
Output
Input
CIS419/519Spring’18
RecurrentNeuralNetworks§ AchainRNN:
§ Hasachain-likestructure§ Eachinputisreplacedwithitsvectorrepresentation𝑥Z§ Hidden(memory)unitℎZ containinformationaboutprevious
inputsandprevioushiddenunitsℎZsE,ℎZsF, etc§ Computed fromthepastmemoryandcurrentword.Itsummarizesthesentenceuptothattime.
91
OOO O O OOO O O OOO O O𝑥ZsE 𝑥Z 𝑥ZpE
OO
OOO
OO
OOO
OO
OOO
ℎZsE ℎZ ℎZpEMemorylayer
Inputlayer
CIS419/519Spring’18
RecurrentNeuralNetworks§ Apopularwayofformalizingit:
ℎZ = 𝑓(𝑊²ℎZsE +𝑊<𝑥Z)§ Where𝑓 isanonlinear, differentiable (why?) function.
§ Outputs?§ Manyoptions;dependingonproblemandcomputational
resource
92
OOO O O OOO O O OOO O O𝑥ZsE 𝑥Z 𝑥ZpE
OO
OOO
OO
OOO
OO
OOO
ℎZsE ℎZ ℎZpEMemorylayer
Inputlayer
CIS419/519Spring’18
RecurrentNeuralNetworks§ Prediction for𝑥Z,withℎZ:
§ Someinherent issueswithRNNs:§ Recurrentneuralnetscannotcapturephraseswithout prefixcontext§ Theyoftencapturetoomuchoflastwordsinfinalvector
§ Aslightlymoresophisticated solution: LongShort-TermMemory(LSTM)units
93
𝑦Z = softmax 𝑊iℎZ
OOO O O OOO O O OOO O O𝑥ZsE 𝑥Z 𝑥ZpE
OO
OOO
OO
OOO
OO
OOOℎZsE ℎZ ℎZpE
Memorylayer
Inputlayer
𝑦ZsE 𝑦Z 𝑦ZpE Output layer
CIS419/519Spring’18
RecurrentNeuralNetworks
§ Multi-layerfeed-forwardNN:DAG§ Justcomputesafixedsequenceofnon-linearlearnedtransformationstoconvertaninputpatterintoanoutputpattern
§ RecurrentNeuralNetwork:Digraph§ Hascycles.§ Cyclecanactasamemory;§ Thehiddenstateofarecurrentnetcancarryalonginformation
abouta“potentially”unboundednumberofpreviousinputs.§ Theycanmodelsequentialdatainamuchmorenaturalway.
94
CIS419/519Spring’18
EquivalencebetweenRNNandFeed-forwardNN
§ Assumethatthereisatimedelayof1inusingeachconnection.
§ Therecurrentnetisjustalayerednetthatkeepsreusingthesameweights.
95SlideCredit:GeoffHinton
1 2 3
1 2 3
1 2 3
1 2 3W1 W2 W3 W4
time=0
time=2
time=1
time=3
W1 W2 W3 W4
W1 W2 W3 W4
1 2 3
w1 w4
w2 w3
CIS419/519Spring’18
Bi-directionalRNN§ OneoftheissueswithRNN:
§ Hiddenvariablescaptureonlyonesidecontext
§ Abi-directionalstructure
96
RNN Bi-directional RNN
CIS419/519Spring’18
Stackofbi-directionalnetworks§ Usethesameideaandmakeyourmodelfurther
complicated:
97
CIS419/519Spring’18
TrainingRNNs§ Howtotrainsuchmodel?
§ Generalizethesameideasfromback-propagation
§ Totaloutputerror:𝐸 𝑦, 𝑡 = ∑ 𝐸Z 𝑦Z , 𝑡Z[ZE
98
OOO O O OOO O O OOO O O
𝑥ZsE 𝑥Z 𝑥ZpE
OO
OOO
OO
OOO
OO
OOOℎZsE ℎZ ℎZpE
𝑦ZsE 𝑦Z 𝑦ZpE
Reminder:𝑦Z = softmax 𝑊iℎZℎZ = 𝑓(𝑊²ℎZsE + 𝑊<𝑥Z)
𝜕𝐸𝜕𝑊 =P
𝜕𝐸Z𝜕𝑊
[
ZE𝜕𝐸Z𝜕𝑊 =P
𝜕𝐸Z𝜕𝑦Z
𝜕𝑦Z𝜕ℎZ
𝜕ℎZ𝜕ℎZs\
𝜕ℎZs\𝜕𝑊
[
ZE
Parameters?𝑊i ,𝑊< ,𝑊² +vectorsfor
input
Thissometimesiscalled“BackpropagationThroughTime”,sincethegradientsarepropagatedbackthroughtime.
CIS419/519Spring’18
RecurrentNeuralNetwork
99𝑦ZsE 𝑦Z 𝑦ZpE
OO O O O OO O O O OO O O O
𝑥ZsE 𝑥Z 𝑥ZpE
OO
OOO
OO
OOO
OO
OOO
ℎZsE ℎZ ℎZpE
Reminder:𝑦Z = softmax 𝑊iℎZℎZ = 𝑓(𝑊²ℎZsE +𝑊<𝑥Z)
𝜕𝐸𝜕𝑊 =P
𝜕𝐸Z𝜕𝑦Z
𝜕𝑦Z𝜕ℎZ
𝜕ℎZ𝜕ℎZs\
𝜕ℎZs\𝜕𝑊
[
ZE
𝜕ℎZ𝜕ℎZs\
= ¿𝜕ℎ/𝜕ℎ/sE
Z
/Zs\pE
= ¿ 𝑊²diag 𝑓�(𝑊²ℎZsE + 𝑊<𝑥Z)Z
/Zs\pE
𝜕ℎZ𝜕ℎZsE
= 𝑊²diag 𝑓�(𝑊²ℎZsE + 𝑊<𝑥Z) diag 𝑎E,… , 𝑎* =𝑎E 0 00 ⋱ 00 0 𝑎*
CIS419/519Spring’18
UnsupervisedRNNs§ Whattoputhere?
§ Hewaslockedupafterhe______.
§ Notethat:§ Thisisunsupervised;youcanusetonsofdatatotrainthis.§ Whiletrainingthemodel,wetrainthewordrepresentationstoo.
100
OOO O O OOO O O OOO O O
𝑥ZsF 𝑥ZsE 𝑥Z 𝑦
OO
OOO
ℎZMemorylayer
Input(context)
OOO O O
output
OO
OOO ℎZpE
OO
OOO ℎZsE
CIS419/519Spring’18
UnsupervisedRNNs§ Thiswouldresultinwordrepresentations
§ thatconveyinformationabouttheirco-occurrence§ Orsomeformofweak“semantic”similarity
§ Abigpartofprogress(past5-10years)ispartlyduetodiscoveringbetterwayscreateunsupervisedcontext-sensitiverepresentations
101
Top Related