Post on 20-May-2020
MachineLearning
NeuralNetworks:Backpropagation
1BasedonslidesandmaterialfromGeoffreyHinton,RichardSocher,DanRoth,Yoav Goldberg,ShaiShalev-Shwartz andShaiBen-David,andothers
Thislecture
• Whatisaneuralnetwork?
• Predictingwithaneuralnetwork
• Trainingneuralnetworks– Backpropagation
• Practicalconcerns
3
Traininganeuralnetwork
• Given– Anetworkarchitecture(layoutofneurons,theirconnectivityand
activations)– Adatasetoflabeledexamples
• S={(xi,yi)}
• Thegoal:Learntheweightsoftheneuralnetwork
• Remember:Forafixedarchitecture,aneuralnetworkisafunctionparameterizedbyitsweights– Prediction:! = $$(&,()
4
Recall:Learningaslossminimization
WehaveaclassifierNN thatiscompletelydefinedbyitsweightsLearntheweightsbyminimizingaloss*
6
Perhapswitharegularizermin(.*($$ &/,( , !/)�
/
Sofar,wesawthatthisstrategyworkedfor:1. LogisticRegression2. SupportVectorMachines3. Perceptron4. LMSregression
Allofthesearelinearmodels
Sameideafornon-linearmodelstoo!
Eachminimizesadifferentlossfunction
Backtoourrunningexample
7
output
Givenaninputx,howistheoutputpredicted
y = 2345 + 244
5 74 + 2845 78
78 = 9(238: + 248
: ;4 + 288: ;8)
z4 = 9(234: + 244
: ;4 + 284: ;8)
Backtoourrunningexample
9
output
Givenaninputx,howistheoutputpredicted
Supposethetruelabelforthisexampleisanumber!/
Wecanwritethesquarelossforthisexampleas:
* = 12!– !/ 8
y = 2345 + 244
5 74 + 2845 78
78 = 9(238: + 248
: ;4 + 288: ;8)
z4 = 9(234: + 244
: ;4 + 284: ;8)
Learningaslossminimization
WehaveaclassifierNN thatiscompletelydefinedbyitsweightsLearntheweightsbyminimizingaloss*
10
Perhapswitharegularizermin(.*($$ ;/, 2 , !/)�
/
Howdowesolvetheoptimizationproblem?
Stochasticgradientdescent
GivenatrainingsetS={(xi,yi)},x 2 <d
1. Initializeparametersw2. Forepoch=1…T:
1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)
• Update:( ← (− DEA*($$ &/, ( , !/))
3. Returnw
11
min(.*($$ ;/, 2 , !/)
�
/
Stochasticgradientdescent
GivenatrainingsetS={(xi,yi)},x 2 <d
1. Initializeparametersw2. Forepoch=1…T:
1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)
• Update:( ← (− DEA*($$ &/, ( , !/))
3. Returnw
12
min(.*($$ ;/, 2 , !/)
�
/
Stochasticgradientdescent
GivenatrainingsetS={(xi,yi)},x 2 <d
1. Initializeparametersw2. Forepoch=1…T:
1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)
• Update:( ← (− DEA*($$ &/, ( , !/))
3. Returnw
13
min(.*($$ ;/, 2 , !/)
�
/
Stochasticgradientdescent
GivenatrainingsetS={(xi,yi)},x 2 <d
1. Initializeparametersw2. Forepoch=1…T:
1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)
• Update:( ← (− DEA*($$ &/, ( , !/))
3. Returnw
14
min(.*($$ ;/, 2 , !/)
�
/
Stochasticgradientdescent
GivenatrainingsetS={(xi,yi)},x 2 <d
1. Initializeparametersw2. Forepoch=1…T:
1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)
• Update:( ← (− DEA*($$ &/, ( , !/))
3. Returnw
15
min(.*($$ ;/, 2 , !/)
�
/
Stochasticgradientdescent
GivenatrainingsetS={(xi,yi)},x 2 <d
1. Initializeparametersw2. Forepoch=1…T:
1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)
• Update:( ← (− DEA*($$ &/, ( , !/))
3. Returnw
17
°t:learningrate,manytweakspossible
Theobjectiveisnotconvex.Initializationcanbeimportant
min(.*($$ ;/, 2 , !/)
�
/
Stochasticgradientdescent
GivenatrainingsetS={(xi,yi)},x 2 <d
1. Initializeparametersw2. Forepoch=1…T:
1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)
• Update:( ← (− DEA*($$ &/, ( , !/))
3. Returnw
18
°t:learningrate,manytweakspossible
Theobjectiveisnotconvex.Initializationcanbeimportant
min(.*($$ ;/, 2 , !/)
�
/
Havewesolvedeverything?
Thederivativeofthelossfunction?
Iftheneuralnetworkisadifferentiablefunction,wecanfindthegradient
– Ormaybeitssub-gradient– Thisisdecidedbytheactivationfunctionsandthelossfunction
ItwaseasyforSVMsandlogisticregression– Onlyonelayer
Buthowdowefindthesub-gradientofamorecomplexfunction?
– Eg:Arecentpaperuseda~150layerneuralnetworkforimageclassification!
19Weneedanefficientalgorithm:Backpropagation
A*($$ &/, ( , !/)
Thederivativeofthelossfunction?
Iftheneuralnetworkisadifferentiablefunction,wecanfindthegradient
– Ormaybeitssub-gradient– Thisisdecidedbytheactivationfunctionsandthelossfunction
ItwaseasyforSVMsandlogisticregression– Onlyonelayer
Buthowdowefindthesub-gradientofamorecomplexfunction?
– Eg:Arecentpaperuseda~150layerneuralnetworkforimageclassification!
20Weneedanefficientalgorithm:Backpropagation
A*($$ &/, ( , !/)
Thederivativeofthelossfunction?
Iftheneuralnetworkisadifferentiablefunction,wecanfindthegradient
– Ormaybeitssub-gradient– Thisisdecidedbytheactivationfunctionsandthelossfunction
ItwaseasyforSVMsandlogisticregression– Onlyonelayer
Buthowdowefindthesub-gradientofamorecomplexfunction?
– Eg:Arecentpaperuseda~150layerneuralnetworkforimageclassification!
21Weneedanefficientalgorithm:Backpropagation
A*($$ &/, ( , !/)
Checkpoint
22
Wherearewe
Ifwehaveaneuralnetwork(structure,activationsandweights),wecanmakeapredictionforaninput
Ifwehadthetruelabeloftheinput,thenwecandefinethelossforthatexample
Ifwecantakethederivativeofthelosswithrespecttoeachoftheweights,wecantakeagradientstepinSGD
Questions?
Checkpoint
23
Wherearewe
Ifwehaveaneuralnetwork(structure,activationsandweights),wecanmakeapredictionforaninput
Ifwehadthetruelabeloftheinput,thenwecandefinethelossforthatexample
Ifwecantakethederivativeofthelosswithrespecttoeachoftheweights,wecantakeagradientstepinSGD
Questions?
Checkpoint
24
Wherearewe
Ifwehaveaneuralnetwork(structure,activationsandweights),wecanmakeapredictionforaninput
Ifwehadthetruelabeloftheinput,thenwecandefinethelossforthatexample
Ifwecantakethederivativeofthelosswithrespecttoeachoftheweights,wecantakeagradientstepinSGD
Questions?
Checkpoint
25
Wherearewe
Ifwehaveaneuralnetwork(structure,activationsandweights),wecanmakeapredictionforaninput
Ifwehadthetruelabeloftheinput,thenwecandefinethelossforthatexample
Ifwecantakethederivativeofthelosswithrespecttoeachoftheweights,wecantakeagradientstepinSGD
Questions?
Reminder:Chainruleforderivatives
– If7 isafunctionof!and! isafunctionof;• Then7 isafunctionof;,aswell
– Question:howtofindFGFH
27SlidecourtesyRichardSocher
Reminder:Chainruleforderivatives
– If7 =afunctionof!4 +afunctionof!8,andthe!/’sarefunctionsof;• Then7 isafunctionof;,aswell
– Question:howtofindFGFH
28SlidecourtesyRichardSocher
Reminder:Chainruleforderivatives
– If7 isasumoffunctionsof!/’s,andthe!/’sarefunctionsof;• Then7 isafunctionof;,aswell
– Question:howtofindFGFH
29SlidecourtesyRichardSocher
Backpropagation
30
output
* = 12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
78 = 9(238: + 248
: ;4 + 288: ;8)
z4 = 9(234: + 244
: ;4 + 284: ;8)
Backpropagation
31
WewanttocomputeFIFJKL
M andFI
FJKLN
output
* = 12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
78 = 9(238: + 248
: ;4 + 288: ;8)
z4 = 9(234: + 244
: ;4 + 284: ;8)
Backpropagation
32
Applyingthechainruletocomputethegradient(Andrememberingpartialcomputationsalongthewaytospeedupthings)
WewanttocomputeFIFJKL
M andFI
FJKLN
output
* = 12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
78 = 9(238: + 248
: ;4 + 288: ;8)
z4 = 9(234: + 244
: ;4 + 284: ;8)
Outputlayer
33
output* =
12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
O*O234
5 = O*O!
O!O234
3
Backpropagationexample
Outputlayer
34
output* =
12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
O*O234
5 = O*O!
O!O234
5
Backpropagationexample
Outputlayer
35
output* =
12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
O*O234
5 = O*O!
O!O234
5
O*O!
= ! − !∗
Backpropagationexample
Outputlayer
36
output* =
12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
O*O234
5 = O*O!
O!O234
5
O*O!
= ! − !∗O!O234
5 = 1
Backpropagationexample
Outputlayer
37
O*O244
5 = O*O!
O!O244
3
output* =
12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
Backpropagationexample
Outputlayer
38
O*O244
5 = O*O!
O!O244
5
output* =
12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
Backpropagationexample
Outputlayer
39
O*O244
5 = O*O!
O!O244
5
O*O!
= ! − !∗
output* =
12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
Backpropagationexample
Outputlayer
40
O*O244
5 = O*O!
O!O244
5
O*O!
= ! − !∗O!O234
5 = 74
output* =
12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
Backpropagationexample
Outputlayer
41
O*O244
5 = O*O!
O!O244
5
O*O!
= ! − !∗O!O234
5 = 74
output* =
12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
Wehavealreadycomputedthispartialderivativeforthepreviouscase
Cachetospeedup!
Backpropagationexample
Hiddenlayerderivatives
42
output
* = 12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
78 = 9(238: + 248
: ;4 + 288: ;8)
z4 = 9(234: + 244
: ;4 + 284: ;8)
Backpropagationexample
Hiddenlayerderivatives
43
WewantFIFJPP
N
output
* = 12!–!∗ 8
y = 2345 + 244
5 74 + 2845 78
78 = 9(238: + 248
: ;4 + 288: ;8)
z4 = 9(234: + 244
: ;4 + 284: ;8)
Backpropagationexample
Hiddenlayerderivatives
44
O*O288
: = O*O!
O!O288
:
Backpropagationexample * = 12!–!∗ 8
Hiddenlayer
45
O*O288
: = O*O!
O!O288
:
= O*O!
OO288
: (2345 + 244
5 74 + 2845 78)
y = 2345 + 244
5 74 + 2845 78
Backpropagationexample
Hiddenlayer
46
O*O288
: = O*O!
O!O288
:
= O*O!
OO288
: (2345 + 244
5 74 + 2845 78)
y = 2345 + 244
5 74 + 2845 78
= O*O!(244
5 OO288
: 74 + 2845 OO288
: 78)
Backpropagationexample
Hiddenlayer
47
O*O288
: = O*O!
O!O288
:
= O*O!
OO288
: (2345 + 244
5 74 + 2845 78)
y = 2345 + 244
5 74 + 2845 78
= O*O!(244
5 OO288
: 74 + 2845 OO288
: 78)
74 isnotafunctionof288:
0
Backpropagationexample
Hiddenlayer
48
O*O288
: = O*O!
O!O288
:
= O*O!
OO288
: (2345 + 244
5 74 + 2845 78)
y = 2345 + 244
5 74 + 2845 78
= O*O!2845 O78O288
:
Backpropagationexample
Hiddenlayer
49
O*O288
: = O*O!
O!O288
:
= O*O!
OO288
: (2345 + 244
5 74 + 2845 78)
= O*O!2845 O78O288
:
78 = 9(238: + 248
: ;4 + 288: ;8)
Backpropagationexample
Hiddenlayer
50
O*O288
: = O*O!
O!O288
:
= O*O!
OO288
: (2345 + 244
5 74 + 2845 78)
= O*O!2845 O78O288
:
78 = 9(238: + 248
: ;4 + 288: ;8)
Callthiss
Backpropagationexample
Hiddenlayer
51
O*O288
: = O*O!
O!O288
:
= O*O!
OO288
: (2345 + 244
5 74 + 2845 78)
= O*O!2845 O78O288
:
78 = 9(238: + 248
: ;4 + 288: ;8)
Callthiss
=O*O!2845 O78OQ
OQO288
:
Backpropagationexample
Hiddenlayer
52
O*O288
: = O*O!2845 O78OQ
OQO288
:
78 = 9(238: + 248
: ;4 + 288: ;8)
Callthiss
Backpropagationexample
(Frompreviousslide)
Hiddenlayer
53
O*O288
: = O*O!2845 O78OQ
OQO288
:
78 = 9(238: + 248
: ;4 + 288: ;8)
Callthiss
Eachofthesepartialderivativesiseasy
Backpropagationexample
Hiddenlayer
54
O*O288
: = O*O!2845 O78OQ
OQO288
:
78 = 9(238: + 248
: ;4 + 288: ;8)
Callthiss
O*O!
= ! − !∗
Eachofthesepartialderivativesiseasy
Backpropagationexample
Hiddenlayer
55
O*O288
: = O*O!2845 O78OQ
OQO288
:
78 = 9(238: + 248
: ;4 + 288: ;8)
Callthiss
O*O!
= ! − !∗
O78OQ
= 78(1 − 78)Why?Because78 Qisthelogisticfunctionwehavealreadyseen
Eachofthesepartialderivativesiseasy
Backpropagationexample
Hiddenlayer
56
O*O288
: = O*O!2845 O78OQ
OQO288
:
78 = 9(238: + 248
: ;4 + 288: ;8)
Callthiss
O*O!
= ! − !∗
O78OQ
= 78(1 − 78)Why?Because78 QisthelogisticfunctionwehavealreadyseenOQ
O288: = ;8
Eachofthesepartialderivativesiseasy
Backpropagationexample
Hiddenlayer
57
O*O288
: = O*O!2845 O78OQ
OQO288
:
78 = 9(238: + 248
: ;4 + 288: ;8)
Callthiss
O*O!
= ! − !∗
O78OQ
= 78(1 − 78)Why?Because78 QisthelogisticfunctionwehavealreadyseenOQ
O288: = ;8
Moreimportant:Wehavealreadycomputedmanyofthesepartialderivativesbecauseweareproceedingfromtoptobottom(i.e.backwards)
Backpropagationexample
Eachofthesepartialderivativesiseasy
TheBackpropagationAlgorithm
Thesamealgorithmworksformultiplelayers
Repeatedapplicationofthechainruleforpartialderivatives– Firstperformforwardpassfrominputstotheoutput– Computeloss
– Fromtheloss,proceedbackwardstocomputepartialderivativesusingthechainrule
– Cachepartialderivativesasyoucomputethem• Willbeusedforlowerlayers
58
Mechanizinglearning
• Backpropagationgivesyouthegradientthatwillbeusedforgradientdescent– SGDgivesusagenericlearningalgorithm– Backpropagationisagenericmethodforcomputingpartialderivatives
• Arecursivealgorithmthatproceedsfromthetopofthenetworktothebottom
• Modernneuralnetworklibrariesimplementautomaticdifferentiationusingbackpropagation– Allowseasyexplorationofnetworkarchitectures– Don’thavetokeepderivingthegradientsbyhandeachtime
59
Stochasticgradientdescent
GivenatrainingsetS={(xi,yi)},x 2 <d
1. Initializeparametersw2. Forepoch=1…T:
1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
• Treatthisexampleastheentiredataset
• ComputethegradientofthelossA*($$ &/,( , !/) usingbackpropagation
• Update:( ← (− DEA*($$ &/,( , !/))
3. Returnw
60
°t:learningrate,manytweakspossible
Theobjectiveisnotconvex.Initializationcanbeimportant
min(.*($$ ;/, 2 , !/)
�
/