Neural Networks: Backpropagation -...

Post on 20-May-2020

55 views 0 download

Preview:

Click to see full reader

Report this document

Transcript of Neural Networks: Backpropagation -...

MachineLearning

NeuralNetworks:Backpropagation

1BasedonslidesandmaterialfromGeoffreyHinton,RichardSocher,DanRoth,Yoav Goldberg,ShaiShalev-Shwartz andShaiBen-David,andothers

Thislecture

• Whatisaneuralnetwork?

• Predictingwithaneuralnetwork

• Trainingneuralnetworks– Backpropagation

• Practicalconcerns

3

Traininganeuralnetwork

• Given– Anetworkarchitecture(layoutofneurons,theirconnectivityand

activations)– Adatasetoflabeledexamples

• S={(xi,yi)}

• Thegoal:Learntheweightsoftheneuralnetwork

• Remember:Forafixedarchitecture,aneuralnetworkisafunctionparameterizedbyitsweights– Prediction:! = $$(&,()

4

Recall:Learningaslossminimization

WehaveaclassifierNN thatiscompletelydefinedbyitsweightsLearntheweightsbyminimizingaloss*

6

Perhapswitharegularizermin(.*($$ &/,( , !/)�

/

Sofar,wesawthatthisstrategyworkedfor:1. LogisticRegression2. SupportVectorMachines3. Perceptron4. LMSregression

Allofthesearelinearmodels

Sameideafornon-linearmodelstoo!

Eachminimizesadifferentlossfunction

Backtoourrunningexample

7

output

Givenaninputx,howistheoutputpredicted

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Backtoourrunningexample

9

output

Givenaninputx,howistheoutputpredicted

Supposethetruelabelforthisexampleisanumber!/

Wecanwritethesquarelossforthisexampleas:

* = 12!– !/ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Learningaslossminimization

WehaveaclassifierNN thatiscompletelydefinedbyitsweightsLearntheweightsbyminimizingaloss*

10

Perhapswitharegularizermin(.*($$ ;/, 2 , !/)�

/

Howdowesolvetheoptimizationproblem?

Stochasticgradientdescent

GivenatrainingsetS={(xi,yi)},x 2 <d

1. Initializeparametersw2. Forepoch=1…T:

1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)

• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

11

min(.*($$ ;/, 2 , !/)

�

/

Stochasticgradientdescent

GivenatrainingsetS={(xi,yi)},x 2 <d

1. Initializeparametersw2. Forepoch=1…T:

1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)

• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

12

min(.*($$ ;/, 2 , !/)

�

/

Stochasticgradientdescent

GivenatrainingsetS={(xi,yi)},x 2 <d

1. Initializeparametersw2. Forepoch=1…T:

1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)

• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

13

min(.*($$ ;/, 2 , !/)

�

/

Stochasticgradientdescent

GivenatrainingsetS={(xi,yi)},x 2 <d

1. Initializeparametersw2. Forepoch=1…T:

1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)

• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

14

min(.*($$ ;/, 2 , !/)

�

/

Stochasticgradientdescent

GivenatrainingsetS={(xi,yi)},x 2 <d

1. Initializeparametersw2. Forepoch=1…T:

1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)

• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

15

min(.*($$ ;/, 2 , !/)

�

/

Stochasticgradientdescent

GivenatrainingsetS={(xi,yi)},x 2 <d

1. Initializeparametersw2. Forepoch=1…T:

1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)

• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

17

°t:learningrate,manytweakspossible

Theobjectiveisnotconvex.Initializationcanbeimportant

min(.*($$ ;/, 2 , !/)

�

/

Stochasticgradientdescent

GivenatrainingsetS={(xi,yi)},x 2 <d

1. Initializeparametersw2. Forepoch=1…T:

1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)

• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

18

°t:learningrate,manytweakspossible

Theobjectiveisnotconvex.Initializationcanbeimportant

min(.*($$ ;/, 2 , !/)

�

/

Havewesolvedeverything?

Thederivativeofthelossfunction?

Iftheneuralnetworkisadifferentiablefunction,wecanfindthegradient

– Ormaybeitssub-gradient– Thisisdecidedbytheactivationfunctionsandthelossfunction

ItwaseasyforSVMsandlogisticregression– Onlyonelayer

Buthowdowefindthesub-gradientofamorecomplexfunction?

– Eg:Arecentpaperuseda~150layerneuralnetworkforimageclassification!

19Weneedanefficientalgorithm:Backpropagation

A*($$ &/, ( , !/)

Thederivativeofthelossfunction?

Iftheneuralnetworkisadifferentiablefunction,wecanfindthegradient

– Ormaybeitssub-gradient– Thisisdecidedbytheactivationfunctionsandthelossfunction

ItwaseasyforSVMsandlogisticregression– Onlyonelayer

Buthowdowefindthesub-gradientofamorecomplexfunction?

– Eg:Arecentpaperuseda~150layerneuralnetworkforimageclassification!

20Weneedanefficientalgorithm:Backpropagation

A*($$ &/, ( , !/)

Thederivativeofthelossfunction?

Iftheneuralnetworkisadifferentiablefunction,wecanfindthegradient

– Ormaybeitssub-gradient– Thisisdecidedbytheactivationfunctionsandthelossfunction

ItwaseasyforSVMsandlogisticregression– Onlyonelayer

Buthowdowefindthesub-gradientofamorecomplexfunction?

– Eg:Arecentpaperuseda~150layerneuralnetworkforimageclassification!

21Weneedanefficientalgorithm:Backpropagation

A*($$ &/, ( , !/)

Checkpoint

22

Wherearewe

Ifwehaveaneuralnetwork(structure,activationsandweights),wecanmakeapredictionforaninput

Ifwehadthetruelabeloftheinput,thenwecandefinethelossforthatexample

Ifwecantakethederivativeofthelosswithrespecttoeachoftheweights,wecantakeagradientstepinSGD

Questions?

Checkpoint

23

Wherearewe

Ifwehaveaneuralnetwork(structure,activationsandweights),wecanmakeapredictionforaninput

Ifwehadthetruelabeloftheinput,thenwecandefinethelossforthatexample

Ifwecantakethederivativeofthelosswithrespecttoeachoftheweights,wecantakeagradientstepinSGD

Questions?

Checkpoint

24

Wherearewe

Ifwehaveaneuralnetwork(structure,activationsandweights),wecanmakeapredictionforaninput

Ifwehadthetruelabeloftheinput,thenwecandefinethelossforthatexample

Ifwecantakethederivativeofthelosswithrespecttoeachoftheweights,wecantakeagradientstepinSGD

Questions?

Checkpoint

25

Wherearewe

Ifwehaveaneuralnetwork(structure,activationsandweights),wecanmakeapredictionforaninput

Ifwehadthetruelabeloftheinput,thenwecandefinethelossforthatexample

Ifwecantakethederivativeofthelosswithrespecttoeachoftheweights,wecantakeagradientstepinSGD

Questions?

Reminder:Chainruleforderivatives

– If7 isafunctionof!and! isafunctionof;• Then7 isafunctionof;,aswell

– Question:howtofindFGFH

27SlidecourtesyRichardSocher

Reminder:Chainruleforderivatives

– If7 =afunctionof!4 +afunctionof!8,andthe!/’sarefunctionsof;• Then7 isafunctionof;,aswell

– Question:howtofindFGFH

28SlidecourtesyRichardSocher

Reminder:Chainruleforderivatives

– If7 isasumoffunctionsof!/’s,andthe!/’sarefunctionsof;• Then7 isafunctionof;,aswell

– Question:howtofindFGFH

29SlidecourtesyRichardSocher

Backpropagation

30

output

* = 12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Backpropagation

31

WewanttocomputeFIFJKL

M andFI

FJKLN

output

* = 12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Backpropagation

32

Applyingthechainruletocomputethegradient(Andrememberingpartialcomputationsalongthewaytospeedupthings)

WewanttocomputeFIFJKL

M andFI

FJKLN

output

* = 12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Outputlayer

33

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

O*O234

5 = O*O!

O!O234

3

Backpropagationexample

Outputlayer

34

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

O*O234

5 = O*O!

O!O234

5

Backpropagationexample

Outputlayer

35

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

O*O234

5 = O*O!

O!O234

5

O*O!

= ! − !∗

Backpropagationexample

Outputlayer

36

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

O*O234

5 = O*O!

O!O234

5

O*O!

= ! − !∗O!O234

5 = 1

Backpropagationexample

Outputlayer

37

O*O244

5 = O*O!

O!O244

3

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

Backpropagationexample

Outputlayer

38

O*O244

5 = O*O!

O!O244

5

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

Backpropagationexample

Outputlayer

39

O*O244

5 = O*O!

O!O244

5

O*O!

= ! − !∗

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

Backpropagationexample

Outputlayer

40

O*O244

5 = O*O!

O!O244

5

O*O!

= ! − !∗O!O234

5 = 74

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

Backpropagationexample

Outputlayer

41

O*O244

5 = O*O!

O!O244

5

O*O!

= ! − !∗O!O234

5 = 74

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

Wehavealreadycomputedthispartialderivativeforthepreviouscase

Cachetospeedup!

Backpropagationexample

Hiddenlayerderivatives

42

output

* = 12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Backpropagationexample

Hiddenlayerderivatives

43

WewantFIFJPP

N

output

* = 12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Backpropagationexample

Hiddenlayerderivatives

44

O*O288

: = O*O!

O!O288

:

Backpropagationexample * = 12!–!∗ 8

Hiddenlayer

45

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

y = 2345 + 244

5 74 + 2845 78

Backpropagationexample

Hiddenlayer

46

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

y = 2345 + 244

5 74 + 2845 78

= O*O!(244

5 OO288

: 74 + 2845 OO288

: 78)

Backpropagationexample

Hiddenlayer

47

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

y = 2345 + 244

5 74 + 2845 78

= O*O!(244

5 OO288

: 74 + 2845 OO288

: 78)

74 isnotafunctionof288:

0

Backpropagationexample

Hiddenlayer

48

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

y = 2345 + 244

5 74 + 2845 78

= O*O!2845 O78O288

:

Backpropagationexample

Hiddenlayer

49

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

= O*O!2845 O78O288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Backpropagationexample

Hiddenlayer

50

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

= O*O!2845 O78O288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

Backpropagationexample

Hiddenlayer

51

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

= O*O!2845 O78O288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

=O*O!2845 O78OQ

OQO288

:

Backpropagationexample

Hiddenlayer

52

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

Backpropagationexample

(Frompreviousslide)

Hiddenlayer

53

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

Eachofthesepartialderivativesiseasy

Backpropagationexample

Hiddenlayer

54

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

O*O!

= ! − !∗

Eachofthesepartialderivativesiseasy

Backpropagationexample

Hiddenlayer

55

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

O*O!

= ! − !∗

O78OQ

= 78(1 − 78)Why?Because78 Qisthelogisticfunctionwehavealreadyseen

Eachofthesepartialderivativesiseasy

Backpropagationexample

Hiddenlayer

56

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

O*O!

= ! − !∗

O78OQ

= 78(1 − 78)Why?Because78 QisthelogisticfunctionwehavealreadyseenOQ

O288: = ;8

Eachofthesepartialderivativesiseasy

Backpropagationexample

Hiddenlayer

57

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

O*O!

= ! − !∗

O78OQ

= 78(1 − 78)Why?Because78 QisthelogisticfunctionwehavealreadyseenOQ

O288: = ;8

Moreimportant:Wehavealreadycomputedmanyofthesepartialderivativesbecauseweareproceedingfromtoptobottom(i.e.backwards)

Backpropagationexample

Eachofthesepartialderivativesiseasy

TheBackpropagationAlgorithm

Thesamealgorithmworksformultiplelayers

Repeatedapplicationofthechainruleforpartialderivatives– Firstperformforwardpassfrominputstotheoutput– Computeloss

– Fromtheloss,proceedbackwardstocomputepartialderivativesusingthechainrule

– Cachepartialderivativesasyoucomputethem• Willbeusedforlowerlayers

58

Mechanizinglearning

• Backpropagationgivesyouthegradientthatwillbeusedforgradientdescent– SGDgivesusagenericlearningalgorithm– Backpropagationisagenericmethodforcomputingpartialderivatives

• Arecursivealgorithmthatproceedsfromthetopofthenetworktothebottom

• Modernneuralnetworklibrariesimplementautomaticdifferentiationusingbackpropagation– Allowseasyexplorationofnetworkarchitectures– Don’thavetokeepderivingthegradientsbyhandeachtime

59

Stochasticgradientdescent

GivenatrainingsetS={(xi,yi)},x 2 <d

1. Initializeparametersw2. Forepoch=1…T:

1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

• Treatthisexampleastheentiredataset

• ComputethegradientofthelossA*($$ &/,( , !/) usingbackpropagation

• Update:( ← (− DEA*($$ &/,( , !/))

3. Returnw

60

°t:learningrate,manytweakspossible

Theobjectiveisnotconvex.Initializationcanbeimportant

min(.*($$ ;/, 2 , !/)

�

/

Copyright © 2022 FDOCUMENTS