Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to...

72
CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader Lecture 12: Perceptron and Back Propagation

Transcript of Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to...

Page 1: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A Introduction to Data SciencePavlos Protopapas and Kevin Rader

Lecture 12: Perceptron and Back Propagation

Page 2: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Outline

1. Review of Classification and Logistic Regression

2. Introduction to Optimization

โ€“ Gradient Descent

โ€“ Stochastic Gradient Descent

3. Single Neuron Network (โ€˜Perceptronโ€™)

4. Multi-Layer Perceptron

5. Back Propagation

2

Page 3: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Outline

1. Review of Classification and Logistic Regression

2. Introduction to Optimization

โ€“ Gradient Descent

โ€“ Stochastic Gradient Descent

3. Single Neuron Network (โ€˜Perceptronโ€™)

4. Multi-Layer Perceptron

5. Back Propagation

3

Page 4: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Classification and Logistic Regression

4

Page 5: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Classification

Methods that are centered around modeling and prediction of a quantitative response variable (ex, number of taxi pickups, number of bike rentals, etc) are called regressions (and Ridge, LASSO, etc).

When the response variable is categorical, then the problem is no longer called a regression problem but is instead labeled as a classification problem.

The goal is to attempt to classify each observation into a category (aka, class or cluster) defined by Y, based on a set of predictor variables X.

5

Page 6: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Heart Data

Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca Thal AHD

63 1 typical 145 233 1 2 150 0 2.3 3 0.0 fixed No

67 1 asymptomatic 160 286 0 2 108 1 1.5 2 3.0 normal Yes

67 1 asymptomatic 120 229 0 2 129 1 2.6 2 2.0 reversable Yes

37 1 nonanginal 130 250 0 0 187 0 3.5 3 0.0 normal No

41 0 nontypical 130 204 0 2 172 0 1.4 1 0.0 normal No

response variable Yis Yes/No

6

Page 7: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Heart Data: logistic estimation

We'd like to predict whether or not a person has a heart disease. And we'd like to make this prediction, for now, just based on the MaxHR.

7

Page 8: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Logistic Regression

Logistic Regression addresses the problem of estimating a probability, ๐‘ƒ ๐‘ฆ = 1 , given an input X. The logistic regression model uses a function, called the logistic function, to model ๐‘ƒ ๐‘ฆ = 1 :

P (Y = 1) =e๏ฟฝ0+๏ฟฝ1X

1 + e๏ฟฝ0+๏ฟฝ1X=

1

1 + e๏ฟฝ(๏ฟฝ0+๏ฟฝ1X)

8

Page 9: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Logistic Regression

As a result the model will predict ๐‘ƒ ๐‘ฆ = 1 with an ๐‘†-shaped curve, which is the general shape of the logistic function.

๐›ฝ( shifts the curve right or left by c = โˆ’+,+-

.

๐›ฝ.controls how steep the ๐‘†-shaped curve is distance from ยฝ to ~1 or ยฝ\ to ~0 to ยฝ is 0

+-

Note: if ๐›ฝ.is positive, then the predicted ๐‘ƒ ๐‘ฆ = 1 goes from zero for small values of ๐‘‹to one for large values of ๐‘‹ and if ๐›ฝ.is negative, then has the ๐‘ƒ ๐‘ฆ = 1 opposite association.

9

Page 10: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Logistic Regression

โˆ’๐›ฝ(๐›ฝ.

2๐›ฝ.

๐›ฝ.4

10

Page 11: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Logistic Regression

P (Y = 1) =1

1 + e๏ฟฝ(๏ฟฝ0+๏ฟฝ1X)

11

Page 12: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Logistic Regression

P (Y = 1) =1

1 + e๏ฟฝ(๏ฟฝ0+๏ฟฝ1X)

12

Page 13: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Estimating the coefficients for Logistic Regression

Find the coefficients that minimize the loss function

โ„’ ๐›ฝ(, ๐›ฝ. = โˆ’6[๐‘ฆ8 log ๐‘8 + 1 โˆ’ ๐‘ฆ8 log(1 โˆ’ ๐‘8)]๏ฟฝ

8

13

Page 14: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

But what is the idea?

14

Start with Regression or Logistic Regression

๐‘Œ = ๐‘“(๐›ฝ( + ๐›ฝ.๐‘ฅ. + ๐›ฝ0๐‘ฅ0 + ๐›ฝE๐‘ฅE + ๐›ฝF๐‘ฅF)

๐‘ฅ.

๐‘ฅ0

๐‘ฅE

๐‘ฅF

Coefficients or WeightsIntercept or Bias

f(X)= ..GHIJKL

Classification

f ๐‘‹ = ๐‘ŠN๐‘‹

Regression

๐‘ŠN = ๐‘Š(,๐‘Š., โ€ฆ ,๐‘ŠF= [๐›ฝ(, ๐›ฝ., โ€ฆ , ๐›ฝF]

Page 15: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

But what is the idea?

15

Start with all randomly selected weights. Most likely it will perform horribly.For example, in our heart data, the model will be giving us the wrong answer.

๏ฟฝ๏ฟฝ = 0.9 โ†’ ๐‘Œ๐‘’๐‘ 

๐‘€๐‘Ž๐‘ฅ๐ป๐‘… = 200

๐ด๐‘”๐‘’ = 52

๐‘†๐‘’๐‘ฅ = ๐‘€๐‘Ž๐‘™๐‘’

๐ถโ„Ž๐‘œ๐‘™ = 152BadComputer

y=NoCorrect answer

Page 16: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

But what is the idea?

16

Start with all randomly selected weights. Most likely it will perform horribly.For example, in our heart data, the model will be giving us the wrong answer.

๏ฟฝ๏ฟฝ = 0.4 โ†’ ๐‘๐‘œ

๐‘€๐‘Ž๐‘ฅ๐ป๐‘… = 170

๐ด๐‘”๐‘’ = 42

๐‘†๐‘’๐‘ฅ = ๐‘€๐‘Ž๐‘™๐‘’

๐ถโ„Ž๐‘œ๐‘™ = 342

y=Yes

BadComputer

Correct answer

Page 17: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

But what is the idea?

17

โ€ข LossFunction:Takesalloftheseresultsandaveragesthemandtellsushowbadorgoodthecomputerorthoseweightsare.

โ€ข Telling the computer how bad or good is, does not help.

โ€ข You want to tell it how to change those weights so it gets better.

Loss function: โ„’ ๐‘ค(,๐‘ค., ๐‘ค0, ๐‘คE, ๐‘คF

For now letโ€™s only consider one weight, โ„’ ๐‘ค.

Page 18: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Minimizing the Loss function

18

To find the optimal point of a function โ„’ ๐‘Š

And find the ๐‘Š that satisfies that equation. Sometimes there is no explicit solution for that.

Ideally we want to know the value of ๐‘ค. that gives the minimul โ„’ ๐‘Š

๐‘‘โ„’(๐‘Š)๐‘‘๐‘Š

= 0

Page 19: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Minimizing the Loss function

19

A more flexible method is

โ€ข Start from any pointโ€ข Determine which direction to go to reduce the loss (left or right)

โ€ข Specifically, we can calculate the slope of the function at this point

โ€ข Shift to the right if slope is negative or shift to the left if slope is positive

โ€ข Repeat

Page 20: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Minimization of the Loss Function

If the step is proportional to the slope then you avoid overshooting the minimum.

Question: What is the mathematical function that describes the slope?

Question: How do we generalize this to more than one predictor?

Question: What do you think it is a good approach for telling the model how to change (what is the step size) to become better?

20

Page 21: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Minimization of the Loss Function

If the step is proportional to the slope then you avoid overshooting the minimum.

Question: What is the mathematical function that describes the slope? DerivativeQuestion: How do we generalize this to more than one predictor?

Take the derivative with respect to each coefficient and do the same sequentiallyQuestion: What do you think it is a good approach for telling the model how to change (what is the step size) to become better?

More on this later

21

Page 22: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Letโ€™s play the Pavlos game

We know that we want to go in the opposite direction of the derivative and we know we want to be making a step proportionally to the derivative.

Making a step means:

22

๐‘คgHh = ๐‘คijk + ๐‘ ๐‘ก๐‘’๐‘

Opposite direction of the derivative means:

๐‘คgHh = ๐‘คijk โˆ’ ๐œ†๐‘‘โ„’๐‘‘๐‘ค

Change to more conventional notation:

๐‘ค(8G.) = ๐‘ค(8) โˆ’ ๐œ†๐‘‘โ„’๐‘‘๐‘ค

LearningRate

Page 23: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Gradient Descent

โ€ข Algorithm for optimization of first order to finding a minimum of a function.

โ€ข It is an iterative method.

โ€ข L is decreasing in the direction of the negative derivative.

โ€ข The learning rate is controlled by the magnitude of ๐œ†.

23

L

w

- +

๐‘ค(8G.) = ๐‘ค(8) โˆ’ ๐œ†๐‘‘โ„’๐‘‘๐‘ค

Page 24: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Considerations

โ€ข We still need to derive the derivatives.

โ€ข We need to know what is the learning rate or how to set it.

โ€ข We need to avoid local minima.

โ€ข Finally, the full likelihood function includes summing up all individual โ€˜errorsโ€™. Unless you are a statistician, this can be hundreds of thousands of examples.

24

Page 25: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Considerations

โ€ข We still need to derive the derivatives.

โ€ข We need to know what is the learning rate or how to set it.

โ€ข We need to avoid local minima.

โ€ข Finally, the full likelihood function includes summing up all individual โ€˜errorsโ€™. Unless you are a statistician, this can be hundreds of thousands of examples.

25

Page 26: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Derivatives: Memories from middle school

26

Page 27: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Linear Regression

27

df

d๏ฟฝ0= 0 ) 2

X

i

(yi ๏ฟฝ ๏ฟฝ0 ๏ฟฝ ๏ฟฝ1xi)

X

i

yi ๏ฟฝ ๏ฟฝ0n๏ฟฝ ๏ฟฝ1

X

i

xi = 0

๏ฟฝ0 = y ๏ฟฝ ๏ฟฝ1x

df

d๏ฟฝ1= 0 ) 2

X

i

(yi ๏ฟฝ ๏ฟฝ0 ๏ฟฝ ๏ฟฝ1xi)(๏ฟฝxi)

๏ฟฝX

i

xiyi + ๏ฟฝ0

X

i

xi + ๏ฟฝ1

X

i

x2i = 0

๏ฟฝX

i

xiyi + (y ๏ฟฝ ๏ฟฝ1x)X

i

xi + ๏ฟฝ1

X

i

x2i = 0

๏ฟฝ1

X

i

x2i ๏ฟฝ nx2

!=X

i

xiyi ๏ฟฝ nxy

) ๏ฟฝ1 =

Pi xiyi ๏ฟฝ nxyPi x

2i ๏ฟฝ nx2

) ๏ฟฝ1 =

Pi(xi ๏ฟฝ x)(yi ๏ฟฝ y)P

i(xi ๏ฟฝ x)2

f =X

i

(yi ๏ฟฝ ๏ฟฝ0 ๏ฟฝ ๏ฟฝ1xi)2

Page 28: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Logistic Regression Derivatives

28

Can we do it?

Wolfram Alpha can do it for us!

We need a formalism to deal with these derivatives.

Page 29: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Chain Rule

โ€ข Chain rule for computing gradients:

โ€ข ๐‘ฆ = ๐‘” ๐‘ฅ ๐‘ง = ๐‘“ ๐‘ฆ = ๐‘“ ๐‘” ๐‘ฅ

โ€ข For longer chains

29

๐œ•๐‘ง๐œ•๐‘ฅ

=๐œ•๐‘ง๐œ•๐‘ฆ๐œ•๐‘ฆ๐œ•๐‘ฅ

๐’š = ๐‘” ๐’™ ๐‘ง = ๐‘“ ๐’š = ๐‘“ ๐‘” ๐’™

๐œ•๐‘ง๐œ•๐‘ฅ8

=6๐œ•๐‘ง๐œ•๐‘ฆr

๐œ•๐‘ฆr๐œ•๐‘ฅ8

๏ฟฝ

r

โˆ‚zโˆ‚xi

= โ€ฆ โˆ‚zโˆ‚yj1jm

โˆ‘j1

โˆ‘ โ€ฆโˆ‚yjmโˆ‚xi

Page 30: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Logistic Regression derivatives

30

โ„’ =6โ„’8

๏ฟฝ

8

= โˆ’6log ๐ฟ8

๏ฟฝ

8

= โˆ’6[๐‘ฆ8 log ๐‘8 + 1 โˆ’ ๐‘ฆ8 log(1 โˆ’ ๐‘8)]๏ฟฝ

8

tโ„’tu

= โˆ‘ tโ„’wtu๏ฟฝ8 = โˆ‘ (๏ฟฝ8

tโ„’wx

tu+ tโ„’w

y

tu)

โ„’8 = โˆ’๐‘ฆ8 log1

1 + ๐‘’zuK{โˆ’ 1 โˆ’ ๐‘ฆ8 log(1 โˆ’

11 + ๐‘’zuK{

)

For logistic regression, the โ€“ve log of the likelihood is:

โ„’8 = โ„’8| + โ„’8}

To simplify the analysis let us split it into two parts,

So the derivative with respect to W is:

Page 31: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER 31

Variables Partialderivatives Partialderivatives

๐œ‰. = โˆ’๐‘ŠN๐‘‹๐œ•๐œ‰.๐œ•๐‘Š

= โˆ’๐‘‹๐œ•๐œ‰.๐œ•๐‘Š

= โˆ’๐‘‹

๐œ‰0 = ๐‘’๏ฟฝ- = ๐‘’zuK{๐œ•๐œ‰0๐œ•๐œ‰.

= ๐‘’๏ฟฝ-๐œ•๐œ‰0๐œ•๐œ‰.

= ๐‘’zuK{

๐œ‰E = 1 + ๐œ‰0 = 1 + ๐‘’zuK{๐œ•๐œ‰E๐œ•๐œ‰0

= 1 t๏ฟฝ๏ฟฝt๏ฟฝ๏ฟฝ

=1

๐œ‰F =1๐œ‰E=

11 + ๐‘’zuK{

= ๐‘๐œ•๐œ‰F๐œ•๐œ‰E

= โˆ’1๐œ‰E0

๐œ•๐œ‰F๐œ•๐œ‰E

= โˆ’1

1 + ๐‘’zuK{ 0

๐œ‰๏ฟฝ = log ๐œ‰F = log ๐‘ = log1

1 + ๐‘’zuK{

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

=1๐œ‰F

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

= 1 + ๐‘’zuK{

โ„’8| = โˆ’๐‘ฆ๐œ‰๏ฟฝ๐œ•โ„’๐œ•๐œ‰๏ฟฝ

= โˆ’๐‘ฆ๐œ•โ„’๐œ•๐œ‰๏ฟฝ

= โˆ’๐‘ฆ

๐œ•โ„’8|

๐œ•๐‘Š=๐œ•โ„’8๐œ•๐œ‰๏ฟฝ

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

๐œ•๐œ‰F๐œ•๐œ‰E

๐œ•๐œ‰E๐œ•๐œ‰0

๐œ•๐œ‰0๐œ•๐œ‰.

๐œ•๐œ‰.๐œ•๐‘Š

๐œ•โ„’8|

๐œ•๐‘Š= โˆ’๐‘ฆ๐‘‹๐‘’zuK{ 1

1 + ๐‘’zuK{

โ„’8| = โˆ’๐‘ฆ8 log1

1 + ๐‘’zuK{

Page 32: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER 32

Variables derivatives Partialderivativeswrt toX,W

๐œ‰. = โˆ’๐‘ŠN๐‘‹ ๐œ•๐œ‰.๐œ•๐‘Š

= โˆ’๐‘‹๐œ•๐œ‰.๐œ•๐‘Š

= โˆ’๐‘‹

๐œ‰0 = ๐‘’๏ฟฝ- = ๐‘’zuK{ ๐œ•๐œ‰0๐œ•๐œ‰.

= ๐‘’๏ฟฝ-๐œ•๐œ‰0๐œ•๐œ‰.

= ๐‘’zuK{

๐œ‰E = 1 + ๐œ‰0 = 1 + ๐‘’zuK{ ๐œ•๐œ‰E๐œ•๐œ‰0

= 1 t๏ฟฝ๏ฟฝt0

=1

๐œ‰F =1๐œ‰E=

11 + ๐‘’zuK{

= ๐‘๐œ•๐œ‰F๐œ•๐œ‰E

= โˆ’1๐œ‰E0

๐œ•๐œ‰F๐œ•๐œ‰E

= โˆ’1

1 + ๐‘’zuK{ 0

๐œ‰๏ฟฝ = 1 โˆ’ ๐œ‰F = 1 โˆ’1

1 + ๐‘’zuK{๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

= โˆ’1 t๏ฟฝ๏ฟฝt๏ฟฝ๏ฟฝ

=-1

๐œ‰๏ฟฝ = log ๐œ‰๏ฟฝ = log(1 โˆ’ ๐‘) = log1

1 + ๐‘’zuK{๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰๏ฟฝ

=1๐œ‰๏ฟฝ

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰๏ฟฝ

=1 + ๐‘’zuK{๐‘’zuK{

โ„’8} = (1 โˆ’ ๐‘ฆ)๐œ‰๏ฟฝ ๐œ•โ„’๐œ•๐œ‰๏ฟฝ

= 1 โˆ’ ๐‘ฆ๐œ•โ„’๐œ•๐œ‰๏ฟฝ

= 1 โˆ’ ๐‘ฆ

๐œ•โ„’8}

๐œ•๐‘Š=๐œ•โ„’8}

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰๏ฟฝ

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

๐œ•๐œ‰F๐œ•๐œ‰E

๐œ•๐œ‰E๐œ•๐œ‰0

๐œ•๐œ‰0๐œ•๐œ‰.

๐œ•๐œ‰.๐œ•๐‘Š

๐œ•โ„’8}

๐œ•๐‘Š= (1 โˆ’ ๐‘ฆ)๐‘‹

1

1 + ๐‘’zuK{

โ„’8} = โˆ’(1 โˆ’ ๐‘ฆ8) log[1 โˆ’1

1 + ๐‘’zuK{]

Page 33: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Learning Rate

33

Page 34: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Learning Rate

Trial and Error.

There are many alternative methods which address how to set or adjust the learning rate, using the derivative or second derivatives and or the momentum. To be discussed in the next lectures on NN.

34

โˆ— J. Nocedal y S. Wright, โ€œNumerical optimizationโ€, Springer, 1999 ๐Ÿ”—โˆ— TLDR: J. Bullinaria, โ€œLearning with Momentum, Conjugate Gradient

Learningโ€, 2015 ๐Ÿ”—

Page 35: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Local and Global minima

35

Page 36: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Local vs Global Minima

36

L

๐›‰

Page 37: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Local vs Global Minima

37

L

๐›‰

Page 38: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Local vs Global Minima

No guarantee that we get the global minimum.

Question: What would be a good strategy?

38

Page 39: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Large data

39

Page 40: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

Instead of using all the examples for every step, use a subset of them (batch).

For each iteration k, use the following loss function to derive the derivatives:

which is an approximation to the full Loss function.

40

โ„’ = โˆ’6[๐‘ฆ8 log ๐‘8 + 1 โˆ’ ๐‘ฆ8 log(1 โˆ’ ๐‘8)]๏ฟฝ

8

โ„’๏ฟฝ = โˆ’ 6[๐‘ฆ8 log ๐‘8 + 1 โˆ’ ๐‘ฆ8 log(1 โˆ’ ๐‘8)]๏ฟฝ

8โˆˆ๏ฟฝ๏ฟฝ

Page 41: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

41

L

๐›‰

Full Likelihood:

Batch Likelihood:

Page 42: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

42

L

๐›‰

Full Likelihood:

Batch Likelihood:

Page 43: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

43

L

๐›‰

Full Likelihood:

Batch Likelihood:

Page 44: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

44

L

๐›‰

Full Likelihood:

Batch Likelihood:

Page 45: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

45

L

๐›‰

Full Likelihood:

Batch Likelihood:

Page 46: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

46

L

๐›‰

Full Likelihood:

Batch Likelihood:

Page 47: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

47

L

๐›‰

Full Likelihood:

Batch Likelihood:

Page 48: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

48

L

๐›‰

Full Likelihood:

Batch Likelihood:

Page 49: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

49

L

๐›‰

Full Likelihood:

Batch Likelihood:

Page 50: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Artificial Neural Networks (ANN)

50

Page 51: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Logistic Regression Revisited

51

๐‘ฅ8 Affine โ„Ž8 = ๐›ฝ( + ๐›ฝ.๐‘ฅ8 Activation ๐‘8 =1

1 + ๐‘’z๏ฟฝwโ„’8 ๐›ฝ = โˆ’๐‘ฆ8 ln ๐‘8 โˆ’ 1 โˆ’ ๐‘ฆ8 ln(1 โˆ’ ๐‘8)LossFun

๐‘ฅ๏ฟฝ โ„’๏ฟฝ ๐›ฝ = โˆ’๐‘ฆ๏ฟฝ ln ๐‘๏ฟฝ โˆ’ 1 โˆ’ ๐‘ฆ๏ฟฝ ln(1 โˆ’ ๐‘๏ฟฝ)Affine โ„Ž๏ฟฝ = ๐›ฝ( + ๐›ฝ.๐‘ฅ๏ฟฝ Activation ๐‘8 =1

1 + ๐‘’z๏ฟฝ๏ฟฝ LossFun

โ„’(๐›ฝ) =6โ„’8 ๐›ฝg

8Affine๐‘‹ โ„Ž = ๐›ฝ( + ๐›ฝ.๐‘‹ Activation ๐‘ =

11 + ๐‘’z๏ฟฝ LossFun

โ€ฆ

Page 52: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Build our first ANN

52

โ„’(๐›ฝ) =6โ„’8 ๐›ฝg

8Affine๐‘‹ โ„Ž = ๐›ฝ( + ๐›ฝ.๐‘‹ Activation ๐‘ =

11 + ๐‘’z๏ฟฝ LossFun

โ„’(๐›ฝ) =6โ„’8 ๐›ฝg

8Affine๐‘‹ โ„Ž = ๐›ฝN๐‘‹ Activation ๐‘ =

11 + ๐‘’z๏ฟฝ LossFun

โ„’(๐‘Š) =6โ„’8 ๐‘Šg

8Affine๐‘‹ โ„Ž = ๐‘ŠN๐‘‹ Activation ๐‘ฆ =

11 + ๐‘’z๏ฟฝ LossFun

Page 53: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Build our first ANN

53

โ„’(๐‘Š) =6โ„’8 ๐‘Šg

8Affine๐‘‹ โ„Ž = ๐‘ŠN๐‘‹ Activation ๐‘ =

11 + ๐‘’z๏ฟฝ LossFun

๐‘‹ ๐‘Œ

Page 54: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Example Using Heart Data

54

Slightly modified data to illustrate a concept.

Page 55: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Example Using Heart Data

55

๐‘‹ ๐‘Œ

Page 56: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Example

56

๐‘‹ ๐‘Œโ€™๐‘‹ ๐‘Œ

Page 57: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Pavlos game #232

57

๐‘Œ

๐‘Œโ€ฒ๐‘‹

๐‘‹W1

W2

๐‘‹

W1

W2

โ„Ž. + โ„Ž0

โ„Ž.

โ„Ž0

Page 58: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Pavlos game #232

58

๐‘‹ ๐‘Œ๐‘‹ ๐‘Œ

๐‘‹

W1

๐‘ŠE

W1

W2W2

โ„Ž.

โ„Ž0

๐‘ž = ๐‘ŠE.โ„Ž. +๐‘ŠE0โ„Ž0 +๐‘ŠE(

Page 59: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Pavlos game #232

59

๐‘‹ ๐‘Œ๐‘‹ ๐‘Œ

๐‘‹

W1

๐‘ŠE

W1

W2W2

โ„Ž.

โ„Ž0

๐‘ž = ๐‘ŠE.โ„Ž. +๐‘ŠE0โ„Ž0 +๐‘ŠE(

๐‘ =1

1 + ๐‘’z๏ฟฝ

๐ฟ = โˆ’๐‘ฆln p โˆ’ 1 โˆ’ y ln(1 โˆ’ p)Need to learn W1, W2 and W3.

Page 60: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Backpropagation

60

Page 61: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Backpropagation: Logistic Regression Revisited

61

โ„’(๐›ฝ) =6โ„’8 ๐›ฝg

8Affine๐‘‹ โ„Ž = ๐›ฝ( + ๐›ฝ.๐‘‹ Activation ๐‘ =

11 + ๐‘’z๏ฟฝ LossFun

๐œ•โ„’๐œ•๐‘

tโ„’t๏ฟฝt๏ฟฝt๏ฟฝ

tโ„’t๏ฟฝt๏ฟฝt๏ฟฝt๏ฟฝt+

๐œ•๐‘๐œ•โ„Ž

= ๐œŽ(โ„Ž)(1 โˆ’ ๐œŽ โ„Ž )๐œ•โ„’๐œ•๐‘

= โˆ’๐‘ฆ1๐‘โˆ’ 1 โˆ’ ๐‘ฆ

11 โˆ’ ๐‘

๐œ•โ„Ž๐œ•๐›ฝ.

= ๐‘‹,๐‘‘โ„’๐‘‘๐›ฝ(

= 1

๐œ•โ„’๐œ•๐›ฝ.

= โˆ’๐‘‹๐œŽ โ„Ž 1 โˆ’ ๐œŽ โ„Ž [๐‘ฆ1๐‘+ 1 โˆ’ ๐‘ฆ

11 โˆ’ ๐‘

]

๐œ•โ„’๐œ•๐›ฝ(

= โˆ’๐œŽ โ„Ž 1 โˆ’ ๐œŽ โ„Ž [๐‘ฆ1๐‘+ 1 โˆ’ ๐‘ฆ

11 โˆ’ ๐‘

]

Page 62: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Backpropagation

62

1. Derivatives need to be evaluated at some values of X,y and W. 2. But since we have an expression, we can build a function that takes as

input X,y,W and returns the derivatives and then we can use gradient descent to update.

3. This approach works well but it does not generalize. For example if the network is changed, we need to write a new function to evaluate the derivatives.

For example this network will need a different function for the derivatives

๐‘‹

W1

๐‘ŠE

W2

๐‘Œ

Page 63: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Backpropagation

63

1. Derivatives need to be evaluated at some values of X,y and W. 2. But since we have an expression, we can build a function that takes as

input X,y,W and returns the derivatives and then we can use gradient descent to update.

3. This approach works well but it does not generalize. For example if the network is changed, we need to write a new function to evaluate the derivatives.

For example this network will need a different function for the derivatives

๐‘‹

W1 ๐‘ŠE

W2

๐‘Œ

๐‘ŠF

๐‘Š๏ฟฝ

Page 64: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Backpropagation. Pavlos game #456

64

Need to find a formalism to calculate the derivatives of the loss wrt to weights that is:

1. Flexible enough that adding a node or a layer or changing something in the network wonโ€™t require to re-derive the functional form from scratch.

2. It is exact.

3. It is computationally efficient.

Hints:

1. Remember we only need to evaluate the derivatives at ๐‘‹8, ๐‘ฆ8 and ๐‘Š(๏ฟฝ).

2. We should take advantage of the chain rule we learned before

Page 65: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Idea 1: Evaluate the derivative at: X={3},y=1,W=3

65

Variables derivatives Value of the variable

Value of the partial derivative

๐‘‘๐ƒ๐’๐‘‘๐‘พ

๐œ‰. = โˆ’๐‘ŠN๐‘‹๐œ•๐œ‰.๐œ•๐‘Š

= โˆ’๐‘‹ โˆ’9 -3 -3

๐œ‰0 = ๐‘’๏ฟฝ- = ๐‘’zuK{๐œ•๐œ‰0๐œ•๐œ‰.

= ๐‘’๏ฟฝ- ๐‘’z๏ฟฝ ๐‘’z๏ฟฝ -3๐‘’z๏ฟฝ

๐œ‰E = 1 + ๐œ‰0 = 1 + ๐‘’zuK{๐œ•๐œ‰E๐œ•๐œ‰0

= 1 1+๐‘’z๏ฟฝ 1 -3๐‘’z๏ฟฝ

๐œ‰F =1๐œ‰E=

11 + ๐‘’zuK{

= ๐‘๐œ•๐œ‰F๐œ•๐œ‰E

= โˆ’1๐œ‰E0

11 + ๐‘’z๏ฟฝ

11 + ๐‘’z๏ฟฝ

0-3๐‘’z๏ฟฝ .

.GHI๏ฟฝ0

๐œ‰๏ฟฝ= log ๐œ‰F = log ๐‘ = log

11 + ๐‘’zuK{

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

=1๐œ‰F log

11 + ๐‘’z๏ฟฝ

1 + ๐‘’z๏ฟฝ -3๐‘’z๏ฟฝ ..GHI๏ฟฝ

โ„’8| = โˆ’๐‘ฆ๐œ‰๏ฟฝ๐œ•โ„’๐œ•๐œ‰๏ฟฝ

= โˆ’๐‘ฆ โˆ’ log1

1 + ๐‘’z๏ฟฝโˆ’1 3๐‘’z๏ฟฝ .

.GHI๏ฟฝ

๐œ•โ„’8|

๐œ•๐‘Š=๐œ•โ„’8๐œ•๐œ‰๏ฟฝ

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

๐œ•๐œ‰F๐œ•๐œ‰E

๐œ•๐œ‰E๐œ•๐œ‰0

๐œ•๐œ‰0๐œ•๐œ‰.

๐œ•๐œ‰.๐œ•๐‘Š โˆ’3 0.00037018372

Page 66: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Basic functions

66

We still need to derive derivativesL

Variables derivatives Value of the variable

Value of the partial derivative

๐‘‘๐ƒ๐’๐‘‘๐‘พ

๐œ‰. = โˆ’๐‘ŠN๐‘‹๐œ•๐œ‰.๐œ•๐‘Š

= โˆ’๐‘‹ โˆ’9 -3 -3

๐œ‰0 = ๐‘’๏ฟฝ- = ๐‘’zuK{๐œ•๐œ‰0๐‘‘๐œ•๐œ‰.

= ๐‘’๏ฟฝ- ๐‘’z๏ฟฝ ๐‘’z๏ฟฝ -3๐‘’z๏ฟฝ

๐œ‰E = 1 + ๐œ‰0 = 1 + ๐‘’zuK{๐œ•๐œ‰E๐œ•๐œ‰0

= 1 1+๐‘’z๏ฟฝ 1 -3๐‘’z๏ฟฝ

๐œ‰F =1๐œ‰E=

11 + ๐‘’zuK{

= ๐‘๐œ•๐œ‰F๐œ•๐œ‰E

= โˆ’1๐œ‰E0

11 + ๐‘’z๏ฟฝ

11 + ๐‘’z๏ฟฝ

0-3๐‘’z๏ฟฝ .

.GHI๏ฟฝ0

๐œ‰๏ฟฝ = log ๐œ‰F = log ๐‘ = log1

1 + ๐‘’zuK{

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

=1๐œ‰F log

11 + ๐‘’z๏ฟฝ

1 + ๐‘’z๏ฟฝ -3๐‘’z๏ฟฝ ..GHI๏ฟฝ

โ„’8| = โˆ’๐‘ฆ๐œ‰๏ฟฝ๐œ•โ„’๐œ•๐œ‰๏ฟฝ

= โˆ’๐‘ฆ โˆ’ log1

1 + ๐‘’z๏ฟฝโˆ’1 3๐‘’z๏ฟฝ .

.GHI๏ฟฝ

๐œ•โ„’8|

๐œ•๐‘Š=๐œ•โ„’8๐œ•๐œ‰๏ฟฝ

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

๐œ•๐œ‰F๐œ•๐œ‰E

๐œ•๐œ‰E๐œ•๐œ‰0

๐œ•๐œ‰0๐œ•๐œ‰.

๐œ•๐œ‰.๐œ•๐‘Š โˆ’3 0.00037018372

Page 67: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Basic functions

67

Notice though those are basic functions that my grandparent can do

๐œ‰( = ๐‘‹๐œ•๐œ‰(๐œ•๐‘‹

= 1 def x0(x):return X

def derx0():return 1

๐œ‰. = โˆ’๐‘ŠN๐œ‰(๐œ•๐œ‰.๐œ•๐‘Š

= โˆ’๐‘‹ def x1(a,x):return โ€“a*X

def derx1(a,x):return -a

๐œ‰0 = e๏ฟฝ-๐œ•๐œ‰0๐œ•๐œ‰.

= ๐‘’๏ฟฝ- def x2(x):return np.exp(x)

def derx2(x):return np.exp(x)

๐œ‰E = 1 + ๐œ‰0๐œ•๐œ‰E๐œ•๐œ‰0

= 1 def x3(x):return 1+x

def derx3(x):return 1

๐œ‰F =1๐œ‰E

๐œ•๐œ‰F๐œ•๐œ‰E

= โˆ’1๐œ‰E0

def der1(x):return 1/(x)

def derx4(x):return -(1/x)**(2)

๐œ‰๏ฟฝ = log ๐œ‰F๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

=1๐œ‰F

def der1(x):return np.log(x)

def derx5(x)return 1/x

โ„’8| = โˆ’๐‘ฆ๐œ‰๏ฟฝ๐œ•โ„’๐œ•๐œ‰๏ฟฝ

= โˆ’๐‘ฆ def der1(y,x):return โ€“y*x

def derL(y):return -y

Page 68: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Putting it altogether

1. We specify the network structure

68

๐‘‹

W1 ๐‘ŠE

W2

๐‘Œ๐‘ŠF

๐‘Š๏ฟฝ

2. We create the computational graph โ€ฆ

What is computational graph?

Page 69: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER 69

XW๐œ‰( = ๐‘Š

ร—๐œ‰. = ๐‘ŠN๐‘‹๐œ‰.๏ฟฝ=X

๐‘’๐‘ฅ๐‘๐œ‰0 = ๐‘’z๏ฟฝ-๐œ‰0๏ฟฝ = โˆ’๐‘’z๏ฟฝ-

+๐œ‰E = 1 + ๐‘’zuK{

รท๐œ‰F =

11 + ๐‘’zuK{ Log

๐œ‰๏ฟฝ = log1

1 + ๐‘’zuK{

1

-

๐œ‰๏ฟฝ = 1 โˆ’1

1 + ๐‘’zuK{

log

๐œ‰ยก = log(1 โˆ’1

1 + ๐‘’zuK{)

1-y

ร—

๐œ‰ยข = 1 โˆ’ y log(1 โˆ’1

1 + ๐‘’zuK{)

y ร—

๐œ‰๏ฟฝ = ylog(1

1 + ๐‘’zuK{)

+

โˆ’โ„’ = ๐œ‰๏ฟฝ = ylog(1

1 + ๐‘’zuK{) + 1 โˆ’ y log(1 โˆ’

11 + ๐‘’zuK{

)

โˆ’

Computational Graph

Page 70: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Putting it altogether

1. We specify the network structure

70

๐‘‹

W1 ๐‘ŠE

W2

๐‘Œ๐‘ŠF

๐‘Š๏ฟฝ

โ€ข We create the computational graph.

โ€ข At each node of the graph we build two functions: the evaluation of the variable and its partial derivative with respect to the previous variable (as shown in the table 3 slides back)

โ€ข Now we can either go forward or backward depending on the situation. In general, forward is easier to implement and to understand. The difference is clearer when there are multiple nodes per layer.

Page 71: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Forward mode: Evaluate the derivative at: X={3},y=1,W=3

71

Variables derivatives Value of the variable

Value of the partial derivative

๐‘‘โ„’๐‘‘๐ƒ๐’

๐œ‰. = โˆ’๐‘ŠN๐‘‹๐œ•๐œ‰.๐œ•๐‘Š

= โˆ’๐‘‹ โˆ’9 -3 -3

๐œ‰0 = ๐‘’๏ฟฝ- = ๐‘’zuK{๐œ•๐œ‰0๐œ•๐œ‰.

= ๐‘’๏ฟฝ- ๐‘’z๏ฟฝ ๐‘’z๏ฟฝ -3๐‘’z๏ฟฝ

๐œ‰E = 1 + ๐œ‰0 = 1 + ๐‘’zuK{๐œ•๐œ‰E๐œ•๐œ‰0

= 1 1+๐‘’z๏ฟฝ 1 -3๐‘’z๏ฟฝ

๐œ‰F =1๐œ‰E=

11 + ๐‘’zuK{

= ๐‘๐œ•๐œ‰F๐œ•๐œ‰E

= โˆ’1๐œ‰E0

11 + ๐‘’z๏ฟฝ

11 + ๐‘’z๏ฟฝ

0-3๐‘’z๏ฟฝ .

.GHI๏ฟฝ0

๐œ‰๏ฟฝ= log ๐œ‰F = log ๐‘ = log

11 + ๐‘’zuK{

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

=1๐œ‰F log

11 + ๐‘’z๏ฟฝ

1 + ๐‘’z๏ฟฝ -3๐‘’z๏ฟฝ ..GHI๏ฟฝ

โ„’8| = โˆ’๐‘ฆ๐œ‰๏ฟฝ๐œ•โ„’๐œ•๐œ‰๏ฟฝ

= โˆ’๐‘ฆ โˆ’ log1

1 + ๐‘’z๏ฟฝโˆ’1 3๐‘’z๏ฟฝ .

.GHI๏ฟฝ

๐œ•โ„’8|

๐œ•๐‘Š=๐œ•โ„’8๐œ•๐œ‰๏ฟฝ

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

๐œ•๐œ‰F๐œ•๐œ‰E

๐œ•๐œ‰E๐œ•๐œ‰0

๐œ•๐œ‰0๐œ•๐œ‰.

๐œ•๐œ‰.๐œ•๐‘Š โˆ’3 0.00037018372

Page 72: Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. CS109A, PROTOPAPAS, RADER Chain Rule โ€ข Chain rule

CS109A, PROTOPAPAS, RADER

Backward mode: Evaluate the derivative at: X={3},y=1,W=3

72

Variables derivatives Value of the variable

Value of the partial derivative

๐œ‰. = โˆ’๐‘ŠN๐‘‹๐œ•๐œ‰.๐œ•๐‘Š

= โˆ’๐‘‹ โˆ’9 -3

๐œ‰0 = ๐‘’๏ฟฝ- = ๐‘’zuK{๐œ•๐œ‰0๐œ•๐œ‰.

= ๐‘’๏ฟฝ- ๐‘’z๏ฟฝ ๐‘’z๏ฟฝ

๐œ‰E = 1 + ๐œ‰0 = 1 + ๐‘’zuK{๐œ•๐œ‰E๐œ•๐œ‰0

= 1 1+๐‘’z๏ฟฝ 1

๐œ‰F =1๐œ‰E=

11 + ๐‘’zuK{

= ๐‘๐œ•๐œ‰F๐œ•๐œ‰E

= โˆ’1๐œ‰E0

11 + ๐‘’z๏ฟฝ

11 + ๐‘’z๏ฟฝ

0

๐œ‰๏ฟฝ = log ๐œ‰F = log ๐‘ = log1

1 + ๐‘’zuK{

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

=1๐œ‰F

log1

1 + ๐‘’z๏ฟฝ1 + ๐‘’z๏ฟฝ

โ„’8| = โˆ’๐‘ฆ๐œ‰๏ฟฝ๐œ•โ„’๐œ•๐œ‰๏ฟฝ

= โˆ’๐‘ฆ โˆ’ log1

1 + ๐‘’z๏ฟฝโˆ’1

๐œ•โ„’8|

๐œ•๐‘Š=๐œ•โ„’8๐œ•๐œ‰๏ฟฝ

๐œ•๐œ‰๏ฟฝ๐œ•๐œ‰F

๐œ•๐œ‰F๐œ•๐œ‰E

๐œ•๐œ‰E๐œ•๐œ‰0

๐œ•๐œ‰0๐œ•๐œ‰.

๐œ•๐œ‰.๐œ•๐‘Š Typeequationhere.

Store all th

ese values