Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to...
Transcript of Lecture 12: Perceptron and Back Propagation...Wolfram Alpha can do it for us! We need a formalism to...
CS109A Introduction to Data SciencePavlos Protopapas and Kevin Rader
Lecture 12: Perceptron and Back Propagation
CS109A, PROTOPAPAS, RADER
Outline
1. Review of Classification and Logistic Regression
2. Introduction to Optimization
โ Gradient Descent
โ Stochastic Gradient Descent
3. Single Neuron Network (โPerceptronโ)
4. Multi-Layer Perceptron
5. Back Propagation
2
CS109A, PROTOPAPAS, RADER
Outline
1. Review of Classification and Logistic Regression
2. Introduction to Optimization
โ Gradient Descent
โ Stochastic Gradient Descent
3. Single Neuron Network (โPerceptronโ)
4. Multi-Layer Perceptron
5. Back Propagation
3
CS109A, PROTOPAPAS, RADER
Classification and Logistic Regression
4
CS109A, PROTOPAPAS, RADER
Classification
Methods that are centered around modeling and prediction of a quantitative response variable (ex, number of taxi pickups, number of bike rentals, etc) are called regressions (and Ridge, LASSO, etc).
When the response variable is categorical, then the problem is no longer called a regression problem but is instead labeled as a classification problem.
The goal is to attempt to classify each observation into a category (aka, class or cluster) defined by Y, based on a set of predictor variables X.
5
CS109A, PROTOPAPAS, RADER
Heart Data
Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca Thal AHD
63 1 typical 145 233 1 2 150 0 2.3 3 0.0 fixed No
67 1 asymptomatic 160 286 0 2 108 1 1.5 2 3.0 normal Yes
67 1 asymptomatic 120 229 0 2 129 1 2.6 2 2.0 reversable Yes
37 1 nonanginal 130 250 0 0 187 0 3.5 3 0.0 normal No
41 0 nontypical 130 204 0 2 172 0 1.4 1 0.0 normal No
response variable Yis Yes/No
6
CS109A, PROTOPAPAS, RADER
Heart Data: logistic estimation
We'd like to predict whether or not a person has a heart disease. And we'd like to make this prediction, for now, just based on the MaxHR.
7
CS109A, PROTOPAPAS, RADER
Logistic Regression
Logistic Regression addresses the problem of estimating a probability, ๐ ๐ฆ = 1 , given an input X. The logistic regression model uses a function, called the logistic function, to model ๐ ๐ฆ = 1 :
P (Y = 1) =e๏ฟฝ0+๏ฟฝ1X
1 + e๏ฟฝ0+๏ฟฝ1X=
1
1 + e๏ฟฝ(๏ฟฝ0+๏ฟฝ1X)
8
CS109A, PROTOPAPAS, RADER
Logistic Regression
As a result the model will predict ๐ ๐ฆ = 1 with an ๐-shaped curve, which is the general shape of the logistic function.
๐ฝ( shifts the curve right or left by c = โ+,+-
.
๐ฝ.controls how steep the ๐-shaped curve is distance from ยฝ to ~1 or ยฝ\ to ~0 to ยฝ is 0
+-
Note: if ๐ฝ.is positive, then the predicted ๐ ๐ฆ = 1 goes from zero for small values of ๐to one for large values of ๐ and if ๐ฝ.is negative, then has the ๐ ๐ฆ = 1 opposite association.
9
CS109A, PROTOPAPAS, RADER
Logistic Regression
โ๐ฝ(๐ฝ.
2๐ฝ.
๐ฝ.4
10
CS109A, PROTOPAPAS, RADER
Logistic Regression
P (Y = 1) =1
1 + e๏ฟฝ(๏ฟฝ0+๏ฟฝ1X)
11
CS109A, PROTOPAPAS, RADER
Logistic Regression
P (Y = 1) =1
1 + e๏ฟฝ(๏ฟฝ0+๏ฟฝ1X)
12
CS109A, PROTOPAPAS, RADER
Estimating the coefficients for Logistic Regression
Find the coefficients that minimize the loss function
โ ๐ฝ(, ๐ฝ. = โ6[๐ฆ8 log ๐8 + 1 โ ๐ฆ8 log(1 โ ๐8)]๏ฟฝ
8
13
CS109A, PROTOPAPAS, RADER
But what is the idea?
14
Start with Regression or Logistic Regression
๐ = ๐(๐ฝ( + ๐ฝ.๐ฅ. + ๐ฝ0๐ฅ0 + ๐ฝE๐ฅE + ๐ฝF๐ฅF)
๐ฅ.
๐ฅ0
๐ฅE
๐ฅF
Coefficients or WeightsIntercept or Bias
f(X)= ..GHIJKL
Classification
f ๐ = ๐N๐
Regression
๐N = ๐(,๐., โฆ ,๐F= [๐ฝ(, ๐ฝ., โฆ , ๐ฝF]
CS109A, PROTOPAPAS, RADER
But what is the idea?
15
Start with all randomly selected weights. Most likely it will perform horribly.For example, in our heart data, the model will be giving us the wrong answer.
๏ฟฝ๏ฟฝ = 0.9 โ ๐๐๐
๐๐๐ฅ๐ป๐ = 200
๐ด๐๐ = 52
๐๐๐ฅ = ๐๐๐๐
๐ถโ๐๐ = 152BadComputer
y=NoCorrect answer
CS109A, PROTOPAPAS, RADER
But what is the idea?
16
Start with all randomly selected weights. Most likely it will perform horribly.For example, in our heart data, the model will be giving us the wrong answer.
๏ฟฝ๏ฟฝ = 0.4 โ ๐๐
๐๐๐ฅ๐ป๐ = 170
๐ด๐๐ = 42
๐๐๐ฅ = ๐๐๐๐
๐ถโ๐๐ = 342
y=Yes
BadComputer
Correct answer
CS109A, PROTOPAPAS, RADER
But what is the idea?
17
โข LossFunction:Takesalloftheseresultsandaveragesthemandtellsushowbadorgoodthecomputerorthoseweightsare.
โข Telling the computer how bad or good is, does not help.
โข You want to tell it how to change those weights so it gets better.
Loss function: โ ๐ค(,๐ค., ๐ค0, ๐คE, ๐คF
For now letโs only consider one weight, โ ๐ค.
CS109A, PROTOPAPAS, RADER
Minimizing the Loss function
18
To find the optimal point of a function โ ๐
And find the ๐ that satisfies that equation. Sometimes there is no explicit solution for that.
Ideally we want to know the value of ๐ค. that gives the minimul โ ๐
๐โ(๐)๐๐
= 0
CS109A, PROTOPAPAS, RADER
Minimizing the Loss function
19
A more flexible method is
โข Start from any pointโข Determine which direction to go to reduce the loss (left or right)
โข Specifically, we can calculate the slope of the function at this point
โข Shift to the right if slope is negative or shift to the left if slope is positive
โข Repeat
CS109A, PROTOPAPAS, RADER
Minimization of the Loss Function
If the step is proportional to the slope then you avoid overshooting the minimum.
Question: What is the mathematical function that describes the slope?
Question: How do we generalize this to more than one predictor?
Question: What do you think it is a good approach for telling the model how to change (what is the step size) to become better?
20
CS109A, PROTOPAPAS, RADER
Minimization of the Loss Function
If the step is proportional to the slope then you avoid overshooting the minimum.
Question: What is the mathematical function that describes the slope? DerivativeQuestion: How do we generalize this to more than one predictor?
Take the derivative with respect to each coefficient and do the same sequentiallyQuestion: What do you think it is a good approach for telling the model how to change (what is the step size) to become better?
21
CS109A, PROTOPAPAS, RADER
Letโs play the Pavlos game
We know that we want to go in the opposite direction of the derivative and we know we want to be making a step proportionally to the derivative.
Making a step means:
22
๐คgHh = ๐คijk + ๐ ๐ก๐๐
Opposite direction of the derivative means:
๐คgHh = ๐คijk โ ๐๐โ๐๐ค
Change to more conventional notation:
๐ค(8G.) = ๐ค(8) โ ๐๐โ๐๐ค
LearningRate
CS109A, PROTOPAPAS, RADER
Gradient Descent
โข Algorithm for optimization of first order to finding a minimum of a function.
โข It is an iterative method.
โข L is decreasing in the direction of the negative derivative.
โข The learning rate is controlled by the magnitude of ๐.
23
L
w
- +
๐ค(8G.) = ๐ค(8) โ ๐๐โ๐๐ค
CS109A, PROTOPAPAS, RADER
Considerations
โข We still need to derive the derivatives.
โข We need to know what is the learning rate or how to set it.
โข We need to avoid local minima.
โข Finally, the full likelihood function includes summing up all individual โerrorsโ. Unless you are a statistician, this can be hundreds of thousands of examples.
24
CS109A, PROTOPAPAS, RADER
Considerations
โข We still need to derive the derivatives.
โข We need to know what is the learning rate or how to set it.
โข We need to avoid local minima.
โข Finally, the full likelihood function includes summing up all individual โerrorsโ. Unless you are a statistician, this can be hundreds of thousands of examples.
25
CS109A, PROTOPAPAS, RADER
Derivatives: Memories from middle school
26
CS109A, PROTOPAPAS, RADER
Linear Regression
27
df
d๏ฟฝ0= 0 ) 2
X
i
(yi ๏ฟฝ ๏ฟฝ0 ๏ฟฝ ๏ฟฝ1xi)
X
i
yi ๏ฟฝ ๏ฟฝ0n๏ฟฝ ๏ฟฝ1
X
i
xi = 0
๏ฟฝ0 = y ๏ฟฝ ๏ฟฝ1x
df
d๏ฟฝ1= 0 ) 2
X
i
(yi ๏ฟฝ ๏ฟฝ0 ๏ฟฝ ๏ฟฝ1xi)(๏ฟฝxi)
๏ฟฝX
i
xiyi + ๏ฟฝ0
X
i
xi + ๏ฟฝ1
X
i
x2i = 0
๏ฟฝX
i
xiyi + (y ๏ฟฝ ๏ฟฝ1x)X
i
xi + ๏ฟฝ1
X
i
x2i = 0
๏ฟฝ1
X
i
x2i ๏ฟฝ nx2
!=X
i
xiyi ๏ฟฝ nxy
) ๏ฟฝ1 =
Pi xiyi ๏ฟฝ nxyPi x
2i ๏ฟฝ nx2
) ๏ฟฝ1 =
Pi(xi ๏ฟฝ x)(yi ๏ฟฝ y)P
i(xi ๏ฟฝ x)2
f =X
i
(yi ๏ฟฝ ๏ฟฝ0 ๏ฟฝ ๏ฟฝ1xi)2
CS109A, PROTOPAPAS, RADER
Logistic Regression Derivatives
28
Can we do it?
Wolfram Alpha can do it for us!
We need a formalism to deal with these derivatives.
CS109A, PROTOPAPAS, RADER
Chain Rule
โข Chain rule for computing gradients:
โข ๐ฆ = ๐ ๐ฅ ๐ง = ๐ ๐ฆ = ๐ ๐ ๐ฅ
โข For longer chains
29
๐๐ง๐๐ฅ
=๐๐ง๐๐ฆ๐๐ฆ๐๐ฅ
๐ = ๐ ๐ ๐ง = ๐ ๐ = ๐ ๐ ๐
๐๐ง๐๐ฅ8
=6๐๐ง๐๐ฆr
๐๐ฆr๐๐ฅ8
๏ฟฝ
r
โzโxi
= โฆ โzโyj1jm
โj1
โ โฆโyjmโxi
CS109A, PROTOPAPAS, RADER
Logistic Regression derivatives
30
โ =6โ8
๏ฟฝ
8
= โ6log ๐ฟ8
๏ฟฝ
8
= โ6[๐ฆ8 log ๐8 + 1 โ ๐ฆ8 log(1 โ ๐8)]๏ฟฝ
8
tโtu
= โ tโwtu๏ฟฝ8 = โ (๏ฟฝ8
tโwx
tu+ tโw
y
tu)
โ8 = โ๐ฆ8 log1
1 + ๐zuK{โ 1 โ ๐ฆ8 log(1 โ
11 + ๐zuK{
)
For logistic regression, the โve log of the likelihood is:
โ8 = โ8| + โ8}
To simplify the analysis let us split it into two parts,
So the derivative with respect to W is:
CS109A, PROTOPAPAS, RADER 31
Variables Partialderivatives Partialderivatives
๐. = โ๐N๐๐๐.๐๐
= โ๐๐๐.๐๐
= โ๐
๐0 = ๐๏ฟฝ- = ๐zuK{๐๐0๐๐.
= ๐๏ฟฝ-๐๐0๐๐.
= ๐zuK{
๐E = 1 + ๐0 = 1 + ๐zuK{๐๐E๐๐0
= 1 t๏ฟฝ๏ฟฝt๏ฟฝ๏ฟฝ
=1
๐F =1๐E=
11 + ๐zuK{
= ๐๐๐F๐๐E
= โ1๐E0
๐๐F๐๐E
= โ1
1 + ๐zuK{ 0
๐๏ฟฝ = log ๐F = log ๐ = log1
1 + ๐zuK{
๐๐๏ฟฝ๐๐F
=1๐F
๐๐๏ฟฝ๐๐F
= 1 + ๐zuK{
โ8| = โ๐ฆ๐๏ฟฝ๐โ๐๐๏ฟฝ
= โ๐ฆ๐โ๐๐๏ฟฝ
= โ๐ฆ
๐โ8|
๐๐=๐โ8๐๐๏ฟฝ
๐๐๏ฟฝ๐๐F
๐๐F๐๐E
๐๐E๐๐0
๐๐0๐๐.
๐๐.๐๐
๐โ8|
๐๐= โ๐ฆ๐๐zuK{ 1
1 + ๐zuK{
โ8| = โ๐ฆ8 log1
1 + ๐zuK{
CS109A, PROTOPAPAS, RADER 32
Variables derivatives Partialderivativeswrt toX,W
๐. = โ๐N๐ ๐๐.๐๐
= โ๐๐๐.๐๐
= โ๐
๐0 = ๐๏ฟฝ- = ๐zuK{ ๐๐0๐๐.
= ๐๏ฟฝ-๐๐0๐๐.
= ๐zuK{
๐E = 1 + ๐0 = 1 + ๐zuK{ ๐๐E๐๐0
= 1 t๏ฟฝ๏ฟฝt0
=1
๐F =1๐E=
11 + ๐zuK{
= ๐๐๐F๐๐E
= โ1๐E0
๐๐F๐๐E
= โ1
1 + ๐zuK{ 0
๐๏ฟฝ = 1 โ ๐F = 1 โ1
1 + ๐zuK{๐๐๏ฟฝ๐๐F
= โ1 t๏ฟฝ๏ฟฝt๏ฟฝ๏ฟฝ
=-1
๐๏ฟฝ = log ๐๏ฟฝ = log(1 โ ๐) = log1
1 + ๐zuK{๐๐๏ฟฝ๐๐๏ฟฝ
=1๐๏ฟฝ
๐๐๏ฟฝ๐๐๏ฟฝ
=1 + ๐zuK{๐zuK{
โ8} = (1 โ ๐ฆ)๐๏ฟฝ ๐โ๐๐๏ฟฝ
= 1 โ ๐ฆ๐โ๐๐๏ฟฝ
= 1 โ ๐ฆ
๐โ8}
๐๐=๐โ8}
๐๐๏ฟฝ๐๐๏ฟฝ๐๐๏ฟฝ
๐๐๏ฟฝ๐๐F
๐๐F๐๐E
๐๐E๐๐0
๐๐0๐๐.
๐๐.๐๐
๐โ8}
๐๐= (1 โ ๐ฆ)๐
1
1 + ๐zuK{
โ8} = โ(1 โ ๐ฆ8) log[1 โ1
1 + ๐zuK{]
CS109A, PROTOPAPAS, RADER
Learning Rate
33
CS109A, PROTOPAPAS, RADER
Learning Rate
Trial and Error.
There are many alternative methods which address how to set or adjust the learning rate, using the derivative or second derivatives and or the momentum. To be discussed in the next lectures on NN.
34
โ J. Nocedal y S. Wright, โNumerical optimizationโ, Springer, 1999 ๐โ TLDR: J. Bullinaria, โLearning with Momentum, Conjugate Gradient
Learningโ, 2015 ๐
CS109A, PROTOPAPAS, RADER
Local and Global minima
35
CS109A, PROTOPAPAS, RADER
Local vs Global Minima
36
L
๐
CS109A, PROTOPAPAS, RADER
Local vs Global Minima
37
L
๐
CS109A, PROTOPAPAS, RADER
Local vs Global Minima
No guarantee that we get the global minimum.
Question: What would be a good strategy?
38
CS109A, PROTOPAPAS, RADER
Large data
39
CS109A, PROTOPAPAS, RADER
Batch and Stochastic Gradient Descent
Instead of using all the examples for every step, use a subset of them (batch).
For each iteration k, use the following loss function to derive the derivatives:
which is an approximation to the full Loss function.
40
โ = โ6[๐ฆ8 log ๐8 + 1 โ ๐ฆ8 log(1 โ ๐8)]๏ฟฝ
8
โ๏ฟฝ = โ 6[๐ฆ8 log ๐8 + 1 โ ๐ฆ8 log(1 โ ๐8)]๏ฟฝ
8โ๏ฟฝ๏ฟฝ
CS109A, PROTOPAPAS, RADER
Batch and Stochastic Gradient Descent
41
L
๐
Full Likelihood:
Batch Likelihood:
CS109A, PROTOPAPAS, RADER
Batch and Stochastic Gradient Descent
42
L
๐
Full Likelihood:
Batch Likelihood:
CS109A, PROTOPAPAS, RADER
Batch and Stochastic Gradient Descent
43
L
๐
Full Likelihood:
Batch Likelihood:
CS109A, PROTOPAPAS, RADER
Batch and Stochastic Gradient Descent
44
L
๐
Full Likelihood:
Batch Likelihood:
CS109A, PROTOPAPAS, RADER
Batch and Stochastic Gradient Descent
45
L
๐
Full Likelihood:
Batch Likelihood:
CS109A, PROTOPAPAS, RADER
Batch and Stochastic Gradient Descent
46
L
๐
Full Likelihood:
Batch Likelihood:
CS109A, PROTOPAPAS, RADER
Batch and Stochastic Gradient Descent
47
L
๐
Full Likelihood:
Batch Likelihood:
CS109A, PROTOPAPAS, RADER
Batch and Stochastic Gradient Descent
48
L
๐
Full Likelihood:
Batch Likelihood:
CS109A, PROTOPAPAS, RADER
Batch and Stochastic Gradient Descent
49
L
๐
Full Likelihood:
Batch Likelihood:
CS109A, PROTOPAPAS, RADER
Artificial Neural Networks (ANN)
50
CS109A, PROTOPAPAS, RADER
Logistic Regression Revisited
51
๐ฅ8 Affine โ8 = ๐ฝ( + ๐ฝ.๐ฅ8 Activation ๐8 =1
1 + ๐z๏ฟฝwโ8 ๐ฝ = โ๐ฆ8 ln ๐8 โ 1 โ ๐ฆ8 ln(1 โ ๐8)LossFun
๐ฅ๏ฟฝ โ๏ฟฝ ๐ฝ = โ๐ฆ๏ฟฝ ln ๐๏ฟฝ โ 1 โ ๐ฆ๏ฟฝ ln(1 โ ๐๏ฟฝ)Affine โ๏ฟฝ = ๐ฝ( + ๐ฝ.๐ฅ๏ฟฝ Activation ๐8 =1
1 + ๐z๏ฟฝ๏ฟฝ LossFun
โ(๐ฝ) =6โ8 ๐ฝg
8Affine๐ โ = ๐ฝ( + ๐ฝ.๐ Activation ๐ =
11 + ๐z๏ฟฝ LossFun
โฆ
CS109A, PROTOPAPAS, RADER
Build our first ANN
52
โ(๐ฝ) =6โ8 ๐ฝg
8Affine๐ โ = ๐ฝ( + ๐ฝ.๐ Activation ๐ =
11 + ๐z๏ฟฝ LossFun
โ(๐ฝ) =6โ8 ๐ฝg
8Affine๐ โ = ๐ฝN๐ Activation ๐ =
11 + ๐z๏ฟฝ LossFun
โ(๐) =6โ8 ๐g
8Affine๐ โ = ๐N๐ Activation ๐ฆ =
11 + ๐z๏ฟฝ LossFun
CS109A, PROTOPAPAS, RADER
Build our first ANN
53
โ(๐) =6โ8 ๐g
8Affine๐ โ = ๐N๐ Activation ๐ =
11 + ๐z๏ฟฝ LossFun
๐ ๐
CS109A, PROTOPAPAS, RADER
Example Using Heart Data
54
Slightly modified data to illustrate a concept.
CS109A, PROTOPAPAS, RADER
Example Using Heart Data
55
๐ ๐
CS109A, PROTOPAPAS, RADER
Example
56
๐ ๐โ๐ ๐
CS109A, PROTOPAPAS, RADER
Pavlos game #232
57
๐
๐โฒ๐
๐W1
W2
๐
W1
W2
โ. + โ0
โ.
โ0
CS109A, PROTOPAPAS, RADER
Pavlos game #232
58
๐ ๐๐ ๐
๐
W1
๐E
W1
W2W2
โ.
โ0
๐ = ๐E.โ. +๐E0โ0 +๐E(
CS109A, PROTOPAPAS, RADER
Pavlos game #232
59
๐ ๐๐ ๐
๐
W1
๐E
W1
W2W2
โ.
โ0
๐ = ๐E.โ. +๐E0โ0 +๐E(
๐ =1
1 + ๐z๏ฟฝ
๐ฟ = โ๐ฆln p โ 1 โ y ln(1 โ p)Need to learn W1, W2 and W3.
CS109A, PROTOPAPAS, RADER
Backpropagation
60
CS109A, PROTOPAPAS, RADER
Backpropagation: Logistic Regression Revisited
61
โ(๐ฝ) =6โ8 ๐ฝg
8Affine๐ โ = ๐ฝ( + ๐ฝ.๐ Activation ๐ =
11 + ๐z๏ฟฝ LossFun
๐โ๐๐
tโt๏ฟฝt๏ฟฝt๏ฟฝ
tโt๏ฟฝt๏ฟฝt๏ฟฝt๏ฟฝt+
๐๐๐โ
= ๐(โ)(1 โ ๐ โ )๐โ๐๐
= โ๐ฆ1๐โ 1 โ ๐ฆ
11 โ ๐
๐โ๐๐ฝ.
= ๐,๐โ๐๐ฝ(
= 1
๐โ๐๐ฝ.
= โ๐๐ โ 1 โ ๐ โ [๐ฆ1๐+ 1 โ ๐ฆ
11 โ ๐
]
๐โ๐๐ฝ(
= โ๐ โ 1 โ ๐ โ [๐ฆ1๐+ 1 โ ๐ฆ
11 โ ๐
]
CS109A, PROTOPAPAS, RADER
Backpropagation
62
1. Derivatives need to be evaluated at some values of X,y and W. 2. But since we have an expression, we can build a function that takes as
input X,y,W and returns the derivatives and then we can use gradient descent to update.
3. This approach works well but it does not generalize. For example if the network is changed, we need to write a new function to evaluate the derivatives.
For example this network will need a different function for the derivatives
๐
W1
๐E
W2
๐
CS109A, PROTOPAPAS, RADER
Backpropagation
63
1. Derivatives need to be evaluated at some values of X,y and W. 2. But since we have an expression, we can build a function that takes as
input X,y,W and returns the derivatives and then we can use gradient descent to update.
3. This approach works well but it does not generalize. For example if the network is changed, we need to write a new function to evaluate the derivatives.
For example this network will need a different function for the derivatives
๐
W1 ๐E
W2
๐
๐F
๐๏ฟฝ
CS109A, PROTOPAPAS, RADER
Backpropagation. Pavlos game #456
64
Need to find a formalism to calculate the derivatives of the loss wrt to weights that is:
1. Flexible enough that adding a node or a layer or changing something in the network wonโt require to re-derive the functional form from scratch.
2. It is exact.
3. It is computationally efficient.
Hints:
1. Remember we only need to evaluate the derivatives at ๐8, ๐ฆ8 and ๐(๏ฟฝ).
2. We should take advantage of the chain rule we learned before
CS109A, PROTOPAPAS, RADER
Idea 1: Evaluate the derivative at: X={3},y=1,W=3
65
Variables derivatives Value of the variable
Value of the partial derivative
๐๐๐๐๐พ
๐. = โ๐N๐๐๐.๐๐
= โ๐ โ9 -3 -3
๐0 = ๐๏ฟฝ- = ๐zuK{๐๐0๐๐.
= ๐๏ฟฝ- ๐z๏ฟฝ ๐z๏ฟฝ -3๐z๏ฟฝ
๐E = 1 + ๐0 = 1 + ๐zuK{๐๐E๐๐0
= 1 1+๐z๏ฟฝ 1 -3๐z๏ฟฝ
๐F =1๐E=
11 + ๐zuK{
= ๐๐๐F๐๐E
= โ1๐E0
11 + ๐z๏ฟฝ
11 + ๐z๏ฟฝ
0-3๐z๏ฟฝ .
.GHI๏ฟฝ0
๐๏ฟฝ= log ๐F = log ๐ = log
11 + ๐zuK{
๐๐๏ฟฝ๐๐F
=1๐F log
11 + ๐z๏ฟฝ
1 + ๐z๏ฟฝ -3๐z๏ฟฝ ..GHI๏ฟฝ
โ8| = โ๐ฆ๐๏ฟฝ๐โ๐๐๏ฟฝ
= โ๐ฆ โ log1
1 + ๐z๏ฟฝโ1 3๐z๏ฟฝ .
.GHI๏ฟฝ
๐โ8|
๐๐=๐โ8๐๐๏ฟฝ
๐๐๏ฟฝ๐๐F
๐๐F๐๐E
๐๐E๐๐0
๐๐0๐๐.
๐๐.๐๐ โ3 0.00037018372
CS109A, PROTOPAPAS, RADER
Basic functions
66
We still need to derive derivativesL
Variables derivatives Value of the variable
Value of the partial derivative
๐๐๐๐๐พ
๐. = โ๐N๐๐๐.๐๐
= โ๐ โ9 -3 -3
๐0 = ๐๏ฟฝ- = ๐zuK{๐๐0๐๐๐.
= ๐๏ฟฝ- ๐z๏ฟฝ ๐z๏ฟฝ -3๐z๏ฟฝ
๐E = 1 + ๐0 = 1 + ๐zuK{๐๐E๐๐0
= 1 1+๐z๏ฟฝ 1 -3๐z๏ฟฝ
๐F =1๐E=
11 + ๐zuK{
= ๐๐๐F๐๐E
= โ1๐E0
11 + ๐z๏ฟฝ
11 + ๐z๏ฟฝ
0-3๐z๏ฟฝ .
.GHI๏ฟฝ0
๐๏ฟฝ = log ๐F = log ๐ = log1
1 + ๐zuK{
๐๐๏ฟฝ๐๐F
=1๐F log
11 + ๐z๏ฟฝ
1 + ๐z๏ฟฝ -3๐z๏ฟฝ ..GHI๏ฟฝ
โ8| = โ๐ฆ๐๏ฟฝ๐โ๐๐๏ฟฝ
= โ๐ฆ โ log1
1 + ๐z๏ฟฝโ1 3๐z๏ฟฝ .
.GHI๏ฟฝ
๐โ8|
๐๐=๐โ8๐๐๏ฟฝ
๐๐๏ฟฝ๐๐F
๐๐F๐๐E
๐๐E๐๐0
๐๐0๐๐.
๐๐.๐๐ โ3 0.00037018372
CS109A, PROTOPAPAS, RADER
Basic functions
67
Notice though those are basic functions that my grandparent can do
๐( = ๐๐๐(๐๐
= 1 def x0(x):return X
def derx0():return 1
๐. = โ๐N๐(๐๐.๐๐
= โ๐ def x1(a,x):return โa*X
def derx1(a,x):return -a
๐0 = e๏ฟฝ-๐๐0๐๐.
= ๐๏ฟฝ- def x2(x):return np.exp(x)
def derx2(x):return np.exp(x)
๐E = 1 + ๐0๐๐E๐๐0
= 1 def x3(x):return 1+x
def derx3(x):return 1
๐F =1๐E
๐๐F๐๐E
= โ1๐E0
def der1(x):return 1/(x)
def derx4(x):return -(1/x)**(2)
๐๏ฟฝ = log ๐F๐๐๏ฟฝ๐๐F
=1๐F
def der1(x):return np.log(x)
def derx5(x)return 1/x
โ8| = โ๐ฆ๐๏ฟฝ๐โ๐๐๏ฟฝ
= โ๐ฆ def der1(y,x):return โy*x
def derL(y):return -y
CS109A, PROTOPAPAS, RADER
Putting it altogether
1. We specify the network structure
68
๐
W1 ๐E
W2
๐๐F
๐๏ฟฝ
2. We create the computational graph โฆ
What is computational graph?
CS109A, PROTOPAPAS, RADER 69
XW๐( = ๐
ร๐. = ๐N๐๐.๏ฟฝ=X
๐๐ฅ๐๐0 = ๐z๏ฟฝ-๐0๏ฟฝ = โ๐z๏ฟฝ-
+๐E = 1 + ๐zuK{
รท๐F =
11 + ๐zuK{ Log
๐๏ฟฝ = log1
1 + ๐zuK{
1
-
๐๏ฟฝ = 1 โ1
1 + ๐zuK{
log
๐ยก = log(1 โ1
1 + ๐zuK{)
1-y
ร
๐ยข = 1 โ y log(1 โ1
1 + ๐zuK{)
y ร
๐๏ฟฝ = ylog(1
1 + ๐zuK{)
+
โโ = ๐๏ฟฝ = ylog(1
1 + ๐zuK{) + 1 โ y log(1 โ
11 + ๐zuK{
)
โ
Computational Graph
CS109A, PROTOPAPAS, RADER
Putting it altogether
1. We specify the network structure
70
๐
W1 ๐E
W2
๐๐F
๐๏ฟฝ
โข We create the computational graph.
โข At each node of the graph we build two functions: the evaluation of the variable and its partial derivative with respect to the previous variable (as shown in the table 3 slides back)
โข Now we can either go forward or backward depending on the situation. In general, forward is easier to implement and to understand. The difference is clearer when there are multiple nodes per layer.
CS109A, PROTOPAPAS, RADER
Forward mode: Evaluate the derivative at: X={3},y=1,W=3
71
Variables derivatives Value of the variable
Value of the partial derivative
๐โ๐๐๐
๐. = โ๐N๐๐๐.๐๐
= โ๐ โ9 -3 -3
๐0 = ๐๏ฟฝ- = ๐zuK{๐๐0๐๐.
= ๐๏ฟฝ- ๐z๏ฟฝ ๐z๏ฟฝ -3๐z๏ฟฝ
๐E = 1 + ๐0 = 1 + ๐zuK{๐๐E๐๐0
= 1 1+๐z๏ฟฝ 1 -3๐z๏ฟฝ
๐F =1๐E=
11 + ๐zuK{
= ๐๐๐F๐๐E
= โ1๐E0
11 + ๐z๏ฟฝ
11 + ๐z๏ฟฝ
0-3๐z๏ฟฝ .
.GHI๏ฟฝ0
๐๏ฟฝ= log ๐F = log ๐ = log
11 + ๐zuK{
๐๐๏ฟฝ๐๐F
=1๐F log
11 + ๐z๏ฟฝ
1 + ๐z๏ฟฝ -3๐z๏ฟฝ ..GHI๏ฟฝ
โ8| = โ๐ฆ๐๏ฟฝ๐โ๐๐๏ฟฝ
= โ๐ฆ โ log1
1 + ๐z๏ฟฝโ1 3๐z๏ฟฝ .
.GHI๏ฟฝ
๐โ8|
๐๐=๐โ8๐๐๏ฟฝ
๐๐๏ฟฝ๐๐F
๐๐F๐๐E
๐๐E๐๐0
๐๐0๐๐.
๐๐.๐๐ โ3 0.00037018372
CS109A, PROTOPAPAS, RADER
Backward mode: Evaluate the derivative at: X={3},y=1,W=3
72
Variables derivatives Value of the variable
Value of the partial derivative
๐. = โ๐N๐๐๐.๐๐
= โ๐ โ9 -3
๐0 = ๐๏ฟฝ- = ๐zuK{๐๐0๐๐.
= ๐๏ฟฝ- ๐z๏ฟฝ ๐z๏ฟฝ
๐E = 1 + ๐0 = 1 + ๐zuK{๐๐E๐๐0
= 1 1+๐z๏ฟฝ 1
๐F =1๐E=
11 + ๐zuK{
= ๐๐๐F๐๐E
= โ1๐E0
11 + ๐z๏ฟฝ
11 + ๐z๏ฟฝ
0
๐๏ฟฝ = log ๐F = log ๐ = log1
1 + ๐zuK{
๐๐๏ฟฝ๐๐F
=1๐F
log1
1 + ๐z๏ฟฝ1 + ๐z๏ฟฝ
โ8| = โ๐ฆ๐๏ฟฝ๐โ๐๐๏ฟฝ
= โ๐ฆ โ log1
1 + ๐z๏ฟฝโ1
๐โ8|
๐๐=๐โ8๐๐๏ฟฝ
๐๐๏ฟฝ๐๐F
๐๐F๐๐E
๐๐E๐๐0
๐๐0๐๐.
๐๐.๐๐ Typeequationhere.
Store all th
ese values