Neural Network Part 1: Multiple Layer Neural...

74
Neural Network Part 1: Multiple Layer Neural Networks CS 760@UW-Madison

Transcript of Neural Network Part 1: Multiple Layer Neural...

Page 1: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Neural Network Part 1: Multiple Layer Neural Networks

CS 760@UW-Madison

Page 2: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Goals for the lectureyou should understand the following concepts

• perceptrons• the perceptron training rule• linear separability• multilayer neural networks• stochastic gradient descent• backpropagation

2

Page 3: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Neural networks• a.k.a. artificial neural networks, connectionist models• inspired by interconnected neurons in biological systems

• simple processing units• each unit receives a number of real-valued inputs• each unit produces a single real-valued output

4

Page 4: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Perceptron

Page 5: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Perceptrons[McCulloch & Pitts, 1943; Rosenblatt, 1959; Widrow & Hoff, 1960]

input units:represent given x

output unit:represents binary classification

6

𝑜𝑜 = 1 if 𝑤𝑤0 + �𝑖𝑖=1

𝑛𝑛

𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖 > 0

0 otherwise

𝑥𝑥0 = 1

Page 6: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

The perceptron training rule

1. randomly initialize weights

2. iterate through training instances until convergence

2a. calculate the output for the given instance

2b. update each weight

η is learning rate;set to value << 1

7

𝑜𝑜 = 1 if 𝑤𝑤0 + �𝑖𝑖=1

𝑛𝑛

𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖 > 0

0 otherwise

Page 7: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Representational power of perceptrons

perceptrons can represent only linearly separable concepts

x1

+ ++ + +

+ -+ - -

+ ++ + -

+ + - --

+ + --

+ - - -

- -

x2

decision boundary given by:

8

also write as: wx > 0

𝑜𝑜 = 1 if 𝑤𝑤0 + �𝑖𝑖=1

𝑛𝑛

𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖 > 0

0 otherwise

Page 8: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Representational power of perceptrons

• in previous example, feature space was 2D so decision boundary was a line

• in higher dimensions, decision boundary is a hyperplane

9

Page 9: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Some linearly separable functions

x1 x2

0 00 11 01 1

y0001

abcd

AND

a c0 1

1

x1

x2

db

x1 x2

0 00 11 01 1

y0111

abcd

OR

b

a c0 1

1

x1

x2

d

10

Page 10: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

XOR is not linearly separable

x1 x2

0 00 11 01 1

y0110

abcd 0 1

1

x1

x2

b

a c

d

a multilayer perceptroncan represent XOR

1

-1

1

1

1

-1

assume w0 = 0 for all nodes11

Page 11: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Multiple Layer Neural Networks

Page 12: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Example multilayer neural network

input: two features from spectral analysis of a spoken sound

output: vowel sound occurring in the context “h__d”

figure from Huang & Lippmann, NIPS 1988

input units

hidden units

output units

13

Page 13: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Decision regions of a multilayer neural network

input: two features from spectral analysis of a spoken sound

output: vowel sound occurring in the context “h__d”

figure from Huang & Lippmann, NIPS 1988

14

Page 14: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Components

• Representations:• Input• Hidden variables

• Layers/weights:• Hidden layers• Output layer

Page 15: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Components

… …

…… …

…Hidden variables ℎ1 ℎ2 ℎ𝐿𝐿

𝑦𝑦 = ℎ𝐿𝐿+1

Input 𝑥𝑥 = ℎ0

First layer Output layer

An 𝐿𝐿 + 1 -layer network

Page 16: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

A three-layer neural network

…… …

…Hidden variables ℎ1 ℎ2

𝑦𝑦

Input 𝑥𝑥

First layer Third layer (output layer)Second layer

Page 17: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Input

• Represented as a vector

• Sometimes require somepreprocessing, e.g.,• Subtract mean• Normalize to [-1,1]

Expand

Page 18: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Input (feature) encoding for neural networksnominal features are usually represented using a 1-of-k encoding

ordinal features can be represented using a thermometer encoding

real-valued features can be represented using individual input units (we may want to scale/normalize them first though)

19

𝐴𝐴 =1000

𝐶𝐶 =0100

𝐺𝐺 =0010

𝑇𝑇 =0001

small=100

medium=110

large=111

Page 19: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Output layers

• Regression: 𝑦𝑦 = 𝑤𝑤𝑇𝑇ℎ + 𝑏𝑏• Linear units: no nonlinearity

𝑦𝑦

Output layer

Page 20: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Output layers

• Multi-dimensional regression: 𝑦𝑦 = 𝑊𝑊𝑇𝑇ℎ + 𝑏𝑏• Linear units: no nonlinearity

𝑦𝑦

Output layer

Page 21: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Output layers

• Binary classification: 𝑦𝑦 = 𝜎𝜎(𝑤𝑤𝑇𝑇ℎ + 𝑏𝑏)• Corresponds to using logistic regression on ℎ

𝑦𝑦

Output layer

Page 22: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Output layers

• Multi-class classification: • 𝑦𝑦 = softmax 𝑧𝑧 where 𝑧𝑧 = 𝑊𝑊𝑇𝑇ℎ + 𝑏𝑏• Corresponds to using multi-class

logistic regression on ℎ

𝑦𝑦

Output layer

𝑧𝑧

Page 23: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Hidden layers

• Neuron takes weighted linear combination of the previous representation layer

• outputs one value for the next layer

……

ℎ𝑖𝑖 ℎ𝑖𝑖+1

Page 24: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Hidden layers

• 𝑎𝑎 = 𝑟𝑟(𝑤𝑤𝑇𝑇𝑥𝑥 + 𝑏𝑏)

• Typical activation function 𝑟𝑟• Threshold t 𝑧𝑧 = 𝕀𝕀[𝑧𝑧 ≥ 0]• Sigmoid 𝜎𝜎 𝑧𝑧 = 1/(

)1 +

exp(−𝑧𝑧)• Tanh tanh 𝑧𝑧 = 2𝜎𝜎 2𝑧𝑧 − 1

𝑎𝑎𝑥𝑥𝑟𝑟(⋅)

Page 25: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Hidden layers

• Problem: saturation

𝑦𝑦𝑥𝑥𝑟𝑟(⋅)

Figure borrowed from Pattern Recognition and Machine Learning, Bishop

Too small gradient

Page 26: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Hidden layers

• Activation function ReLU (rectified linear unit)• ReLU 𝑧𝑧 = max{𝑧𝑧, 0}

Figure from Deep learning, by Goodfellow, Bengio, Courville.

Page 27: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Hidden layers

• Activation function ReLU (rectified linear unit)• ReLU 𝑧𝑧 = max{𝑧𝑧, 0}

Gradient 0

Gradient 1

Page 28: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Hidden layers

• Generalizations of ReLU gReLU 𝑧𝑧 = max 𝑧𝑧, 0 + 𝛼𝛼min{𝑧𝑧, 0}• Leaky-ReLU 𝑧𝑧 = max{𝑧𝑧, 0} + 𝛼𝛼min{𝑧𝑧, 0} for a constant 𝛼𝛼 ∈ (0,1)• Parametric-ReLU 𝑧𝑧 :𝛼𝛼 learnable

𝑧𝑧

gReLU 𝑧𝑧

Page 29: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Training Neural Networks with Gradient Descent:

Backpropagation

Page 30: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Learning in multilayer networks• work on neural nets fizzled in the 1960’s

• single layer networks had representational limitations (linear separability)

• no effective methods for training multilayer networks

• revived again with the invention of backpropagation method [Rumelhart & McClelland, 1986; also Werbos, 1975]• key insight: require neural network to be differentiable;

use gradient descent

how to determineerror signal for hidden units?

31

Page 31: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Learning in multilayer networks• learning techniques nowadays

• random initialization of the weights• stochastic gradient descent (can add momentum)• regularization techniques

• norm constraint• dropout• batch normalization• data augmentation• early stopping• pretraining • …

Page 32: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Gradient descent in weight space

figure from Cho & Chow, Neurocomputing 1999

Given a training set we can specify an error measure that is a function of our weight vector w

This error measure defines a surface over the hypothesis (i.e. weight) space

w1w2

33

)},(),...,,{( )()()1()1( mm yxyxD =

𝐸𝐸 𝒘𝒘 =12�𝑑𝑑∈D

𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 2

Page 33: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Gradient descent in weight space

w1

w2

Error

on each iteration• current weights define a

point in this space• find direction in which error

surface descends most steeply

• take a step (i.e. update weights) in that direction

gradient descent is an iterative process aimed at finding a minimum in the error surface

34

Page 34: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Gradient descent in weight space

w1

w2

Error

calculate the gradient of E:

take a step in the opposite direction

35

Page 35: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Batch neural network training

given: network structure and a training set

initialize all weights in w to small random numbers

until stopping criteria met do

initialize the error

for each (x(d), y(d)) in the training set

input x(d) to the network and compute output o(d)

increment the error

calculate the gradient

update the weights

36

)},(),...,,{( )()()1()1( mm yxyxD =

𝒘𝒘 ← 𝒘𝒘 + Δ𝒘𝒘, Δ𝒘𝒘 = −𝜂𝜂∇𝐸𝐸 𝒘𝒘

Page 36: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Online vs. batch training

• Standard gradient descent (batch training): calculates error gradient for the entire training set, before taking a step in weight space

• Stochastic gradient descent (online training): calculates error gradient for a single instance, then takes a step in weight space

• much faster convergence• less susceptible to local minima

37

Page 37: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Online neural network training (stochastic gradient descent)

given: network structure and a training set

initialize all weights in w to small random numbers

until stopping criteria met do

for each (x(d), y(d)) in the training set

input x(d) to the network and compute output o(d)

calculate the error

calculate the gradient

update the weights

38

)},(),...,,{( )()()1()1( mm yxyxD =

∇𝐸𝐸(𝑑𝑑) 𝒘𝒘 =𝜕𝜕𝐸𝐸(𝑑𝑑)

𝜕𝜕𝑤𝑤0,𝜕𝜕𝐸𝐸(𝑑𝑑)

𝜕𝜕𝑤𝑤1, … ,

𝜕𝜕𝐸𝐸(𝑑𝑑)

𝜕𝜕𝑤𝑤𝑛𝑛

𝐸𝐸(𝑑𝑑) 𝒘𝒘 =12 𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 2

𝒘𝒘 ← 𝒘𝒘 + Δ𝒘𝒘, Δ𝒘𝒘 = −𝜂𝜂∇𝐸𝐸(𝑑𝑑) 𝒘𝒘

Page 38: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Taking derivatives for one neuron

recall the chain rule from calculus

we’ll make use of this as follows

39

𝑦𝑦 = 𝑓𝑓 𝑢𝑢𝑢𝑢 = 𝑔𝑔 𝑥𝑥

𝜕𝜕𝑦𝑦𝜕𝜕𝑥𝑥

=𝜕𝜕𝑦𝑦𝜕𝜕𝑢𝑢

𝜕𝜕𝑢𝑢𝜕𝜕𝑥𝑥

𝜕𝜕𝐸𝐸𝜕𝜕𝑤𝑤𝑖𝑖

=𝜕𝜕𝐸𝐸𝜕𝜕𝑜𝑜

𝜕𝜕𝑜𝑜𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛

𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛𝜕𝜕𝑤𝑤𝑖𝑖

𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑤𝑤0 + �𝑖𝑖=1

𝑛𝑛

𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖

𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜

Page 39: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Gradient descent: simple caseConsider a simple case of a network with one linear output unit and no hidden units:

let’s learn wi’s that minimize squared error

batch case online case

40

𝑜𝑜(𝑑𝑑) = 𝑤𝑤0 + �𝑖𝑖=1

𝑛𝑛

𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖(𝑑𝑑)

𝐸𝐸 𝒘𝒘 =12�𝑑𝑑∈D

𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 2

𝜕𝜕𝐸𝐸𝜕𝜕𝑤𝑤𝑖𝑖

=𝜕𝜕𝜕𝜕𝑤𝑤𝑖𝑖

12�𝑑𝑑∈D

𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 2 𝜕𝜕𝐸𝐸(𝑑𝑑)

𝜕𝜕𝑤𝑤𝑖𝑖=

𝜕𝜕𝜕𝜕𝑤𝑤𝑖𝑖

12𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 2

Page 40: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Stochastic gradient descent: simple caselet’s focus on the online case (stochastic gradient descent):

𝜕𝜕𝐸𝐸(𝑑𝑑)

𝜕𝜕𝑤𝑤𝑖𝑖=

𝜕𝜕𝜕𝜕𝑤𝑤𝑖𝑖

12𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 2

= 𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 𝜕𝜕𝜕𝜕𝑤𝑤𝑖𝑖

𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑)

= 𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) −𝜕𝜕𝑜𝑜(𝑑𝑑)

𝜕𝜕𝑤𝑤𝑖𝑖

= − 𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 𝜕𝜕𝑜𝑜(𝑑𝑑)

𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛(𝑑𝑑)𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛(𝑑𝑑)

𝜕𝜕𝑤𝑤𝑖𝑖= − 𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛(𝑑𝑑)

𝜕𝜕𝑤𝑤𝑖𝑖

= − 𝑦𝑦 𝑑𝑑 − 𝑜𝑜 𝑑𝑑 𝑥𝑥𝑖𝑖(𝑑𝑑)

41

Page 41: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Gradient descent with a sigmoidNow let’s consider the case in which we have a sigmoid output unit and no hidden units:

useful property:

42

𝑛𝑛𝑛𝑛𝑛𝑛(𝑑𝑑) = 𝑤𝑤0 + �𝑖𝑖=1

𝑛𝑛

𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖(𝑑𝑑)

𝑜𝑜(𝑑𝑑) =1

1 + 𝑛𝑛−𝑛𝑛𝑛𝑛𝑡𝑡(𝑑𝑑)

𝜕𝜕𝑜𝑜 𝑑𝑑

𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛(𝑑𝑑) = 𝑜𝑜 𝑑𝑑 (1 − 𝑜𝑜 𝑑𝑑 )

Page 42: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Stochastic GD with sigmoid output unit

43

43

𝜕𝜕𝐸𝐸(𝑑𝑑)

𝜕𝜕𝑤𝑤𝑖𝑖=

𝜕𝜕𝜕𝜕𝑤𝑤𝑖𝑖

12𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 2

= 𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 𝜕𝜕𝜕𝜕𝑤𝑤𝑖𝑖

𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑)

= 𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) −𝜕𝜕𝑜𝑜(𝑑𝑑)

𝜕𝜕𝑤𝑤𝑖𝑖

= − 𝑦𝑦(𝑑𝑑) − 𝑜𝑜(𝑑𝑑) 𝜕𝜕𝑜𝑜(𝑑𝑑)

𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛(𝑑𝑑)𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛(𝑑𝑑)

𝜕𝜕𝑤𝑤𝑖𝑖

= − 𝑦𝑦 𝑑𝑑 − 𝑜𝑜 𝑑𝑑 𝑜𝑜 𝑑𝑑 (1 − 𝑜𝑜(𝑑𝑑))𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛(𝑑𝑑)

𝜕𝜕𝑤𝑤𝑖𝑖

= − 𝑦𝑦 𝑑𝑑 − 𝑜𝑜 𝑑𝑑 𝑜𝑜 𝑑𝑑 (1 − 𝑜𝑜(𝑑𝑑))𝑥𝑥𝑖𝑖(𝑑𝑑)

Page 43: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Backpropagation

• now we’ve covered how to do gradient descent for single neurons (i.e., single-layer networks) with• linear output units• sigmoid output units

• how can we calculate for every weight in a multilayer network?

backpropagate errors from the output units to the hidden units

44

Page 44: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Backpropagation notationlet’s consider the online case, but drop the (d) superscripts for simplicitylet’s consider hidden and output units have sigmoid activation

we’ll use • subscripts on y, o, net to indicate which unit they refer to• subscripts to indicate the unit a weight emanates from and goes to

i

j

45

Page 45: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Backpropagation

each weight is changed by

xi if i is an input unit

46

∆𝑤𝑤𝑗𝑗𝑖𝑖 = −𝜂𝜂𝜕𝜕𝐸𝐸𝜕𝜕𝑤𝑤𝑗𝑗𝑖𝑖

= −𝜂𝜂𝜕𝜕𝐸𝐸𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛𝑗𝑗

𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛𝑗𝑗𝜕𝜕𝑤𝑤𝑗𝑗𝑖𝑖

= 𝜂𝜂𝛿𝛿𝑗𝑗𝑜𝑜𝑖𝑖

where 𝛿𝛿𝑗𝑗 = − 𝜕𝜕𝜕𝜕𝜕𝜕𝑛𝑛𝑛𝑛𝑡𝑡𝑗𝑗

i

j

Page 46: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Backpropagation

each weight is changed by

47

∆𝑤𝑤𝑗𝑗𝑖𝑖 = 𝜂𝜂𝛿𝛿𝑗𝑗𝑜𝑜𝑖𝑖

where 𝛿𝛿𝑗𝑗 = − 𝜕𝜕𝜕𝜕𝜕𝜕𝑛𝑛𝑛𝑛𝑡𝑡𝑗𝑗

Case 1: if j is an output sigmoid unit

𝛿𝛿𝑗𝑗 = 𝑜𝑜𝑗𝑗(1 − 𝑜𝑜𝑗𝑗)(𝑦𝑦𝑗𝑗 − 𝑜𝑜𝑗𝑗)

same as single neuron with sigmoid output

i

j

Page 47: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Backpropagation

each weight is changed by

Case 2: if j is a hidden unit

48sum over all downstream neurons k, sum of backpropagated contributions to error

∆𝑤𝑤𝑗𝑗𝑖𝑖 = 𝜂𝜂𝛿𝛿𝑗𝑗𝑜𝑜𝑖𝑖

where 𝛿𝛿𝑗𝑗 = − 𝜕𝜕𝜕𝜕𝜕𝜕𝑛𝑛𝑛𝑛𝑡𝑡𝑗𝑗

𝛿𝛿𝑗𝑗 = −�𝑘𝑘

𝜕𝜕𝐸𝐸𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛𝑘𝑘

𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛𝑘𝑘𝜕𝜕𝑜𝑜𝑗𝑗

𝜕𝜕𝑜𝑜𝑗𝑗𝜕𝜕𝑛𝑛𝑛𝑛𝑛𝑛𝑗𝑗

= 𝑜𝑜𝑗𝑗(1 − 𝑜𝑜𝑗𝑗)�𝑘𝑘

𝛿𝛿𝑘𝑘𝑤𝑤𝑘𝑘𝑗𝑗

j

k

Page 48: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Backpropagation illustrated

j

1. calculate error of output units

49

𝛿𝛿𝑗𝑗 = 𝑜𝑜𝑗𝑗(1 − 𝑜𝑜𝑗𝑗)(𝑦𝑦𝑗𝑗 − 𝑜𝑜𝑗𝑗)

i

j

2. determine updates for weights going to output units

∆𝑤𝑤𝑗𝑗𝑖𝑖 = 𝜂𝜂𝛿𝛿𝑗𝑗𝑜𝑜𝑖𝑖

Page 49: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

Backpropagation illustrated

j

k

3. calculate error for hidden units

50

𝛿𝛿𝑗𝑗 = 𝑜𝑜𝑗𝑗(1 − 𝑜𝑜𝑗𝑗)�𝑘𝑘

𝛿𝛿𝑘𝑘𝑤𝑤𝑘𝑘𝑗𝑗

i

j

4. determine updates for weights to hidden units using hidden-unit errors

∆𝑤𝑤𝑗𝑗𝑖𝑖 = 𝜂𝜂𝛿𝛿𝑗𝑗𝑜𝑜𝑖𝑖

Page 50: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

More Illustration of Backpropagation

Page 51: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 52

Learning in neural network• Again we will minimize the error (𝐾𝐾 outputs):

• 𝑥𝑥: one training point in the training set 𝐷𝐷• 𝑎𝑎𝑐𝑐: the 𝑐𝑐-th output for the training point 𝑥𝑥• 𝑦𝑦𝑐𝑐: the 𝑐𝑐-th element of the label indicator vector for 𝑥𝑥

𝐸𝐸 =12�𝑥𝑥∈𝐷𝐷

𝐸𝐸𝑥𝑥 , 𝐸𝐸𝑥𝑥 = 𝑦𝑦 − 𝑎𝑎 2 = �𝑐𝑐=1

𝐾𝐾

𝑎𝑎𝑐𝑐 − 𝑦𝑦𝑐𝑐 2

𝑥𝑥2

𝑥𝑥1…

𝑎𝑎1

𝑎𝑎𝐾𝐾

10…00

= 𝑦𝑦

Page 52: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 53

Learning in neural network• Again we will minimize the error (𝐾𝐾 outputs):

• 𝑥𝑥: one training point in the training set 𝐷𝐷• 𝑎𝑎𝑐𝑐: the 𝑐𝑐-th output for the training point 𝑥𝑥• 𝑦𝑦𝑐𝑐: the 𝑐𝑐-th element of the label indicator vector for 𝑥𝑥• Our variables are all the weights 𝑤𝑤 on all the edges Apparent difficulty: we don’t know the ‘correct’

output of hidden units

𝐸𝐸 =12�𝑥𝑥∈𝐷𝐷

𝐸𝐸𝑥𝑥 , 𝐸𝐸𝑥𝑥 = 𝑦𝑦 − 𝑎𝑎 2 = �𝑐𝑐=1

𝐾𝐾

𝑎𝑎𝑐𝑐 − 𝑦𝑦𝑐𝑐 2

Page 53: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 54

Learning in neural network• Again we will minimize the error (𝐾𝐾 outputs):

• 𝑥𝑥: one training point in the training set 𝐷𝐷• 𝑎𝑎𝑐𝑐: the 𝑐𝑐-th output for the training point 𝑥𝑥• 𝑦𝑦𝑐𝑐: the 𝑐𝑐-th element of the label indicator vector for 𝑥𝑥• Our variables are all the weights 𝑤𝑤 on all the edges Apparent difficulty: we don’t know the ‘correct’

output of hidden units It turns out to be OK: we can still do gradient

descent. The trick you need is the chain rule The algorithm is known as back-propagation

𝐸𝐸 =12�𝑥𝑥∈𝐷𝐷

𝐸𝐸𝑥𝑥 , 𝐸𝐸𝑥𝑥 = 𝑦𝑦 − 𝑎𝑎 2 = �𝑐𝑐=1

𝐾𝐾

𝑎𝑎𝑐𝑐 − 𝑦𝑦𝑐𝑐 2

Page 54: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 55

Gradient (on one data point)

𝑥𝑥2

𝑥𝑥1𝑤𝑤11

(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

want to compute 𝜕𝜕𝜕𝜕𝑥𝑥𝜕𝜕𝑤𝑤11

4

Page 55: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 56

Gradient (on one data point)

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥𝑎𝑎1𝑦𝑦 − 𝑎𝑎 2

Page 56: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 57

Gradient (on one data point)

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥𝑎𝑎1𝑦𝑦 − 𝑎𝑎 2

𝑧𝑧1(4) = 𝑤𝑤11

(4)𝑎𝑎1(3) + 𝑤𝑤12

(4)𝑎𝑎2(3)

𝑔𝑔 𝑧𝑧1(4)

𝑧𝑧1(4)

Page 57: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 58

Gradient (on one data point)

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥𝑎𝑎1𝑦𝑦 − 𝑎𝑎 2

𝑧𝑧1(4) = 𝑤𝑤11

(4)𝑎𝑎1(3) + 𝑤𝑤12

(4)𝑎𝑎2(3)

𝑔𝑔 𝑧𝑧1(4)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝑤𝑤12(4)

Page 58: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 59

Gradient (on one data point)

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥𝑎𝑎1𝑦𝑦 − 𝑎𝑎 2

𝑧𝑧1(4) = 𝑤𝑤11

(4)𝑎𝑎1(3) + 𝑤𝑤12

(4)𝑎𝑎2(3)

𝑔𝑔 𝑧𝑧1(4)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤11

(4) =𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑎𝑎1

𝜕𝜕𝑎𝑎1𝜕𝜕𝑧𝑧1

(4)𝜕𝜕𝑧𝑧1

(4)

𝜕𝜕𝑤𝑤11(4)By Chain Rule:

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑎𝑎1

= 2(𝑎𝑎1 − 𝑦𝑦1)𝜕𝜕𝑎𝑎1𝜕𝜕𝑧𝑧1

(4) = 𝑔𝑔′ 𝑧𝑧1(4)

Page 59: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 60

Gradient (on one data point)

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥𝑎𝑎1𝑦𝑦 − 𝑎𝑎 2

𝑧𝑧1(4) = 𝑤𝑤11

(4)𝑎𝑎1(3) + 𝑤𝑤12

(4)𝑎𝑎2(3)

𝑔𝑔 𝑧𝑧1(4)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤11

(4) = 2(𝑎𝑎1 − 𝑦𝑦1)𝑔𝑔′ 𝑧𝑧1(4) 𝜕𝜕𝑧𝑧1

(4)

𝜕𝜕𝑤𝑤11(4)By Chain Rule:

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑎𝑎1

= 2(𝑎𝑎1 − 𝑦𝑦1)𝜕𝜕𝑎𝑎1𝜕𝜕𝑧𝑧1

(4) = 𝑔𝑔′ 𝑧𝑧1(4)

Page 60: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 61

Gradient (on one data point)

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥𝑎𝑎1𝑦𝑦 − 𝑎𝑎 2

𝑧𝑧1(4) = 𝑤𝑤11

(4)𝑎𝑎1(3) + 𝑤𝑤12

(4)𝑎𝑎2(3)

𝑔𝑔 𝑧𝑧1(4)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤11

(4) = 2(𝑎𝑎1 − 𝑦𝑦1)𝑔𝑔′ 𝑧𝑧1(4) 𝑎𝑎1

(3)By Chain Rule:

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑎𝑎1

= 2(𝑎𝑎1 − 𝑦𝑦1)𝜕𝜕𝑎𝑎1𝜕𝜕𝑧𝑧1

(4) = 𝑔𝑔′ 𝑧𝑧1(4)

Page 61: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 62

Gradient (on one data point)

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥𝑎𝑎1𝑦𝑦 − 𝑎𝑎 2

𝑧𝑧1(4) = 𝑤𝑤11

(4)𝑎𝑎1(3) + 𝑤𝑤12

(4)𝑎𝑎2(3)

𝑔𝑔 𝑧𝑧1(4)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤11

(4) = 2(𝑎𝑎1 − 𝑦𝑦1)𝑔𝑔 𝑧𝑧1(4) 1 − 𝑔𝑔 𝑧𝑧1

(4) 𝑎𝑎1(3)

By Chain Rule:

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑎𝑎1

= 2(𝑎𝑎1 − 𝑦𝑦1)𝜕𝜕𝑎𝑎1𝜕𝜕𝑧𝑧1

(4) = 𝑔𝑔′ 𝑧𝑧1(4)

Page 62: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 63

Gradient (on one data point)

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥𝑎𝑎1𝑦𝑦 − 𝑎𝑎 2

𝑧𝑧1(4) = 𝑤𝑤11

(4)𝑎𝑎1(3) + 𝑤𝑤12

(4)𝑎𝑎2(3)

𝑔𝑔 𝑧𝑧1(4)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤11

(4) = 2 𝑎𝑎1 − 𝑦𝑦1 𝑎𝑎1 1 − 𝑎𝑎1 𝑎𝑎1(3)

By Chain Rule:

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑎𝑎1

= 2(𝑎𝑎1 − 𝑦𝑦1)𝜕𝜕𝑎𝑎1𝜕𝜕𝑧𝑧1

(4) = 𝑔𝑔′ 𝑧𝑧1(4)

Page 63: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 64

Gradient (on one data point)

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥𝑎𝑎1𝑦𝑦 − 𝑎𝑎 2

𝑧𝑧1(4) = 𝑤𝑤11

(4)𝑎𝑎1(3) + 𝑤𝑤12

(4)𝑎𝑎2(3)

𝑔𝑔 𝑧𝑧1(4)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤11

(4) = 2 𝑎𝑎1 − 𝑦𝑦1 𝑎𝑎1 1 − 𝑎𝑎1 𝑎𝑎1(3)

By Chain Rule:

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑎𝑎1

= 2(𝑎𝑎1 − 𝑦𝑦1)𝜕𝜕𝑎𝑎1𝜕𝜕𝑧𝑧1

(4) = 𝑔𝑔′ 𝑧𝑧1(4)

Can be computed by network activation

Page 64: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 65

Backpropagation

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥

𝑧𝑧1(4) = 𝑤𝑤11

(4)𝑎𝑎1(3) + 𝑤𝑤12

(4)𝑎𝑎2(3)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤11

(4) = 2 𝑎𝑎1 − 𝑦𝑦1 𝑎𝑎1 1 − 𝑎𝑎1 𝑎𝑎1(3)

By Chain Rule:

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑧𝑧1

(4) = 2(𝑎𝑎1 − 𝑦𝑦1)𝑔𝑔′ 𝑧𝑧1(4)

Page 65: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 66

Backpropagation

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥

𝑧𝑧1(4) = 𝑤𝑤11

(4)𝑎𝑎1(3) + 𝑤𝑤12

(4)𝑎𝑎2(3)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤11

(4) = 2 𝑎𝑎1 − 𝑦𝑦1 𝑎𝑎1 1 − 𝑎𝑎1 𝑎𝑎1(3)

By Chain Rule:

𝛿𝛿1(4) =

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑧𝑧1

(4) = 2(𝑎𝑎1 − 𝑦𝑦1)𝑔𝑔′ 𝑧𝑧1(4)

Page 66: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 67

Backpropagation

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥

𝛿𝛿1(4)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤11

(4) = 𝛿𝛿1(4)𝑎𝑎1

(3)By Chain Rule:

𝛿𝛿1(4) =

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑧𝑧1

(4) = 2(𝑎𝑎1 − 𝑦𝑦1)𝑔𝑔′ 𝑧𝑧1(4)

𝑎𝑎1(3)

Page 67: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 68

Backpropagation

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

𝑤𝑤11(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥

𝛿𝛿1(4)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤11

(4) = 𝛿𝛿1(4)𝑎𝑎1

(3),𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤12

(4) = 𝛿𝛿1(4)𝑎𝑎2

(3)By Chain Rule:

𝛿𝛿1(4) =

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑧𝑧1

(4) = 2(𝑎𝑎1 − 𝑦𝑦1)𝑔𝑔′ 𝑧𝑧1(4)

𝑤𝑤12(4)

𝑎𝑎2(3)

Page 68: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 69

Backpropagation

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2𝑤𝑤21

(4)

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

𝐸𝐸𝑥𝑥

𝛿𝛿2(4)

𝑧𝑧1(4)

𝑤𝑤124 𝑎𝑎2

(3)

𝑤𝑤114 𝑎𝑎1

(3)

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤21

(4) = 𝛿𝛿2(4)𝑎𝑎1

(3),𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤22

(4) = 𝛿𝛿2(4)𝑎𝑎2

(3)By Chain Rule:

𝛿𝛿2(4) =

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑧𝑧2

(4) = 2(𝑎𝑎2 − 𝑦𝑦2)𝑔𝑔′ 𝑧𝑧2(4)

𝑤𝑤22(4)

𝑎𝑎2(3)

𝑎𝑎1(3)

Page 69: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 70

𝛿𝛿2(4)

𝛿𝛿1(3)

𝛿𝛿2(3)

𝛿𝛿1(2)

𝛿𝛿2(2)

𝛿𝛿1(4)

Backpropagation

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

Thus, for any weight in the network:

𝜕𝜕𝐸𝐸𝑥𝑥𝜕𝜕𝑤𝑤𝑗𝑗𝑘𝑘

(𝑙𝑙) = 𝛿𝛿𝑗𝑗(𝑙𝑙)𝑎𝑎𝑘𝑘

(𝑙𝑙−1)

𝛿𝛿𝑗𝑗(𝑙𝑙) : 𝛿𝛿 of 𝑗𝑗𝑡𝑡𝑡 neuron in Layer 𝑙𝑙𝑎𝑎𝑘𝑘

(𝑙𝑙−1) : Activation of 𝑘𝑘𝑡𝑡𝑡 neuron in Layer 𝑙𝑙 − 1𝑤𝑤𝑗𝑗𝑘𝑘

(𝑙𝑙) : Weight from 𝑘𝑘𝑡𝑡𝑡 neuron in Layer 𝑙𝑙 − 1 to 𝑗𝑗𝑡𝑡𝑡 neuron in Layer 𝑙𝑙

Page 70: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 71

𝛿𝛿2(4)

𝛿𝛿1(3)

𝛿𝛿2(3)

𝛿𝛿1(2)

𝛿𝛿2(2)

𝛿𝛿1(4)

Exercise

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

Show that for any bias in the network:

𝜕𝜕𝐸𝐸𝑥𝑥𝜕𝜕𝑏𝑏𝑗𝑗

(𝑙𝑙) = 𝛿𝛿𝑗𝑗(𝑙𝑙)

𝛿𝛿𝑗𝑗(𝑙𝑙) : 𝛿𝛿 of 𝑗𝑗𝑡𝑡𝑡 neuron in Layer 𝑙𝑙𝑏𝑏𝑗𝑗

(𝑙𝑙) : bias for the 𝑗𝑗𝑡𝑡𝑡 neuron in Layer 𝑙𝑙, i.e., 𝑧𝑧𝑗𝑗(𝑙𝑙) = ∑𝑘𝑘𝑤𝑤𝑗𝑗𝑘𝑘

(𝑙𝑙)𝑎𝑎𝑘𝑘(𝑙𝑙−1) + 𝑏𝑏𝑗𝑗

(𝑙𝑙)

Page 71: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 72

𝛿𝛿2(4)

𝛿𝛿1(3)

𝛿𝛿2(3)

𝛿𝛿1(2)

𝛿𝛿2(2)

𝛿𝛿1(4)

Backpropagation of 𝛿𝛿

𝑥𝑥2

𝑥𝑥1= 𝑦𝑦 − 𝑎𝑎 2

𝑎𝑎1

𝑎𝑎2

Layer (4)Layer (3)Layer (2)Layer (1)

𝐸𝐸𝑥𝑥

Thus, for any neuron in the network:

𝛿𝛿𝑗𝑗(𝑙𝑙) = �

𝑘𝑘

𝛿𝛿𝑘𝑘𝑙𝑙+1 𝑤𝑤𝑘𝑘𝑗𝑗

𝑙𝑙+1 𝑔𝑔′ 𝑧𝑧𝑗𝑗𝑙𝑙

𝛿𝛿𝑗𝑗(𝑙𝑙) : 𝛿𝛿 of 𝑗𝑗𝑡𝑡𝑡 Neuron in Layer 𝑙𝑙𝛿𝛿𝑘𝑘

(𝑙𝑙+1) : 𝛿𝛿 of 𝑘𝑘𝑡𝑡𝑡 Neuron in Layer 𝑙𝑙 + 1𝑔𝑔′ 𝑧𝑧𝑗𝑗

𝑙𝑙 : derivative of 𝑗𝑗𝑡𝑡𝑡 Neuron in Layer 𝑙𝑙 w.r.t. its linear combination input

𝑤𝑤𝑘𝑘𝑗𝑗(𝑙𝑙+1) : Weight from 𝑗𝑗𝑡𝑡𝑡 Neuron in Layer 𝑙𝑙 to 𝑘𝑘𝑡𝑡𝑡 Neuron in Layer 𝑙𝑙 + 1

Page 72: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 73

Gradient descent with Backpropagation1. Initialize Network with Random Weights and Biases

2. For each Training Image:

a. Compute Activations for the Entire Network

b. Compute 𝛿𝛿 for Neurons in the Output Layer using Network Activation and Desired Activation

𝛿𝛿𝑗𝑗(𝐿𝐿) = 2 𝑦𝑦𝑗𝑗 − 𝑎𝑎𝑗𝑗 𝑎𝑎𝑗𝑗(1 − 𝑎𝑎𝑗𝑗)

c. Compute 𝛿𝛿 for all Neurons in the previous Layers

𝛿𝛿𝑗𝑗(𝑙𝑙) = �

𝑘𝑘

𝛿𝛿𝑘𝑘𝑙𝑙+1 𝑤𝑤𝑘𝑘𝑗𝑗

𝑙𝑙+1 𝑎𝑎𝑗𝑗𝑙𝑙 (1 − 𝑎𝑎𝑗𝑗

𝑙𝑙 )

d. Compute Gradient of Cost w.r.t each Weight and Bias for the Training Image using 𝛿𝛿

𝜕𝜕𝐸𝐸𝑥𝑥𝜕𝜕𝑤𝑤𝑗𝑗𝑘𝑘

(𝑙𝑙) = 𝛿𝛿𝑗𝑗(𝑙𝑙)𝑎𝑎𝑘𝑘

(𝑙𝑙−1) 𝜕𝜕𝐸𝐸𝑥𝑥𝜕𝜕𝑏𝑏𝑗𝑗

(𝑙𝑙) = 𝛿𝛿𝑗𝑗(𝑙𝑙)

Page 73: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

slide 74

Gradient descent with Backpropagation3. Average the Gradient w.r.t. each Weight and Bias over the

Entire Training Set𝜕𝜕𝐸𝐸

𝜕𝜕𝑤𝑤𝑗𝑗𝑘𝑘(𝑙𝑙) =

1𝑛𝑛�

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑤𝑤𝑗𝑗𝑘𝑘

(𝑙𝑙)𝜕𝜕𝐸𝐸

𝜕𝜕𝑏𝑏𝑗𝑗(𝑙𝑙) =

1𝑛𝑛�

𝜕𝜕𝐸𝐸𝒙𝒙𝜕𝜕𝑏𝑏𝑗𝑗

(𝑙𝑙)

4. Update the Weights and Biases using Gradient Descent

𝑤𝑤𝑗𝑗𝑘𝑘(𝑙𝑙)⃪𝑤𝑤𝑗𝑗𝑘𝑘

(𝑙𝑙) − 𝜂𝜂𝜕𝜕𝐸𝐸

𝜕𝜕𝑤𝑤𝑗𝑗𝑘𝑘𝑙𝑙 𝑏𝑏𝑗𝑗

𝑙𝑙 ⃪ 𝑏𝑏𝑗𝑗𝑙𝑙 − 𝜂𝜂

𝜕𝜕𝐸𝐸

𝜕𝜕𝑏𝑏𝑗𝑗(𝑙𝑙)

5. Repeat Steps 2-4 till Cost reduces below an acceptable level

Page 74: Neural Network Part 1: Multiple Layer Neural Networkspages.cs.wisc.edu/~yliang/cs760_spring21/slides/lecture...Goals for the lecture you should understand the following concepts •

THANK YOUSome of the slides in these lectures have been adapted/borrowed

from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan,

Tom Dietterich, Pedro Domingos, and Mohit Gupta.