How the backpropagation algorithm workssrikumar/cv_spring2017_files/DeepLearning2.pdfBackpropagation...

29
How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

Transcript of How the backpropagation algorithm workssrikumar/cv_spring2017_files/DeepLearning2.pdfBackpropagation...

How the backpropagation algorithm works

Srikumar Ramalingam

School of Computing

University of Utah

Reference

Most of the slides are taken from the second chapter of the online book by Michael Nielson:

β€’ neuralnetworksanddeeplearning.com

Introduction

β€’ First discovered in 1970.

β€’ First influential paper in 1986:

Rumelhart, Hinton and Williams, Learning representations by back-propagating errors, Nature, 1986.

Perceptron (Reminder)

Sigmoid neuron (Reminder)

β€’ A sigmoid neuron can take real numbers (π‘₯1, π‘₯2, π‘₯3) within 0 to 1 and returns a number within 0 to 1. The weights (𝑀1, 𝑀2, 𝑀3) and the bias term 𝑏 are real numbers.

Sigmoid function

Matrix equations for neural networks

The indices β€œj” and β€œk” seem a little counter-intuitive!

Layer to layer relationship

β€’ 𝑏𝑗𝑙 is the bias term in the jth neuron in the lth layer.

β€’ π‘Žπ‘—π‘™ is the activation in the jth neuron in the lth layer.

β€’ 𝑧𝑗𝑙 is the weighted input to the jth neuron in the lth layer.

π‘Žπ‘—π‘™ = 𝜎(𝑧𝑗

𝑙)

𝑧𝑗𝑙 =

π‘˜

π‘€π‘—π‘˜π‘™ π‘Žπ‘˜

π‘™βˆ’1 + 𝑏𝑗𝑙

Cost function from the network

# of input samples

for each input sample

Output activation vector for a specific training sample π‘₯.

Groundtruth for each input

Backpropagation and stochastic gradient descentβ€’ The goal of the backpropagation algorithm is to compute the

gradients πœ•πΆ

πœ•π‘€and

πœ•πΆ

πœ•π‘of the cost function C with respect to each and

every weight and bias parameters. Note that backpropagation is only used to compute the gradients.

β€’ Stochastic gradient descent is the training algorithm.

Assumptions on the cost function

1. We assume that the cost function can be written as the average over

the cost functions from individual training samples: 𝐢 =1

𝑛σπ‘₯ 𝐢π‘₯. The

cost function for the individual training sample is given by 𝐢π‘₯ =1

2𝑦 π‘₯ βˆ’ π‘ŽπΏ π‘₯ 2.

- why do we need this assumption? Backpropagation will only allow us to compute the gradients with respect to a single training

sample as given by πœ•πΆπ‘₯

πœ•π‘€and

πœ•πΆπ‘₯

πœ•π‘. We then recover

πœ•πΆ

πœ•π‘€and

πœ•πΆ

πœ•π‘by averaging

the gradients from the different training samples.

Assumptions on the cost function (continued)

2. We assume that the cost function can be written as a function of the output from the neural network. We assume that the input π‘₯ and its associated correct labeling 𝑦 π‘₯ are fixed and treated as constants.

Hadamard product

β€’ Let 𝑠 and 𝑑 are two vectors. The Hadamard product is given by:

Such elementwise multiplication is also referred to as schur product.

Backpropagation

β€’ Our goal is to compute the partial derivatives πœ•πΆ

πœ•π‘€π‘—π‘˜π‘™ and

πœ•πΆ

πœ•π‘π‘—π‘™ .

β€’ We compute some intermediate quantities while doing so:

𝛿𝑗𝑙 =

πœ•πΆ

πœ•π‘§π‘—π‘™

Four equations of the BP (backpropagation)

Chain Rule in differentiation

β€’ In order to differentiate a function z = 𝑓 𝑔 π‘₯ w.r.t π‘₯, we can do the following:

Let y = 𝑔 π‘₯ , 𝑧 = 𝑓 𝑦 ,𝑑𝑧

𝑑π‘₯=

𝑑𝑧

𝑑𝑦×

𝑑𝑦

𝑑π‘₯

Chain Rule in differentiation (vector case)

Let π‘₯ ∈ β„π‘š, 𝑦 ∈ ℝ𝑛, g maps from β„π‘š to ℝ𝑛, and 𝑓 maps from ℝ𝑛 to ℝ. If 𝑦 = 𝑔 π‘₯ and 𝑧 = 𝑓 𝑦 , then

πœ•π‘§

πœ•π‘₯𝑖=

π‘˜

πœ•π‘§

πœ•π‘¦π‘˜

πœ•π‘¦π‘˜πœ•π‘₯𝑖

Chain Rule in differentiation (computation graph)

πœ•π‘§

πœ•π‘₯=

𝑗:π‘₯βˆˆπ‘ƒπ‘Žπ‘Ÿπ‘’π‘›π‘‘ 𝑦𝑗 ,

π‘¦π‘—βˆˆπ΄π‘›π‘π‘’π‘ π‘‘π‘œπ‘Ÿ (𝑧)

πœ•π‘§

πœ•π‘¦π‘—

πœ•π‘¦π‘—

πœ•π‘₯π‘₯ 𝑧

𝑦1

𝑦2

𝑦3

BP1

Here L is the last layer.

𝛿𝐿 =πœ•πΆ

πœ•π‘§πΏ, πœŽβ€² 𝑧𝐿 =

πœ• 𝜎 𝑧𝐿

πœ•π‘§πΏ, π›»π‘ŽπΆ =

πœ•πΆ

πœ•π‘ŽπΏ=

πœ•πΆ

πœ•π‘Ž1𝐿 ,

πœ•πΆ

πœ•π‘Ž2𝐿 , … ,

πœ•πΆ

πœ•π‘Žπ‘›πΏ

𝑇

Proof:

𝛿𝑗𝐿 =

πœ•πΆ

πœ•π‘§π‘—πΏ = Οƒπ‘˜

πœ•πΆ

πœ•π‘Žπ‘˜πΏ

πœ•π‘Žπ‘˜πΏ

πœ•π‘§π‘—πΏ =

πœ•πΆ

πœ•π‘Žπ‘—πΏ

πœ•π‘Žπ‘—πΏ

πœ•π‘§π‘—πΏ when 𝑗 β‰  π‘˜, the term

πœ•π‘Žπ‘˜πΏ

πœ•π‘§π‘—πΏ vanishes.

𝛿𝑗𝐿 =

πœ•πΆ

πœ•π‘Žπ‘—πΏ πœŽβ€²(𝑧𝑗

𝐿)

Thus we have

πœ•πΏ =πœ•πΆ

πœ•π‘ŽπΏβŠ™πœŽβ€²(𝑧𝐿 )

π‘ŽπΏ = 𝜎(𝑧𝐿)

𝐢 = 𝑓(π‘ŽπΏ)

πΏπ‘Žπ‘¦π‘’π‘Ÿ 𝐿 βˆ’ 1πΏπ‘Žπ‘¦π‘’π‘Ÿ 𝐿 βˆ’ 1 πΏπ‘Žπ‘¦π‘’π‘Ÿ 𝐿

BP2

πœ•π‘™ = 𝑀𝑙+1 𝑇𝛿𝑙+1 βŠ™πœŽβ€² 𝑧𝑙

Proof:

𝛿𝑗𝑙 =

πœ•πΆ

πœ•π‘§π‘—π‘™ = Οƒπ‘˜

πœ•πΆ

πœ•π‘§π‘˜π‘™+1

πœ•π‘§π‘˜π‘™+1

πœ•π‘§π‘—π‘™ = Οƒπ‘˜

πœ•π‘§π‘˜π‘™+1

πœ•π‘§π‘—π‘™ π›Ώπ‘˜

𝑙+1

π‘§π‘˜π‘™+1 = Οƒπ‘—π‘€π‘˜π‘—

𝑙+1π‘Žπ‘—π‘™ + π‘π‘˜

𝑙 = Οƒπ‘—π‘€π‘˜π‘—π‘™+1𝜎 𝑧𝑗

𝑙 + π‘π‘˜π‘™

By differentiating we have:πœ•π‘§π‘˜

𝑙+1

πœ•π‘§π‘—π‘™ = π‘€π‘˜π‘—

𝑙+1πœŽβ€² 𝑧𝑗𝑙

𝛿𝑗𝑙 = Οƒπ‘˜π‘€π‘˜π‘—

𝑙+1π›Ώπ‘˜π‘™+1πœŽβ€² 𝑧𝑗

𝑙

𝑧𝑙+1 = 𝑀𝑙+1π‘Žπ‘™ + 𝑏𝑙+1

πΏπ‘Žπ‘¦π‘’π‘Ÿ 𝑙 πΏπ‘Žπ‘¦π‘’π‘Ÿ 𝑙 + 1

BP3

πœ•πΆ

πœ•π‘π‘—π‘™ = 𝛿𝑗

𝑙

Proof:πœ•πΆ

πœ•π‘π‘—π‘™ =

π‘˜

πœ•πΆ

πœ•π‘§π‘˜π‘™

πœ•π‘§π‘˜π‘™

πœ•π‘π‘—π‘™ =

πœ•πΆ

πœ•π‘§π‘—π‘™

πœ•π‘§π‘—π‘™

πœ•π‘π‘—π‘™

= π›Ώπ‘—π‘™πœ• Οƒπ‘˜ π‘€π‘—π‘˜π‘Žπ‘˜

π‘™βˆ’1 + 𝑏𝑗𝑙

πœ•π‘π‘—= 𝛿𝑗

𝑙

𝑧𝑙 = π‘€π‘™π‘Žπ‘™βˆ’1 + 𝑏𝑙

πΏπ‘Žπ‘¦π‘’π‘Ÿ 𝑙-1 πΏπ‘Žπ‘¦π‘’π‘Ÿ 𝑙

BP4

πœ•πΆ

πœ•π‘€π‘—π‘˜π‘™ = π‘Žπ‘˜

π‘™βˆ’1𝛿𝑗𝑙

Proof: πœ•πΆ

πœ•π‘€π‘—π‘˜π‘™ = Οƒπ‘š

πœ•πΆ

πœ•π‘§π‘šπ‘™

πœ•π‘§π‘šπ‘™

πœ•π‘€π‘—π‘˜π‘™

=πœ•πΆ

πœ•π‘§π‘—π‘™

πœ•π‘§π‘—π‘™

πœ•π‘€π‘—π‘˜

= π›Ώπ‘—π‘™πœ• Οƒπ‘˜π‘€π‘—π‘˜

𝑙 π‘Žπ‘˜π‘™βˆ’1 + 𝑏𝑗

𝑙

πœ•π‘€π‘—π‘˜

= 𝛿𝑗𝑙 π‘Žπ‘˜

π‘™βˆ’1

𝑧𝑙 = π‘€π‘™π‘Žπ‘™βˆ’1 + 𝑏𝑙

πΏπ‘Žπ‘¦π‘’π‘Ÿ 𝑙-1 πΏπ‘Žπ‘¦π‘’π‘Ÿ 𝑙

The backpropagation algorithm

The word β€œbackpropagation” comes from the fact that we compute the error vectors 𝛿𝑗𝑙 in the backward

direction.

Stochastic gradient descent with BP

Gradients using finite differences

Here πœ– is a small positive number and 𝑒𝑗 is the unit vector in the jth direction.

Conceptually very easy to implement.In order to compute this derivative w.r.t one parameter, we need to do one forward pass – for millions of variables we will have to do millions of forward passes.

- Backpropagation can get all the gradients in just one forward and backward pass – forward and backward passes are roughly equivalent in computations.

The derivatives using finite differences would be a million times slower!!

Backpropagation – the big picture

β€’ To compute the total change in C we need to consider all possible paths from the weight to the cost.

β€’ We are computing the rate of change of C w.r.t a weight w.

β€’ Every edge between two neurons in the network is associated with a rate factor that is just the ratio of partial derivatives of one neurons activation with respect to another neurons activation.

β€’ The rate factor for a path is just the product of the rate factors of the edges in the path.

β€’ The total change is the sum of the rate factors of all the paths from the weight to the cost.

Thank You

Chain Rule in differentiation (vector case)

Let π‘₯ ∈ β„π‘š, 𝑦 ∈ ℝ𝑛, g maps from β„π‘š to ℝ𝑛, and 𝑓 maps from ℝ𝑛 to ℝ. If 𝑦 = 𝑔 π‘₯ and 𝑧 = 𝑓 𝑦 , then

πœ•π‘§

πœ•π‘₯𝑖=

π‘˜

πœ•π‘§

πœ•π‘¦π‘˜

πœ•π‘¦π‘˜πœ•π‘₯𝑖

𝛻π‘₯𝑧 =πœ•π‘¦

πœ•π‘₯

𝑇𝛻𝑦𝑧

Here πœ•π‘¦

πœ•π‘₯is the 𝑛 Γ—π‘š Jacobian matrix of 𝑔.

Source: http://math.arizona.edu/~calc/Rules.pdf

Source: http://math.arizona.edu/~calc/Rules.pdf