How the backpropagation algorithm workssrikumar/cv_spring2017_files/DeepLearning2.pdfBackpropagation...
Transcript of How the backpropagation algorithm workssrikumar/cv_spring2017_files/DeepLearning2.pdfBackpropagation...
Reference
Most of the slides are taken from the second chapter of the online book by Michael Nielson:
β’ neuralnetworksanddeeplearning.com
Introduction
β’ First discovered in 1970.
β’ First influential paper in 1986:
Rumelhart, Hinton and Williams, Learning representations by back-propagating errors, Nature, 1986.
Sigmoid neuron (Reminder)
β’ A sigmoid neuron can take real numbers (π₯1, π₯2, π₯3) within 0 to 1 and returns a number within 0 to 1. The weights (π€1, π€2, π€3) and the bias term π are real numbers.
Sigmoid function
Matrix equations for neural networks
The indices βjβ and βkβ seem a little counter-intuitive!
Layer to layer relationship
β’ πππ is the bias term in the jth neuron in the lth layer.
β’ πππ is the activation in the jth neuron in the lth layer.
β’ π§ππ is the weighted input to the jth neuron in the lth layer.
πππ = π(π§π
π)
π§ππ =
π
π€πππ ππ
πβ1 + πππ
Cost function from the network
# of input samples
for each input sample
Output activation vector for a specific training sample π₯.
Groundtruth for each input
Backpropagation and stochastic gradient descentβ’ The goal of the backpropagation algorithm is to compute the
gradients ππΆ
ππ€and
ππΆ
ππof the cost function C with respect to each and
every weight and bias parameters. Note that backpropagation is only used to compute the gradients.
β’ Stochastic gradient descent is the training algorithm.
Assumptions on the cost function
1. We assume that the cost function can be written as the average over
the cost functions from individual training samples: πΆ =1
πΟπ₯ πΆπ₯. The
cost function for the individual training sample is given by πΆπ₯ =1
2π¦ π₯ β ππΏ π₯ 2.
- why do we need this assumption? Backpropagation will only allow us to compute the gradients with respect to a single training
sample as given by ππΆπ₯
ππ€and
ππΆπ₯
ππ. We then recover
ππΆ
ππ€and
ππΆ
ππby averaging
the gradients from the different training samples.
Assumptions on the cost function (continued)
2. We assume that the cost function can be written as a function of the output from the neural network. We assume that the input π₯ and its associated correct labeling π¦ π₯ are fixed and treated as constants.
Hadamard product
β’ Let π and π‘ are two vectors. The Hadamard product is given by:
Such elementwise multiplication is also referred to as schur product.
Backpropagation
β’ Our goal is to compute the partial derivatives ππΆ
ππ€πππ and
ππΆ
ππππ .
β’ We compute some intermediate quantities while doing so:
πΏππ =
ππΆ
ππ§ππ
Chain Rule in differentiation
β’ In order to differentiate a function z = π π π₯ w.r.t π₯, we can do the following:
Let y = π π₯ , π§ = π π¦ ,ππ§
ππ₯=
ππ§
ππ¦Γ
ππ¦
ππ₯
Chain Rule in differentiation (vector case)
Let π₯ β βπ, π¦ β βπ, g maps from βπ to βπ, and π maps from βπ to β. If π¦ = π π₯ and π§ = π π¦ , then
ππ§
ππ₯π=
π
ππ§
ππ¦π
ππ¦πππ₯π
Chain Rule in differentiation (computation graph)
ππ§
ππ₯=
π:π₯βππππππ‘ π¦π ,
π¦πβπ΄ππππ π‘ππ (π§)
ππ§
ππ¦π
ππ¦π
ππ₯π₯ π§
π¦1
π¦2
π¦3
BP1
Here L is the last layer.
πΏπΏ =ππΆ
ππ§πΏ, πβ² π§πΏ =
π π π§πΏ
ππ§πΏ, π»ππΆ =
ππΆ
πππΏ=
ππΆ
ππ1πΏ ,
ππΆ
ππ2πΏ , β¦ ,
ππΆ
ππππΏ
π
Proof:
πΏππΏ =
ππΆ
ππ§ππΏ = Οπ
ππΆ
ππππΏ
ππππΏ
ππ§ππΏ =
ππΆ
ππππΏ
ππππΏ
ππ§ππΏ when π β π, the term
ππππΏ
ππ§ππΏ vanishes.
πΏππΏ =
ππΆ
ππππΏ πβ²(π§π
πΏ)
Thus we have
ππΏ =ππΆ
πππΏβπβ²(π§πΏ )
ππΏ = π(π§πΏ)
πΆ = π(ππΏ)
πΏππ¦ππ πΏ β 1πΏππ¦ππ πΏ β 1 πΏππ¦ππ πΏ
BP2
ππ = π€π+1 ππΏπ+1 βπβ² π§π
Proof:
πΏππ =
ππΆ
ππ§ππ = Οπ
ππΆ
ππ§ππ+1
ππ§ππ+1
ππ§ππ = Οπ
ππ§ππ+1
ππ§ππ πΏπ
π+1
π§ππ+1 = Οππ€ππ
π+1πππ + ππ
π = Οππ€πππ+1π π§π
π + πππ
By differentiating we have:ππ§π
π+1
ππ§ππ = π€ππ
π+1πβ² π§ππ
πΏππ = Οππ€ππ
π+1πΏππ+1πβ² π§π
π
π§π+1 = π€π+1ππ + ππ+1
πΏππ¦ππ π πΏππ¦ππ π + 1
BP3
ππΆ
ππππ = πΏπ
π
Proof:ππΆ
ππππ =
π
ππΆ
ππ§ππ
ππ§ππ
ππππ =
ππΆ
ππ§ππ
ππ§ππ
ππππ
= πΏπππ Οπ π€ππππ
πβ1 + πππ
πππ= πΏπ
π
π§π = π€πππβ1 + ππ
πΏππ¦ππ π-1 πΏππ¦ππ π
BP4
ππΆ
ππ€πππ = ππ
πβ1πΏππ
Proof: ππΆ
ππ€πππ = Οπ
ππΆ
ππ§ππ
ππ§ππ
ππ€πππ
=ππΆ
ππ§ππ
ππ§ππ
ππ€ππ
= πΏπππ Οππ€ππ
π πππβ1 + ππ
π
ππ€ππ
= πΏππ ππ
πβ1
π§π = π€πππβ1 + ππ
πΏππ¦ππ π-1 πΏππ¦ππ π
The backpropagation algorithm
The word βbackpropagationβ comes from the fact that we compute the error vectors πΏππ in the backward
direction.
Gradients using finite differences
Here π is a small positive number and ππ is the unit vector in the jth direction.
Conceptually very easy to implement.In order to compute this derivative w.r.t one parameter, we need to do one forward pass β for millions of variables we will have to do millions of forward passes.
- Backpropagation can get all the gradients in just one forward and backward pass β forward and backward passes are roughly equivalent in computations.
The derivatives using finite differences would be a million times slower!!
Backpropagation β the big picture
β’ To compute the total change in C we need to consider all possible paths from the weight to the cost.
β’ We are computing the rate of change of C w.r.t a weight w.
β’ Every edge between two neurons in the network is associated with a rate factor that is just the ratio of partial derivatives of one neurons activation with respect to another neurons activation.
β’ The rate factor for a path is just the product of the rate factors of the edges in the path.
β’ The total change is the sum of the rate factors of all the paths from the weight to the cost.
Chain Rule in differentiation (vector case)
Let π₯ β βπ, π¦ β βπ, g maps from βπ to βπ, and π maps from βπ to β. If π¦ = π π₯ and π§ = π π¦ , then
ππ§
ππ₯π=
π
ππ§
ππ¦π
ππ¦πππ₯π
π»π₯π§ =ππ¦
ππ₯
ππ»π¦π§
Here ππ¦
ππ₯is the π Γπ Jacobian matrix of π.