INF 5860 Machine learningfor image classification Lecture ...
Transcript of INF 5860 Machine learningfor image classification Lecture ...
INF 5860 Machine learning for image classification
Lecture : Backpropagation – learning in neural netsAnne Solberg March 3, 2017
Reading material
– Reading material: • http://cs231n.github.io/optimization-2/
• Additional optional material:• Lecture on backpropagation in Coursera Course on
Machine Learning (Andrew Ng)CS 231n on youtube: lecture 4
• http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf• http://colah.github.io/posts/2015-08-Backprop/• http://neuralnetworksanddeeplearning.com/chap2.html
03.03.2017 INF 5860 3
Notation- forward propagation
03.03.2017 INF 5860 4
aaaaa)(
xxxxa
xxxxa
xxxxa
)1(s s size has :jlayer in nodes s 1,-jlayer in nodes s1) 1)-(jlayer in (nodes(j))layer in (nodes dimension has
j to1-jlayer from mapping function gcontrollin weightsofmatrix -
layer and unit of activation - 0layer isinput that theAssume
(1)3
)2(13
)1(2
)2(12
)1(1
)2(11
)1(0
)2(10
(2)1
3)1(
332)1(
321)1(
310)1(
30(1)3
3)1(
232)1(
221)1(
210)1(
20(1)2
3)1(
132)1(
121)1(
110)1(
10(1)1
1-jj)(
j1-j
)(
)(
)(
gxh
g
g
g
jia
j
j
j
ji
Example feed-forward computation• Input x: 3x1 vector
03.03.2017 INF 5860 5
+1 Bias node+1
)()()(1
)2()2()2(1)1()1()1(1
X
: timeoneat 1....Nn all predictcan wesamples trainingN have weIf
1)1layer in nodeshidden nof. classes (nof. 52:1)inputs nof. 1layer in nodeshidden (nof. 44:
321
321
321
)2(
)1(
NxNxNnx
xxnxxxnx
pixelpixelpixel
pixelpixelpixel
pixelpixelpixel
z1 = Theta1.dot(X)a1 = sigmoid(z1)#Append 1 to a1 before computing z2Continue with layer 2……….
Cost function for one-vs-all neural networksFor a neural nets with one-vs-all :
03.03.2017 INF 5860 6
tionTermRegulariza* LossTerm)(llayer in bias)(without units ofNumber :s
layers ofnumber :L
)(2
):)),((1log())(1(:)),((log)(1)(
R(x)h a :Output
l
1
1
2)(
1
1
11 1
KL
.
J
miXhiyiXhiy
mJ
jl s
j
lji
s
i
L
l
m
ikkkk
K
k
Remark: two variations are common:- Regularize all weights including the bias terms (sum from 0)- Avoid regularizing the bias terms (sum from 1)
In practise, this choice do not matter.
Cost function for softmax neural networksFor a neural net with softmax loss function :
03.03.2017 INF 5860 7
tionTermRegulariza* LossTerm)(
llayer in bias)(without units ofNumber :slayers ofnumber :L
)(2
log11)(
1
)x,|KP(y
)x,|2P(y)x,|1P(y
(x)ha :Output
l
1
1
2)(
1
1
11
1
1
1
L
.
2
1
J
me
eky
mJ
e
e
e
e
jl
iTk
iTk
TK
T
T
Tk
s
j
lji
s
i
L
l
m
iK
k
x
xK
ki
x
x
x
K
k
x
Remark: two variations are common, See previous slide:
Introduction to backpropagation and computational graphs• We now have a network architecture
and a cost function. • A learning algorithm for the net should
give us a way to change the weights in such a manner that the output is closerto the correct class labels.
• The activation function should assurethat a small change in weights results in a small change in ouputs.
• Backpropagation use partial derivatives to compute the derivative of the costfunction J with respect to all theweights.
03.03.2017 INF 5860 8
3.3.2017 INF 5860 9
f
x
z
y
Activations
zL
xz
yz
xz
zL
xL
yz
zL
yL
During backpropagation, thenode will learn
The gate uses chain rule to redistribute thisgradient to its inputs
zL
Green numbers: forward propagationRed numbers: backwards propagation
A more complicated graph example
3.3.2017 INF 5860 10
3.3.2017 INF 5860 11
1.0
3.3.2017 INF 5860 12
1.0
-1/(1.37)2*1.0
-0.53
3.3.2017 INF 5860 13
1.0
1*-0.53=-0.53
-0.53-0.53
3.3.2017 INF 5860 14
1.0
e-1*-0.53=-0.20
-0.53-0.53-0.20
3.3.2017 INF 5860 15
1.0
(-1)*-0.20=0.20
-0.53-0.53-0.200.20
3.3.2017 INF 5860 16
1.0-0.53-0.53-0.200.20
(1)*0.20=0.20Distribute to both inputs
0.20
0.20
3.3.2017 INF 5860 17
1.0-0.53-0.53-0.200.20
(1)*0.20=0.20Distribute to both inputs
0.20
0.20
0.20
0.20
3.3.2017 INF 5860 18
1.0-0.53-0.53-0.200.20
0.20
0.20
0.20
0.20
-1*0.20=0.2
2*0.20=0.4
3.3.2017 INF 5860 19
1.0-0.53-0.53-0.200.20
0.20
0.20
0.20
0.20
-0.2
0.4
-2*0.20=-0.4
-3*0.20=-0.6
25.2.2017 INF 5860 20
1.0-0.53-0.53-0.200.20
0.20
0.20
0.20
0.20
-0.2
0.4
-2*0.20=-0.4
-3*0.20=-0.6
The sigmoid gate
The sigmoid gate
25.2.2017 INF 5860 21
Output: 0.73Derivative of the sigmoid gate: (1-0.73)0.73=0.20This is the same result as above.
Forward and backward for a single neuron
3.3.2017 INF 5860 22
Remark: an efficient implemetation will store inputs and intermediatesduring forward, so that they are available for backprop.
A more tricky example
• Stage the forward pass into simple operations that we now thederivative of:
3.3.2017 INF 5860 23
A more tricky example
• In the backwards pass: compute the derivative of all these terms:
25.2.2017 INF 5860 24
Patterns in backward flow
3.3.2017 INF 5860 25
add gate: gradient distributormax gate: gradient routermul gate: be careful
f=x*y means thatdf/dx=y and df/dy=x
Remark on multiplier gate:If a gate get one large and one small input, backprop will use thebig input to cause a large changeon the small input, and vice versa.This is partly why feature scaling is important
The optimization problem
3.3.2017 INF 5860 26
(l),
(l)
every respect to J with of sderivative theNeed
descent gradient usingJ minimize want toWe weights with
layersLnet with forward-feeda andJ function lossa Given
nm
Backpropagation: recursive application of the chain rule on a computational graph to compute the gradients of all input/parameters/intermediates
Implementation: • Forward: compute the result of the node operation and save
the intermediates needed for gradient computation• Backwards: apply the chain tule to compute the gradients of
the loss function with respect to the input of each node.
A very simple net with one input
03.3.2017 INF 5860 27
Sigmoid* * Sigmoid
x
(1) (2)
z(1) z(2)a(1) a(2)
Assume that we want to minimize the square error between theoutput a(2) and the true class yE=1/2(y-a(2))2 (Mean square error in this example)Compute the partial derivatives with respect to (1) and (2),
and use use gradient descent to update (1)
and (2)
)2(1
)1( and
EE
Measure of error on training dataE=1/2(y-a(2))2
03.3.2017 INF 5860 28
)(1)()(' so g(z) function sigmoid theapplies )2()2()2()2(
)2()2(
)1()2(
)2(2)2(
)2(
)2(2
)2(
)2()2(
)2(
)2(2
)2()2(
zgzgzgza
a
azaya
zza
yaaaEE
03.3.2017 INF 5860 29
)(1)()(' so g(z) function sigmoid theapplies
)(1)()(1)(
)(1)()(1)(
)(1)(
)(1)(
)1()1()1()1(
)1()1(
)1()1()2()2()2()2(
)1(
)1()1(
1)1()2()2()2()2(
)1(1
)1(
)1(
)1()2()2()2()2(
)1(
)1(
)1(
)2()2()2()2(
)1(
)2(
)2(
)2()2(
)1(
)2(
)2()1(
zgzgzgza
a
xzgzgzgzgya
zzgzgzgzgya
zza
zgzgya
aaz
zgzgya
zza
yaa
aEE
From scalars to vectors• In the example x was a scalar, and was a vector with one
element pr. weight. • When working with vector input, for each layer will be a
matrix. • Deriving the vector/matrix version of backpropagation is more
tedious, but follows the same principle. • A good source is http://neuralnetworksanddeeplearning.com/chap2.html• We now present the vector algorithm
03.3.2017 INF 5860 30
E
lE
Backpropagation algorithm for a single training sample (xi, yi)
03.3.2017 INF 5860 31
1)()(
)1()2()2()1(
)2()3()3()2(
)3(j
)3()3(
)3()3(j
J notation, thisWith
)('
)(' Compute
jlayer in nodes ofnumber theis where,,..1 of vector thebe Let
Let
:netlayer -3a For 0)(set tionregulariza theignore now,For
li
ljl
ij
T
T
jj
jj
a
zg
zg
ssjya
ya
Note that this is the elementwise product, or Hadamard-product of two vectors
Derivative of loss function• In backpropagation, we need the derivative of the loss functions with
respect to the activation of the output layer aiL.
• If we ignore the regularization term, the derivative of the logistic loss function for sample i can be shown to be (ai
L-yi)– See http://stats.stackexchange.com/questions/219241/gradient-for-
logistic-loss-function• For softmax, ignoring the regularization term, the derivative of the
softmax loss is also (aiL-yi)
– See http://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function
NOTE: aiL is computed differently
03.3.2017 INF 5860 32
03.3.2017 INF 5860 33
)((2)(2) kyindakk )(' )1((2))2((1) zgT
Notice that the bias nodes do not receive input from previous layer.Thus, they should NOT be used in backpropagation
Including the regularization term
03.3.2017 INF 5860 34
layer) previous thein bias thefrominput gets(it 0 from j and 1, from indexed is i that Note
1jfor m
1J
termsbias theregularizenot do that weis convention thehere , 0jfor 1J:tionregulariza theincluding update ationBackpropag
tionTermRegulariza* LossTerm)(
)(2
):)),((1log())(1(:)),((log)(1)(
(l)ij
)()((l)ij
)()((l)ij
1
1
2)(
1
1
11 1
.
lij
lij
lij
lij
s
j
lji
s
i
L
l
m
ikkkk
K
k
mD
mD
J
miXhiyiXhiy
mJ
jl
Remark: softmax will have the same regularization term
Backpropagation with a loop over training data
03.3.2017 INF 5860 35
(l)ij)(
(l)ij
(l)ij
)((l)ij
(l)ij
1)(l(l)j
(l)ij
(l)ij
)(1)(l)((l)(1)2)-(L
i1)-(L1)-(L
(0)i
(0)
(l)ij
mm11
DJ Here,
0j if ,1D
0j if ,1D
aSet
)(' ,.... Compute
otherwise 0 andk yif 1 function,indicator an is yind,)( Compute
1,...1,a compute tonpropagatio forward Do
xaSet m:1ifor
lj,i, allfor 0Set
)y,),.....(xy,(xset Training
lij
lij
i
lTl
ikk
m
m
zgas
kyinda
Ll
Checking dimensions
• Note that in backpropagation, we use T
• When implementing this shape() is your best friend
• Think of a net with one hidden layer (layer 1) with 25 nodes + bias, and output layer with 10 nodes (10 classes)
• (2) has dimension 10x26 including bias, and ((2))T is 26x10• (2) has dimension 10x1
• REMARK: we can either ignore the bias terms in backpropagation, or compute0
(1) also (resulting in a 26x1 vector), but later ignore the 0(1) values
– When doing backpropagation from layer 2 to layer 1, ignore the bias in (index 0 oflayer 2) and backpropagate ((2))T(1:25,0:9)
• (1) then has dimension [(25x10)x(10x1)](25x1) = 25x1
03.3.2017 INF 5860 36
)(' )2((3))2((2) zgT
Assumptions behindbackpropragation1. The loss function should be expressed as a sum or
average over all training samles.– This is true for all the functions we have studied so far– We will be able to compute for a single training example,
and then average over all samples.
03.3.2017 INF 5860 37
llayer in bias)(without units ofNumber :s
layers ofnumber :L
)(2
):)),((1log())(1(:)),((log)(1)(
R(x)h :Output
l
1
1
2)(
1
1
11 1
K
.
jl s
j
lji
s
i
L
l
m
ikkkk
K
k miXhiyiXhiy
mJ
lij
L
Assumptions behindbackpropragation2. The loss function must be expressed as a function
of the outputs of the net.– This allows us to change the weights and measure how
similar yi and the output h(x) is.
03.3.2017 INF 5860 38
llayer in bias)(without units ofNumber :s
layers ofnumber :L
)(2
):)),((1log())(1(:)),((log)(1)(
R(x)h :Output
l
1
1
2)(
1
1
11 1
K
.
jl s
j
lji
s
i
L
l
m
ikkkk
K
k miXhiyiXhiy
mL
Gradient checking
• When implementing backpropagation, weuse gradient checking to verify theimplementation.
• When the code works, we turn off gradient checking.
• But what is it?
03.3.2017 INF 5860 39
Gradient checking: numericalestimation of the gradient• The gradient of a function is defined as:
• When we have the cost function implemented, wecan easily approximate the gradient as
03.3.2017 INF 5860 40
2)()(lim)(
0
JJJdd
2)()( JJ
Procedure for gradient checking• ‘Unroll’ 1, 2,…into a 1-d vector = [1,…. n]• Approximate
• Check that the difference between this partial derivative and the one from backpropagation is smaller than a threshold.
03.3.2017 INF 5860 41
2),....,(),....,(
2),....,(),....,(
2),....,(),....,(
2121
2121
2
2121
1
nn
n
nn
nn
JJJ
JJJ
JJJ
Regarding gradient checking:
• Computing the approximated gradient is computationally much slower than backpropagation:– Use gradient checking for a small example when debugging
the backpropagation code.– Once it works, turn off gradient checking and proceed with
training the entire data set.
03.3.2017 INF 5860 42
Random initialization of weights
• All weights must be initialized to small, butdifferent random numbers. – More on why next week.
03.3.2017 INF 5860 43
Training a neural network
• Choose an architecture: – Number of inputs: dimension of feature vector or
image– Number of outputs: number of classes– 1-2 hidden layers.
• For simplicity: use the same number of nodes in eachhidden layer
– More on practial details in the next two lectures.
03.3.2017 INF 5860 44
Training a network1. Randomly initialize each weight to small numbers2. Implement forward propagation to get the output 3. Implement code to compute the cost function J()4. Implement backprop to compute the partial derivatives
for i=1:mPerform forward propagation and backpropagation for
sample xi,yi
5. Use gradient checking to compate numerical estimates and backpropagation gradients. Afterward, disable gradient checking.
6. Use gradient descent (or optimization methods) withbackpropagation to minimize J.
03.3.2017 INF 5860 45
Weekly exercise:
• A detailed programming exercise, withdescriptions on the operations, will be available.
• Implementing backpropagation is central to Mandatory exercise 1– No solution in python will be given, but test data
with known results.
03.3.2017 INF 5860 46
Next weeks:
• Training in practice, useful tricks. • Babysitting the training process• Parameter updates• Activation functions• Weight initialization• Preprocessing• Evaluation Main reading material: http://cs231n.github.io/neural-networks-3/
3.3..2017 INF 5860 47