Neural Networks - Bishop PRML Ch. 5

41
Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1

Transcript of Neural Networks - Bishop PRML Ch. 5

Page 1: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Neural NetworksBishop PRML Ch. 5

Alireza Ghane

Neural Networks Alireza Ghane / Greg Mori 1

Page 2: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks

• Neural networks arise from attempts to model human/animalbrains

• Many models, many claims of biological plausibility• We will focus on multi-layer perceptrons

• Mathematical properties rather than plausibility

Neural Networks Alireza Ghane / Greg Mori 2

Page 3: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

Neural Networks Alireza Ghane / Greg Mori 3

Page 4: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

Neural Networks Alireza Ghane / Greg Mori 4

Page 5: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjφj(x)

for fixed non-linear basis functions φ(·)

• We now extend this model by allowing adaptive basis functions,and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons) we leteach basis function be another non-linear function of linearcombination of the inputs:

φj(x) = f

M∑j=1

. . .

Neural Networks Alireza Ghane / Greg Mori 5

Page 6: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjφj(x)

for fixed non-linear basis functions φ(·)

• We now extend this model by allowing adaptive basis functions,and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons) we leteach basis function be another non-linear function of linearcombination of the inputs:

φj(x) = f

M∑j=1

. . .

Neural Networks Alireza Ghane / Greg Mori 6

Page 7: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

• Starting with input x = (x1, . . . , xD), construct linearcombinations:

aj =

D∑i=1

w(1)ji xi + w

(1)j0

These aj are known as activations• Pass through an activation function h(·) to get output zj = h(aj)

• Model of an individual neuron

from Russell and Norvig, AIMA2eNeural Networks Alireza Ghane / Greg Mori 7

Page 8: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

• Starting with input x = (x1, . . . , xD), construct linearcombinations:

aj =

D∑i=1

w(1)ji xi + w

(1)j0

These aj are known as activations• Pass through an activation function h(·) to get output zj = h(aj)

• Model of an individual neuron

from Russell and Norvig, AIMA2eNeural Networks Alireza Ghane / Greg Mori 8

Page 9: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

• Starting with input x = (x1, . . . , xD), construct linearcombinations:

aj =

D∑i=1

w(1)ji xi + w

(1)j0

These aj are known as activations• Pass through an activation function h(·) to get output zj = h(aj)

• Model of an individual neuron

from Russell and Norvig, AIMA2eNeural Networks Alireza Ghane / Greg Mori 9

Page 10: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Activation Functions

• Can use a variety of activation functions• Sigmoidal (S-shaped)

• Logistic sigmoid 1/(1 + exp(−a)) (useful for binaryclassification)

• Hyperbolic tangent tanh• Radial basis function zj =

∑i(xi − wji)2

• Softmax• Useful for multi-class classification

• Identity• Useful for regression

• Threshold• . . .

• Needs to be differentiable for gradient-based learning (later)• Can use different activation functions in each unit

Neural Networks Alireza Ghane / Greg Mori 10

Page 11: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

• Connect together a number of these units into a feed-forwardnetwork (DAG)

• Above shows a network with one layer of hidden units• Implements function:

yk(x,w) = h

M∑j=1

w(2)kj h

(D∑i=1

w(1)ji xi + w

(1)j0

)+ w

(2)k0

Neural Networks Alireza Ghane / Greg Mori 11

Page 12: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

Neural Networks Alireza Ghane / Greg Mori 12

Page 13: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training• Given a specified network structure, how do we set its

parameters (weights)?• As usual, we define a criterion to measure how well our network

performs, optimize against it• For regression, training data are (xn, t), tn ∈ R

• Squared error naturally arises:

E(w) =

N∑n=1

{y(xn,w)− tn}2

• For binary classification, this is another discriminative model,ML:

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn

E(w) = −N∑n=1

{tn ln yn + (1− tn) ln(1− yn)}

Neural Networks Alireza Ghane / Greg Mori 13

Page 14: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training• Given a specified network structure, how do we set its

parameters (weights)?• As usual, we define a criterion to measure how well our network

performs, optimize against it• For regression, training data are (xn, t), tn ∈ R

• Squared error naturally arises:

E(w) =

N∑n=1

{y(xn,w)− tn}2

• For binary classification, this is another discriminative model,ML:

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn

E(w) = −N∑n=1

{tn ln yn + (1− tn) ln(1− yn)}

Neural Networks Alireza Ghane / Greg Mori 14

Page 15: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training• Given a specified network structure, how do we set its

parameters (weights)?• As usual, we define a criterion to measure how well our network

performs, optimize against it• For regression, training data are (xn, t), tn ∈ R

• Squared error naturally arises:

E(w) =

N∑n=1

{y(xn,w)− tn}2

• For binary classification, this is another discriminative model,ML:

p(t|w) =

N∏n=1

ytnn {1− yn}1−tn

E(w) = −N∑n=1

{tn ln yn + (1− tn) ln(1− yn)}

Neural Networks Alireza Ghane / Greg Mori 15

Page 16: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Parameter Optimization

w1

w2

E(w)

wA wB wC

∇E

• For either of these problems, the error function E(w) is nasty• Nasty = non-convex• Non-convex = has local minima

Neural Networks Alireza Ghane / Greg Mori 16

Page 17: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

• The typical strategy for optimization problems of this sort is adescent method:

w(τ+1) = w(τ) + ∆w(τ)

• As we’ve seen before, these come in many flavours• Gradient descent∇E(w(τ))• Stochastic gradient descent∇En(w(τ))• Newton-Raphson (second order)∇2

• All of these can be used here, stochastic gradient descent isparticularly effective

• Redundancy in training data, escaping local minima

Neural Networks Alireza Ghane / Greg Mori 17

Page 18: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

• The typical strategy for optimization problems of this sort is adescent method:

w(τ+1) = w(τ) + ∆w(τ)

• As we’ve seen before, these come in many flavours• Gradient descent∇E(w(τ))• Stochastic gradient descent∇En(w(τ))• Newton-Raphson (second order)∇2

• All of these can be used here, stochastic gradient descent isparticularly effective

• Redundancy in training data, escaping local minima

Neural Networks Alireza Ghane / Greg Mori 18

Page 19: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

• The typical strategy for optimization problems of this sort is adescent method:

w(τ+1) = w(τ) + ∆w(τ)

• As we’ve seen before, these come in many flavours• Gradient descent∇E(w(τ))• Stochastic gradient descent∇En(w(τ))• Newton-Raphson (second order)∇2

• All of these can be used here, stochastic gradient descent isparticularly effective

• Redundancy in training data, escaping local minima

Neural Networks Alireza Ghane / Greg Mori 19

Page 20: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivatives withrespect to weights

• Numerical method for calculating error derivatives, use finitedifferences:

∂En∂wji

≈ En(wji + ε)− En(wji − ε)2ε

• How much computation would this take with W weights in thenetwork?

• O(W ) per derivative, O(W 2) total per gradient descent step

Neural Networks Alireza Ghane / Greg Mori 20

Page 21: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivatives withrespect to weights

• Numerical method for calculating error derivatives, use finitedifferences:

∂En∂wji

≈ En(wji + ε)− En(wji − ε)2ε

• How much computation would this take with W weights in thenetwork?

• O(W ) per derivative, O(W 2) total per gradient descent step

Neural Networks Alireza Ghane / Greg Mori 21

Page 22: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivatives withrespect to weights

• Numerical method for calculating error derivatives, use finitedifferences:

∂En∂wji

≈ En(wji + ε)− En(wji − ε)2ε

• How much computation would this take with W weights in thenetwork?

• O(W ) per derivative, O(W 2) total per gradient descent step

Neural Networks Alireza Ghane / Greg Mori 22

Page 23: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivatives withrespect to weights

• Numerical method for calculating error derivatives, use finitedifferences:

∂En∂wji

≈ En(wji + ε)− En(wji − ε)2ε

• How much computation would this take with W weights in thenetwork?

• O(W ) per derivative, O(W 2) total per gradient descent step

Neural Networks Alireza Ghane / Greg Mori 23

Page 24: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

Neural Networks Alireza Ghane / Greg Mori 24

Page 25: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

• Backprop is an efficient method for computing error derivatives∂En∂wji

• O(W ) to compute derivatives wrt all weights

• First, feed training example xn forward through the network,storing all activations aj

• Calculating derivatives for weights connected to output nodes iseasy

• e.g. For linear output nodes yk =∑i wkizi:

∂En∂wki

=∂

∂wki

1

2(yn − tn)2 = (yn − tn)xni

• For hidden layers, propagate error backwards from the outputnodes

Neural Networks Alireza Ghane / Greg Mori 25

Page 26: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

• Backprop is an efficient method for computing error derivatives∂En∂wji

• O(W ) to compute derivatives wrt all weights

• First, feed training example xn forward through the network,storing all activations aj

• Calculating derivatives for weights connected to output nodes iseasy

• e.g. For linear output nodes yk =∑i wkizi:

∂En∂wki

=∂

∂wki

1

2(yn − tn)2 = (yn − tn)xni

• For hidden layers, propagate error backwards from the outputnodes

Neural Networks Alireza Ghane / Greg Mori 26

Page 27: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

• Backprop is an efficient method for computing error derivatives∂En∂wji

• O(W ) to compute derivatives wrt all weights

• First, feed training example xn forward through the network,storing all activations aj

• Calculating derivatives for weights connected to output nodes iseasy

• e.g. For linear output nodes yk =∑i wkizi:

∂En∂wki

=∂

∂wki

1

2(yn − tn)2 = (yn − tn)xni

• For hidden layers, propagate error backwards from the outputnodes

Neural Networks Alireza Ghane / Greg Mori 27

Page 28: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Chain Rule for Partial Derivatives

• A “reminder”• For f(x, y), with f differentiable wrt x and y, and x and y

differentiable wrt u and v:

∂f

∂u=

∂f

∂x

∂x

∂u+∂f

∂y

∂y

∂u

and

∂f

∂v=

∂f

∂x

∂x

∂v+∂f

∂y

∂y

∂v

Neural Networks Alireza Ghane / Greg Mori 28

Page 29: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation• We can write

∂En∂wji

=∂

∂wjiEn(aj1 , aj2 , . . . , ajm)

where {ji} are the indices of the nodes in the same layer as nodej

• Using the chain rule:

∂En∂wji

=∂En∂aj

∂aj∂wji

+∑k

∂En∂ak

∂ak∂wji

where∑

k runs over all other nodes k in the same layer as nodej.

• Since ak does not depend on wji, all terms in the summation goto 0

∂En∂wji

=∂En∂aj

∂aj∂wji

Neural Networks Alireza Ghane / Greg Mori 29

Page 30: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation• We can write

∂En∂wji

=∂

∂wjiEn(aj1 , aj2 , . . . , ajm)

where {ji} are the indices of the nodes in the same layer as nodej

• Using the chain rule:

∂En∂wji

=∂En∂aj

∂aj∂wji

+∑k

∂En∂ak

∂ak∂wji

where∑

k runs over all other nodes k in the same layer as nodej.

• Since ak does not depend on wji, all terms in the summation goto 0

∂En∂wji

=∂En∂aj

∂aj∂wji

Neural Networks Alireza Ghane / Greg Mori 30

Page 31: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation• We can write

∂En∂wji

=∂

∂wjiEn(aj1 , aj2 , . . . , ajm)

where {ji} are the indices of the nodes in the same layer as nodej

• Using the chain rule:

∂En∂wji

=∂En∂aj

∂aj∂wji

+∑k

∂En∂ak

∂ak∂wji

where∑

k runs over all other nodes k in the same layer as nodej.

• Since ak does not depend on wji, all terms in the summation goto 0

∂En∂wji

=∂En∂aj

∂aj∂wji

Neural Networks Alireza Ghane / Greg Mori 31

Page 32: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

• Introduce error δj ≡ ∂En∂aj

∂En∂wji

= δj∂aj∂wji

• Other factor is:

∂aj∂wji

=∂

∂wji

∑k

wjkzk = zi

Neural Networks Alireza Ghane / Greg Mori 32

Page 33: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

• Error δj can also be computed using chain rule:

δj ≡∂En∂aj

=∑k

∂En∂ak︸︷︷︸δk

∂ak∂aj

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑k

wkjδk

• A weighted sum of the later error “caused” by this weight

Neural Networks Alireza Ghane / Greg Mori 33

Page 34: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

• Error δj can also be computed using chain rule:

δj ≡∂En∂aj

=∑k

∂En∂ak︸︷︷︸δk

∂ak∂aj

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑k

wkjδk

• A weighted sum of the later error “caused” by this weight

Neural Networks Alireza Ghane / Greg Mori 34

Page 35: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

Neural Networks Alireza Ghane / Greg Mori 35

Page 36: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Applications of Neural Networks

• Many success stories for neural networks• Credit card fraud detection• Hand-written digit recognition• Face detection• Autonomous driving (CMU ALVINN)

Neural Networks Alireza Ghane / Greg Mori 36

Page 37: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Hand-written Digit Recognition

• MNIST - standard dataset for hand-written digit recognition• 60000 training, 10000 test images

Neural Networks Alireza Ghane / Greg Mori 37

Page 38: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

LeNet-5

INPUT

32x32

Convolutions SubsamplingConvolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps6@14x14

S4: f. maps 16@5x5

C5: layer120

C3: f. maps 16@10x10

F6: layer 84

Full connection

Full connection

Gaussian connections

OUTPUT

10

• LeNet developed by Yann LeCun et al.• Convolutional neural network

• Local receptive fields (5x5 connectivity)• Subsampling (2x2)• Shared weights (reuse same 5x5 “filter”)• Breaking symmetry

Neural Networks Alireza Ghane / Greg Mori 38

Page 39: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3

9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4

8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1

9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8

4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4

8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8

1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1

2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0

4!>9 2!>8

• The 82 errors made by LeNet5 (0.82% test error rate)

Neural Networks Alireza Ghane / Greg Mori 39

Page 40: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Deep Learning

• Collection of important techniques to improve performance:• Multi-layer networks (as in LeNet 5), auto-encoders for

unsupervised feature learning, sparsity, convolutional networks,parameter tying, probabilistic models for initializing parameters,hinge activation functions for steeper gradients

• ufldl.stanford.edu/eccv10-tutorial• http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf

• Scalability is key, can use lots of data since stochastic gradientdescent is memory-efficient

Neural Networks Alireza Ghane / Greg Mori 40

Page 41: Neural Networks - Bishop PRML Ch. 5

Feed-forward Networks Network Training Error Backpropagation Applications

Conclusion

• Readings: Ch. 5.1, 5.2, 5.3• Feed-forward networks can be used for regression or

classification• Similar to linear models, except with adaptive non-linear basis

functions• These allow us to do more than e.g. linear decision boundaries

• Different error functions• Learning is more difficult, error function not convex

• Use stochastic gradient descent, obtain (good?) local minimum

• Backpropagation for efficient gradient computation

Neural Networks Alireza Ghane / Greg Mori 41