MultiLayer Perceptrons Backpropagation Part...

51
Artificial Neural Networks MultiLayer Perceptrons Backpropagation Part 3/3 Berrin Yanikoglu

Transcript of MultiLayer Perceptrons Backpropagation Part...

Page 1: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Artificial Neural Networks MultiLayer Perceptrons

Backpropagation Part 3/3

Berrin Yanikoglu

Page 2: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Capabilities of Multilayer Perceptrons

Page 3: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Multilayer Perceptron

In Multilayer perceptrons, there may be one or more hidden layer(s) which are called hidden since they are not observed from the outside.

Page 4: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Multilayer Perceptron

•  Each layer may have different number of nodes and different activation functions:

•  Commonly, same activation function within one layer •  Typically,

–  sigmoid/tanh activation function is used in the hidden units, and

–  sigmoid/tanh or linear activation functions are used in the output units depending on the problem (classification or function approximation)

•  In feedforward networks, activations are passed only from one layer to the next.

Page 5: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Backpropagation

Capabilities of multilayer NNs were known, but ... •  a learning algorithm was introduced by Werbos (1974); •  made famous by Rumelhart and McClelland (mid 1980s -

the PDP book) •  Started massive research in the area

Page 6: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

XOR problem

•  Learning Boolean functions: 1/0 output can be seen as a 2-class classification problem

•  Xor can be solved by a 1-hidden layer network

Page 7: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

XOR problem

1 W1

W2

-1 -1

1 1

-0.5

1.5

1 1 -1.5

W 11 = [ 1 1 -0.5] W 12 = [-1 -1 1.5] W 2 = [ 1 1 -1.5]

+

+

Notice how each node implements a decision boundary and the output node combines (AND) their result.

boundary1 (implemented by node1)

boundary2

2

-

-

Page 8: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Capabilities (Hardlimiting nodes)

Single layer –  Hyperplane boundaries

1-hidden layer –  Can form any, possibly unbounded convex region

2-hidden layers –  Arbitrarily complex decision regions

Page 9: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Capabilities: Decision Regions

Page 10: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Capabilities

Thm.(Cybenko 1989, Hornik et al. 1989): Every bounded continuous function can be approximated arbitrarily accurately by 2 layers of weights (1-hidden layer) and sigmoidal units. All other functions can be learned by 2-hidden layer networks. –  Discontinuities can be theoretically tolerated for most real life

problems. Also, functions without compact support can be learned under some conditions.

–  Proff is based on Kolmogorov’s Thm.

Discontinuity Continuous with discontinuous first derivative

Page 11: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Performance Learning

Page 12: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Performance Learning

A learning paradigm where the network adjusts its parameters (weights and biases) so as to optimize its “performance”

•  Need to define a performance index – e.g. mean square error= 1/N Σ (ti-oi)2 i=1 where we consider the difference between the target and the output for a pattern i.

•  Search the parameter space to minimize the performance index with respect to the parameters

Page 13: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Performance Optimization

Iterative minimization techniques is a form of performance

learning:

•  Define E(.) as the performance index

•  Starting with an initial guess w(0), find w(n+1) at each iteration such that

•  In particular, we will see that Gradient Descent is an iterative (error) minimization technique

E(w(n+1))< E(w(n))

Page 14: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Basic Optimization Algorithm

w k 1 + w k α k p k + = w Δ k w k 1 + w k – ( ) α k p k = = or

w k

w k 1 + αkpk

Start with initial guess w0 and update the guess in each stage moving along the search direction:

A new state in the search involves deciding on a search direction and the size of the step to take in that direction: pk - Search Direction αk - Learning Rate

Page 15: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Gradient Descent

Also known as Steepest Descent

Page 16: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Performance Optimization

Iterative minimization techniques: Gradient Descent –  Successive adjustments to w are in the direction of the

steepest descent (direction opposite to the gradient vector)

w(n+1) = w(n) - η ∇E(w(n))gradient

Page 17: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Performance Optimization: Iterative Techniques Summary - ADVANCED

F xk 1+( ) F xk( )<

Choose the next step so that the function decreases:

F xk 1+( ) F xk xΔ k+( ) F xk( ) gkT xΔ k+≈=

For small changes in x we can approximate F(x) using the Taylor Series Expansion:

gk F x( )∇x xk=

where

gkT xΔ k α kgk

Tpk 0<=

If we want the function to decrease, we must choose pk such that:

pk g– k=We can maximize the decrease by choosing:

xk 1+ xk αkgk–=

Page 18: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Steepest Descent

Page 19: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Example F x( ) x1

2 2 x1x2 2x22 x1+ + +=

x 00.50.5

=

F x( )∇x1∂∂ F x( )

x2∂∂ F x( )

2x1 2x2 1+ +

2x1 4x2+= = g0 F x( )∇

x x0=33

= =

α 0.1=

x1 x0 αg0– 0.50.5

0.1 33

– 0.20.2

= = =

x2 x1 αg1– 0.20.2

0.1 1.81.2

– 0.020.08

= = =

Page 20: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Two simple error surfaces (for 2 weights)

a) b)

In a) as you move parallel to one of the axis, there is no change in error In b) moving diagonally, rather than parallel to the axes, brings the biggest change.

Page 21: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Gradient Descent: illustration

Gradient: ∇E[w]=[∂E/∂w0,… ∂E/∂wn]

(w1,w2)

(w1+Δw1,w2 +Δw2)

Δw=-η ∇E[w]

Page 22: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Gradient Vector

F x ( )∇

x 1 ∂∂ F x ( )

x 2 ∂∂ F x ( )

x n ∂∂ F x ( )

=

Each dimension of the gradient vector is the partial derivative of the function with respect to one of the dimensions.

Page 23: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Plot

-2 -1 0 1 2-2

-1

0

1

2

Page 24: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Gradient Descent for

Delta Rule for Adaline (Linear Activation)

Page 25: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

o

w0

w2

...

wn

Page 26: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be
Page 27: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be
Page 28: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Stochastic Gradient Descent

Page 29: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Approximate Gradient Descent (Stochastic Backpropagation)

Normally, in gradient descent, we would need to compute how the error over all input samples (true gradient) changes with respect to a small change in a given weight.

But the common form of the gradient descent algorithm takes

one input pattern, compute the error of the network on that pattern only, and updates the weights using only that information. –  Notice that the new weight may not be good/better for all

patterns, but we expect that if we take a small step, we will average and approximate the true gradient.

Page 30: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Stochastic Approximation to Steepest Descent

Instead of updating every weight until all examples have been observed, we update on every example:

wi ≅ η (t-o) xi Remarks:

•  Speeds up learning significantly when data sets are large •  Standard gradient descent can be used with a larger step size.

•  When there are multiple local minima, stochastic approximation to gradient descent may avoid the problem of getting stuck on a local minimum.

Δ

Page 31: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Gradient Descent Backpropagation Algorithm

Derivation for General Activation Functions

Page 32: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Transfer Function Derivatives

f’ n ( ) n d d 1

1 e n – + - - - - - - - - - - - - - - - - - ⎝ ⎠ ⎛ ⎞ e n –

1 e n – + ( ) 2 - - - - - - - - - - - - - - - - - - - - - - - - 1 1

1 e n – + - - - - - - - - - - - - - - - - - –

⎝ ⎠ ⎛ ⎞ 1 1 e n – + - - - - - - - - - - - - - - - - - ⎝ ⎠ ⎛ ⎞ 1 a – ( ) a ( ) = = = =

f’ n ( ) n d

d n ( ) 1 = =

Sigmoid: Linear:

Note: saturation of sigmoid nodes and efficiency of computing the derivative

Page 33: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Stochastic Backpropagation

To calculate the partial derivative of Ep (error on pattern p) w.r.t a given weight wji, we have to consider whether this is the weight of an output or hidden node:

If wji is an output node weight, the situation is simpler:

ji

j

j

j

jji

p

dwdnet

dnetdo

dodE

dwdE

××=

ijjjji

p onetfotdwdE

×ʹ′×−−= )()(

j i

oj

wji

∑=i

jiij wonet

)( jj netfo =Note that oi is the input to node j.

Ep= (tp-op)2

Page 34: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

The situation is more complex with a hidden node (you don’t need to follow), because basically: •  while we know what the output of an output node should be, •  we don’t know what the output of a hidden node should be.

40

Page 35: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Backpropagation – Hidden nodes

If wji is a hidden node weight:

ji

j

j

j

jji

p

dwdnet

dnetdo

dodE

dwdE

××=

ijj

onetfdodE

×ʹ′×= )(

j i

oj

wji

∑=i

jiij wonet

)( jj netfo =

k

k’

wkj

Ep= (tp-op)2

Page 36: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Backpropagation – Hidden nodes

If wji is a hidden node weight:

ji

j

j

j

jji

p

dwdnet

dnetdo

dodE

dwdE

××=

ijj

onetfdodE

×ʹ′×= )(

j i

oj

wji

∑=i

jiij wonet

)( jj netfo =

k

k’

∑ ×=k k

kjj dnet

dEwdodE

Note that as j is a hidden node, we do not know its target. Hence, dE/doj can only be calculated through j’s contribution to the derivative of E w.r.t netk at the output nodes:

wkj

Page 37: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Vanishing Gradient Problem

The use of chain rule in computing the gradient (for sigmoid or tanh) activations causes the gradient to decrease exponentially, as we move from the output towards the first layers.

Layers will learn in much slower speed.

43

Page 38: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

We will cover the rest briefly. 1)  Showing what a single neuron (or in the example 3 of them)

can do, in a function approximation situation. 2)  Showing networks in different complexity, solving the same

problem (approximating a sinusoidal function)

44

Page 39: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Function Approximation & Network Capabilities

Page 40: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Function Approximation

Neural Networks are intrinsically function approximators: –  we can train a NN to map real valued vectors to real-

valued vectors.

Function approximation capabilities of a simple network, in response to its parameters (weights and biases) are illustrated in the next slides.

Page 41: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Function Approximation: Example

f 1 n( ) 1

1 e n–+-----------------=

f 2 n( ) n=

w1 1,1 10=

w2 1,1 10=

b11 10–=

b21 10=

w1 1,2 1=

w1 2,2 1=

b2 0=

Nominal Parameter Values

Superscripts are layer numbers

Page 42: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Nominal Response

-2 -1 0 1 2-1

0

1

2

3

Page 43: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Parameter Variations

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

1– b2 1≤ ≤

What would be the effect of varying the bias of the output neuron?

Page 44: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Parameter Variations

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

1– w1 1,2 1≤ ≤

1– w1 2,2 1≤ ≤

0 b21 20≤ ≤

1– b2 1≤ ≤

Page 45: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Network Complexity

Page 46: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Choice of Architecture g p( ) 1 iπ

4----- p⎝ ⎠⎛ ⎞sin+=

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

1-3-1 Network (1 input, 3 hidden, 1 output nodes)

i = 1 i = 2

i = 4 i = 8

Page 47: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Choice of Network Architecture

g p( ) 1 6π4------ p⎝ ⎠⎛ ⎞sin+=

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

1-5-1

1-2-1 1-3-1

1-4-1

Residual error decreases with O(1/M) where M is the number of hidden units

Page 48: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Convergence in Time g p( ) 1 πp( )sin+=

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

1

2 3

4

5

0

1

2

3 4

5

0

Page 49: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Generalization

p1 t1{ , } p2 t2{ , } … pQ tQ{ , }, , ,

g p( ) 1π4---p⎝ ⎠⎛ ⎞sin+= p 2– 1.6– 1.2– … 1.6 2, , , , ,=

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

1-2-1 1-9-1

Page 50: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

We will see later how to use complex models (e.g. a higher order polynomial or a MLP with large number of nodes) together with regularization to control model complexity. •  E.g. to keep the weight small, so that even if we use

complex models, the weight are kept in check so that the overall model is not prone to overfit.

56

Page 51: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be

Next: Issues and Variations on

Backpropagation