Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward...

25
Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning = “backpropagation” = local optimization in the space of weights

Transcript of Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward...

Page 1: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Multilayer Perceptron= FeedForward Neural Network

History Definition Classification = feedforward operation Learning = “backpropagation”

= local optimization in the space of weights

Page 2: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Pattern Classification

Figures in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000with the permission of the authors and the publisher

Page 3: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

3

● Perceptron (Rosenblatt, 1956) with its simple learning algorithm generated a lot of excitement

● Minsky & Papert (1969) showed that even a simple XOR cannot be learnt by a perceptron, whichvlead to skepticism about their utility

● The problem was solved by “networks” of perceptrons (multi-layer perceptrons, feedforward neural networks)

History

Page 4: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

4

Page 5: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

5A three layer perceptronnetwork implementing the XOR

Page 6: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

∑ ∑= =

==+=H Hn

1j

n

0j

tkkjj0kkjjk ,y.wwywwynet

Patter

6

● Net activation:

where the subscript i indexes units in the input layer, j in the hidden; wji denotes the input-to-hidden layer weights at the hidden unit j.

● A single “bias unit” is connected to each unit other than the input units

● Each hidden unit emits an output that is a nonlinear function of its activation, that is: yj = f(netj)

● Each output unit similarly computes its net activation based on the hidden unit signals as:

where the subscript k indexes units in the ouput layer and nHdenotes the number of hidden units

∑ ∑= =

≡=+=d

1i

d

0i

tjjii0jjiij ,x.wwxwwxnet

Multilayer Perceptron, terminology

Page 7: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

7

● The function f(.), the activation function, introduces nonlinearity the a unit. The first activation function considered was the sign (as in perceptron):

● Unfortunately, a network with such f is difficult to train

<−≥

≡=0net if 1

0net if 1)netsgn()net(f

Page 8: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

8

● Is a function of the following form

● Popularized in 1986, backpropagation learning bacamepossible for f continuous and differentiable

● We can allow the activation in the output layer to be different from the activation function in the hidden layer or have different activation for each individual unit

● We assume for now that all activation functions to be identical

c)1,...,(k

(1) wwxwfwfz)x(gHn

1j0k

d

1i0jijikjkk

=

+

+=≡ ∑ ∑

= =

Multilayer Perceptron = FF Neural net

Page 9: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

9

● The vector of outputs is denoted zk. zk = f(netk)

● In the case of c outputs (classes), we can view the network as computing c discriminants functions zk = gk(x) and classify the input x according to the largest discriminant function gk(x) ∀ k = 1, …, c

Page 10: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

10

Page 11: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

11

Question: Can every decision be implemented by a three-layer network described by equation (1) ?

Answer: Yes (due to A. Kolmogorov)“Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units nH, proper nonlinearities, and weights.”

for properly chosen functions δj and βij

( ) )2n];1,0[I(Ix )x()x(g n1n2

1jiijj ≥=∈∀= ∑

+

=

βΣδ

Expressive Power of multi-layer Networks

Page 12: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

12

● Each of the 2n+1 hidden units δj takes as input a sum of dnonlinear functions, one for each input feature xi

● Each hidden unit emits a nonlinear function δj of its total input

● Unfortunately: Kolmogorov’s theorem is not constructive; tells us very little about how to find the nonlinear functions based on data; this is the central problem in network-based pattern recognition

● Remarkably: Kolmogorov’s theorem proves that a continuous function in dimension n can be represented as a function of a sufficient number of one-dimensional functions

Expressive Power of multi-layer Networks

Page 13: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

13

Page 14: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

14

● Network have two modes of operation:

● Classfication = Feedforward operationThe feedforward operations consists of presenting a pattern to the input units and passing (or feeding) the signals through the network in order to get outputs units (no cycles!)

● LearningThe supervised learning consists of presenting an input pattern and modifying the network parameters (weights) to reduce distances between the computed output and the desired output

Modes of operation of MLP

Page 15: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

● is a gradient descent method. The cost function is, for a single training pattern:

● where tk is k-th target (or desired) output and zk be the k-th computed output

● with k = 1, …, c and w represents all the weights of the network

● Class k is represented as vector (0,0, ..1, …, 0), where 1 is in the k-th position.

● For a “pure” output of the newural net, z=(0,..,1..,0), J(w ) is 0 or 1.

∑=

−=−=c

1k

22kk zt

21)zt(

21)w(J

Patter

15Learning: Backpropagation Cost function

Page 16: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

● in gradient descent, the update of weights is in the direction of the gradient

● the step size is found e.g. by line search (a search for a minimum in 1D).

● w t +1 = w t + ∆w

Patter

16

wJw∂∂

−= η∆

Learning: Update

Page 17: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

17

● Error on the hidden–to-output weights

where the sensitivity of unit k is defined as:

and describes how the overall error changes with the activation of the unit’s net

kj

kk

kj

k

kkj wnet

wnet.

netJ

wJ

∂∂

−=∂∂

∂∂

=∂∂ δ

kk net

J∂∂

−=δ

)net('f)zt(netz.

zJ

netJ

kkkk

k

kkk −=

∂∂

∂∂

−=∂∂

−=δ

Learning: Calculating the gradient

Page 18: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

18

Since netk = wkt.y therefore:

Conclusion: the weight update (or learning rule) for the hidden-to-output weights is:

∆wkj = ηδkyj = η(tk – zk) f’ (netk)yj

● Error on the input-to-hidden units

jkj

k ywnet

=∂∂

ji

j

j

j

jji wnet

.nety

.yJ

wJ

∂∂

=∂∂

Learning: Calculating the gradient

Page 19: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

19

Similarly as in the preceding case, we define the sensitivity for a hidden unit:

which means that:“The sensitivity at a hidden unit is simply the sum of the individual sensitivities at the output units weighted by the hidden-to-output weights wkj; all multipled by f’(netj)” The learning rule for the input-to-hidden weights is:

∑ ∑

∑∑

= =

==

−−=∂∂

∂∂

−−=

∂∂

−−=

∂∂

=∂∂

c

1k

c

1kkjkkk

j

k

k

kkk

c

1k j

kkk

2k

c

1kk

jj

w)net('f)zt(y

net.netz)zt(

yz)zt()zt(

21

yyJ

∑=

≡c

1kkkjjj w)net('f δδ

[ ] ijkkjjiji x)net('f wxwj

δ

δΣηδη∆ ==

Learning: Calculating the gradient

Page 20: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

20

● Calculating the gradient for the full training set is a sum of gradients, since the cost function is additive

● Stopping criterion:

{ e.g. The algorithm terminates when the change in the criterion function J(w) is smaller than some preset value θ

{ when cross-validation error goes up

(1) JJn

1pp∑

=

=

Learning: Calculating the gradient

Page 21: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

21

Page 22: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

● Starting with a pseudo-random weight configuration, the stochastic backpropagation algorithm can be written as shown in the box.

● The evalution of J is stochastic to speed up the process

Begin initialize nH; w, criterion θ, η, m ← 0

do m ← m + 1xm ← randomly chosen patternwji ← wji + ηδjxi; wkj ← wkj + ηδkyj

until ||∇J(w)|| < θreturn w

End

Patter

22Learning: Calculating the gradient

Page 23: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

23

● learning by local optimization: global minimum not guaranteed

● time-consuming learning. Many nested loops:● repeat for different number of hidden units● random restarts to deal with local minima● iterations of the backpropagation algorithm

● many parameters, not easy to set for a non-experienced users

● a cost-function (quadratic loss) which is convenient, but not the classification error

Disadvantages of MLP

Page 24: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

Patter

24

● handles well problems with multiple classes

● handles naturally both classification and regression problems (estimating real values)

● after normalization, the output may be interpreted as aposteriori probability. We know the class with maximum response, but also the reliability of it.

Advantages of MLP

Page 25: Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =

25/15

Thank you