M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6...

153
MIMA Group Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Machine Learning Chapter 6 Neural Networks M L D M

Transcript of M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6...

Page 1: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Machine Learning

Chapter 6

Neural Networks

M L

D M

Page 2: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

Contents

Introduction

Single-Layer Perceptron Networks

Learning Rules for Single-Layer Perceptron

Networks

Multilayer Perceptron

Back Propagation Learning Algorithm

Radial-Basis Function Networks

Self-Organizing Maps

Page 3: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3

Introduction

(Artificial) Neural Networks are Computational models which mimic the brain's learning

processes.

They have the essential features of neurons and their interconnections as found in the brain.

Typically, a computer is programmed to simulate these features.

Other definitions … A neural network is a massively parallel distributed processor

made up of simple processing units, which has a natural propensity for storing experimental knowledge and making it available for use. It resembles the brain in two respects:

Knowledge is acquired by the network from its environment through a learning process

Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.

Page 4: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4

Introduction

A neural network is a machine learning

approach inspired by the way in which the brain

performs a particular learning task:

Knowledge about the learning task is given in the

form of examples.

Inter neuron connection strengths (weights) are used

to store the acquired information (the training

examples).

During the learning process the weights are modified

in order to model the particular learning task correctly

on the training examples.

Page 5: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5

Biological Neural Systems

The brain is composed of approximately 100

billion (1011

) neurons

Schematic drawing of two biological

neurons connected by synapses

Dendrites

Synapse

Axon

A typical neuron collects signals from other

neurons through a host of fine structures called

dendrites.

The neuron sends out spikes of electrical activity

through a long, thin strand known as an axon,

which splits into thousands of branches.

At the end of the branch, a structure called a

synapse converts the activity from the axon into

electrical effects that inhibit or excite activity in

the connected neurons.

When a neuron receives excitatory input that is

sufficiently large compared with its inhibitory

input, it sends a spike of electrical activity down

its axon.

Learning occurs by changing the effectiveness of the synapses so that the

influence of one neuron on the other changes

Page 6: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6

Neuron

Page 7: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7

Interconnections between Neurons

Page 8: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 8

Neural Networks

A NN is a machine learning approach inspired

by the way in which the brain performs a

particular learning task

Various types of neurons

Various network architectures

Various learning algorithms

Various applications

Page 9: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9

Characteristics of NN’s

Characteristics of Neural Networks

Large scale and parallel processing

Robust

Self-adaptive and organizing

Good enough to simulate non-linear relations

Hardware

Page 10: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10

Historical Background

1943 McCulloch and Pitts proposed the first

computational models of neuron.

1949 Hebb proposed the first learning rule.

1958 Rosenblatt’s work in perceptrons.

1969 Minsky and Papert’s exposed limitation of

the theory.

1970s Decade of dormancy for neural networks.

1980-90s Neural network return (self-

organization, back-propagation algorithms, etc)

Page 11: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11

Applications

Combinatorial Optimization

Pattern Recognition

Bioinformatics

Text processing

Natural language processing

Data Mining

Page 12: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12

Types

Structure

Feed-forward

Feed-back

Learning method

Supervised

Unsupervised

Signal type

Continuous

Discrete

Page 13: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13

The Neuron

The neuron is the basic information processing unit of a NN. It consists of:

1 A set of synapses or connecting links, each link characterized by a weight:

W1, W2, …, Wm

2 An adder function (linear combiner) which computes the weighted sum of the inputs:

3 Activation function (squashing function) for limiting the amplitude of the output of the neuron.

m

1jj xwu

j

) (u y b

Page 14: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 14

The Neuron

Input

signal

Synaptic

weights

Summing

function

Bias

b

Activation

functionLocal

Field

vOutput

y

x1

x2

xm

w2

wm

w1

)(

Page 15: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15

Bias of a Neuron

Bias b has the effect of applying an affine

transformation to u

v = u + b

v is the induced field of the neuron

v

u

m

1jj xwu

j

Page 16: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16

Bias as Extra Input

Bias is an external parameter of the neuron. Can

be modeled by adding an extra input.

Input

signal

Synaptic

weights

Summing

function

Activation

functionLocal

Field

vOutput

y

x1

x2

xm

w2

wm

w1

)(

w0x0 =1

bw

xwv j

m

j

j

0

0

Page 17: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17

Activation Function

1.Linear function

2. Step function

3. Ramp function

xifa

xifaxf

)(

2

1

xif

xif

xif

xxf

)(

axxf )(

Page 18: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18

Activation Function

4. Logistic function

5. Hyperbolic tangent

6. Gaussian function

xexf

1

1)(

xx

xx

ee

eexf

)(

22 /)( xexf

Page 19: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 19

Activation function

real

unrestricted

[0,1]

[-1,+1]

Interval

continuous

0,1

-1,+1

-,+

binary

{-1,0,+1}

{-100,...,+100}

multi value

discrete

Activation

Page 20: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20

Perceptron

In 1943, McCulloch and Pitts proposed the first

single neuron model.

Hebb proposed the theory that the learning

process is generated from the change of weights

between synapses.

Rosenblatt combined them together, and

proposed “Perceptron”.

Perceptron is just a single neural model, and is

composed of synaptic weights and threshold.

It is the simplest and earliest neural network

model, used for classification.

Page 21: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21

Perceptron

Input

signal

Synaptic

weights

Summing

function

Activation

functionLocal

Field

vOutput

y

x1

x2

xm

w2

wm

w1

)(

w0x0 = 1

bw

xwv j

m

j

j

0

0

Error

A plane passes through the origin in

the augmented input space.

Page 22: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22

Perceptron

Given training sets T1C1 and T2 C2 with

elements in form of x=(x0, x1, x2 , …, xm) T , where

x1, x2 , ... , xmR and x0= 1.

Assume T1 and T2 are linearly separable.

Find w=(w0, w1, w2 , ... , wm) T such that

2

1

1

1)sgn(

T

TT

x

xxw

Page 23: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23

Perceptron

x1

x2

+w1

w2

w3

w4

w5

w6

x

Which w’s correctly

classify x?

What trick can be

used?

+

d = +1

d = 1

Page 24: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24

Perceptron

Is this w ok?

+

d = +1

d = 1

x1

x2

+ w

x

0T w x

Page 25: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25

Perceptron

Is this w ok?

+

d = +1

d = 1

x1

x2

+

x

0T w x

w

Page 26: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26

Perceptron

Is this w ok?

+

d = +1

d = 1

x1

x2

+

x

0T w x

w

w = ?

?

?

How to adjust w?

Page 27: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27

Perceptron

Is this w ok?

+

d = +1

d = 1

x1

x2

+

x

0T w x

w

How to adjust w?

w = x

( )T TT wx xw xw x

reasonable?

>0<0

Page 28: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28

Perceptron

Is this w ok?

+

d = +1

d = 1

x1

x2

+

x

0T w x

w

How to adjust w?w = x

reasonable?

( )T TT wx xw xw x

<0 >0

Page 29: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29

Perceptron

Is this w ok?

+

d = +1

d = 1

x1

x2

+

x

0T w x

w

w = ?

+x or x

Page 30: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30

Learning Rule

Upon misclassification on

+ d = +1

d = 1

w x

w x

Define error

r d y 2

2

0

+

+No error

0

Page 31: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31

Learning Rule

r w xDefine error

r d y 2

2

0

+

+No error

Page 32: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32

Learning Rule

r w xLearning Rate

Error (d y)

Input

Page 33: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33

Learning Rule

( )( ) i ii x tw t r

ii ir d y

( ( )( )) i i iiw t y td x

0

2 1, 1

2 1, 1

i i

i i

i i

d y

d y

d y

incorrect

correct

Page 34: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 34

Learning Rule

x y( ) d yw x

.

.

....

d+

Page 35: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 35

Learning Rule

x y( ) d yw x

.

.

....

d+

If the given training set is linearly separable, the learning

process will converge in a finite number of steps.

How to prove?

Page 36: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 36

The Learning Scenario

x1

x2+

x(1)

+x(2)

x(3)

x(4)

Page 37: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 37

The Learning Scenario

x1

x2

w0

+x(1)

+x(2)

x(3)

x(4)

Page 38: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 38

The Learning Scenario

x1

x2

w0

+x(1)

+x(2)

x(3)

x(4)

w1

w0

Page 39: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 39

The Learning Scenario

x1

x2

w0

+x(1)

+x(2)

x(3)

x(4)

w1

w0

w2

w1

Page 40: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 40

The Learning Scenario

x1

x2

w0

+x(1)

+x(2)

x(3)

x(4)

w1

w0

w1

w2

w2w3

Page 41: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 41

The Learning Scenario

x1

x2

w0

+x(1)

+x(2)

x(3)

x(4)

w1

w0

w1

w2

w2w3

w4=w3

Page 42: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 42

The Learning Scenario

x1

x2+

x(1)

+x(2)

x(3)

x(4)

w

The demonstration is in augmented space.

Conceptually, in augmented space, we

adjust the weight vector to fit the data.

Page 43: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 43

Weight Space

w1

w2

+x

w

A weight in the shaded area will give correct

classification for the positive example.

Page 44: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 44

Weight Space

w1

w2

+x w

w = x

A weight in the shaded area will give correct

classification for the positive example.

Page 45: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 45

Weight Space

w1

w2

x

w

A weight not in the shaded area will give

correct classification for the negative example.

Page 46: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 46

Weight SpaceA weight not in the shaded area will give

correct classification for the negative example.

w1

w2

x

w

w = x

Page 47: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 47

Learning Scenario in Weight Space

w1

w2

+x(1)+

x(2)

x(3)

x(4)

Page 48: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 48

Learning Scenario in Weight Space

w1

w2

+x(1)+

x(2)

x(3)

x(4)

To correctly classify the

training set, the weight

must move into the shaded

area.

Page 49: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 49

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

Page 50: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 50

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

w2

Page 51: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 51

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

w2 =w3

Page 52: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 52

Learning Scenario in Weight Space

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

w2 =w3w4

To correctly classify the

training set, the weight

must move into the shaded

area.

Page 53: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 53

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

w2 =w3w4

w5

Page 54: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 54

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

w2 =w3w4

w5

w6

Page 55: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 55

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

w2 =w3w4

w5

w6

w7

Page 56: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 56

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

w2 =w3w4

w5

w6

w7

w8

Page 57: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 57

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

w2 =w3w4

w5

w6

w7

w8

w9

Page 58: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 58

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

w2 =w3w4

w5

w6

w7

w8

w9w10

Page 59: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 59

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w1

w2 =w3w4

w5

w6

w7

w8

w9w10 w11

Page 60: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 60

Learning Scenario in Weight Space

To correctly classify the

training set, the weight

must move into the shaded

area.

w1

w2

+x(1)+

x(2)

x(3)

x(4)

w0

w11

Conceptually, in weight space, we move the weight into the feasible region.

Page 61: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 61

Least Mean Square Learning

Minimize the cost function (error function):

( 2

1

( ))1( ) ( )

2

kp

k

kyE d

w

( ( )) 2

1

1( )

2

Tkp

k

k

d

xw

( )

1 1

(

2

)1

2

p m

ll

kk

k l

xwd

Page 62: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 62

Least Mean Square Learning

Our goal is to go downhill.

E(w)

w1

w2Contour Map

(w1, w2)w1

w2w

Page 63: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 63

Least Mean Square Learning

Our goal is to go downhill.

E(w)

w1

w2Contour Map

(w1, w2)w1

w2w

How to find the steepest decent direction?

Page 64: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 64

Least Mean Square Learning

Gradient Operator

Let f(w) = f (w1, w2,…, wm) be a function over Rm.

1 2

1 2 m

m

f f f

w wdw dwdf w

wd

L

Define

1 2, , ,T

mdw dw dww L

,df f f w w

1 2

, , ,

T

m

f f ff

w w w

L

Page 65: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 65

Least Mean Square Learning

fw fw f

w

df : positive df : zero df : negative

Go uphill Plain Go downhill

,df f f w w

Page 66: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 66

Least Mean Square Learning

fw fw f

w

df : positive df : zero df : negative

Go uphill Plain Go downhill

,df f f w w

To minimize f , we choose

w = f

Page 67: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 67

Least Mean Square Learning

Minimize the cost function (error function):

( )( )

2

1 1

1( )

2

p m

k l

l

k

l

kd xE w

w

( )

1

( ) (

1

)( ) p m

k l

k kk

l j

j

l

E

wwd x x

w

( ( )(

1

) )kkp

T k

j

k

xd

xw ( )

1

( )) (p

k

k k

j

kyd x

(k)

( )

1

()( ) kp

k

k

j

j

E

wx

w( ) ( ) ( )k k kd y

Page 68: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 68

Least Mean Square Learning

Minimize the cost function (error function):

( )( )

2

1 1

1( )

2

p m

k l

l

k

l

kd xE w

w

1 2

( ) ( ) ( )( ) , , ,

T

w

m

E E EE

w w w

w w ww K

( )E ww w Weight Modification Rule

( )

1

()( ) kp

k

k

j

j

E

wx

w( ) ( ) ( )k k kd y

Page 69: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 69

Least Mean Square Learning

Learning Modes

Batch Learning Mode

Incremental Learning Mode

( ( ))

1

p

k

k k

jj xw

( ( ))k k

jj xw

( )

1

()( ) kp

k

k

j

j

E

wx

w( ) ( ) ( )k k kd y

Page 70: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 70

Perceptron

Summary

Separability: some parameters get the training set

perfectly correct

Convergence: if the training is separable, perceptron

will eventually converge (binary case)?

The Perceptron convergence theorem

The relation between perceptron and Bayes

classifier

Page 71: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 71

Multilayer Perceptron

. . .

. . .

. . .

. . .

x1 x2 xm

y1 y2 yn

Hidden Layer

Input Layer

Output Layer

Page 72: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 72

How an MLP Works?

XOR

01

1

x1

x2

Example:

Not linearly separable.

Is a single layer perceptron

workable?

Page 73: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 73

How an MLP Works?

01

1

XOR

x1

x2

Example:

L1L200

01

11

L2L1

x1 x2 x3= 1

y1 y2

Page 74: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 74

How an MLP Works?

01

1

XOR

x1

x2

Example:

L1L200

01

11

01

1

y1

y2

L3

Page 75: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 75

How an MLP Works?

Example:

01

1

y1

y2

L3

L2L1

x1 x2 x3= 1

L3

y1 y2

y3= 1

z

Page 76: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 76

Parity Problem

Is the problem linearly separable?

x1

x2

x3

0110100

000001010011100101110111 1

x1 x2 x3

Page 77: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 77

Parity Problem

Is the problem linearly separable?

x1

x2

x3

0110100

000001010011100101110111 1

x1 x2 x3

Page 78: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 78

Parity Problem

Is the problem linearly separable?

x1

x2

x3

0110100

000001010011100101110111 1

x1 x2 x3

111

011

001000

Page 79: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 79

Parity Problem

Is the problem linearly separable?

x1

x2

x3

111

011

001000

P1

x1 x2 x3

y1

P2

y2

P3

y3

Page 80: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 80

Parity Problem

x1

x2

x3

111

011

001000

y1

y2

y3

P4

Page 81: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 81

Parity Problem

y1

y2

y3

P4

P1

x1 x2 x3

P2 P3

y1

z

y3

P4

y2

Page 82: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 82

Back Propagation Learning

Learning on Output Neurons

Learning on Hidden Neurons

General Learning Rule

Measure error

Reduce that error

By appropriately adjusting

each of the weights in the

network

. . .

. . .

. . .

. . .

x1 x2 xm

o1 o2 on

d1 d2 dn

Page 83: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 83

Back Propagation Learning

Forward Pass:

Error is calculated from outputs

Used to update output weights

Backward Pass:

Error at hidden nodes is calculated by back

propagating the error at the outputs through the new

weights

Hidden weights updated

Page 84: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 84

Symbols

Subscript i, j, k, represent different neurons, when j is

the neuron of hidden layer, then i is on the left side of

j, and k is on the right side of j.

n is the iteration no.

E(n) is the sum of instantaneous error energy of the

nth iteration, its average is Eav.

ej(n) is the error of the jth neuron on the nth iteration.

dj(n) is the expected value of the jth neuron on the nth

iteration.

Page 85: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 85

Symbols

yj(n) is output of the jth neuron on the nth iteration; if

the jth neuron is the output layer, the Oj(n) can be

used.

wji(n) is the weight from i to j, its change is Δwji(n).

vj(n) is the internal state of the jth neuron.

φ(.) is the activation function of the jth neuron.

θj is the threshold of the jth neuron.

xi(n) is the ith element of the input sample.

η is the learning rate.

Page 86: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 86

Symbols

Signal-flow graph highlighting the details of

output neuron k connected to hidden neuron j.

Page 87: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 87

Back Propagation Learning

The error function of the jth neuron in output

layer is:

BP-1

Instantaneous error E(n) is defined as:

BP-2

Eav is defined as (N is the number of training

samples):

BP-3

)()()( nyndne jjj

Cj

j nenE )(2

1)( 2

N

nav nE

NE

1

)(1

Page 88: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 88

Batch vs. On-Line Learning

In the batch learning, adjustments to synaptic weights of

the multilayer perceptron are performed after the

presentation of all the N training samples.

This training process that all the N samples are

represented one time is called one epoch of training.

So the cost function for batch learning is defined by the

average error energy Eav.

Advantages

Accurate estimation the gradient vector

Parallelization

Disadvantage

More storage requirements

Page 89: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 89

BP Learning Details

At the nth iteration, we can training the network

by minimizing E(n), and the output of the jth

neuron is given by:

BP-4

BP-5

P

iijij nynwnv

0

)()()(

))(()( nvny jjj

Page 90: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 90

BP Learning Details

We define the gradient as:

According to BP-4

We further denote

)(

)(

)(

)(

)(

)(

nw

nv

nv

nE

nw

nE

ji

j

jji

)()()()()(

)(

0

nynynwnwnw

nvi

p

iiji

jiji

j

)(

)()(

nv

nEn

jj

Cj

j nenE )(2

1)( 2

)()()( nyndne jjj

Page 91: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 91

BP Learning Details

Then the change of wji(n) is :

is the learning rate, thus the weights can be updated

as:

)()()( nynnw ijji

)()()()()()1( nynnwnwnwnw ijjijijiji

Page 92: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 92

BP Learning Details

When neuron j is a neuron in the output layer,

according to BP-1, BP-1, we have:

)(

)(

)(

)(

)(

)()(

nv

ny

ny

nE

nv

nEn

j

j

jjj

)(

)()()(

2

1

)(

2

nv

nvnynd

ny j

j

cjjj

j

)()()( nvnynd jjjj

Cj

j nenE )(2

1)( 2 )()()( nyndne jjj

Page 93: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 93

BP Learning Details

When neuron j is in the hidden layer, there is no

expected value for us to use. Thus, we use the error

propagated from the neuron connected to it:

Page 94: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 94

BP Learning Details

j and k connected with wkj

If k is in the output layer,

Then for neuron j,

)(

)(

)(

)()(

nv

ny

ny

nEn

j

j

jj

)(

)(

)(nv

ny

nEj

j

k

kj

kk j

k

kj

nwnv

nE

ny

nv

nv

nE

ny

nE)(

)(

)(

)(

)(

)(

)(

)(

)(

)()()( nvnynd kkk )()(

)(n

nv

nEk

k

k

kjkjjj nwnnvn )()()()(

)()()()()( nwnvnyndnv kjkk

kkjj

)()()( nynnw ijji

Page 95: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 95

BP Learning Details

The previous equations show that we need a

function (.) differentiable, e.g., the sigmoid

function:

)()( nvny jj

)(,)(exp1

1nv

nvj

j

)()(

)(nv

nv

nyj

j

j

2)(exp1

)(exp

nv

nv

j

j

)](1)[()( nynynv jjj

Page 96: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 96

BP Learning Details

Signal-flow graph of a part of the adjoint system

pertaining to back-propagation of error signals.

Page 97: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 97

Two Passes of Computation

Forward pass

Backward pass

This pass stats at the output layer by passing the

error signals leftward through the network, layer by

layer, and recursively computing delta(local gradient)

for each neuron.

P

iijij nynwnv

0

)()()(

))(()( nvny jjj

Page 98: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 98

Signal-flow graphical summary

Top part of the graph: forward pass.

Bottom part of the graph: backward pass.

Page 99: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 99

Learning Rate

The learning rate should not be too large or too

small.

In order to avoid the danger of instability, a

momentum term can be introduced into the

equation.

)()()1()( nynnwnw ijjiji

Page 100: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 100

Learning Rate

If we rewrite the equation as a time series with

index t, the equation becomes:

n

t

ij

n

ji tytnw0

1 )()()(

We can rewrite it as:

n

t ji

n

jitw

tEnw

0

1

)(

)()(

For the time series to be convergent, 0<=|alpha|<1

The sign of the partial derivative can affect the speed

and stability.

Page 101: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 101

Stopping Criteria

In general, the BP cannot be shown to converge,

and there are no well-defined criteria for

stopping its operation.

However, there are some reasonable criteria

that can be used to terminate the weight

adjustments, e.g.

When the Euclidean norm of the gradient vector

reaches a sufficiently small gradient threshold.

When the average squared error per epoch is

sufficiently small. Usually, it is in the range of 0.1 to 1

percent per epoch, or as small as 0.01 percent.

Page 102: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 102

XOR Problem Revisiting

Page 103: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 103

XOR Problem Revisiting

The weights are initialized as:

=0.5

When the sample (1,1) is given to the network:

TTT www )8.0,4.0,5.0()0(,)1,1,3.0()0(,)1,1,2.1()0( 321

63.084.08.096.04.0)1(5.0exp1

1

84.01111)1(3.0exp1

1

96.01111)1()2.1(exp1

1

2

1

z

y

y

Page 104: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 104

XOR Problem Revisiting

We have:

0158.08.0)147.0()84.01(84.02

002.04.0)147.0()96.01(96.01

)(1)()()(

)()()(

nOnOnOnd

nvnen

jjjj

jjj

k

kjkjj

k

kjkjj

nwnnyny

nwnnvn

)()()(1)(

)()()()(

147.0)63.01(63.0)63.00(3

Page 105: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 105

XOR Problem Revisiting

Then the weights are updated as:

Finally, we can have:

TTT

TTT

TTT

w

w

w

)738.0,329.0,5735.0()84.0,96.0,1)(147.0(5.0)8.0,4.0,5.0()1(

)992.0,992.0,3079.0()1,1,1)(0158.0(5.0)1,1,3.0()1(

)999.0,999.0,199.1()1,1,1)(0002.0(5.0)1,1,2.1()1(

3

2

1

T

T

T

w

w

w

)189.0,384.0,216.0()1(

)98.0,826.0,294.0()1(

)179.1,912.0,198.1()1(

3

2

1

Page 106: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 106

XOR Problem Revisiting

We can revisit the problem from the view of

space transformation, the points in the sample

space are transformed into a new space, i.e.,

(0,0), (1,1), (1,0) and (0,1)are mapped to

(0.768,0.427), (0.964,0.819), (0.892,0.629) and

(0.915,0.665).

22221212

12121111

exp1

1

exp1

1

xwxwy

xwxwy

Page 107: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 107

XOR Problem Revisiting

4997.0)427.0,768.0()0,0( layer outputlayer hidden z

4999.0)819.0,964.0()1,1( layeroutput layerhidden z

5025.0)629.0,892.0()1,0( layeroutput layerhidden z

5020.0)665.0,915.0()0,1( layeroutput layerhidden z

Page 108: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 108

Other Issues

Heuristics for BP learning

Stochastic versus batch update

Maximizing information content

Activation function

Target values

Normalizing the inputs

Initialization

Learning from hints

Learning rates.

Page 109: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 109

Stochastic versus batch update

The stochastic mode (pattern-by-pattern) is

computationally faster than batch mode.

Especially, when the data is large and redundant,

it will much better to use stochastic than to use

batch.

Page 110: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 110

Maximizing information content

Every training example should be chosen on the

basis that its information content is the largest

possible for the task at hand.

How to choose?

Use an sample that results in the largest training error.

Use an example that radically different from all those

previous used.

Page 111: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 111

Activation function

Graph of the hyperbolic tangent function φ(v) =α

tanh(bv) for α 1.7159 and b = 2/3.The

recommended target values are +1 and –1.

bxbx

bxbx

ee

eeax

)(

Page 112: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 112

Target values

The target values should be within the range of

the sigmoid activation function.

Otherwise, the BP algorithm tends to drive the

free parameters of the network to infinity, and

thereby slow down the learning process by

driving the hidden neurons into saturation.

ad

ad

j

j

Page 113: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 113

Normalizing the inputs

Each input variable should be preprocessed

Mean removal

Decorrelation

Covariance equalization

Normalization methods

Min-Max

Z-score standardization

Page 114: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 114

Initialization

Too large

The neurons in the network will be driven into

saturation

Too small

The BP algorithm will operate on a very flat area

around the origin of the error surface.

Page 115: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 115

Learning from hints

We can make use of some information that we

have about the activation function or data.

Page 116: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 116

Learning rates

The learning rate should be assigned a smaller

value in the last layers than in the front layers.

Neurons with many inputs should have a smaller

learning rate than neurons with few inputs.

Annealing method can be applied.

Page 117: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 117

Stopping Criteria

In general, the BP cannot be shown to converge,

and there are no well-defined criteria for

stopping its operation.

However, there are some reasonable criteria

that can be used to terminate the weight

adjustments, e.g.

When the Euclidean norm of the gradient vector

reaches a sufficiently small gradient threshold.

When the average squared error per epoch is

sufficiently small. Usually, it is in the range of 0.1 to 1

percent per epoch, or as small as 0.01 percent.

Page 118: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 118

Some problems

The layers

The number of hidden layer neurons

Kolmogorov theorem: the neurons in hidden layers

can be: s=2m+1 (m is the number of neurons in input

layer)

Demos

Page 119: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 119

BP Summary

Strengths of BP learning Great representation power Wide practical applicability Easy to implement Good generalization power

Problems of BP learning Learning often takes a long time to converge The net is essentially a black box? Gradient descent approach only guarantees a local

minimum error Not every function that is representable can be

learned Generalization is not guaranteed even if the error is

reduced to zero

Page 120: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 120

BP Summary

No well-founded way to assess the quality of BP learning

Network paralysis may occur (learning is stopped) Selection of learning parameters can only be done by

trial-and-error BP learning is non-incremental (to include new

training samples, the network must be re-trained with all old and new samples)

Page 121: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 121

Radial-Basis Functions

A radial basis

function (RBF) is a

real-valued function

whose value depends

only on the distance

from the origin, so that

; or alternatively on the

distance from some

other point c, called a

center

)()( xx

)()( cxx

Page 122: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 122

Interpolation problem

In its strict sense, the problem can be stated as: Given a set of N different points {xi∈Rm0} and a

corresponding set of N real numbers di, find a function F that satisfies the interpolation condition: F(xi)=di.

The Radial-Basis Function (RBF) technique consists of choosing a function F that has the form:

N

i

ii xxwxF1

||)(||)(

How to get the solution?

Page 123: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 123

Radial-Basis Function

where

NNNNNN

N

N

d

d

d

w

w

w

2

1

2

1

21

22221

11211

)( ijji xx

Micchelli theorem(1986) is proved that if the equation is as the above, then the matrix is nonsingular.

Page 124: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 124

Radial-Basis Functions

Multiquadrics

Inverse multiquadrics

Gaussian functions

2/122 )()( crr

2/122 )/(1)( crr

)2

exp()(2

2

rr

Page 125: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 125

RBF Networks

Structure of an RBF network, based on

interpolation theory.

Page 126: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 126

RBF Networks

The network has three layers:

Input layer, which consist of m0 source nodes.

Hidden layer, consist of the same number of

computation units as the size of the training samples,

namely, N.

Output layer, there is no restriction on the size of the

output layer.

Page 127: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 127

Modifications to RBF Networks

Page 128: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 128

How to get the k centers

This can be computed by un-supervised learning.

K-means

SOM

Clustering algorithm can be used here, e.g. we use

k-means

K

ij jiC

ji ux)(

2||||min

Page 129: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 129

Self-Organization Maps

Teuvo Kohonen (1982, 1984)

In biological systemsCells tuned to similar orientations tend to be physically

located in proximity with one another

Microelectrode studies with cats

So, SOM is motivated by a distinct feature of the human brain:The brain is organized in many places in such a way

that different sensory inputs are represented by topologically ordered computation maps.

Page 130: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 130

Self-Organization Maps

Orientation tuning over the surface forms a kind of map with similar tunings being found close to each otherTopographic feature map

Train a network using competitive learning to create feature maps automatically

Page 131: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 131

SOM Clustering

Self-organizing map (SOM) An unsupervised artificial neural network

Mapping high-dimensional data into a one or two-dimensional representation space

Similar data may be found in neighboring regions

Disadvantages Fixed size in terms of the number of units and their

particular arrangement

Hierarchical relations between the input data are not mirrored in a straight-forward manner

Page 132: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 132

SOM Structure

Two-dimensional lattice of neurons, illustrated

for a three-dimensional input and four-by-four

dimensional output (all shown in blue).

Page 133: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 133

SOM Structure

Page 134: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 134

Features

Kohonen’s algorithm creates a vector quantizer

by adjusting weight from common input nodes to

M output nodes

Continuous valued input vectors are presented

without specifying the desired output

After the learning, weight will be organized such

that topologically close nodes are sensitive to

inputs that are physically similar

Output nodes will be ordered in a natural

manner

Page 135: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 135

Initial setup of SOM

Consists a set of units i in a two-dimension grid

Each unit i is assigned a weight vector mi as the same dimension as the input data

The initial weight vector is assigned random values

Page 136: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 136

Essential processes in SOM

Competition process

Find the best match of the input vector x with the

synaptic-weight vectors.

Cooperation process

Decide the topological neighborhood centered on the

winning neuron, and make it decay smoothly with

lateral distance.

Synaptic adaptation

The weights of corresponding neurons are updated in

relation to the input vector.

Page 137: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 137

Competition process

Winner Selection

Initially, pick up a random input vector x(t)

Compute the unit c with the highest activity level (the

winner c(t)) by Euclidean distance formula

T

mxxxx ],...,,[ 21

ljwwww T

jmjjj ,...,2,1,],...,,[ 21

ljwxxi j ,...,2,1||,||minarg)(

Neuron i is called the best-matching, or

winning neuron for the input vector

Page 138: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 138

Cooperative process

A neuron that is firing tends to excite the

neurons in its immediate neighborhood more

that those farther away from it.

Let hji denote the topological neighborhood

function centered on winning neuron i and

encompassing a set of excited neurons.

Let dij denote the lateral distance between the

winning neuron i and excited neuron j.

Page 139: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 139

Cooperative process

hji and dji should satisfy two distinct requirements:

hij is symmetric about the maximum point defiend by

dij = 0; In other words, it attains its maximum value at

the winning neuron i for which the distance dij is zero.

The amplitude of the topological neighborhood hij

decreases monotonically with increasing lateral

distance dij, decaying to zero for dij ->∞, this is

necessary condition for convergence.

Page 140: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 140

Cooperative process

A good choice of hij that satisfies these

requirements is the Gaussian function:

)2

exp(2

2

ji

ji

dh

|| ijd ji

|||| ijji rrd

Page 141: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 141

Cooperative process

Another unique feature of the SOM algorithm is

that the size of the topological neighborhood is

permitted to shrink with time.

This requirement can be satisfied by making the

width sigma of the topological neighborhood

function hji decrease with time. A popular choice

for the dependence of sigma on discrete time n

is the exponential decay:

)exp()(1

0

n

n

Page 142: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 142

Cooperative process

Correspondingly, the topological neighborhood

function assumes a time-varying form of its own,

as follows:

))(2

exp()(2

2

n

dnh

ji

ji

Page 143: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 143

Adaptive process

In the stage, the synaptic-weight is required to

change in relation to the input vector.

)()()1( nwnwnw jjj

Where the last term can be calculated using the

following equation:

))()()(()()( nwnxnhnnw jjij

The learning rate η(n) should also be time

varying.

)exp()(2

0

n

n

Page 144: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 144

Weight update

Page 145: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 145

Weight update

Illustration of the relationship between feature

map Փ and weight vector wi of winning neuron i

I

Page 146: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 146

Learning Process (Adaptation)

Page 147: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 147

SOM Learning Summary

1. initialization

Choose random values for weights

Or choose input vectors randomly to initialize them

2.Sampling

Draw a sample from input space

3. similarity matching

Find the best-matching neuron

4.Updating

5.Continuation

))()()(()()( nwnxnhnnw jjij

ljwxxi j ,...,2,1||,||minarg)(

Page 148: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 148

Neighborhoods

Page 149: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 149

Applications

Optimization problems

Clustering problems

Pattern recognition

Others

Page 150: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 150

Applications

A grouping according to similarity has

emerged

Animal names and their attributes

birds

peaceful

hunters

is

has

likes

to

Dove Hen Duck Goose Owl Hawk Eagle Fox Dog Wolf Cat Tiger Lion Horse Zebra Cow

Small 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 Medium 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0

Big 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

2 legs 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 4 legs 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 Hair 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1

Hooves 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Mane 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0

Feathers 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0

Hunt 0 0 0 0 1 1 1 1 0 1 1 1 1 0 0 0 Run 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 Fly 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0

Swim 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0

Page 151: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 151

Application to PR

Features

字母

特征向量

斜线 上横线 中横线 下横线 竖线 上半圆弧 下半圆弧 左半圆弧 右半圆弧

A '\002' ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\0’

B ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\002’

C ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\001’ ‘\0’

D ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\001’

E ‘\0’ ‘\001’ ‘\001’ ‘\001’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\0’

F ‘\0’ ‘\001’ ‘\001’ ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\0’

H ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\002’‘\0’

‘\0’ ‘\0’ ‘\0’

K ‘\002’ ‘\0’ ‘\0’ ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\0’

P ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\001’ ‘\0 ‘\0’ ‘\0’ ‘\01’

T ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\0’

U ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\002’ ‘\001’ ‘\0’ ‘\0’ ‘\0’

Page 152: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 152

Applications

Demos

Page 153: M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6 Neural Networks M L D M. MIMA ... A neural network is a massively parallel distributed

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Any question?