M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6...
Transcript of M L D M Machine Learningmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter6.pdf · Chapter 6...
MIMA Group
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University
Machine Learning
Chapter 6
Neural Networks
M L
D M
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2
Contents
Introduction
Single-Layer Perceptron Networks
Learning Rules for Single-Layer Perceptron
Networks
Multilayer Perceptron
Back Propagation Learning Algorithm
Radial-Basis Function Networks
Self-Organizing Maps
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3
Introduction
(Artificial) Neural Networks are Computational models which mimic the brain's learning
processes.
They have the essential features of neurons and their interconnections as found in the brain.
Typically, a computer is programmed to simulate these features.
Other definitions … A neural network is a massively parallel distributed processor
made up of simple processing units, which has a natural propensity for storing experimental knowledge and making it available for use. It resembles the brain in two respects:
Knowledge is acquired by the network from its environment through a learning process
Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4
Introduction
A neural network is a machine learning
approach inspired by the way in which the brain
performs a particular learning task:
Knowledge about the learning task is given in the
form of examples.
Inter neuron connection strengths (weights) are used
to store the acquired information (the training
examples).
During the learning process the weights are modified
in order to model the particular learning task correctly
on the training examples.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5
Biological Neural Systems
The brain is composed of approximately 100
billion (1011
) neurons
Schematic drawing of two biological
neurons connected by synapses
Dendrites
Synapse
Axon
A typical neuron collects signals from other
neurons through a host of fine structures called
dendrites.
The neuron sends out spikes of electrical activity
through a long, thin strand known as an axon,
which splits into thousands of branches.
At the end of the branch, a structure called a
synapse converts the activity from the axon into
electrical effects that inhibit or excite activity in
the connected neurons.
When a neuron receives excitatory input that is
sufficiently large compared with its inhibitory
input, it sends a spike of electrical activity down
its axon.
Learning occurs by changing the effectiveness of the synapses so that the
influence of one neuron on the other changes
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6
Neuron
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7
Interconnections between Neurons
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 8
Neural Networks
A NN is a machine learning approach inspired
by the way in which the brain performs a
particular learning task
Various types of neurons
Various network architectures
Various learning algorithms
Various applications
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9
Characteristics of NN’s
Characteristics of Neural Networks
Large scale and parallel processing
Robust
Self-adaptive and organizing
Good enough to simulate non-linear relations
Hardware
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10
Historical Background
1943 McCulloch and Pitts proposed the first
computational models of neuron.
1949 Hebb proposed the first learning rule.
1958 Rosenblatt’s work in perceptrons.
1969 Minsky and Papert’s exposed limitation of
the theory.
1970s Decade of dormancy for neural networks.
1980-90s Neural network return (self-
organization, back-propagation algorithms, etc)
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11
Applications
Combinatorial Optimization
Pattern Recognition
Bioinformatics
Text processing
Natural language processing
Data Mining
…
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12
Types
Structure
Feed-forward
Feed-back
Learning method
Supervised
Unsupervised
Signal type
Continuous
Discrete
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13
The Neuron
The neuron is the basic information processing unit of a NN. It consists of:
1 A set of synapses or connecting links, each link characterized by a weight:
W1, W2, …, Wm
2 An adder function (linear combiner) which computes the weighted sum of the inputs:
3 Activation function (squashing function) for limiting the amplitude of the output of the neuron.
m
1jj xwu
j
) (u y b
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 14
The Neuron
Input
signal
Synaptic
weights
Summing
function
Bias
b
Activation
functionLocal
Field
vOutput
y
x1
x2
xm
w2
wm
w1
)(
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15
Bias of a Neuron
Bias b has the effect of applying an affine
transformation to u
v = u + b
v is the induced field of the neuron
v
u
m
1jj xwu
j
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16
Bias as Extra Input
Bias is an external parameter of the neuron. Can
be modeled by adding an extra input.
Input
signal
Synaptic
weights
Summing
function
Activation
functionLocal
Field
vOutput
y
x1
x2
xm
w2
wm
w1
)(
w0x0 =1
bw
xwv j
m
j
j
0
0
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17
Activation Function
1.Linear function
2. Step function
3. Ramp function
xifa
xifaxf
)(
2
1
xif
xif
xif
xxf
)(
axxf )(
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18
Activation Function
4. Logistic function
5. Hyperbolic tangent
6. Gaussian function
xexf
1
1)(
xx
xx
ee
eexf
)(
22 /)( xexf
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 19
Activation function
real
unrestricted
[0,1]
[-1,+1]
Interval
continuous
0,1
-1,+1
-,+
binary
{-1,0,+1}
{-100,...,+100}
multi value
discrete
Activation
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20
Perceptron
In 1943, McCulloch and Pitts proposed the first
single neuron model.
Hebb proposed the theory that the learning
process is generated from the change of weights
between synapses.
Rosenblatt combined them together, and
proposed “Perceptron”.
Perceptron is just a single neural model, and is
composed of synaptic weights and threshold.
It is the simplest and earliest neural network
model, used for classification.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21
Perceptron
Input
signal
Synaptic
weights
Summing
function
Activation
functionLocal
Field
vOutput
y
x1
x2
xm
w2
wm
w1
)(
w0x0 = 1
bw
xwv j
m
j
j
0
0
Error
A plane passes through the origin in
the augmented input space.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22
Perceptron
Given training sets T1C1 and T2 C2 with
elements in form of x=(x0, x1, x2 , …, xm) T , where
x1, x2 , ... , xmR and x0= 1.
Assume T1 and T2 are linearly separable.
Find w=(w0, w1, w2 , ... , wm) T such that
2
1
1
1)sgn(
T
TT
x
xxw
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23
Perceptron
x1
x2
+w1
w2
w3
w4
w5
w6
x
Which w’s correctly
classify x?
What trick can be
used?
+
d = +1
d = 1
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24
Perceptron
Is this w ok?
+
d = +1
d = 1
x1
x2
+ w
x
0T w x
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25
Perceptron
Is this w ok?
+
d = +1
d = 1
x1
x2
+
x
0T w x
w
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26
Perceptron
Is this w ok?
+
d = +1
d = 1
x1
x2
+
x
0T w x
w
w = ?
?
?
How to adjust w?
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27
Perceptron
Is this w ok?
+
d = +1
d = 1
x1
x2
+
x
0T w x
w
How to adjust w?
w = x
( )T TT wx xw xw x
reasonable?
>0<0
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28
Perceptron
Is this w ok?
+
d = +1
d = 1
x1
x2
+
x
0T w x
w
How to adjust w?w = x
reasonable?
( )T TT wx xw xw x
<0 >0
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29
Perceptron
Is this w ok?
+
d = +1
d = 1
x1
x2
+
x
0T w x
w
w = ?
+x or x
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30
Learning Rule
Upon misclassification on
+ d = +1
d = 1
w x
w x
Define error
r d y 2
2
0
+
+No error
0
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31
Learning Rule
r w xDefine error
r d y 2
2
0
+
+No error
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32
Learning Rule
r w xLearning Rate
Error (d y)
Input
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33
Learning Rule
( )( ) i ii x tw t r
ii ir d y
( ( )( )) i i iiw t y td x
0
2 1, 1
2 1, 1
i i
i i
i i
d y
d y
d y
incorrect
correct
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 34
Learning Rule
x y( ) d yw x
.
.
....
d+
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 35
Learning Rule
x y( ) d yw x
.
.
....
d+
If the given training set is linearly separable, the learning
process will converge in a finite number of steps.
How to prove?
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 36
The Learning Scenario
x1
x2+
x(1)
+x(2)
x(3)
x(4)
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 37
The Learning Scenario
x1
x2
w0
+x(1)
+x(2)
x(3)
x(4)
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 38
The Learning Scenario
x1
x2
w0
+x(1)
+x(2)
x(3)
x(4)
w1
w0
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 39
The Learning Scenario
x1
x2
w0
+x(1)
+x(2)
x(3)
x(4)
w1
w0
w2
w1
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 40
The Learning Scenario
x1
x2
w0
+x(1)
+x(2)
x(3)
x(4)
w1
w0
w1
w2
w2w3
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 41
The Learning Scenario
x1
x2
w0
+x(1)
+x(2)
x(3)
x(4)
w1
w0
w1
w2
w2w3
w4=w3
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 42
The Learning Scenario
x1
x2+
x(1)
+x(2)
x(3)
x(4)
w
The demonstration is in augmented space.
Conceptually, in augmented space, we
adjust the weight vector to fit the data.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 43
Weight Space
w1
w2
+x
w
A weight in the shaded area will give correct
classification for the positive example.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 44
Weight Space
w1
w2
+x w
w = x
A weight in the shaded area will give correct
classification for the positive example.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 45
Weight Space
w1
w2
x
w
A weight not in the shaded area will give
correct classification for the negative example.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 46
Weight SpaceA weight not in the shaded area will give
correct classification for the negative example.
w1
w2
x
w
w = x
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 47
Learning Scenario in Weight Space
w1
w2
+x(1)+
x(2)
x(3)
x(4)
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 48
Learning Scenario in Weight Space
w1
w2
+x(1)+
x(2)
x(3)
x(4)
To correctly classify the
training set, the weight
must move into the shaded
area.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 49
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 50
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
w2
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 51
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
w2 =w3
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 52
Learning Scenario in Weight Space
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
w2 =w3w4
To correctly classify the
training set, the weight
must move into the shaded
area.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 53
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
w2 =w3w4
w5
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 54
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
w2 =w3w4
w5
w6
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 55
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
w2 =w3w4
w5
w6
w7
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 56
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
w2 =w3w4
w5
w6
w7
w8
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 57
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
w2 =w3w4
w5
w6
w7
w8
w9
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 58
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
w2 =w3w4
w5
w6
w7
w8
w9w10
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 59
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w1
w2 =w3w4
w5
w6
w7
w8
w9w10 w11
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 60
Learning Scenario in Weight Space
To correctly classify the
training set, the weight
must move into the shaded
area.
w1
w2
+x(1)+
x(2)
x(3)
x(4)
w0
w11
Conceptually, in weight space, we move the weight into the feasible region.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 61
Least Mean Square Learning
Minimize the cost function (error function):
( 2
1
( ))1( ) ( )
2
kp
k
kyE d
w
( ( )) 2
1
1( )
2
Tkp
k
k
d
xw
( )
1 1
(
2
)1
2
p m
ll
kk
k l
xwd
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 62
Least Mean Square Learning
Our goal is to go downhill.
E(w)
w1
w2Contour Map
(w1, w2)w1
w2w
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 63
Least Mean Square Learning
Our goal is to go downhill.
E(w)
w1
w2Contour Map
(w1, w2)w1
w2w
How to find the steepest decent direction?
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 64
Least Mean Square Learning
Gradient Operator
Let f(w) = f (w1, w2,…, wm) be a function over Rm.
1 2
1 2 m
m
f f f
w wdw dwdf w
wd
L
Define
1 2, , ,T
mdw dw dww L
,df f f w w
1 2
, , ,
T
m
f f ff
w w w
L
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 65
Least Mean Square Learning
fw fw f
w
df : positive df : zero df : negative
Go uphill Plain Go downhill
,df f f w w
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 66
Least Mean Square Learning
fw fw f
w
df : positive df : zero df : negative
Go uphill Plain Go downhill
,df f f w w
To minimize f , we choose
w = f
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 67
Least Mean Square Learning
Minimize the cost function (error function):
( )( )
2
1 1
1( )
2
p m
k l
l
k
l
kd xE w
w
( )
1
( ) (
1
)( ) p m
k l
k kk
l j
j
l
E
wwd x x
w
( ( )(
1
) )kkp
T k
j
k
xd
xw ( )
1
( )) (p
k
k k
j
kyd x
(k)
( )
1
()( ) kp
k
k
j
j
E
wx
w( ) ( ) ( )k k kd y
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 68
Least Mean Square Learning
Minimize the cost function (error function):
( )( )
2
1 1
1( )
2
p m
k l
l
k
l
kd xE w
w
1 2
( ) ( ) ( )( ) , , ,
T
w
m
E E EE
w w w
w w ww K
( )E ww w Weight Modification Rule
( )
1
()( ) kp
k
k
j
j
E
wx
w( ) ( ) ( )k k kd y
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 69
Least Mean Square Learning
Learning Modes
Batch Learning Mode
Incremental Learning Mode
( ( ))
1
p
k
k k
jj xw
( ( ))k k
jj xw
( )
1
()( ) kp
k
k
j
j
E
wx
w( ) ( ) ( )k k kd y
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 70
Perceptron
Summary
Separability: some parameters get the training set
perfectly correct
Convergence: if the training is separable, perceptron
will eventually converge (binary case)?
The Perceptron convergence theorem
The relation between perceptron and Bayes
classifier
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 71
Multilayer Perceptron
. . .
. . .
. . .
. . .
x1 x2 xm
y1 y2 yn
Hidden Layer
Input Layer
Output Layer
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 72
How an MLP Works?
XOR
01
1
x1
x2
Example:
Not linearly separable.
Is a single layer perceptron
workable?
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 73
How an MLP Works?
01
1
XOR
x1
x2
Example:
L1L200
01
11
L2L1
x1 x2 x3= 1
y1 y2
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 74
How an MLP Works?
01
1
XOR
x1
x2
Example:
L1L200
01
11
01
1
y1
y2
L3
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 75
How an MLP Works?
Example:
01
1
y1
y2
L3
L2L1
x1 x2 x3= 1
L3
y1 y2
y3= 1
z
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 76
Parity Problem
Is the problem linearly separable?
x1
x2
x3
0110100
000001010011100101110111 1
x1 x2 x3
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 77
Parity Problem
Is the problem linearly separable?
x1
x2
x3
0110100
000001010011100101110111 1
x1 x2 x3
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 78
Parity Problem
Is the problem linearly separable?
x1
x2
x3
0110100
000001010011100101110111 1
x1 x2 x3
111
011
001000
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 79
Parity Problem
Is the problem linearly separable?
x1
x2
x3
111
011
001000
P1
x1 x2 x3
y1
P2
y2
P3
y3
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 80
Parity Problem
x1
x2
x3
111
011
001000
y1
y2
y3
P4
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 81
Parity Problem
y1
y2
y3
P4
P1
x1 x2 x3
P2 P3
y1
z
y3
P4
y2
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 82
Back Propagation Learning
Learning on Output Neurons
Learning on Hidden Neurons
General Learning Rule
Measure error
Reduce that error
By appropriately adjusting
each of the weights in the
network
. . .
. . .
. . .
. . .
x1 x2 xm
o1 o2 on
d1 d2 dn
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 83
Back Propagation Learning
Forward Pass:
Error is calculated from outputs
Used to update output weights
Backward Pass:
Error at hidden nodes is calculated by back
propagating the error at the outputs through the new
weights
Hidden weights updated
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 84
Symbols
Subscript i, j, k, represent different neurons, when j is
the neuron of hidden layer, then i is on the left side of
j, and k is on the right side of j.
n is the iteration no.
E(n) is the sum of instantaneous error energy of the
nth iteration, its average is Eav.
ej(n) is the error of the jth neuron on the nth iteration.
dj(n) is the expected value of the jth neuron on the nth
iteration.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 85
Symbols
yj(n) is output of the jth neuron on the nth iteration; if
the jth neuron is the output layer, the Oj(n) can be
used.
wji(n) is the weight from i to j, its change is Δwji(n).
vj(n) is the internal state of the jth neuron.
φ(.) is the activation function of the jth neuron.
θj is the threshold of the jth neuron.
xi(n) is the ith element of the input sample.
η is the learning rate.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 86
Symbols
Signal-flow graph highlighting the details of
output neuron k connected to hidden neuron j.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 87
Back Propagation Learning
The error function of the jth neuron in output
layer is:
BP-1
Instantaneous error E(n) is defined as:
BP-2
Eav is defined as (N is the number of training
samples):
BP-3
)()()( nyndne jjj
Cj
j nenE )(2
1)( 2
N
nav nE
NE
1
)(1
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 88
Batch vs. On-Line Learning
In the batch learning, adjustments to synaptic weights of
the multilayer perceptron are performed after the
presentation of all the N training samples.
This training process that all the N samples are
represented one time is called one epoch of training.
So the cost function for batch learning is defined by the
average error energy Eav.
Advantages
Accurate estimation the gradient vector
Parallelization
Disadvantage
More storage requirements
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 89
BP Learning Details
At the nth iteration, we can training the network
by minimizing E(n), and the output of the jth
neuron is given by:
BP-4
BP-5
P
iijij nynwnv
0
)()()(
))(()( nvny jjj
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 90
BP Learning Details
We define the gradient as:
According to BP-4
We further denote
)(
)(
)(
)(
)(
)(
nw
nv
nv
nE
nw
nE
ji
j
jji
)()()()()(
)(
0
nynynwnwnw
nvi
p
iiji
jiji
j
)(
)()(
nv
nEn
jj
Cj
j nenE )(2
1)( 2
)()()( nyndne jjj
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 91
BP Learning Details
Then the change of wji(n) is :
is the learning rate, thus the weights can be updated
as:
)()()( nynnw ijji
)()()()()()1( nynnwnwnwnw ijjijijiji
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 92
BP Learning Details
When neuron j is a neuron in the output layer,
according to BP-1, BP-1, we have:
)(
)(
)(
)(
)(
)()(
nv
ny
ny
nE
nv
nEn
j
j
jjj
)(
)()()(
2
1
)(
2
nv
nvnynd
ny j
j
cjjj
j
)()()( nvnynd jjjj
Cj
j nenE )(2
1)( 2 )()()( nyndne jjj
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 93
BP Learning Details
When neuron j is in the hidden layer, there is no
expected value for us to use. Thus, we use the error
propagated from the neuron connected to it:
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 94
BP Learning Details
j and k connected with wkj
If k is in the output layer,
Then for neuron j,
)(
)(
)(
)()(
nv
ny
ny
nEn
j
j
jj
)(
)(
)(nv
ny
nEj
j
k
kj
kk j
k
kj
nwnv
nE
ny
nv
nv
nE
ny
nE)(
)(
)(
)(
)(
)(
)(
)(
)(
)()()( nvnynd kkk )()(
)(n
nv
nEk
k
k
kjkjjj nwnnvn )()()()(
)()()()()( nwnvnyndnv kjkk
kkjj
)()()( nynnw ijji
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 95
BP Learning Details
The previous equations show that we need a
function (.) differentiable, e.g., the sigmoid
function:
)()( nvny jj
)(,)(exp1
1nv
nvj
j
)()(
)(nv
nv
nyj
j
j
2)(exp1
)(exp
nv
nv
j
j
)](1)[()( nynynv jjj
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 96
BP Learning Details
Signal-flow graph of a part of the adjoint system
pertaining to back-propagation of error signals.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 97
Two Passes of Computation
Forward pass
Backward pass
This pass stats at the output layer by passing the
error signals leftward through the network, layer by
layer, and recursively computing delta(local gradient)
for each neuron.
P
iijij nynwnv
0
)()()(
))(()( nvny jjj
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 98
Signal-flow graphical summary
Top part of the graph: forward pass.
Bottom part of the graph: backward pass.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 99
Learning Rate
The learning rate should not be too large or too
small.
In order to avoid the danger of instability, a
momentum term can be introduced into the
equation.
)()()1()( nynnwnw ijjiji
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 100
Learning Rate
If we rewrite the equation as a time series with
index t, the equation becomes:
n
t
ij
n
ji tytnw0
1 )()()(
We can rewrite it as:
n
t ji
n
jitw
tEnw
0
1
)(
)()(
For the time series to be convergent, 0<=|alpha|<1
The sign of the partial derivative can affect the speed
and stability.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 101
Stopping Criteria
In general, the BP cannot be shown to converge,
and there are no well-defined criteria for
stopping its operation.
However, there are some reasonable criteria
that can be used to terminate the weight
adjustments, e.g.
When the Euclidean norm of the gradient vector
reaches a sufficiently small gradient threshold.
When the average squared error per epoch is
sufficiently small. Usually, it is in the range of 0.1 to 1
percent per epoch, or as small as 0.01 percent.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 102
XOR Problem Revisiting
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 103
XOR Problem Revisiting
The weights are initialized as:
=0.5
When the sample (1,1) is given to the network:
TTT www )8.0,4.0,5.0()0(,)1,1,3.0()0(,)1,1,2.1()0( 321
63.084.08.096.04.0)1(5.0exp1
1
84.01111)1(3.0exp1
1
96.01111)1()2.1(exp1
1
2
1
z
y
y
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 104
XOR Problem Revisiting
We have:
0158.08.0)147.0()84.01(84.02
002.04.0)147.0()96.01(96.01
)(1)()()(
)()()(
nOnOnOnd
nvnen
jjjj
jjj
k
kjkjj
k
kjkjj
nwnnyny
nwnnvn
)()()(1)(
)()()()(
147.0)63.01(63.0)63.00(3
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 105
XOR Problem Revisiting
Then the weights are updated as:
Finally, we can have:
TTT
TTT
TTT
w
w
w
)738.0,329.0,5735.0()84.0,96.0,1)(147.0(5.0)8.0,4.0,5.0()1(
)992.0,992.0,3079.0()1,1,1)(0158.0(5.0)1,1,3.0()1(
)999.0,999.0,199.1()1,1,1)(0002.0(5.0)1,1,2.1()1(
3
2
1
T
T
T
w
w
w
)189.0,384.0,216.0()1(
)98.0,826.0,294.0()1(
)179.1,912.0,198.1()1(
3
2
1
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 106
XOR Problem Revisiting
We can revisit the problem from the view of
space transformation, the points in the sample
space are transformed into a new space, i.e.,
(0,0), (1,1), (1,0) and (0,1)are mapped to
(0.768,0.427), (0.964,0.819), (0.892,0.629) and
(0.915,0.665).
22221212
12121111
exp1
1
exp1
1
xwxwy
xwxwy
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 107
XOR Problem Revisiting
4997.0)427.0,768.0()0,0( layer outputlayer hidden z
4999.0)819.0,964.0()1,1( layeroutput layerhidden z
5025.0)629.0,892.0()1,0( layeroutput layerhidden z
5020.0)665.0,915.0()0,1( layeroutput layerhidden z
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 108
Other Issues
Heuristics for BP learning
Stochastic versus batch update
Maximizing information content
Activation function
Target values
Normalizing the inputs
Initialization
Learning from hints
Learning rates.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 109
Stochastic versus batch update
The stochastic mode (pattern-by-pattern) is
computationally faster than batch mode.
Especially, when the data is large and redundant,
it will much better to use stochastic than to use
batch.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 110
Maximizing information content
Every training example should be chosen on the
basis that its information content is the largest
possible for the task at hand.
How to choose?
Use an sample that results in the largest training error.
Use an example that radically different from all those
previous used.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 111
Activation function
Graph of the hyperbolic tangent function φ(v) =α
tanh(bv) for α 1.7159 and b = 2/3.The
recommended target values are +1 and –1.
bxbx
bxbx
ee
eeax
)(
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 112
Target values
The target values should be within the range of
the sigmoid activation function.
Otherwise, the BP algorithm tends to drive the
free parameters of the network to infinity, and
thereby slow down the learning process by
driving the hidden neurons into saturation.
ad
ad
j
j
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 113
Normalizing the inputs
Each input variable should be preprocessed
Mean removal
Decorrelation
Covariance equalization
Normalization methods
Min-Max
Z-score standardization
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 114
Initialization
Too large
The neurons in the network will be driven into
saturation
Too small
The BP algorithm will operate on a very flat area
around the origin of the error surface.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 115
Learning from hints
We can make use of some information that we
have about the activation function or data.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 116
Learning rates
The learning rate should be assigned a smaller
value in the last layers than in the front layers.
Neurons with many inputs should have a smaller
learning rate than neurons with few inputs.
Annealing method can be applied.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 117
Stopping Criteria
In general, the BP cannot be shown to converge,
and there are no well-defined criteria for
stopping its operation.
However, there are some reasonable criteria
that can be used to terminate the weight
adjustments, e.g.
When the Euclidean norm of the gradient vector
reaches a sufficiently small gradient threshold.
When the average squared error per epoch is
sufficiently small. Usually, it is in the range of 0.1 to 1
percent per epoch, or as small as 0.01 percent.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 118
Some problems
The layers
The number of hidden layer neurons
Kolmogorov theorem: the neurons in hidden layers
can be: s=2m+1 (m is the number of neurons in input
layer)
Demos
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 119
BP Summary
Strengths of BP learning Great representation power Wide practical applicability Easy to implement Good generalization power
Problems of BP learning Learning often takes a long time to converge The net is essentially a black box? Gradient descent approach only guarantees a local
minimum error Not every function that is representable can be
learned Generalization is not guaranteed even if the error is
reduced to zero
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 120
BP Summary
No well-founded way to assess the quality of BP learning
Network paralysis may occur (learning is stopped) Selection of learning parameters can only be done by
trial-and-error BP learning is non-incremental (to include new
training samples, the network must be re-trained with all old and new samples)
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 121
Radial-Basis Functions
A radial basis
function (RBF) is a
real-valued function
whose value depends
only on the distance
from the origin, so that
; or alternatively on the
distance from some
other point c, called a
center
)()( xx
)()( cxx
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 122
Interpolation problem
In its strict sense, the problem can be stated as: Given a set of N different points {xi∈Rm0} and a
corresponding set of N real numbers di, find a function F that satisfies the interpolation condition: F(xi)=di.
The Radial-Basis Function (RBF) technique consists of choosing a function F that has the form:
N
i
ii xxwxF1
||)(||)(
How to get the solution?
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 123
Radial-Basis Function
where
NNNNNN
N
N
d
d
d
w
w
w
2
1
2
1
21
22221
11211
)( ijji xx
Micchelli theorem(1986) is proved that if the equation is as the above, then the matrix is nonsingular.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 124
Radial-Basis Functions
Multiquadrics
Inverse multiquadrics
Gaussian functions
2/122 )()( crr
2/122 )/(1)( crr
)2
exp()(2
2
rr
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 125
RBF Networks
Structure of an RBF network, based on
interpolation theory.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 126
RBF Networks
The network has three layers:
Input layer, which consist of m0 source nodes.
Hidden layer, consist of the same number of
computation units as the size of the training samples,
namely, N.
Output layer, there is no restriction on the size of the
output layer.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 127
Modifications to RBF Networks
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 128
How to get the k centers
This can be computed by un-supervised learning.
K-means
SOM
Clustering algorithm can be used here, e.g. we use
k-means
K
ij jiC
ji ux)(
2||||min
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 129
Self-Organization Maps
Teuvo Kohonen (1982, 1984)
In biological systemsCells tuned to similar orientations tend to be physically
located in proximity with one another
Microelectrode studies with cats
So, SOM is motivated by a distinct feature of the human brain:The brain is organized in many places in such a way
that different sensory inputs are represented by topologically ordered computation maps.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 130
Self-Organization Maps
Orientation tuning over the surface forms a kind of map with similar tunings being found close to each otherTopographic feature map
Train a network using competitive learning to create feature maps automatically
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 131
SOM Clustering
Self-organizing map (SOM) An unsupervised artificial neural network
Mapping high-dimensional data into a one or two-dimensional representation space
Similar data may be found in neighboring regions
Disadvantages Fixed size in terms of the number of units and their
particular arrangement
Hierarchical relations between the input data are not mirrored in a straight-forward manner
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 132
SOM Structure
Two-dimensional lattice of neurons, illustrated
for a three-dimensional input and four-by-four
dimensional output (all shown in blue).
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 133
SOM Structure
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 134
Features
Kohonen’s algorithm creates a vector quantizer
by adjusting weight from common input nodes to
M output nodes
Continuous valued input vectors are presented
without specifying the desired output
After the learning, weight will be organized such
that topologically close nodes are sensitive to
inputs that are physically similar
Output nodes will be ordered in a natural
manner
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 135
Initial setup of SOM
Consists a set of units i in a two-dimension grid
Each unit i is assigned a weight vector mi as the same dimension as the input data
The initial weight vector is assigned random values
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 136
Essential processes in SOM
Competition process
Find the best match of the input vector x with the
synaptic-weight vectors.
Cooperation process
Decide the topological neighborhood centered on the
winning neuron, and make it decay smoothly with
lateral distance.
Synaptic adaptation
The weights of corresponding neurons are updated in
relation to the input vector.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 137
Competition process
Winner Selection
Initially, pick up a random input vector x(t)
Compute the unit c with the highest activity level (the
winner c(t)) by Euclidean distance formula
T
mxxxx ],...,,[ 21
ljwwww T
jmjjj ,...,2,1,],...,,[ 21
ljwxxi j ,...,2,1||,||minarg)(
Neuron i is called the best-matching, or
winning neuron for the input vector
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 138
Cooperative process
A neuron that is firing tends to excite the
neurons in its immediate neighborhood more
that those farther away from it.
Let hji denote the topological neighborhood
function centered on winning neuron i and
encompassing a set of excited neurons.
Let dij denote the lateral distance between the
winning neuron i and excited neuron j.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 139
Cooperative process
hji and dji should satisfy two distinct requirements:
hij is symmetric about the maximum point defiend by
dij = 0; In other words, it attains its maximum value at
the winning neuron i for which the distance dij is zero.
The amplitude of the topological neighborhood hij
decreases monotonically with increasing lateral
distance dij, decaying to zero for dij ->∞, this is
necessary condition for convergence.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 140
Cooperative process
A good choice of hij that satisfies these
requirements is the Gaussian function:
)2
exp(2
2
ji
ji
dh
|| ijd ji
|||| ijji rrd
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 141
Cooperative process
Another unique feature of the SOM algorithm is
that the size of the topological neighborhood is
permitted to shrink with time.
This requirement can be satisfied by making the
width sigma of the topological neighborhood
function hji decrease with time. A popular choice
for the dependence of sigma on discrete time n
is the exponential decay:
)exp()(1
0
n
n
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 142
Cooperative process
Correspondingly, the topological neighborhood
function assumes a time-varying form of its own,
as follows:
))(2
exp()(2
2
n
dnh
ji
ji
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 143
Adaptive process
In the stage, the synaptic-weight is required to
change in relation to the input vector.
)()()1( nwnwnw jjj
Where the last term can be calculated using the
following equation:
))()()(()()( nwnxnhnnw jjij
The learning rate η(n) should also be time
varying.
)exp()(2
0
n
n
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 144
Weight update
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 145
Weight update
Illustration of the relationship between feature
map Փ and weight vector wi of winning neuron i
I
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 146
Learning Process (Adaptation)
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 147
SOM Learning Summary
1. initialization
Choose random values for weights
Or choose input vectors randomly to initialize them
2.Sampling
Draw a sample from input space
3. similarity matching
Find the best-matching neuron
4.Updating
5.Continuation
))()()(()()( nwnxnhnnw jjij
ljwxxi j ,...,2,1||,||minarg)(
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 148
Neighborhoods
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 149
Applications
Optimization problems
Clustering problems
Pattern recognition
Others
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 150
Applications
A grouping according to similarity has
emerged
Animal names and their attributes
birds
peaceful
hunters
is
has
likes
to
Dove Hen Duck Goose Owl Hawk Eagle Fox Dog Wolf Cat Tiger Lion Horse Zebra Cow
Small 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 Medium 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
Big 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
2 legs 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 4 legs 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 Hair 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
Hooves 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Mane 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0
Feathers 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
Hunt 0 0 0 0 1 1 1 1 0 1 1 1 1 0 0 0 Run 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 Fly 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0
Swim 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 151
Application to PR
Features
字母
特征向量
斜线 上横线 中横线 下横线 竖线 上半圆弧 下半圆弧 左半圆弧 右半圆弧
A '\002' ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\0’
B ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\002’
C ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\001’ ‘\0’
D ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\001’
E ‘\0’ ‘\001’ ‘\001’ ‘\001’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\0’
F ‘\0’ ‘\001’ ‘\001’ ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\0’
H ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\002’‘\0’
‘\0’ ‘\0’ ‘\0’
K ‘\002’ ‘\0’ ‘\0’ ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\0’
P ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\001’ ‘\0 ‘\0’ ‘\0’ ‘\01’
T ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\001’ ‘\0’ ‘\0’ ‘\0’ ‘\0’
U ‘\0’ ‘\0’ ‘\0’ ‘\0’ ‘\002’ ‘\001’ ‘\0’ ‘\0’ ‘\0’
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 152
Applications
Demos
MIMA Group
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University
Any question?