1 Back-Propagation. 2 Objectives A generalization of the LMS algorithm, called backpropagation, can...
-
Upload
blake-malone -
Category
Documents
-
view
225 -
download
2
Transcript of 1 Back-Propagation. 2 Objectives A generalization of the LMS algorithm, called backpropagation, can...
1
Back-PropagationBack-Propagation
2
Objectives
A generalization of the LMS algorithm, called backpropagation, can be used to train multilayer networks.
Backpropagation is an approximate steepest descent algorithm, in which the performance index is mean square error.
In order to calculate the derivatives, we need to use the chain rule of calculus.
3
Motivation
The perceptron learning and the LMS algorithm were designed to train single-layer perceptron-like networks.
They are only able to solve linearly separable classification problems.Parallel Distributed ProcessingThe multilayer perceptron, trained by the backpropagation algorithm, is
currently the most widely used neural network.
4
Three-Layer Network
321 SSSR Number of neurons in each layer:R: number of inputs Sn: number of neurons in n layer
5
Pattern Classification: XOR gate
The limitations of the single-layer perceptron (Minsky & Papert, 1969)
0,
0
011 tp
1,
1
022 tp
1,
0
133 tp
0,
1
144 tp
1P
2P 4P
3P
6
Two-Layer XOR Network
Two-layer, 1-2-1 network
11w
12w
1P
4P
AND
11
1
11n
12n
1
5.1
11a
12a
21n 2
1a
1p
2p
22
11
5.1
Individual Decisions
7
Solved Problem P11.1
Design a multilayer network to distinguish these categories.
T1 1111 p
T2 1111 p
T3 1111 p
T4 1111 p
Class I Class II01 bWp02 bWp
03 bWp04 bWp
There is no hyperplane that can separate these two categories.
8
Solution of Problem P11.1
11
1
11n
12n
1
11a
12a
21n 2
1a
1p
2p
1
1
2
2
2
2
3p
4p
AND
OR
9
Function Approximation
Two-layer, 1-2-1 network
nnfe
nfn
)( ,1
1)( 21
.10,10,10,10 12
11
12
11 bbww
.0,1,1 221
21 bww
10
Function Approximation
The centers of the steps occur where the net input to a neuron in the first layer is zero.
The steepness of each step can be adjusted by changing the network weights.
110100
110)10(012
12
12
12
12
11
11
11
11
11
wbpbpwn
wbpbpwn
11
Effect of Parameter Changes
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1
0
1
2
3
12b
20 15 10 5 0
12
Effect of Parameter Changes
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1
0
1
2
3
21w
1.0
0.5
0.0
-0.5
-1.0
13
Effect of Parameter Changes
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1
0
1
2
3
21w
1.0
0.5
0.0
-0.5
-1.0
14
Effect of Parameter Changes
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1
0
1
2
3
2b
1.0
0.5
0.0
-0.5
-1.0
15
Function Approximation
Two-layer networks, with sigmoid transfer functions in the hidden layer and linear transfer functions in the output layer, can approximate virtually any function of interest to any degree accuracy, provided sufficiently many hidden units are available.
16
Backpropagation Algorithm
For multilayer networks the outputs of one layer becomes the input to the following layer.
1,...,2,1,0 ),( 1111 Mmmmmmm baWfaMaapa ,0
17
Performance Index
Training Set:
Mean Square Error:
Vector Case:
Approximate Mean Square Error:
Approximate Steepest Descent Algorithm
p1 t1{ , } p2 t2{ , } pQ tQ{ , }
F x E e2 = E t a– 2 =
F x E eTe = E t a–
Tt a– =
F̂ x t k a k – T t k a k – eTk e k = =
w i jm
k 1+ wi jm
k F̂
w i jm
------------–= bim
k 1+ bim
k F̂
bim
---------–=
18
Chain Rule
If f(n) = en and n = 2w, so that f(n(w)) = e2w.
Approximate mean square error:
dw
wdn
dn
ndf
dw
wndf )()())((
2)()())((
nedw
wdn
dn
ndf
dw
wndf
)()()]()([)]()([)(ˆ TT kkkkkkF eeatatx
mji
mi
mi
mjim
ji
mji
mji w
n
n
Fkw
w
Fkwkw
,,
,,,
ˆ)(
ˆ)()1(
mi
mi
mi
mim
i
mi
mi b
n
n
Fkb
b
Fkbkb
ˆ
)(ˆ
)()1(
19
Sensitivity & Gradient
The net input to the ith neurons of layer m:
The sensitivity of to changes in the ith element of the net input at layer m:
Gradient:
mi
S
j
mj
mji
mi bawn
m
1
1
1, 1 ,1
,
mi
mim
jmji
mi
b
na
w
n
F̂mi
mi nFs ˆ
1
,,
ˆˆ
m
jmim
ji
mi
mi
mji
asw
n
n
F
w
F
mi
mim
i
mi
mi
mi
ssb
n
n
F
b
F
1ˆˆ
20
Steepest Descent Algorithm
The steepest descent algorithm for the approximate mean square error:
Matrix form:
1,
,,, )(
ˆ)()1(
mj
mi
mjim
ji
mi
mi
mji
mji askw
w
n
n
Fkwkw
mi
mim
i
mi
mi
mi
mi skb
b
n
n
Fkbkb
)(ˆ
)()1(
Wm
k 1+ Wm
k sm
am 1–
T
–=
bmk 1+ bm
k sm–=
sm F
nm
----------
F
n1m
---------
F
n2m
---------
F
nS
mm
-----------
=
21
BP the Sensitivity
Backpropagation: a recurrence relationship in which the sensitivity at layer m is computed from the sensitivity at layer m+1.
Jacobian matrix:
.
1
2
1
1
1
12
2
12
1
12
11
2
11
1
11
1
111
m
s
m
sm
m
sm
m
s
m
s
m
m
m
m
m
m
s
m
m
m
m
m
m
m
m
mmm
m
m
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n
22
Matrix RepressionThe i,j element of Jacobian matrix
mj
mj
mmj
m
mj
mmji
mj
mj
mm
jimj
mjm
jimj
s
l
mi
ml
mli
mj
mi
n
nfn
nw
n
nfw
n
aw
n
baw
n
n
m
)()(g
).(g
)(
1,
1,
1,
1
11,1
,)(11
mmmm
m
nFWn
n
.
)(00
0)(0
00)(
)( 2
1
m
s
m
mm
mm
mm
mng
ng
ng
nF
23
Recurrence Relation
The recurrence relation for the sensitivity
The sensitivities are propagated backward through the network from the last layer to the first layer.
.))((
ˆ))((
ˆˆ
1T1
11
1
T1
mmmm
mTmm
mm
m
mm
sWnF
n
FWnF
n
F
n
n
n
Fs
.121 ssss MM
24
Backpropagation Algorithm
At the final layer:
.)(2
)()()(ˆ
1
2
T
Mi
iiiM
i
S
jjj
Mi
Mi
Mi n
aat
n
at
nn
Fs
M
atat
)()( M
iM
Mi
Mi
M
Mi
Mi
Mi
i ngn
nf
n
a
n
a
)()(2 Mi
Mii
Mi ngats
))((2 atnFs MMM
25
Summary
The first step is to propagate the input forward through the network:
The second step is to propagate the sensitivities backward through the network:Output layer:Hidden layer:
The final step is to update the weights and biases:
Maa
1,...,2,1,0 ),( 1111 Mmmmmmm baWfapa 0
))((2 atnFs MMM 1,2,...,1 ,))(( 1T1 Mmmmmmm sWnFs
T1)()()1( mmmm kk asWW mmm kk sbb )()1(
26
BP Neural Network
m
jS mw,1
mjw ,1
mjiw ,
Layer m
1
j
mS
1
i
Layer m-1
1mS
mw 1,1
miw 1,
m
S mw1,1
m
SS mmw,1
m
S mw,1
m
Si mw,
1
k
MS
Layer MMa1
Mka
M
S Ma
Layer 1
1p
2p
Rp
11,1w
11,2w
1
1,1Sw
1
,1 RSw
1
2
1S
27
Ex: Function Approximation
g p 14---p sin+=
1-2-1Network
+
t
ep
28
Network Architecture
1-2-1Network
ap
29
G(p) = 1+sin(/4 * p) for -2<=p <=2
P =1
30
Initial Values
W10 0.27–
0.41–= b1
0 0.48–
0.13–= W2
0 0.09 0.17–= b20 0.48=
Network ResponseSine Wave
-2 -1 0 1 2-1
0
1
2
3
Initial Network Response:
p
2a
31
Forward Propagation
a0
p 1= =
a1 f1 W1a0 b1+ l ogsig 0.27–
0.41–1
0.48–
0.13–+
logsig 0.75–
0.54–
= = =
a2
f2 W2a1 b2
+ purelin 0.09 0.17–0.321
0.3680.48+( ) 0.446= = =
e t a– 1 4---p sin+
a2– 1 4---1 sin+
0.446– 1.261= = = =
a1
1
1 e0.75+--------------------
1
1 e0.54+--------------------
0.321
0.368= =
Initial input:Output of the 1st layer:
Output of the 2nd layer:
error:
32
Transfer Func. Derivatives
))(1(1
1
1
11
)1(1
1)(
11
21
aaee
e
e
edn
dnf
nn
n
n
n
1)()(2 ndn
dnf
33
Backpropagation
The second layer sensitivity:
The first layer sensitivity:522.2261.112
)]([2))((2 22222
enfn atFs
0997.0
0495.0
522.217.0
09.0
368.0)368.01(0
0321.0)321.01(
))(1(0
0))(1())(( 2
22,1
21,1
12
12
11
1122111 ssWnFs
w
w
aa
aaT
34
Weight Update
Learning rate 1.0
0772.0171.0
368.0321.0]522.2[1.017.009.0
)()0()1( 1222
TasWW
]732.0[]522.2[1.0]48.0[)0()1( 222 sbb
420.0
265.0]1[
0997.0
0495.01.0
41.0
27.0
)()0()1( 0111 TasWW
140.0
475.0
0997.0
0495.01.0
13.0
48.0)0()1( 111 sbb
35
ex
The network transfer functions are 21 )()( nnf , n
nf1
)(2 ,
and the input / target pair is given to be ))1(),1(( tp .
Perform one iteration of backpropagation with 1 .
For the network shown in the figure below, the initial weights and biases are chosen
to be 1)0(,1)0(,2)0(,1)0( 2211 bwbw .
36
Choice of Network Structure
Multilayer networks can be used to approximate almost any function, if we have enough neurons in the hidden layers.
We cannot say, in general, how many layers or how many neurons are necessary for adequate performance.
37
Illustrated Example 1
g p 1i 4----- p sin+=
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
1-3-1 Network 1i 2i
4i 8i
38
Illustrated Example 2
g p 164
------ p sin+=
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
1-5-1
1-2-1 1-3-1
1-4-1
22 p
39
Convergence
g p 1 p sin+=
-2 -1 0 1 2-1
0
1
2
3
1
23
4
5
0
-2 -1 0 1 2-1
0
1
2
3
1
2
34
5
0
22 p
Convergence to Global Min. Convergence to Local Min.The numbers to each curve indicate the sequence of iterations.
40
Generalization
In most cases the multilayer network is trained with a finite number of examples of proper network behavior:
This training set is normally representative of a much larger class of possible input/output pairs.
Can the network successfully generalize what it has learned to the total population?p1 t1{ , } p2 t2{ , } pQ tQ{ , }
41
Generalization Example
g p 14---p sin+= p 2– 1.6– 1.2– 1.6 2 =
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
1-2-1 1-9-1
For a network to be able to generalize, it should have fewer parameters than there are data points in the training set.
Generalize well Not generalize well
42
Objectives
The neural networks, trained in a supervised manner, require a target signal to define correct network behavior.
The unsupervised learning rules give networks the ability to learn associations between patterns that occur together frequently.
Associative learning allows networks to perform useful tasks such as pattern recognition (instar) and recall (outstar).
43
What is an Association?
An association is any link between a system’s input and output such that when a pattern A is presented to the system it will respond with pattern B.
When two patterns are link by an association, the input pattern is referred to as the stimulus and the output pattern is to referred to as the response.
44
Classic Experiment
Ivan Pavlov He trained a dog to salivate at the sound of a bell, by ringing the bell
whenever food was presented. When the bell is repeatedly paired with the food, the dog is conditioned to salivate at the sound of the bell, even when no food is present.
B. F. Skinner He trained a rat to press a bar in order to obtain a food pellet.
45
Associative Learning
Anderson and Kohonen independently developed the linear associator in the late 1960s and early 1970s.
Grossberg introduced nonlinear continuous-time associative networks during the same time period.
46
Simple Associative Network
Single-Input Hard Limit AssociatorRestrict the value of p to be either 0 or 1, indicating whether a stimulus is
absent or present.
The output a indicates the presence or absence of the network’s response.
p w
1
b
n a
stimulus. no ,0
stimulus. ,1p
response. no ,0response. ,1
)5.0()( wphardlimbwphardlima
47
Two Types of Inputs
Unconditioned Stimulus
Analogous to the food presented to the dog in Pavlov’s experiment.
Conditioned Stimulus
Analogous to the bell in Pavlov’s experiment.
The dog salivates only when food is presented. This is an innate that does not have to be learned.
48
Banana Associator
An unconditioned stimulus (banana shape) and a conditioned stimulus (banana smell)
The network is to associate the shape of a banana, but not the smell.
p w
1
b
n a0p 0w5.0 ,0 ,10 bww
detected.not shape ,0
detected. shape ,10p
detected.not smell ,0
detected. smell ,1p )( 00 bwppwhardlima
49
Associative Learning
Both animals and humans tend to associate things occur simultaneously.If a banana smell stimulus occurs simultaneously with a banana concept
response (activated by some other stimulus such as the sight of a banana shape), the network should strengthen the connection between them so that later it can activate its banana concept in response to the banana smell alone.
50
Unsupervised Hebb Rule
Increasing the weighting wij between a neuron’s input pj and output ai in proportion to their product:
Hebb rule uses only signals available within the layer containing the weighting being updated. Local learning rule
Vector form:
Learning is performed in response to the training sequence
)()()1()( qpqaqwqw jiijij
)()()1()( qqqq paww
)(...,),2(),1( Qppp
51
Ex: Banana Associator
Initial weights:
Training sequence:
Learning rule:
0)0(,10 ww
}...1)3(,0)3({},1)2(,1)2({},1)1(,0)1({ 000 pppppp
1),()()1()( qpqaqwqw
p w
1
b
n a0p 0w
Shape Smell
Fruit
Network
Banana ?
Banana ?
Smell
Sight
52
Ex: Banana Associator
First iteration (sight fails):
(no response)
Second iteration (sight works):
(banana)
0)5.01001(
)5.0)1()0()1(()1( 00
hardlim
pwpwhardlima
0100)1()1()0()1( paww
1)5.01011(
)5.0)2()1()2(()2( 00
hardlim
pwpwhardlima
1110)2()2()1()2( paww
53
Ex: Banana Associator
Third iteration (sight fails):
(banana)
From now on, the network is capable of responding to bananas that are detected either sight or smell. Even if both detection systems suffer intermittent faults, the network will be correct most of the time.
1)5.01101(
)5.0)3()2()3(()3( 00
hardlim
pwpwhardlima
2111)3()3()2()3( paww
54
Problems of Hebb Rule
Weights will become arbitrarily large
Synapses cannot grow without bound.
There is no mechanism for weights to decrease
If the inputs or outputs of a Hebb network experience ant noise, every weight will grow (however slowly) until the network responds to any stimulus.
55
Hebb Rule with Decay
, the decay rate, is a positive constant less than one.
This keeps the weight matrix from growing without bound, which can be found by setting both ai and pj to 1, i.e.,
The maximum weight value is determined by the decay rate .
)()()1()1(
)1()()()1()(
qqq
qqqqqT
T
paW
WpaWW
max)1( ijjiijij wpaww
maxijw
56
Ex: Banana Associator
First iteration (sight fails): no response
Second iteration (sight works): banana
Third iteration (sight fails): banana
.1.0,1),()()1()1()( qpqaqwqw
0)5.0)1()0()1(()1( 00 pwpwhardlima
001.0100)0(1.0)1()1()0()1( wpaww
1)5.0)2()1()2(()2( 00 pwpwhardlima
101.0110)1(1.0)2()2()1()2( wpaww
1)5.0)3()2()3(()3( 00 pwpwhardlima
9.111.0111)2(1.0)3()3()2()3( wpaww
57
Ex: Banana Associator
101.0
1max
ijw
0 10 20 300
10
20
30
0 10 20 300
2
4
6
8
10
Hebb Rule Hebb with Decay
58
0 10 20 300
1
2
3
Prob. of Hebb Rule with Decay
Associations will decay away if stimuli are not occasionally presented.
If ai = 0, then
If = 0.1, this reducesto
The weight decays by10% at each iterationfor which ai = 0(no stimulus)
)1()1()( qwqw ijij
)1(9.0)( qwqw ijij
59
Instar (Recognition Network)
A neuron that has a vector input and a scalar output is referred to as an instar.
This neuron is capable of pattern recognition.
Instar is similar to perceptron, ADALINE and linear associator.
1
b
n a1p
2p
Rp
1,1w
Rw ,1
60
Instar Operation
Input-output expression:
The instar is active when or
where is the angle between two vectors.
If , the inner product is maximized when the angle is 0.
Assume that all input vectors have the same length (norm).
)()( 1 bhardlimbhardlima T pwWp
bT pw1 bT cos11 pwpw
wp 1
61
Vector Recognition
If , then the instar will be only active when = 0.
If , then the instarwill be active for a range of angles.
The larger the value of b, the more patterns there will be that can activate the instar, thus making it the less discriminatory.
Forgetting problem in Hebb rule with decay: it requires stimuli to be repeated or associations would be lost.
pw1b
pw1bw1
62
Instar Rule
Hebb rule:
Hebb rule with decay:
Instar rule: a decay term, the forgetting problem, is add that is proportion to
(a rule allow weight decay only when the instar is active (a !=0)
If
)()()1()( qpqaqwqw jiijij
)()()1()1()( qpqaqwqw jiijij
)(qai
)1()()()()1()( qwqaqpqaqwqw ijijiijij
)]1()()[()1()(
)]1()()[()1()(
qqqaqq
qwqpqaqwqw
iiii
ijjiijij
wpww
63
Graphical Representation
For the case where the instar is active( ),
For the case where the instaris inactive ( ),
1ia
)()1()1(
)]1()([)1()(
qqqq
i
iii
pw
wpww
)(qp
)1( qi w
)(qi w0ia
)1()( qq ii ww
64
Ex: Orange Recognizer
The elements of p will be contained to 1 values.
1
2b
n a1p
2p
3p
1,1w
3,1w
0p30 w
Sight of orange
Measured shape
Measured texture
Measured weight
Orange?
weight
texture
shape
p
detectednot orange ,0
visuallydetected orange ,10p
Sight
Fruit
Network
Orange ?
Measure
)( 00 bpwhardlima Wp
b = -2, a value slightly more positive than –||p||2 = -3
65
Initialization & Training
Initial weights:
The instar rule (=1):
Training sequence:
First iteration:
000)0()0(,3 10 Tw wW
)]1()()[()1()( 111 qqqaqq wpww
,1
1
1
)2(,1)2(,
1
1
1
)1(,0)1( 00
pp pp
response) (no0)2)1()1(()1( 00 Wppwhardlima
Ta 000)]0()1()[1()0()1( 111 wpww
66
Second Training Iteration
Second iteration:
The network can now recognition the orange by its measurements.
(orange)12
1
1
1
00013
)2)2()2(()2( 00
hardlim
pwhardlima Wp
1
1
1
0
0
0
1
1
1
1
0
0
0
)]1()2()[2()1()2( 111 wpww a
67
Third Training Iteration
Third iteration:
(orange)12
1
1
1
11103
)2)3()3(()3( 00
hardlim
pwhardlima Wp
1
1
1
1
1
1
1
1
1
1
1
1
1
)]2()3()[3()2()3( 111 wpww a
Orange will now be detected if either set of sensors works.
68
P13.5
Consider the instar network shown in slide 64, the reaining sequence for this network will consist of following inputs:
{p0(1) = 0, p(1)= }, {p0(2) = 1, p(2)= },
These two sets of inputs are repeatedly presented to the network until the weight matrix w converges.
1. Perform the first four iterations of the instar rule, with learning rate =0.5, Assume that the initial w maytrix is set to all zeros.
2. Display the results of each iteration of the instar rule in graphical form.
1
1
1
1
69
Kohonen Rule
Kohonen rule:
Learning occurs when the neuron’s index i is a member of the set X(q).
The Kohonen rule can be made equivalent to the instar rule by defining X(q) as the set of all i such that
The Kohonen rule allows the weights of a neuron to learn an input vector and is therefore suitable for recognition applications.
)()],1()([)1()( 111 qXiqqqq wpww
1)( qai
70
Ourstar (Recall Network)
The outstar network has a scalar input and a vector output.
It can perform pattern recall by associating a stimulus with a vector response.
p
1,1w
1,2w
1,Sw
1n
2n
Sn
1a
2a
Sa
71
Outstar Operation
Input-output expression:
If we would like the outstar network to associate a stimulus (an input of 1) with a particular output vector a*, set W = a*.
If p = 1, a = satlins(Wp) = satlins(a*p) = a* Hence, the pattern is correctly recalled.
The column of a weight matrix represents the pattern to be recalled.
)( psatlinsa W
72
Outstar Rule
In instar rule, the weight decay term of Hebb rule is proportional to the output of network, ai.
In outstar rule, the weight decay term of Hebb rule is proportional to the input of network, pj.
If = ,
Learning occurs whenever pj is nonzero (instead of ai). When learning occurs, column wj moves toward the output vector. (complimentary to instar rule)
)1()()()()1()( qwqpqpqaqwqw ijjjiijij
)()]1()([)1()(
)()]1()([)1()(
qpqqqq
qpqwqaqwqw
jjjj
jijiijij
waww
73
Ex: Pineapple Recaller
Any set of p0 (with 1 values) will be copied to a.
Sight
Fruit
Network
Measurement?
Measure
1n
2n
1a
2a
Measured shape
Measured texture
Measured weight
Identified Pineapple 3n 3a
11,1w
12,2w
13,3w
21,1w
23,3w
11p
12p
13p
2p
Recalled shape
Recalled texture
Recalled weight
)( 00 ppsatlins WWa
100
010
0010W
weight
texture
shape0p
otherwise ,0seen becan pineapple a if ,1
p
74
Initialization
The outstar rule (=1):
Training sequence:
Pineapple measurements:
)()]1()([)1()( qpqqqq jjjj waww
,1)2(,
1
1
1
)2(,1)1(,
0
0
0
)1( 00
pppp
1
1
1pineapplep
75
First Training Iteration
First iteration:
response) (no
0
0
0
)1
0
0
0
0
0
0
(
)1()1()1( 00
satlins
ppsatlins WWa
0
0
0
1
0
0
0
0
0
0
0
0
0
)1()]0()1([)0()1( 111 pwaww
76
Second Training Iteration
Second iteration:
The network forms an association between the sight and the measurements.
given) nts(measureme
1
1
1
)1
0
0
0
1
1
1
()2(
satlinsa
1
1
1
1
0
0
0
1
1
1
0
0
0
)2()]1()2([)1()2( 111 pwaww
77
Third Training Iteration
Third iteration:
Even if the measurement system fail, the network is now able to recall the measurements of the pineapple when it sees it.
recalled) nts(measureme
1
1
1
)1
1
1
1
0
0
0
()3(
satlinsa
1
1
1
1
1
1
1
1
1
1
1
1
1
)3()]2()3([)2()3( 111 pwaww