INTRODUCTION TO
Machine Learning
ETHEM ALPAYDIN© The MIT Press, 2004
[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml
Lecture Slides for
CHAPTER 11:
Multilayer Perceptrons
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
3
Neural Networks
Networks of processing units (neurons) with connections (synapses) between them
Large number of neurons: 1010
Large connectitivity: 105
Parallel processing
Distributed computation/memory
Robust to noise, failures
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
4
Understanding the Brain
Levels of analysis (Marr, 1982)1. Computational theory
2. Representation and algorithm
3. Hardware implementation
Reverse engineering: From hardware to theory
Parallel processing: SIMD vs MIMD
Neural net: SIMD with modifiable local memory
Learning: Update by training/experience
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
5
Perceptron
(Rosenblatt, 1962)
[ ][ ]Td
Td
Td
jjj
x,...,x,
w,...,w,w
wxwy
1
10
01
1=
=
=+= ∑=
x
w
xw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
6
What a Perceptron Does
Regression: y=wx+w0 Classification: y=1(wx+w0>0)
ww0
y
x
x0=+1
ww0
y
x
s
w0
y
x
( ) [ ]xwToy
−+==
exp11
sigmoid
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
7
K Outputs
kk
i
i
k k
ii
Tii
yy
C
oo
y
o
maxif
choose
expexp
=
=
=
∑
xw
Classification:
Regression:
xy
xw
W=
=+= ∑=
Tii
d
jjiji wxwy 0
1
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
8
Training
Online (instances seen one by one) vs batch (whole sample) learning:
No need to store the whole sample
Problem may change in time
Wear and degradation in system components
Stochastic gradient-descent: Update after a single pattern
Generic update rule (LMS rule):
( )( ) InpututActualOutpputDesiredOutctorLearningFaUpdate ⋅−⋅=
−=∆ tj
ti
ti
tij xyrw η
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
9
Training a Perceptron: Regression
Regression (Linear output):
( ) ( ) ( )[ ]( ) t
jttt
j
tTtttttt
xyrw
ryrr,E
−η=
−=−=
∆
22
21
21
| xwxw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
10
Classification
Single sigmoid output
K>2 softmax outputs
( )( ) ( ) ( )
( ) tj
tttj
ttttttt
tTt
xyrw
yryr,E
y
−=∆
−−−−=
=
η
1 log 1 log |
sigmoid
rxw
xw
{ }( )
( ) tj
ti
ti
tij
i
ti
ti
ttii
t
k
tTk
tTit
xyrw
yr,Ey
−=∆
−== ∑∑η
log | exp
exprxw
xwxw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11
Learning Boolean AND
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
12
XOR
No w0, w1, w2 satisfy:
(Minsky and Papert, 1969)
0
0
0
0
021
01
02
0
≤++>+>+≤
www
ww
ww
w
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
13
Multilayer Perceptrons
(Rumelhart et al., 1986)
( )
( )[ ]∑
∑
=
=
+−+=
=
+==
d
j hjhj
Thh
H
hihih
Tii
wxw
z
vzvy
1 0
10
exp1
1
sigmoid xw
zv
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
14x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
15
Backpropagation
( )
( )[ ]
hj
h
h
i
ihj
d
j hjhj
Thh
H
hihih
Tii
wz
zy
yE
wE
wxw
z
vzvy
∂∂
∂∂
∂∂
=∂∂
+−+=
=
+==
∑
∑
=
=
exp1
1
sigmoid
1 0
10
xw
zv
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
16
( ) ( )( ) ( ) t
jth
th
th
tt
tj
th
th
th
tt
hj
th
th
t
tt
hjhj
xzzvyr
xzzvyr
wz
zy
yE
wE
w
−−η=
−−−η−=
∂∂
∂∂
∂∂
η−=
∂∂
η−=
∑
∑
∑
1
1
∆
Regression
Forward
Backward
x
( )xwThhz sigmoid=
∑=
+=H
h
thh
t vzvy1
0
( ) ( )221
| ∑ −=t
tt yr,E XvW
( ) th
t
tth zyrv ∑ −=∆
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
17
Regression with Multiple Outputs
zh
vih
yi
xj
whj
( ) ( )
( )
( ) ( ) tj
th
th
t iih
ti
tihj
th
t
ti
tiih
i
H
h
thih
ti
t i
ti
ti
xzzvyrw
zyrv
vzvy
yr,E
−⎥⎦
⎤⎢⎣
⎡−η=
−η=
+=
−=
∑ ∑
∑
∑
∑∑
=
1
21
|
01
2
∆
∆
XVW
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
18
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
19
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
20
whx+w0
zh
vhzh
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
21
Two-Class Discrimination
One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt
( ) ( ) ( )( )( ) ( ) t
jth
thh
t
tthj
th
t
tth
t
tttt
H
h
thh
t
xzzvyrw
zyrv
yryr,E
vzvy
−−=∆
−=∆
−−+−=
⎟⎟⎠
⎞⎜⎜⎝
⎛+=
∑
∑
∑
∑=
1
1 log 1 log |
sigmoid1
0
η
η
XvW
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
22
K>2 Classes
( )
( )
( )
( ) ( ) tj
th
th
t iih
ti
tihj
th
t
ti
tiih
t i
ti
ti
ti
k
tk
tit
i
H
hi
thih
ti
xzzvyrw
zyrv
yr,E
CPo
oyvzvo
−⎥⎦
⎤⎢⎣
⎡−η=
−η=
−=
≡=+=
∑ ∑
∑
∑∑∑∑
=
1
log|
|exp
exp
10
∆
∆
Xv
x
W
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
23
Multiple Hidden Layers
MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks
( )
( )
∑
∑
∑
=
=
=
+==
=⎟⎟⎠
⎞⎜⎜⎝
⎛+==
=⎟⎟⎠
⎞⎜⎜⎝
⎛+==
2
1
1022
21
0212122
11
01111
1sigmoidsigmoid
1sigmoidsigmoid
H
lll
T
H
hlhlh
Tll
d
jhjhj
Thh
vzvy
H,...,l,wzwz
H,...,h,wxwz
zv
zw
xw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
24
Improving Convergence
Momentum
Adaptive learning rate
1−α+∂∂
η−= ti
i
tti w
wE
w ∆∆
⎩⎨⎧
η−<+
=ητ+
otherwise
if
b
EEa tt
∆
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
25
Overfitting/OvertrainingNumber of weights: H (d+1)+(H+1)K
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
27
Structured MLP
(Le Cun et al, 1989)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
28
Weight Sharing
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
29
Hints
Invariance to translation, rotation, size
Virtual examples
Augmented error: E’=E+λhEh
If x’ and x are the “same”: Eh=[g(x|θ)- g(x’|θ)]2
Approximation hint:
(Abu-Mostafa, 1995)
( ) [ ]( )( ) ( )( )( ) ( )⎪
⎩
⎪⎨
⎧
>θ−θ<θ−θ∈θ
=
xx
xx
xx
h
bxgbxg
axgaxg
b,axg
E
|if |
|if |
|if 0
2
2
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
30
Tuning the Network Size
Destructive
Weight decay:
Constructive
Growing networks
(Ash, 1989) (Fahlman and Lebiere, 1989)
∑λ+=
λ−∂∂
η−=
ii
ii
i
wE'E
wwE
w
2
2
∆
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
31
Bayesian Learning
Consider weights wi as random vars, prior p(wi)
Weight decay, ridge regression, regularizationcost=data-misfit + λ complexity
( ) ( ) ( )( ) ( )
( ) ( ) ( )
( ) ( ) ( )
2
2
212exp where
log|log|log
|log max arg |
|
w
w
www
ww
www
w
λ+=
⎥⎦
⎤⎢⎣
⎡λ
−⋅==
++=
==
∏
E'E
)/(
wcwpwpp
Cppp
pˆp
ppp
ii
ii
MAP
XX
XX
XX
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
32
Dimensionality Reduction
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
33
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
34
Learning Time
Applications:Sequence recognition: Speech recognition
Sequence reproduction: Time-series prediction
Sequence association
Network architecturesTime-delay networks (Waibel et al., 1989)
Recurrent networks (Rumelhart et al., 1986)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
35
Time-Delay Neural Networks
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
36
Recurrent Networks
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
37
Unfolding in Time
Top Related