Download - INTRODUCTION TO Machine Learning - CmpE WEBethem/i2ml/slides/v1... · INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 [email protected] ethem/i2ml Lecture

INTRODUCTION TO

Machine Learning

ETHEM ALPAYDIN© The MIT Press, 2004

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml

Lecture Slides for

CHAPTER 11:

Multilayer Perceptrons

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

3

Neural Networks

Networks of processing units (neurons) with connections (synapses) between them

Large number of neurons: 1010

Large connectitivity: 105

Parallel processing

Distributed computation/memory

Robust to noise, failures


4

Understanding the Brain

Levels of analysis (Marr, 1982)1. Computational theory

2. Representation and algorithm

3. Hardware implementation

Reverse engineering: From hardware to theory

Parallel processing: SIMD vs MIMD

Neural net: SIMD with modifiable local memory

Learning: Update by training/experience


5

Perceptron

(Rosenblatt, 1962)

[ ][ ]Td

Td

Td

jjj

x,...,x,

w,...,w,w

wxwy

1

10

01

1=

=

=+= ∑=

x

w

xw


6

What a Perceptron Does

Regression: y=wx+w0 Classification: y=1(wx+w0>0)

ww0

y

x

x0=+1

ww0

y

x

s

w0

y

x

( ) [ ]xwToy

−+==

exp11

sigmoid


7

K Outputs

kk

i

i

k k

ii

Tii

yy

C

oo

y

o

maxif

choose

expexp

=

=

=

∑

xw

Classification:

Regression:

xy

xw

W=

=+= ∑=

Tii

d

jjiji wxwy 0

1


8

Training

Online (instances seen one by one) vs batch (whole sample) learning:

No need to store the whole sample

Problem may change in time

Wear and degradation in system components

Stochastic gradient-descent: Update after a single pattern

Generic update rule (LMS rule):

( )( ) InpututActualOutpputDesiredOutctorLearningFaUpdate ⋅−⋅=

−=∆ tj

ti

ti

tij xyrw η


9

Training a Perceptron: Regression

Regression (Linear output):

( ) ( ) ( )[ ]( ) t

jttt

j

tTtttttt

xyrw

ryrr,E

−η=

−=−=

∆

22

21

21

| xwxw


10

Classification

Single sigmoid output

K>2 softmax outputs

( )( ) ( ) ( )

( ) tj

tttj

ttttttt

tTt

xyrw

yryr,E

y

−=∆

−−−−=

=

η

1 log 1 log |

sigmoid

rxw

xw

{ }( )

( ) tj

ti

ti

tij

i

ti

ti

ttii

t

k

tTk

tTit

xyrw

yr,Ey

−=∆

−== ∑∑η

log | exp

exprxw

xwxw


11

Learning Boolean AND


12

XOR

No w0, w1, w2 satisfy:

(Minsky and Papert, 1969)

0

0

0

0

021

01

02

0

≤++>+>+≤

www

ww

ww

w


13

Multilayer Perceptrons

(Rumelhart et al., 1986)

( )

( )[ ]∑

∑

=

=

+−+=

=

+==

d

j hjhj

Thh

H

hihih

Tii

wxw

z

vzvy

1 0

10

exp1

1

sigmoid xw

zv


14x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)


15

Backpropagation

( )

( )[ ]

hj

h

h

i

ihj

d

j hjhj

Thh

H

hihih

Tii

wz

zy

yE

wE

wxw

z

vzvy

∂∂

∂∂

∂∂

=∂∂

+−+=

=

+==

∑

∑

=

=

exp1

1

sigmoid

1 0

10

xw

zv


16

( ) ( )( ) ( ) t

jth

th

th

tt

tj

th

th

th

tt

hj

th

th

t

tt

hjhj

xzzvyr

xzzvyr

wz

zy

yE

wE

w

−−η=

−−−η−=

∂∂

∂∂

∂∂

η−=

∂∂

η−=

∑

∑

∑

1

1

∆

Regression

Forward

Backward

x

( )xwThhz sigmoid=

∑=

+=H

h

thh

t vzvy1

0

( ) ( )221

| ∑ −=t

tt yr,E XvW

( ) th

t

tth zyrv ∑ −=∆


17

Regression with Multiple Outputs

zh

vih

yi

xj

whj

( ) ( )

( )

( ) ( ) tj

th

th

t iih

ti

tihj

th

t

ti

tiih

i

H

h

thih

ti

t i

ti

ti

xzzvyrw

zyrv

vzvy

yr,E

−⎥⎦

⎤⎢⎣

⎡−η=

−η=

+=

−=

∑ ∑

∑

∑

∑∑

=

1

21

|

01

2

∆

∆

XVW


18


19


20

whx+w0

zh

vhzh


21

Two-Class Discrimination

One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt

( ) ( ) ( )( )( ) ( ) t

jth

thh

t

tthj

th

t

tth

t

tttt

H

h

thh

t

xzzvyrw

zyrv

yryr,E

vzvy

−−=∆

−=∆

−−+−=

⎟⎟⎠

⎞⎜⎜⎝

⎛+=

∑

∑

∑

∑=

1

1 log 1 log |

sigmoid1

0

η

η

XvW


22

K>2 Classes

( )

( )

( )

( ) ( ) tj

th

th

t iih

ti

tihj

th

t

ti

tiih

t i

ti

ti

ti

k

tk

tit

i

H

hi

thih

ti

xzzvyrw

zyrv

yr,E

CPo

oyvzvo

−⎥⎦

⎤⎢⎣

⎡−η=

−η=

−=

≡=+=

∑ ∑

∑

∑∑∑∑

=

1

log|

|exp

exp

10

∆

∆

Xv

x

W


23

Multiple Hidden Layers

MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks

( )

( )

∑

∑

∑

=

=

=

+==

=⎟⎟⎠

⎞⎜⎜⎝

⎛+==

=⎟⎟⎠

⎞⎜⎜⎝

⎛+==

2

1

1022

21

0212122

11

01111

1sigmoidsigmoid

1sigmoidsigmoid

H

lll

T

H

hlhlh

Tll

d

jhjhj

Thh

vzvy

H,...,l,wzwz

H,...,h,wxwz

zv

zw

xw


24

Improving Convergence

Momentum

Adaptive learning rate

1−α+∂∂

η−= ti

i

tti w

wE

w ∆∆

⎩⎨⎧

η−<+

=ητ+

otherwise

if

b

EEa tt

∆


25

Overfitting/OvertrainingNumber of weights: H (d+1)+(H+1)K


26


27

Structured MLP

(Le Cun et al, 1989)


28

Weight Sharing


29

Hints

Invariance to translation, rotation, size

Virtual examples

Augmented error: E’=E+λhEh

If x’ and x are the “same”: Eh=[g(x|θ)- g(x’|θ)]2

Approximation hint:

(Abu-Mostafa, 1995)

( ) [ ]( )( ) ( )( )( ) ( )⎪

⎩

⎪⎨

⎧

>θ−θ<θ−θ∈θ

=

xx

xx

xx

h

bxgbxg

axgaxg

b,axg

E

|if |

|if |

|if 0

2

2


30

Tuning the Network Size

Destructive

Weight decay:

Constructive

Growing networks

(Ash, 1989) (Fahlman and Lebiere, 1989)

∑λ+=

λ−∂∂

η−=

ii

ii

i

wE'E

wwE

w

2

2

∆


31

Bayesian Learning

Consider weights wi as random vars, prior p(wi)

Weight decay, ridge regression, regularizationcost=data-misfit + λ complexity

( ) ( ) ( )( ) ( )

( ) ( ) ( )

( ) ( ) ( )

2

2

212exp where

log|log|log

|log max arg |

|

w

w

www

ww

www

w

λ+=

⎥⎦

⎤⎢⎣

⎡λ

−⋅==

++=

==

∏

E'E

)/(

wcwpwpp

Cppp

pˆp

ppp

ii

ii

MAP

XX

XX

XX


32

Dimensionality Reduction


33


34

Learning Time

Applications:Sequence recognition: Speech recognition

Sequence reproduction: Time-series prediction

Sequence association

Network architecturesTime-delay networks (Waibel et al., 1989)

Recurrent networks (Rumelhart et al., 1986)


35

Time-Delay Neural Networks


36

Recurrent Networks


37

Unfolding in Time