Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus....

26
10 1 Widrow-Hoff Learning (LMS Algorithm)

Transcript of Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus....

Page 1: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

1

Widrow-Hoff Learning(LMS Algorithm)

Anthony S Maida
Hagan, Demuth, Beale, De Jesus
Anthony S Maida
Approximate steepest descent
Anthony S Maida
Least mean square
Page 2: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

2

ADALINE Network

a = purelin (Wp + b)

Linear Neuron

p a

1

nW

b

R x 1S x R

S x 1

S x 1

S x 1

Input

R S

a purelin Wp b+( ) Wp b+= =

ai purelin ni( ) purelin wTi p bi+( ) wT

i p bi+= = =

wi

wi 1,

wi 2,

wi R,

=

Anthony S Maida
Adaptive linear neuron.
Anthony S Maida
Single layer network w/ trainable linear output units.
Anthony S Maida
Incoming wt vec to output unit i.
Anthony S Maida
Output unit i.
Anthony S Maida
Text
Page 3: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

3

Two-Input ADALINE

p1an

Inputs

bp2 w1,2

w1,1

1

Σ

a = purelin (Wp + b)

Two-Input Neuron

p1-b/w1,1

p2

-b/w1,2

1wTp + b = 0

a > 0a < 0

1w

a purelin n( ) purelin wT1 p b+( ) wT

1 p b+= = =

a wT1 p b+ w1 1, p1 w1 2, p2 b+ += =

Page 4: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

4

Mean Square Error

p1 t1{ , } p2 t2{ , } … pQ tQ{ , }, , ,

Training Set:

pq tqInput: Target:

x w1b

= z p1

= a wT1 p b+= a xTz=

F x( ) E e2 ][= E t a–( )2 ][ E t xTz–( )2 ][= =

Notation:

Mean Square Error:

Anthony S Maida
Labeled data.
Anthony S Maida
Q data items.
Anthony S Maida
trainable bias
Anthony S Maida
(target - actual output) squared
Anthony S Maida
Anthony S Maida
Trainable params
Anthony S Maida
Anthony S Maida
Trainable params
Anthony S Maida
Anthony S Maida
Input
Anthony S Maida
Anthony S Maida
Packing bias into the wt vec simplifies notation.
Anthony S Maida
Anthony S Maida
Expected value
Anthony S Maida
Anthony S Maida
Anthony S Maida
Expected square error.
Page 5: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

5

Error AnalysisF x( ) E e2 ][= E t a–( )2 ][ E t xTz–( )2 ][= =

F x( ) E t2 2txTz– xTzzTx+ ][=

F x( ) E t2 ] 2xTE tz[ ]– xT E zzT[ ]x+[=

F x( ) c 2xTh– xTRx+=

c E t2 ][= h E tz[ ]= R E zzT[ ]=

F x( ) c dTx 12---x

TAx+ +=

d 2h–= A 2R=

The mean square error for the ADALINE Network is aquadratic function:

Anthony S Maida
Substitute linear output.
Anthony S Maida
Anthony S Maida
Expand quadratic.
Anthony S Maida
Anthony S Maida
Minimize F(x).
Anthony S Maida
Expected value allows you to break sum into components.
Anthony S Maida
Another quadratic.
Anthony S Maida
Input correlation matrix.
Anthony S Maida
Anthony S Maida
Cross-correlation matrix btw input & target.
Anthony S Maida
Characteristics of quadratic depend on A.
Anthony S Maida
Convert above quadratic to standard form for further analysis.
Anthony S Maida
z: inputs
Anthony S Maida
x: trainable params
Anthony S Maida
h and R are difficult to caculate.
Anthony S Maida
Anthony S Maida
Page 6: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

6

Stationary Point

∇F x( ) ∇ c dTx 12---xTAx+ +⎝ ⎠

⎛ ⎞ d Ax+ 2h– 2Rx+= = =

2h– 2Rx+ 0=

x∗ R 1– h=

A 2R=

Hessian Matrix:

The correlation matrix R must be at least positive semidefinite. Ifthere are any zero eigenvalues, the performance index will either

have a weak minumum or else no stationary point, otherwisethere will be a unique global minimum x*.

If R is positive definite:

Anthony S Maida
Only nonneg eigenvals.
Anthony S Maida
Anthony S Maida
Anthony S Maida
Global minimum.
Anthony S Maida
Guarantee
Anthony S Maida
Anthony S Maida
Set gradient to zero and solve for global min.
Anthony S Maida
Rx = h
Anthony S Maida
Quadratic analysis of
Anthony S Maida
R may be expensive to compute is data set is large.
Anthony S Maida
Anthony S Maida
Gradient of quadratic.
Anthony S Maida
Anthony S Maida
All eigenvals > 0.
Page 7: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

7

Approximate Steepest Descent

F̂ x( ) t k( ) a k( )–( )2 e2 k( )= =

Approximate mean square error (one sample):

∇̂F x( ) e2 k( )∇=

e2 k( )∇[ ] je2 k( )∂w1 j,∂

---------------- 2e k( ) e k( )∂w1 j,∂

-------------= = j 1 2 … R, , ,=

e2 k( )∇[ ]R 1+e2 k( )∂

b∂---------------- 2e k( )

e k( )∂b∂

-------------= =

Approximate (stochastic) gradient:

Anthony S Maida
Anthony S Maida
Design algorithm to locate minimum point.
Anthony S Maida
Avoids calculating h and R.
Anthony S Maida
Also called online or incremental b/c of trial-by-trial update.
Anthony S Maida
R = # of input features, j varies over R.
Anthony S Maida
bias
Anthony S Maida
k is a trial.
Anthony S Maida
Expectation of squared error is replaced by squared error at iteration k.
Anthony S Maida
R+1
Anthony S Maida
A total of R+1 components in gradient vector.
Anthony S Maida
Squared term drops out when taking deriv.
Anthony S Maida
Page 8: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

8

Approximate Gradient Calculation

e k( )∂w1 j,∂

------------- t k( ) a k( )–[ ]∂w1 j,∂

---------------------------------- w1 j,∂∂ t k( ) wT

1 p k( ) b+( )–[ ]= =

e k( )∂w1 j,∂

-------------w1 j,∂∂ t k( ) w1 i, pi k( )

i 1=

R

∑ b+⎝ ⎠⎜ ⎟⎛ ⎞

–=

e k( )∂w1 j,∂

------------- p j k( )–= e k( )∂b∂

------------- 1–=

∇̂F x( ) e2 k( )∇ 2e k( )z k( )–= =

Anthony S Maida
Calculate partial derivatives.
Anthony S Maida
t - a
Anthony S Maida
Anthony S Maida
Anthony S Maida
All steps involve substitutions.
Anthony S Maida
substitution
Anthony S Maida
Pack result into vectorized form.
Page 9: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

9

LMS Algorithm

xk 1+ xk α F x( )∇ x xk=–=

xk 1+ xk 2αe k( )z k( )+=

w1 k 1+( ) w1 k( ) 2αe k( )p k( )+=

b k 1+( ) b k( ) 2αe k( )+=

Anthony S Maida
Stochastic (trial-by-trial) gradient descent.
Anthony S Maida
1st layer wt vector
Anthony S Maida
Anthony S Maida
From previous slide.
Anthony S Maida
wt updates
Anthony S Maida
bias update
Page 10: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

10

Multiple-Neuron Case

wi k 1+( ) wi k( ) 2αei k( )p k( )+=

bi k 1+( ) bi k( ) 2αei k( )+=

W k 1+( ) W k( ) 2αe k( )pT k( )+=

b k 1+( ) b k( ) 2αe k( )+=

Matrix Form:

Anthony S Maida
Since each output unit is independent, simply convert to matrix form.
Anthony S Maida
Initialize wts to 0.
Page 11: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

11

Analysis of Convergencexk 1+ xk 2αe k( )z k( )+=

E xk 1+[ ] E xk[ ] 2αE e k( )z k( )[ ]+=

E xk 1+[ ] E xk[ ] 2α E t k( )z k( )[ ] E xkTz k( )( )z k( )[ ]–{ }+=

E xk 1+[ ] E xk[ ] 2α E tkz k( )[ ] E z k( )zTk( )( )xk[ ]–{ }+=

E xk 1+[ ] E xk[ ] 2α h RE xk[ ]–{ }+=

E xk 1+[ ] I 2αR–[ ]E xk[ ] 2αh+=

For stability, the eigenvalues of thismatrix must fall inside the unit circle.

Anthony S Maida
Stability
Page 12: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

12

Conditions for Stability

eig I 2αR–[ ]( ) 1 2αλi– 1<=

Therefore the stability condition simplifies to

1 2αλi– 1–>

λi 0>Since , 1 2αλi– 1< .

α 1 λ⁄ i for all i<

0 α 1 λmax⁄< <

(where λi is an eigenvalue of R)

Anthony S Maida
Stability condition
Page 13: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

13

Steady State Response

E xss[ ] I 2αR–[ ]E xss[ ] 2αh+=

E xss[ ] R 1– h x∗= =

E xk 1+[ ] I 2αR–[ ]E xk[ ] 2αh+=

If the system is stable, then a steady state condition will be reached.

The solution to this equation is

This is also the strong minimum of the performance index.

Page 14: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

14

Example

p1

1–11–t1, 1–= =

⎩ ⎭⎪ ⎪⎨ ⎬⎪ ⎪⎧ ⎫

p2

111–t2, 1= =

⎩ ⎭⎪ ⎪⎨ ⎬⎪ ⎪⎧ ⎫

R E ppT[ ] 1

2---p1p1T 1

2---p2p2T+==

R 12---

1–11–

1– 1 1–12---

111–

1 1 1–+1 0 00 1 1–0 1– 1

= =

λ1 1.0 λ2 0.0 λ3 2.0=,=,=

α 1λmax------------< 1

2.0------- 0.5==

Banana Apple

Anthony S Maida
Choose alpha to guarantee stability, from slide 12 .
Anthony S Maida
2 trials/batch.
Anthony S Maida
Input correlation matrix.
Anthony S Maida
Largest eigenvalue is 2.0.
Anthony S Maida
Text
Anthony S Maida
Positive semidefinite.
Anthony S Maida
Compute expectation to obtain input correlation matrix.
Anthony S Maida
Set up apple/banana example and choose alpha.
Anthony S Maida
Find eigenvals.
Anthony S Maida
Must be < 0.5, but the following example chose 0.2.
Page 15: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

15

Iteration One

a 0( ) W 0( )p 0( ) W 0( )p1 0 0 01–11–

0====

e 0( ) t 0( ) a 0( ) t1 a 0( ) 1– 0 1–=–=–=–=

W 1( ) W 0( ) 2αe 0( )pT 0( )+=

W 1( ) 0 0 0 2 0.2( ) 1–( )1–11–

T

0.4 0.4– 0.4=+=

Banana

Anthony S Maida
alpha = .2
Anthony S Maida
Anthony S Maida
Initialize wts to 0.
Anthony S Maida
New wts.
Anthony S Maida
Start training.
Anthony S Maida
Anthony S Maida
Page 16: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

16

Iteration Two

Apple a 1( ) W 1( )p 1( ) W 1( )p2 0.4 0.4– 0.4111–

0.4–====

e 1( ) t 1( ) a 1( ) t2 a 1( ) 1 0.4–( ) 1.4=–=–=–=

W 2( ) 0.4 0.4– 0.4 2 0.2( ) 1.4( )111–

T

0.96 0.16 0.16–=+=

Anthony S Maida
Page 17: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

17

Iteration Three

a 2( ) W 2( )p 2( ) W 2( )p1 0.96 0.16 0.16–1–11–

0.64–====

e 2( ) t 2( ) a 2( ) t1 a 2( ) 1– 0.64–( ) 0.36–=–=–=–=

W 3( ) W 2( ) 2αe 2( )pT 2( )+ 1.1040 0.0160 0.0160–= =

W ∞( ) 1 0 0=

Anthony S Maida
After convergence.
Anthony S Maida
Use banana again.
Anthony S Maida
Gradually approaching convergence value.
Page 18: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

18

Adaptive Filtering

p1(k) = y(k)

D

D

D

p2(k) = y(k - 1)

pR(k) = y(k - R + 1)

y(k)

a(k)n(k)SxR

Inputs

Σb

w1,R

w1,1

y(k)

D

D

D

w1,2

a(k) = purelin (Wp(k) + b)

ADALINE

1

Tapped Delay Line Adaptive Filter

a k( ) purelin Wp b+( ) w1 i, y k i– 1+( )i 1=

R

∑ b+= =

Anthony S Maida
uses a tapped delay line
Anthony S Maida
New building block.
Anthony S Maida
Takes an input sequence.
Anthony S Maida
Finite impulse response (FIR) filter.
Anthony S Maida
Trainable
Page 19: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

19

Example: Noise Cancellation

Adaptive Filter

60-HzNoise Source

Noise Path Filter

EEG Signal(random)

Contaminating Noise

Contaminated Signal

"Error"

Restored Signal

+-

Adaptive Filter Adjusts to Minimize Error (and in doing this removes 60-Hz noise from contaminated signal)

Adaptively Filtered Noise to Cancel Contamination

Graduate Student

v

m

s t

a

e

Anthony S Maida
Use adeline for adaptive filter and train.
Anthony S Maida
Anthony S Maida
Electroencephalogram
Anthony S Maida
Contaminated by 60 Hz noise.
Anthony S Maida
Contamination.
Anthony S Maida
Noise is input to filter.
Anthony S Maida
Train to get t-a=0 by getting a = m.
Anthony S Maida
Anthony S Maida
Anthony S Maida
Anthony S Maida
Need to restore original signal, s.
Anthony S Maida
Anthony S Maida
Anthony S Maida
Anthony S Maida
e = t - a
Page 20: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

20

Noise Cancellation Adaptive Filter

a(k)n(k)SxR

Inputs

Σ

w1,1

D w1,2

ADALINE

v(k)

a(k) = w1,1 v(k) + w1,2 v(k - 1)

Anthony S Maida
Condensed diagram for this example.
Anthony S Maida
Remember one step into the past.
Anthony S Maida
No bias.
Anthony S Maida
v is the noise signal.
Anthony S Maida
v is 60 Hz sine wave sampled at 180 Hz.
Anthony S Maida
Train to get a = m.
Page 21: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

21

Correlation Matrix

R zzT[ ]= h E tz[ ]=

z k( ) v k( )v k 1–( )

=

t k( ) s k( ) m k( )+=

R E v2 k( )[ ] E v k( )v k 1–( )[ ]

E v k 1–( )v k( )[ ] E v2

k 1–( )[ ]=

h E s k( ) m k( )+( )v k( )[ ]E s k( ) m k( )+( )v k 1–( )[ ]

=

Anthony S Maida
Find the input correlation matrix.
Anthony S Maida
Input vector
Anthony S Maida
Find input/target cross-correlation matrix.
Anthony S Maida
Target is current signal + filtered noise from slide 19.
Anthony S Maida
From slide 5.
Anthony S Maida
Input correlation matrix.
Anthony S Maida
Cross-correlation matrix.
Anthony S Maida
Expected values are used.
Anthony S Maida
s: EEG signal
Anthony S Maida
m: filtered noise
Page 22: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

22

Signals

v k( ) 1.2 2πk3

---------⎝ ⎠⎛ ⎞sin=

E v2 k( )[ ] 1.2( )213---

2πk3---------⎝ ⎠

⎛ ⎞sin⎝ ⎠⎛ ⎞2

k 1=

3

∑ 1.2( )20.5 0.72= = =

E v2 k 1–( )[ ] E v2 k( )[ ] 0.72= =

E v k( )v k 1–( )[ ] 13--- 1.2 2πk

3---------sin⎝ ⎠⎛ ⎞ 1.2 2π k 1–( )

3-----------------------sin⎝ ⎠⎛ ⎞

k 1=

3

∑=

1.2( )20.5 2π3------⎝ ⎠

⎛ ⎞cos 0.36–= =

R 0.72 0.36–0.36– 0.72

=

m k( ) 1.2 2πk3--------- 3π

4------–⎝ ⎠⎛ ⎞sin=

Anthony S Maida
Artificial data
Page 23: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

23

Stationary PointE s k( ) m k( )+( )v k( )[ ] E s k( )v k( )[ ] E m k( )v k( )[ ]+=

E s k( ) m k( )+( )v k 1–( )[ ] E s k( )v k 1–( )[ ] E m k( )v k 1–( )[ ]+=

h 0.51–0.70

=

x∗ R 1– h 0.72 0.36–0.36– 0.72

1–0.51–

0.700.30–

0.82= = =

0

0

h E s k( ) m k( )+( )v k( )[ ]E s k( ) m k( )+( )v k 1–( )[ ]

=

E m k( )v k 1–( )[ ] 13--- 1.2 2πk

3--------- 3π4------–⎝ ⎠

⎛ ⎞sin⎝ ⎠⎛ ⎞ 1.2 2π k 1–( )

3-----------------------sin⎝ ⎠⎛ ⎞

k 1=

3

∑ 0.70= =

E m k( )v k( )[ ]13--- 1.2 2πk

3--------- 3π4------–⎝ ⎠

⎛ ⎞sin⎝ ⎠⎛ ⎞ 1.2 2πk

3---------sin⎝ ⎠⎛ ⎞

k 1=

3

∑ 0.51–= =

Page 24: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

24

Performance IndexF x( ) c 2xTh– xTRx+=

c E t2 k( )[ ] E s k( ) m k( )+( )2[ ]==

c E s2 k( )[ ] 2E s k( )m k( )[ ] E m2 k( )[ ]+ +=

E s2 k( )[ ] 10.4------- s2 sd

0.2–

0.2

∫ 13 0.4( )---------------s3

0.2–

0.20.0133= = =

F x∗( ) 0.7333 2 0.72( )– 0.72+ 0.0133= =

E m2 k( )[ ] 13--- 1.2 2π

3------ 3π

4------–⎝ ⎠

⎛ ⎞sin⎩ ⎭⎨ ⎬⎧ ⎫2

k 1=

3

∑ 0.72= =

c 0.0133 0.72+ 0.7333= =

Page 25: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

25

LMS Response

-2 -1 0 1 2-2

-1

0

1

2

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-4

-2

0

2

4Original and Restored EEG Signals

Original and Restored EEG Signals

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-4

-2

0

2

4

Time

EEG Signal Minus Restored Signal

Anthony S Maida
alpha = 0.1
Anthony S Maida
Assessing the performance.
Anthony S Maida
Illustrates how the filter adapts to cancel noise.
Anthony S Maida
Initially the restored signal is a poor approximation to the original signal.
Anthony S Maida
Jitter around minimum is caused by using approximation (stochastic) gradient descent.
Anthony S Maida
s - e
Anthony S Maida
Anthony S Maida
Doesn’t stabilize at exactly 0 b/c of jitter.
Page 26: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

26

Echo Cancellation

AdaptiveFilterHybrid

+

-

Transmission Line

Transmission Line

Phone

+

-

AdaptiveFilter Hybrid Phone

Anthony S Maida
A larger application involving transmission over phone lines.