Introduction to Radial Basis Function Networks

139
Introduction to Radial Basis Function Networks

description

Introduction to Radial Basis Function Networks. Content. Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection - PowerPoint PPT Presentation

Transcript of Introduction to Radial Basis Function Networks

Page 1: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

Page 2: Introduction to  Radial Basis Function Networks

Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Incremental Operations

Page 3: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

Overview

Page 4: Introduction to  Radial Basis Function Networks

Typical Applications of NN

Pattern Classification

Function Approximation

Time-Series Forecasting

( )l f x l C N

( )fy x mY R y

mX R x

nX R x

1 2 3( ) ( , , , )t t tft x x x x

Page 5: Introduction to  Radial Basis Function Networks

Function Approximation

mX R f nY R

:f X YUnknown

mX R nY R

ˆ :f X YApproximatorf

Page 6: Introduction to  Radial Basis Function Networks

NeuralNetwork

Supervised Learning

UnknownFunction

ixiy

ˆ iy1,2,i +

+

ie

Page 7: Introduction to  Radial Basis Function Networks

Neural Networks as Universal Approximators Feedforward neural networks with a single hidden layer of sigmoidal units are capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy.

– Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks are Universal Approximators," Neural Networks, 2(5), 359-366.

Like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators.– Park, J. and Sandberg, I. W. (1991). "Universal Approximation Using Ra

dial-Basis-Function Networks," Neural Computation, 3(2), 246-257.– Park, J. and Sandberg, I. W. (1993). "Approximation and Radial-Basis-F

unction Networks," Neural Computation, 5(2), 305-316.

Page 8: Introduction to  Radial Basis Function Networks

Statistics vs. Neural Networks

Statistics Neural Networksmodel networkestimation learningregression supervised learninginterpolation generalizationobservations training setparameters (synaptic) weightsindependent variables inputsdependent variables outputsridge regression weight decay

Page 9: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

The Model ofFunction Approximator

Page 10: Introduction to  Radial Basis Function Networks

Linear Models

1

(( ) )ii

i

m

f w

x xFixed BasisFunctions

Weights

Page 11: Introduction to  Radial Basis Function Networks

Linear Models

Feature VectorsInputs

HiddenUnits

OutputUnits

• Decomposition• Feature Extraction• Transformation

Linearly weighted output

1

(( ) )ii

i

m

f w

x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Page 12: Introduction to  Radial Basis Function Networks

Linear Models

Feature VectorsInputs

HiddenUnits

OutputUnits

• Decomposition• Feature Extraction• Transformation

Linearly weighted output

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Can you say some bases?1

(( ) )ii

i

m

f w

x x

Page 13: Introduction to  Radial Basis Function Networks

Example Linear Models

Polynomial

Fourier Series

( ) ii

iwf xx

0exp 2( )k

k jf w k xx

Are they orthogonal bases?

( ) , 0,1,2,ii ix x

0( ) exp 2 0,1,, 2, k x j k x k

1

(( ) )ii

i

m

f w

x x

Page 14: Introduction to  Radial Basis Function Networks

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Single-Layer Perceptrons as Universal Aproximators

HiddenUnits

With sufficient number of sigmoidal units, it can be a universal approximator.

1

(( ) )ii

i

m

f w

x x

Page 15: Introduction to  Radial Basis Function Networks

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Radial Basis Function Networks as Universal Aproximators

HiddenUnits

With sufficient number of radial-basis-function units, it can also be a universal approximator.

1

(( ) )ii

i

m

f w

x x

Page 16: Introduction to  Radial Basis Function Networks

Non-Linear Models

1

(( ) )ii

i

m

f w

x xAdjusted by the

Learning process

Weights

1

(( ) )ii

i

m

f w

x x

Page 17: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

The Radial Basis Function Networks

Page 18: Introduction to  Radial Basis Function Networks

Radial Basis Functions1

(( ) )ii

i

m

f w

x x

Center

Distance Measure Shape

Three parameters for a radial function:

xi

r = ||x xi||

i(x)= (||x xi||)

Page 19: Introduction to  Radial Basis Function Networks

Typical Radial Functions

Gaussian

Hardy Multiquadratic

Inverse Multiquadratic

222 0 and

r

r e r

2 2 0 and r r c c c r

2 2 0 and r c r c c r

Page 20: Introduction to  Radial Basis Function Networks

Gaussian Basis Function (=0.5,1.0,1.5)

222 0 and

r

r e r

0.5 1.0

1.5

Page 21: Introduction to  Radial Basis Function Networks

Inverse Multiquadratic

00.10.20.30.40.50.60.70.80.9

1

-10 -5 0 5 10

2 2 0 and r c r c c r

c=1c=2

c=3c=4

c=5

Page 22: Introduction to  Radial Basis Function Networks

+ +

Most General RBF

1( ) ( )i iT

i μ x μx x

+1μ

2μ 3μ

Basis {i: i =1,2,…} is `near’ orthogonal.

Page 23: Introduction to  Radial Basis Function Networks

Properties of RBF’sOn-Center, Off Surround

Analogies with localized receptive fields found in several biological structures, e.g.,– visual cortex;– ganglion cells

Page 24: Introduction to  Radial Basis Function Networks

The Topology of RBF

Feature Vectorsx1 x2 xn

y1 ym

Inputs

HiddenUnits

OutputUnits

Projection

Interpolation

As a function approximator ( )y f x

Page 25: Introduction to  Radial Basis Function Networks

The Topology of RBF

Feature Vectorsx1 x2 xn

y1 ym

Inputs

HiddenUnits

OutputUnits

Subclasses

Classes

As a pattern classifier.

Page 26: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

RBFN’s forFunction Approximation

Page 27: Introduction to  Radial Basis Function Networks

The idea

x

y Unknown Function to Approximate

TrainingData

Page 28: Introduction to  Radial Basis Function Networks

The idea

x

y Unknown Function to Approximate

TrainingData

Basis Functions (Kernels)

Page 29: Introduction to  Radial Basis Function Networks

The idea

x

y

Basis Functions (Kernels)

Function Learned

1

)) ((m

iiiwy f

xx

Page 30: Introduction to  Radial Basis Function Networks

The idea

x

y

Basis Functions (Kernels)

Function Learned

NontrainingSample

1

)) ((m

iiiwy f

xx

Page 31: Introduction to  Radial Basis Function Networks

The idea

x

y Function Learned

NontrainingSample

1

)) ((m

iiiwy f

xx

Page 32: Introduction to  Radial Basis Function Networks

Radial Basis Function Networks as Universal Aproximators

1

( ) ( )m

ii

iy wf

x x

x1 x2 xn

w1 w2 wm

x =

Training set ( )

1

( ),pk

k

ky

xT

Goal (( ))k kfy x for all k

2

( )

1

( )min kp

k

k

SSE fy

x

2

)(

1

) (

1

p mk

ik

ki

i

y w

x

Page 33: Introduction to  Radial Basis Function Networks

Learn the Optimal Weight Vector

w1 w2 wm

x1 x2 xnx =

Training set ( )

1

( ),pk

k

ky

xT

Goal (( ))k kfy x for all k

1

( ) ( )m

ii

iy wf

x x

2

( )

1

( )min kp

k

k

SSE fy

x

2

)(

1

) (

1

p mk

ik

ki

i

y w

x

Page 34: Introduction to  Radial Basis Function Networks

2

( )

1

( )min kp

k

k

SSE fy

x

2

)(

1

) (

1

p mk

ik

ki

i

y w

x

Regularization

Training set ( )

1

( ),pk

k

ky

xT

Goal (( ))k kfy x for all k

2

1

m

iiiw

2

1

m

iiiw

0i

If regularization is unneeded, set 0i

C

Page 35: Introduction to  Radial Basis Function Networks

Learn the Optimal Weight Vector

2 2( )

1

( )

1

p mk

ki

ii

k wfyC

xMinimize

1

( ) ( )m

ii

iwf

x x

(( )

) ( )

1

2 2kp

jk

j

kk

j j

wfC f

w wy

x

x

( ) ( ) ( )

1

2 2p

k kj

kjj

k f wy

x x

0

( ) * ( ) ( )

1 1

)* (p p

k k kj j

k kjj

kwf y

x x x

*

1

*( ) ( )i

m

ii

wf

x x

Page 36: Introduction to  Radial Basis Function Networks

Learn the Optimal Weight Vector

(1) ( ), ,Tp

j j j φ x x

* * (1) * ( ), ,Tpf ff x xDefine

(1) ( ), ,Tpy yy

* *j

T Tj jjw φ f φ y 1, ,j m

( ) * ( ) ( )

1 1

)* (p p

k k kj j

k kjj

kwf y

x x x

1

( ) ( )m

ii

iwf

x x *

1

*( ) ( )i

m

ii

wf

x x

Page 37: Introduction to  Radial Basis Function Networks

Learn the Optimal Weight Vector

*1 1

*2 2

*1

*

*

1

2 2

*

T T

T T

T Tm mmm

w

w

w

φ f φ yφ f φ y

φ f φ y

* *j

T Tj jjw φ f φ y 1, ,j m

1 2, , , mΦ φ φ φ

1

2

m

Λ

* * *1 2* , , ,

T

mw w ww

Define

**T T wΦ Λf Φ y

Page 38: Introduction to  Radial Basis Function Networks

Learn the Optimal Weight Vector

**T T wΦ Λf Φ y

* (1)

* (2)*

* ( )p

f

f

f

x

xf

x

*

1

*( ) ( )i

m

ii

wf

x x

(1)

1

(2)

1

(

*

*

1

*

)

k

k

m

kk

m

kk

mP

kk

k

w

w

w

x

x

x

(1) (1) (1)*1 21

(2) (2) (2) *1 2 2

*( ) ( ) ( )

1 2

m

m

p p p mm

ww

w

x x x

x x x

x x x

*Φw

* *T T ΛwΦ wΦ Φ y

Page 39: Introduction to  Radial Basis Function Networks

Learn the Optimal Weight Vector

* *T T ΛwΦ wΦ Φ y

*T T wΦ ΛΦ Φ y

1* T T Λw Φ Φ Φ y

Variance Matrix

1 TA Φ y

1 :ADesign Matrix:Φ

Page 40: Introduction to  Radial Basis Function Networks

Summary

1* T T Λw Φ Φ Φ y 1 TA Φ y

(1)

(2)

( )

j

jj

pj

x

x

(1)

(2)

( )p

yy

y

y

1 2, , , mΦ φ φ φ

*1*2

*

*

m

ww

w

w

1

2

m

Λ

1

)) ((m

iiiwy f

xx

Training set ( )

1

( ),pk

k

ky

xT

Page 41: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

The Projection Matrix

Page 42: Introduction to  Radial Basis Function Networks

The Empirical-Error Vector

1* Tw A Φ y

*1w *

2w *mw

( )kx ( )1

kx ( )2

kx ( )knx( )kx ( )

1kx ( )

2kx ( )k

nx

UnknownFunction

y ** f Φw

1φ 2φ mφ

Page 43: Introduction to  Radial Basis Function Networks

The Empirical-Error Vector

1* Tw A Φ y

*1w *

2w *mw

( )kx ( )1

kx ( )2

kx ( )knx( )kx ( )

1kx ( )

2kx ( )k

nx

UnknownFunction

y ** f Φw

1φ 2φ mφError Vector

* e y f * y Φw 1 T ΦAy Φ y 1 T

p AI Φ Φ yPy

Page 44: Introduction to  Radial Basis Function Networks

Sum-Squared-Error

e Py1 T

p P I Φ ΦA

T A Φ Φ Λ

1 2, , , mΦ φ φ φError Vector

T Py Py2Ty P y

2( ) * ( )

1

pk k

k

SSE y f

x

If =0, the RBFN’s learning algorithm is to minimize SSE (MSE).

Page 45: Introduction to  Radial Basis Function Networks

The Projection Matrix

e Py1 T

p P I Φ ΦA

T A Φ Φ Λ

1 2, , , mΦ φ φ φError Vector

2TSSE y P y

1 2( , , , )mspane φ φ φΛ 0

( )P Py Pee Py

Ty Py

Page 46: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

Learning the Kernels

Page 47: Introduction to  Radial Basis Function Networks

RBFN’s as Universal Approximators

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

Training set ) ( )(

1,

pk

k

k

yxT

2

2( ) exp2

jj

j

x μx

Kernels

Page 48: Introduction to  Radial Basis Function Networks

What to Learn?

Weights wij’s Centers j’s of j’s Widths j’s of j’s Number of j’s Model Selection

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

Page 49: Introduction to  Radial Basis Function Networks

One-Stage Learning

( ) ( ) ( )1

k k kij i i jw y f x x

( ) ( )

1

lk k

i ij jj

f w

x x

( ) ( ) ( )2 2

1

ljk k k

j j ij i iij

w y f

x μμ x x

2

( ) ( ) ( )3 3

1

ljk k k

j j ij i iij

w y f

x μx x

Page 50: Introduction to  Radial Basis Function Networks

One-Stage Learning ( ) ( )

1

lk k

i ij jj

f w

x xThe simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting.

( ) ( ) ( )1

k k kij i i jw y f x x

( ) ( ) ( )2 2

1

ljk k k

j j ij i iij

w y f

x μμ x x

2

( ) ( ) ( )3 3

1

ljk k k

j j ij i iij

w y f

x μx x

Page 51: Introduction to  Radial Basis Function Networks

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

Two-Stage Training

Step 1

Step 2

Determines Centers j’s of j’s. Widths j’s of j’s. Number of j’s.

Determines wij’s.E.g., using batch-learning.

Page 52: Introduction to  Radial Basis Function Networks

Train the Kernels

Page 53: Introduction to  Radial Basis Function Networks

+

+

+

+

+

Unsupervised Training

Page 54: Introduction to  Radial Basis Function Networks

Subset Selection– Random Subset Selection– Forward Selection– Backward Elimination

Clustering Algorithms– KMEANS– LVQ

Mixture Models– GMM

Methods

Page 55: Introduction to  Radial Basis Function Networks

Subset Selection

Page 56: Introduction to  Radial Basis Function Networks

Random Subset Selection Randomly choosing a subset of points

from training set

Sensitive to the initially chosen points.

Using some adaptive techniques to tune– Centers– Widths– #points

Page 57: Introduction to  Radial Basis Function Networks

Clustering Algorithms

Partition the data points into K clusters.

Page 58: Introduction to  Radial Basis Function Networks

Clustering Algorithms

Is such a partition satisfactory?

Page 59: Introduction to  Radial Basis Function Networks

Clustering AlgorithmsHow about this?

Page 60: Introduction to  Radial Basis Function Networks

+

++

+

Clustering Algorithms

1

2

3

4

Page 61: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

Bias-Variance Dilemma

Page 62: Introduction to  Radial Basis Function Networks

Goal Revisit

Minimize Prediction Error• Ultimate Goal Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error

Page 63: Introduction to  Radial Basis Function Networks

Badness of Fit Underfitting

– A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting.– Produces excessive bias in the outputs.

Overfitting– A model (e.g., network) that is too complex may fit the noise, not just the signal, leading to overfitting.– Produces excessive variance in the outputs.

Page 64: Introduction to  Radial Basis Function Networks

Underfitting/Overfitting Avoidance

Model selection Jittering Early stopping Weight decay

– Regularization– Ridge Regression

Bayesian learning Combining networks

Page 65: Introduction to  Radial Basis Function Networks

Best Way to Avoid Overfitting

Use lots of training data, e.g.,– 30 times as many training cases as there are weights in the network.– for noise-free data, 5 times as many training cases as weights may be sufficient.

Don’t arbitrarily reduce the number of weights for fear of underfitting.

Page 66: Introduction to  Radial Basis Function Networks

Badness of FitUnderfit Overfit

Page 67: Introduction to  Radial Basis Function Networks

Badness of FitUnderfit Overfit

Page 68: Introduction to  Radial Basis Function Networks

Bias-Variance Dilemma

Underfit Overfit

Large bias Small biasSmall variance Large variance

However, it's not really a dilemma.

Page 69: Introduction to  Radial Basis Function Networks

Underfit Overfit

Large bias Small biasSmall variance Large variance

Bias-Variance DilemmaMore on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data.

Page 70: Introduction to  Radial Basis Function Networks

Bias-Variance DilemmaIt's not really a dilemma.

fit moreoverfitmoreunderfit

Bias

Variance

Page 71: Introduction to  Radial Basis Function Networks

Bias-Variance Dilemma

Sets of functions E.g., depend on # hidden nodes used.

The true modelnoise

bias

Solution obtained with training set 2.

Solution obtained with training set 1.

Solution obtained with training set 3.

biasbias

The mean of the bias=?The variance of the bias=?

Page 72: Introduction to  Radial Basis Function Networks

Bias-Variance Dilemma

Sets of functions E.g., depend on # hidden nodes used.

The true modelnoise

The mean of the bias=?The variance of the bias=?

biasVariance

Page 73: Introduction to  Radial Basis Function Networks

Model Selection

Sets of functions E.g., depend on # hidden nodes used.

The true modelnoise

biasVariance

Reduce the effective number of parameters.Reduce the number of hidden nodes.

Page 74: Introduction to  Radial Basis Function Networks

Bias-Variance Dilemma

Sets of functions E.g., depend on # hidden nodes used.

The true modelnoise

bias ( )g x( ) ( )y g x x

( )f x

Goal: 2min ( ) ( )E y f x x

Page 75: Introduction to  Radial Basis Function Networks

Bias-Variance DilemmaGoal: 2min ( ) ( )E y f x x

2( ) ( )E y f x x 2( ) ( )E g f x x

2 2( ) ( ) 2 ( ) ( )E g f g f x x x x

2 2( ) ( ) 2 ( ) ( )E g f E g f E x x x x

Goal: 2min ( ) ( )E g f x x 2min ( ) ( )E y f x x

0 constant

Page 76: Introduction to  Radial Basis Function Networks

Bias-Variance Dilemma 2( ) ( )E g f x x 2[ ( )] [( ) () )( ]E f E fE g f x xx x

2[ ]( )) (E fE g x x 2[ ]( )) (E fE f x x

[ ( )] [ ( )) ( ]2 ( )E g fE f E f x xx x

2 2[ ( )] [ ( )]( ) ( ) [ ( )] [2 ( ) ( ) ( )]E f E f E f EE g g f ff x x x xx x x x

( ) ( )[ ( )] [ ( )]E E f Eg f f x xxx

( ) ( )E g f x x [ )]( ()EE fg x x [ ( ()] )E E f f x x [ ( )] [ ( )]E f E fE x x

( ) ( )E g E f x x [ )]( () EE fg x x [ ( )] ( )EE f f x x [ ( )] [ ( )]E f E f x x 0

2 22( ) ( ) ( ) ( )E y f E E g f x x x x

0

Page 77: Introduction to  Radial Basis Function Networks

Bias-Variance Dilemma

2 22( ) ( ) ( ) ( )E y f E E g f x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx

Goal: 2min ( ) ( )E y f x x

noise bias2 variance

Cannot be minimized

Minimize both bias2 and variance

Page 78: Introduction to  Radial Basis Function Networks

Model Complexity vs. Bias-Variance

2 22( ) ( ) ( ) ( )E y f E E g f x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx

Goal: 2min ( ) ( )E y f x x

noise bias2 variance

ModelComplexity(Capacity)

Page 79: Introduction to  Radial Basis Function Networks

Bias-Variance Dilemma

2 22( ) ( ) ( ) ( )E y f E E g f x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx

Goal: 2min ( ) ( )E y f x x

noise bias2 variance

ModelComplexity(Capacity)

Page 80: Introduction to  Radial Basis Function Networks

Example (Polynomial Fits)22exp( 16 )y x x

Page 81: Introduction to  Radial Basis Function Networks

Example (Polynomial Fits)

Page 82: Introduction to  Radial Basis Function Networks

Example (Polynomial Fits)

Degree 1 Degree 5

Degree 10 Degree 15

Page 83: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

The Effective Number of Parameters

Page 84: Introduction to  Radial Basis Function Networks

Variance Estimation

Mean

Variance 22

1

1ˆp

ii

xp

In general, not available.

Page 85: Introduction to  Radial Basis Function Networks

Variance Estimation

Mean1

1ˆp

ii

x xp

Variance 22 2

1

ˆ1

1 ˆp

ii

s xp

Loss 1 degree of freedom

Page 86: Introduction to  Radial Basis Function Networks

Simple Linear Regression

0 1y x 2~ (0, )N

01

yx

y

x

Page 87: Introduction to  Radial Basis Function Networks

Simple Linear Regression

01

yx

y

x

01ˆ

ˆ

y

x

0 1y x 2~ (0, )N

21

ˆp

i ii

SSE y y

Minimize

Page 88: Introduction to  Radial Basis Function Networks

Mean Squared Error (MSE)

01

yx

y

x

01ˆ

ˆ

y

x

0 1y x 2~ (0, )N

21

ˆp

i ii

SSE y y

Minimize

2

2ˆ SSEMSE

p

Loss 2 degrees

of freedom

Page 89: Introduction to  Radial Basis Function Networks

Variance Estimation

2ˆ SSEMSEmp

Loss m degrees of freedom

m: #parameters of the model

Page 90: Introduction to  Radial Basis Function Networks

The Number of Parameters

w1 w2 wm

x1 x2 xnx =

1

( ) ( )m

ii

iy wf

x x

m#degrees of freedom:

Page 91: Introduction to  Radial Basis Function Networks

The Effective Number of Parameters ()

w1 w2 wm

x1 x2 xnx =

1

( ) ( )m

ii

iy wf

x x The projection Matrix1 T

p P I Φ ΦA

T A Φ Φ Λ

Λ 0( )p trace P m

Page 92: Introduction to  Radial Basis Function Networks

The Effective Number of Parameters ()

The projection Matrix1 T

p P I Φ ΦA

T A Φ Φ Λ

Λ 0

1( ) Tptrace trace AP I Φ Φ

1 Tp trace ΦA Φ

( ) ( ) ( )trace trace trace A B A BFacts:( ) ( )trace traceAB BA

1 TTp trace

Φ Φ ΦΦ

Pf)

1T Tp trace

Φ ΦΦ Φ

mp trace I

p m ( )p trace P m

Page 93: Introduction to  Radial Basis Function Networks

Regularization

((

1

)2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

ErModel spenalror t

sy

Co t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models withlarge weightsSSE

( )p trace PThe effective number of parameters:

Page 94: Introduction to  Radial Basis Function Networks

Regularization

((

1

)2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

ErModel spenalror t

sy

Co t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models withlarge weightsSSE Without penalty

(i=0), there are m degrees of freedom

to minimize SSE (Cost).

The effective number of

parameters =m.

The effective number of parameters:( )p trace P

Page 95: Introduction to  Radial Basis Function Networks

Regularization

((

1

)2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

ErModel spenalror t

sy

Co t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models withlarge weightsSSE

With penalty (i>0), the liberty to

minimize SSE will be reduced.

The effective number of

parameters <m.

The effective number of parameters:( )p trace P

Page 96: Introduction to  Radial Basis Function Networks

Variance Estimation

2ˆ MSE

Loss degrees of freedom

The effective number of parameters:( )p trace P

SSEp

Page 97: Introduction to  Radial Basis Function Networks

Variance Estimation

2ˆ MSE

The effective number of parameters:( )p trace P

( )SSE

trace

P

Page 98: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

Model Selection

Page 99: Introduction to  Radial Basis Function Networks

Model Selection Goal

– Choose the fittest model Criteria

– Least prediction error Main Tools (Estimate Model Fitness)

– Cross validation– Projection matrix

Methods– Weight decay (Ridge regression)– Pruning and Growing RBFN’s

Page 100: Introduction to  Radial Basis Function Networks

Empirical Error vs. Model Fitness

Minimize Prediction Error• Ultimate Goal Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error

Minimize Prediction Error

(MSE)

Page 101: Introduction to  Radial Basis Function Networks

Estimating Prediction ErrorWhen you have plenty of data use

independent test sets– E.g., use the same training set to train

different models, and choose the best model by comparing on the test set.

When data is scarce, use– Cross-Validation– Bootstrap

Page 102: Introduction to  Radial Basis Function Networks

Cross Validation Simplest and most widely used method

for estimating prediction error.

Partition the original set into several different ways and to compute an average score over the different partitions, e.g.,– K-fold Cross-Validation– Leave-One-Out Cross-Validation– Generalize Cross-Validation

Page 103: Introduction to  Radial Basis Function Networks

K-Fold CVSplit the set, say, D of available input-output patterns into k mutually exclusive subsets, say D1, D2, …, Dk.Train and test the learning algorithm k times, each time it is trained on D\Di and tested on Di.

Page 104: Introduction to  Radial Basis Function Networks

K-Fold CV

Available Data

Page 105: Introduction to  Radial Basis Function Networks

K-Fold CV

Available DataD1 D2 D3. . . Dk

D1 D2 D3. . . Dk

D1 D2 D3. . . Dk

D1 D2 D3. . . Dk

D1 D2 D3. . . Dk

Test Set

Training Set

Estimate 2

Page 106: Introduction to  Radial Basis Function Networks

Leave-One-Out CVSplit the p available input-output

patterns into a training set of size p1 and a test set of size 1.

Average the squared error on the left-out pattern over the p possible ways of partition.

A special case of k-fold CV.

Page 107: Introduction to  Radial Basis Function Networks

Error Variance Predicted by LOO

A special case of k-fold CV.

22

1

1ˆ ( )p

LOO i i ii

y fp

x

( , ) : 1, 2, ,k kD y k p x

\ ( , )i i iD D y x

Available input-output patterns.

1, ,i p Training sets of LOO.

if Function learned using Di as training set.

The estimate for the variance of prediction error using LOO:

Error-square for the left-out element.

Page 108: Introduction to  Radial Basis Function Networks

Error Variance Predicted by LOO

A special case of k-fold CV.

22

1

1ˆ ( )p

LOO i i ii

y fp

x

( , ) : 1, 2, ,k kD y k p x

\ ( , )i i iD D y x

Available input-output patterns.

1, ,i p Training sets of LOO.

if Function learned using Di as training set.

The estimate for the variance of prediction error using LOO:

Error-square for the left-out element.

Given a model, the function with least empirical error for Di.

As an index of model’s fitness.We want to find a model also minimize

this.

Page 109: Introduction to  Radial Basis Function Networks

Error Variance Predicted by LOO

A special case of k-fold CV.

22

1

1ˆ ( )p

LOO i i ii

y fp

x

( , ) : 1, 2, ,k kD y k p x

\ ( , )i i iD D y x

Available input-output patterns.

1, ,i p Training sets of LOO.

if Function learned using Di as training set.

The estimate for the variance of prediction error using LOO:

Error-square for the left-out element.

How to estimate?Are there any efficient ways?

Page 110: Introduction to  Radial Basis Function Networks

Error Variance Predicted by LOO

22

1

1ˆ ( )p

LOO i i ii

y fp

x

Error-square for the left-out element.

2 2(1ˆ ˆ ( )) ˆTLOO di

pag y PP Py

Page 111: Introduction to  Radial Basis Function Networks

Generalized Cross-Validation

2 2(1ˆ ˆ ( )) ˆTLOO di

pag y PP Py

( )( ) ptracediag

p

PP I

2

22

ˆ ˆˆ( )

T

GCVp

trace

y P yP

2

2

ˆ ˆ( )

Tpp

y P y

Page 112: Introduction to  Radial Basis Function Networks

More Criteria Based on CV2

22

ˆ ˆˆ( )

T

GCVpp

y P y GCV(Generalized CV)

22 ˆ ˆˆ

T

UEV p

y P y UEV

(Unbiased estimate of variance)

22 ˆ ˆˆ

T

FPEpp p

y P y FPE(Final Prediction Error)

Akaike’sInformation Criterion

BIC(Bayesian Information Criterio) 22 ln( ) 1 ˆ ˆˆ

T

BIC

p pp p

y P y

Page 113: Introduction to  Radial Basis Function Networks

More Criteria Based on CV2

22

ˆ ˆˆ( )

T

GCVpp

y P y

22 ˆ ˆˆ

T

UEV p

y P y

22 ˆ ˆˆ

T

FPEpp p

y P y

22 ln( ) 1 ˆ ˆˆ

T

BIC

p pp p

y P y

2ˆUEVp

p

2ˆUEVp

p

2ln( ) 1ˆUEV

p pp

2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC

Page 114: Introduction to  Radial Basis Function Networks

More Criteria Based on CV2

22

ˆ ˆˆ( )

T

GCVpp

y P y

22 ˆ ˆˆ

T

UEV p

y P y

22 ˆ ˆˆ

T

FPEpp p

y P y

22 ln( ) 1 ˆ ˆˆ

T

BIC

p pp p

y P y

2ˆUEVp

p

2ˆUEVp

p

2ln( ) 1ˆUEV

p pp

2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC

?P

Page 115: Introduction to  Radial Basis Function Networks

Regularization

((

1

)2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

ErModel spenalror t

sy

Co t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models withlarge weightsSSE

Standard Ridge Regression,

,i i

Page 116: Introduction to  Radial Basis Function Networks

Regularization

((

1

)2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

ErModel spenalror t

sy

Co t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models withlarge weightsSSE

2

1

m

iiw

Standard Ridge Regression,

,i i

Page 117: Introduction to  Radial Basis Function Networks

Solution Review

2(( ) )

1 1

2

1

min p m m

ki

k

ki i

i i

C w wy

x

1* Tw A Φ yT

m Φ Φ IA1 T

p P I Φ ΦA

Used to compute model selection criteria

Page 118: Introduction to  Radial Basis Function Networks

Example22( ) (1 2 ) xy x x x e

50p

2~ (0,0.2 )N

Width of RBF r = 0.5

Page 119: Introduction to  Radial Basis Function Networks

Example2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC

22( ) (1 2 ) xy x x x e

50p

2~ (0,0.2 )N

Width of RBF r = 0.5

Page 120: Introduction to  Radial Basis Function Networks

Example2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC

22( ) (1 2 ) xy x x x e

50p

2~ (0,0.2 )N

Width of RBF r = 0.5

How the determine the

optimal regularization

parameter effectively?

Page 121: Introduction to  Radial Basis Function Networks

Optimizing the Regularization Parameter

Re-Estimation Formula

2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

Page 122: Introduction to  Radial Basis Function Networks

Local Ridge Regression

Re-Estimation Formula

2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

Page 123: Introduction to  Radial Basis Function Networks

Examplesin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF

( )p trace P

0.05r

Page 124: Introduction to  Radial Basis Function Networks

Examplesin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF

( )p trace P

0.05r

Page 125: Introduction to  Radial Basis Function Networks

Examplesin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF0.05r

There are two local-minima. 2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

Using the about re-estimation formula, it will be stuck at the nearest local minimum.

That is, the solution depends on the initial setting.

Page 126: Introduction to  Radial Basis Function Networks

Examplesin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF0.05r

There are two local-minima. 2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

ˆ(0) 0.01

lim 0.1ˆ( )t

t

Page 127: Introduction to  Radial Basis Function Networks

Examplesin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF0.05r

There are two local-minima. 2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

5ˆ(0) 10

4ˆ( )lim 2.1 10t

t

Page 128: Introduction to  Radial Basis Function Networks

Examplesin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF0.05r

RMSE: Root Mean Squared ErrorIn real case, it is not available.

Page 129: Introduction to  Radial Basis Function Networks

Examplesin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF0.05r

RMSE: Root Mean Squared ErrorIn real case, it is not available.

Page 130: Introduction to  Radial Basis Function Networks

Local Ridge Regression

2

(( ) )

1 1

2

1

min p m m

ki

k

ki i

i i

C w wy

x

Standard Ridge Regression

2

(( ) )

1 1

2

1

min p m m

i ik

ik i

ik

i

C w wy

x

Local Ridge Regression

Page 131: Introduction to  Radial Basis Function Networks

Local Ridge Regression

2

(( ) )

1 1

2

1

min p m m

ki

k

ki i

i i

C w wy

x

Standard Ridge Regression

2

(( ) )

1 1

2

1

min p m m

i ik

ik i

ik

i

C w wy

x

Local Ridge Regression

j implies that j() can be removed.

Page 132: Introduction to  Radial Basis Function Networks

The Solutions1* Tw A Φ y

Tm Φ Φ IA

1 Tp

P I Φ ΦAUsed to compute model selection criteria

T A Φ Φ Λ

TA Φ Φ Linear Regression

Standard Ridge Regression

Local Ridge Regression

Page 133: Introduction to  Radial Basis Function Networks

Optimizing the Regularization Parameters

Incremental OperationP: The current projection Matrix.

Pj: The projection Matrix obtained by removing j(). T

j j j jj T

j j j j

P φ φ P

P Pφ P φ

2

22

ˆ ˆˆ( )

T

GCVp

trace

y P yP

Page 134: Introduction to  Radial Basis Function Networks

Optimizing the Regularization Parameters

2

22

ˆ ˆˆ( )

T

GCVp

trace

y P yP

Solve 2ˆ

0GCV

j

Subject to 0j

Tj j j j

j Tj j j j

P φ φ PP P

φ P φ

Page 135: Introduction to  Radial Basis Function Networks

Optimizing the Regularization Parameters

2Tja y P y2T Tj j j jby P φ y P φ2 2( )T T

j j j j jc φ P φ y P φ

( )jtrace P2T

j j j φ P φ

if if and ˆ

0 if and if

c bTj j j b a

c bTjj j j b a

c b c bT Tj j j j j jb a b a

a ba ba b

φ P φφ P φ

φ P φ φ P φ

Solve 2ˆ

0GCV

j

Subject to 0j

Page 136: Introduction to  Radial Basis Function Networks

Optimizing the Regularization Parameters

2Tja y P y2T Tj j j jby P φ y P φ2 2( )T T

j j j j jc φ P φ y P φ

( )jtrace P2T

j j j φ P φ

if if and ˆ

0 if and if

c bTj j j b a

c bTjj j j b a

c b c bT Tj j j j j jb a b a

a ba ba b

φ P φφ P φ

φ P φ φ P φ

Solve 2ˆ

0GCV

j

Subject to 0j Remove j()

Page 137: Introduction to  Radial Basis Function Networks

The Algorithm

Initialize i’s. – e.g., performing standard ridge regression.

Repeat the following until GCV converges:– Randomly select j and compute– Perform local ridge regression– If GCV reduce & remove j()

ˆj

ˆj

Page 138: Introduction to  Radial Basis Function Networks

References

Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI).

Mark J. L. Orr (April 1996), Introduction to Radial Basis Function Networks, http://www.anc.ed.ac.uk/~mjo/intro/intro.html.

Page 139: Introduction to  Radial Basis Function Networks

Introduction to Radial Basis Function Networks

Incremental Operations