Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial...

29
Training Radial Basis Function Networks Christian Borgelt Artificial Neural Networks and Deep Learning 217

Transcript of Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial...

Page 1: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Training Radial Basis Function Networks

Christian Borgelt Artificial Neural Networks and Deep Learning 217

Page 2: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Radial Basis Function Networks: Initialization

Let Lfixed = {l1, . . . , lm} be a fixed learning task,consisting of m training patterns l = (~ı (l), ~o (l)).

Simple radial basis function network:One hidden neuron vk, k = 1, . . . ,m, for each training pattern:

∀k ∈ {1, . . . ,m} : ~wvk =~ı (lk).

If the activation function is the Gaussian function,the radii σk are chosen heuristically

∀k ∈ {1, . . . ,m} : σk =dmax√2m

,

where

dmax = maxlj,lk∈Lfixed

d(

~ı (lj),~ı (lk))

.

Christian Borgelt Artificial Neural Networks and Deep Learning 218

Page 3: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Radial Basis Function Networks: Initialization

Initializing the connections from the hidden to the output neurons

∀u :m∑

k=1

wuvm out(l)vm −θu = o

(l)u or abbreviated A · ~wu = ~ou,

where ~ou = (o(l1)u , . . . , o

(lm)u )⊤ is the vector of desired outputs, θu = 0, and

A =

out(l1)v1 out

(l1)v2 . . . out

(l1)vm

out(l2)v1 out

(l2)v2 . . . out

(l2)vm

... ... ...

out(lm)v1 out

(lm)v2 . . . out

(lm)vm

.

This is a linear equation system, that can be solved by inverting the matrix A:

~wu = A−1 · ~ou.

Christian Borgelt Artificial Neural Networks and Deep Learning 219

Page 4: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

RBFN Initialization: Example

Simple radial basis function network for the biimplication x1 ↔ x2

x1 x2 y

0 0 11 0 00 1 01 1 1

1

2

1

2

1

2

1

2

0

x2

x1

y

1

0

1

0

1

1

0

0w1

w2

w3

w4

Christian Borgelt Artificial Neural Networks and Deep Learning 220

Page 5: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

RBFN Initialization: Example

Simple radial basis function network for the biimplication x1 ↔ x2

A =

1 e−2 e−2 e−4

e−2 1 e−4 e−2

e−2 e−4 1 e−2

e−4 e−2 e−2 1

A−1 =

aD

bD

bD

cD

bD

aD

cD

bD

bD

cD

aD

bD

cD

bD

bD

aD

whereD = 1− 4e−4 + 6e−8 − 4e−12 + e−16 ≈ 0.9287

a = 1 − 2e−4 + e−8 ≈ 0.9637

b = −e−2 + 2e−6 − e−10 ≈ −0.1304

c = e−4 − 2e−8 + e−12 ≈ 0.0177

~wu = A−1 · ~ou =1

D

a + c

2b2b

a + c

1.0567−0.2809−0.28091.0567

Christian Borgelt Artificial Neural Networks and Deep Learning 221

Page 6: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

RBFN Initialization: Example

Simple radial basis function network for the biimplication x1 ↔ x2

single basis function

–10

12

–1

0

1

2

1

1

x1

x2

act

all basis functions

–10

12

–1

0

1

2

1

1

x1

x2

act

output

–10

12

–1

0

1

2

1

1

x1

x2

y

• Initialization leads already to a perfect solution of the learning task.

• Subsequent training is not necessary.

Christian Borgelt Artificial Neural Networks and Deep Learning 222

Page 7: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Radial Basis Function Networks: Initialization

Normal radial basis function networks:Select subset of k training patterns as centers.

A =

1 out(l1)v1 out

(l1)v2 . . . out

(l1)vk

1 out(l2)v1 out

(l2)v2 . . . out

(l2)vk... ... ... ...

1 out(lm)v1 out

(lm)v2 . . . out

(lm)vk

A · ~wu = ~ou

Compute (Moore–Penrose) pseudo inverse:

A+ = (A⊤A)−1A⊤.

The weights can then be computed by

~wu = A+ · ~ou = (A⊤A)−1A⊤ · ~ou

Christian Borgelt Artificial Neural Networks and Deep Learning 223

Page 8: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

RBFN Initialization: Example

Normal radial basis function network for the biimplication x1 ↔ x2

Select two training patterns:

• l1 = (~ı (l1), ~o (l1)) = ((0, 0), (1))

• l4 = (~ı (l4), ~o (l4)) = ((1, 1), (1))

x1

x2

1

2

1

2

θ y

1

0

0

1

w1

w2

Christian Borgelt Artificial Neural Networks and Deep Learning 224

Page 9: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

RBFN Initialization: Example

Normal radial basis function network for the biimplication x1 ↔ x2

A =

1 1 e−4

1 e−2 e−2

1 e−2 e−2

1 e−4 1

A+ = (A⊤A)−1A⊤ =

a b b a

c d d e

e d d c

where

a ≈ −0.1810, b ≈ 0.6810,c ≈ 1.1781, d ≈ −0.6688, e ≈ 0.1594.

Resulting weights:

~wu =

−θ

w1w2

= A+ · ~ou ≈

−0.36201.33751.3375

.

Christian Borgelt Artificial Neural Networks and Deep Learning 225

Page 10: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

RBFN Initialization: Example

Normal radial basis function network for the biimplication x1 ↔ x2

basis function (0,0)

–10

12

–1

0

1

2

1

1

x1

x2

act

basis function (1,1)

–10

12

–1

0

1

2

1

1

x1

x2

act

output

–10

12

–1

0

1 0

1

–0.36

0

1

x1

x2

y

(1, 0)

• Initialization leads already to a perfect solution of the learning task.

• This is an accident, because the linear equation system is not over-determined,due to linearly dependent equations.

Christian Borgelt Artificial Neural Networks and Deep Learning 226

Page 11: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Radial Basis Function Networks: Initialization

How to choose the radial basis function centers?

• Use all data points as centers for the radial basis functions.

◦ Advantages: Only radius and output weights need to be determined; desiredoutput values can be achieved exactly (unless there are inconsistencies).

◦ Disadvantage: Often much too many radial basis functions; computing theweights to the output neuron via a pseudo-inverse can become infeasible.

• Use a random subset of data points as centers for the radial basis functions.

◦ Advantages: Fast; only radius and output weights need to be determined.

◦ Disadvantages: Performance depends heavily on the choice of data points.

• Use the result of clustering as centers for the radial basis functions, e.g.

◦ c-means clustering (on the next slides)

◦ Learning vector quantization (to be discussed later)

Christian Borgelt Artificial Neural Networks and Deep Learning 227

Page 12: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

RBFN Initialization: c-means Clustering

• Choose a number c of clusters to be found (user input).

• Initialize the cluster centers randomly(for instance, by randomly selecting c data points).

• Data point assignment:Assign each data point to the cluster center that is closest to it(that is, closer than any other cluster center).

• Cluster center update:Compute new cluster centers as the mean vectors of the assigned data points.(Intuitively: center of gravity if each data point has unit weight.)

• Repeat these two steps (data point assignment and cluster center update)until the clusters centers do not change anymore.

It can be shown that this scheme must converge,that is, the update of the cluster centers cannot go on forever.

Christian Borgelt Artificial Neural Networks and Deep Learning 228

Page 13: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

c-Means Clustering: Example

Data set to cluster.

Choose c = 3 clusters.(From visual inspection, can bedifficult to determine in general.)

Initial position of cluster centers.

Randomly selected data points.(Alternative methods includee.g. latin hypercube sampling)

Christian Borgelt Artificial Neural Networks and Deep Learning 229

Page 14: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Delaunay Triangulations and Voronoi Diagrams

• Dots represent cluster centers.

• Left: Delaunay TriangulationThe circle through the corners of a triangle does not contain another point.

• Right: Voronoi Diagram / TesselationMidperpendiculars of the Delaunay triangulation: boundaries of the regionsof points that are closest to the enclosed cluster center (Voronoi cells).

Christian Borgelt Artificial Neural Networks and Deep Learning 230

Page 15: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Delaunay Triangulations and Voronoi Diagrams

• Delaunay Triangulation: simple triangle (shown in gray on the left)

• Voronoi Diagram: midperpendiculars of the triangle’s edges(shown in blue on the left, in gray on the right)

Christian Borgelt Artificial Neural Networks and Deep Learning 231

Page 16: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

c-Means Clustering: Example

Christian Borgelt Artificial Neural Networks and Deep Learning 232

Page 17: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Radial Basis Function Networks: Training

Training radial basis function networks:Derivation of update rules is analogous to that of multi-layer perceptrons.

Weights from the hidden to the output neurons.

Gradient:

~∇~wue(l)u =

∂e(l)u

∂ ~wu= −2(o

(l)u − out

(l)u ) ~in

(l)u ,

Weight update rule:

∆~w(l)u = −η3

2~∇~wu

e(l)u = η3(o

(l)u − out

(l)u ) ~in

(l)u

Typical learning rate: η3 ≈ 0.001.

(Two more learning rates are needed for the center coordinates and the radii.)

Christian Borgelt Artificial Neural Networks and Deep Learning 233

Page 18: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Radial Basis Function Networks: Training

Training radial basis function networks:Center coordinates (weights from the input to the hidden neurons).

Gradient:

~∇~wve(l) =

∂e(l)

∂ ~wv= −2

s∈succ(v)(o(l)s − out

(l)s )wsu

∂ out(l)v

∂ net(l)v

∂ net(l)v

∂ ~wv

Weight update rule:

∆~w(l)v = −η1

2~∇~wv

e(l) = η1∑

s∈succ(v)(o(l)s − out

(l)s )wsv

∂ out(l)v

∂ net(l)v

∂ net(l)v

∂ ~wv

Typical learning rate: η1 ≈ 0.02.

Christian Borgelt Artificial Neural Networks and Deep Learning 234

Page 19: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Radial Basis Function Networks: Training

Training radial basis function networks:Center coordinates (weights from the input to the hidden neurons).

Special case: Euclidean distance

∂ net(l)v

∂ ~wv=

n∑

i=1

(wvpi − out(l)pi )

2

−12

(~wv − ~in(l)v ).

Special case: Gaussian activation function

∂ out(l)v

∂ net(l)v

=∂fact( net

(l)v , σv)

∂ net(l)v

=∂

∂ net(l)v

e−

(

net(l)v

)2

2σ2v = −net(l)v

σ2ve−

(

net(l)v

)2

2σ2v .

Christian Borgelt Artificial Neural Networks and Deep Learning 235

Page 20: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Radial Basis Function Networks: Training

Training radial basis function networks:Radii of radial basis functions.

Gradient:∂e(l)

∂σv= −2

s∈succ(v)(o(l)s − out

(l)s )wsu

∂ out(l)v

∂σv.

Weight update rule:

∆σ(l)v = −η2

2

∂e(l)

∂σv= η2

s∈succ(v)(o(l)s − out

(l)s )wsv

∂ out(l)v

∂σv.

Typical learning rate: η2 ≈ 0.01.

Christian Borgelt Artificial Neural Networks and Deep Learning 236

Page 21: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Radial Basis Function Networks: Training

Training radial basis function networks:Radii of radial basis functions.

Special case: Gaussian activation function

∂ out(l)v

∂σv=

∂σve−

(

net(l)v

)2

2σ2v =

(

net(l)v

)2

σ3ve−

(

net(l)v

)2

2σ2v .

(The distance function is irrelevant for the radius update,since it only enters the network input function.)

Christian Borgelt Artificial Neural Networks and Deep Learning 237

Page 22: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Radial Basis Function Networks: Generalization

Generalization of the distance function

Idea: Use anisotropic (direction dependent) distance function.

Example: Mahalanobis distance

d(~x, ~y) =√

(~x− ~y)⊤Σ−1(~x− ~y).

Example: biimplication

x1

x2

1

30 y

1

2

1

2

1

Σ =

(

9 8

8 9

)

x1

x2

0 1

0

1

Christian Borgelt Artificial Neural Networks and Deep Learning 238

Page 23: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Application: Recognition of Handwritten Digits

picture not available in online version

• Images of 20,000 handwritten digits (2,000 per class),split into training and test data set of 10,000 samples each (1,000 per class).

• Represented in a normalized fashion as 16× 16 gray values in {0, . . . , 255}.

• Data was originally used in the StatLog project [Michie et al. 1994].

Christian Borgelt Artificial Neural Networks and Deep Learning 239

Page 24: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Application: Recognition of Handwritten Digits

• Comparison of various classifiers:

◦ Nearest Neighbor (1NN) ◦ Learning Vector Quantization (LVQ)

◦ Decision Tree (C4.5) ◦ Radial Basis Function Network (RBF)

◦ Multi-Layer Perceptron (MLP) ◦ Support Vector Machine (SVM)

• Distinction of the number of RBF training phases:

◦ 1 phase: find output connection weights e.g. with pseudo-inverse.

◦ 2 phase: find RBF centers e.g. with some clustering plus 1 phase.

◦ 3 phase: 2 phase plus error backpropagation training.

• Initialization of radial basis function centers:

◦ Random choice of data points

◦ c-means Clustering

◦ Learning Vector Quantization

◦ Decision Tree (one RBF center per leaf)

Christian Borgelt Artificial Neural Networks and Deep Learning 240

Page 25: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Application: Recognition of Handwritten Digits

picture not available in online version

• The 60 cluster centers (6 per class) resulting from c-means clustering.(Clustering was conducted with c = 6 for each class separately.)

• Initial cluster centers were selected randomly from the training data.

• The weights of the connections to the output neuronwere computed with the pseudo-inverse method.

Christian Borgelt Artificial Neural Networks and Deep Learning 241

Page 26: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Application: Recognition of Handwritten Digits

picture not available in online version

• The 60 cluster centers (6 per class) after training the radial basis function networkwith error backpropagation.

• Differences between the initial and the trained centers of the radial basis functionsappear to be fairly small, but ...

Christian Borgelt Artificial Neural Networks and Deep Learning 242

Page 27: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Application: Recognition of Handwritten Digits

picture not available in online version

• Distance matrices showing the Euclidean distances of the 60 radial basis functioncenters before and after training.

• Centers are sorted by class/digit: first 6 rows/columns refer to digit 0, next 6rows/columns to digit 1 etc.

• Distances are encoded as gray values: darker means smaller distance.

Christian Borgelt Artificial Neural Networks and Deep Learning 243

Page 28: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Application: Recognition of Handwritten Digits

picture not available in online version

• Before training (left): many distances between centers of different classes/digits aresmall (e.g. 2-3, 3-8, 3-9, 5-8, 5-9), which increases the chance of misclassifications.

• After training (right): only very few small distances between centers of differentclasses/digits; basically all small distances between centers of same class/digit.

Christian Borgelt Artificial Neural Networks and Deep Learning 244

Page 29: Training Radial Basis Function Networks · Radial Basis Function Networks: Training Training radial basis function networks: Radii of radial basis functions. Special case: Gaussian

Application: Recognition of Handwritten Digits

Classification results:

Classifier Accuracy

Nearest Neighbor (1NN) 97.68%Learning Vector Quantization (LVQ) 96.99%Decision Tree (C4.5) 91.12%2-Phase-RBF (data points) 95.24%2-Phase-RBF (c-means) 96.94%2-Phase-RBF (LVQ) 95.86%2-Phase-RBF (C4.5) 92.72%3-Phase-RBF (data points) 97.23%3-Phase-RBF (c-means) 98.06%3-Phase-RBF (LVQ) 98.49%3-Phase-RBF (C4.5) 94.83%Support Vector Machine (SVM) 98.76%Multi-Layer Perceptron (MLP) 97.59%

• LVQ: 200 vectors(20 per class)

C4.5: 505 leavesc-means: 60 centers(?)

(6 per class)SVM: 10 classifiers,

≈ 4200 vectorsMLP: 1 hidden layer

with 200 neurons

• Results are mediansof three training/test runs.

• Error backpropagationimproves RBF results.

Christian Borgelt Artificial Neural Networks and Deep Learning 245