Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the...

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set

Example of Radial Basis Function (RBF) network

Input vectord dimensions

K radial basis functions

Single output

Structure used for multivariate regressionsor binary classification

Review: RBF network provides alternative to back propagationEach hidden node is associated with a cluster of input instances Hidden layer connected to the output by linear least squares

Gaussians are the most frequently used radial basis functionjj(x) = exp(-½(|x-mj|/sj)2)

Clusters of input instances areparameterized by a mean and variance

Nr

r

r

2

1

)()()()(

)()()()(

)()()()(

r

xxxx

xxxx

xxxx

NK

N3

N2

N1

2K

23

22

21

1K

13

12

11

D

Linear least squares with basis functions

Nt

tt ,r 1}{ xXGiven training set

and the mean and variance of K clusters of input data,construct the NxK matrix D and column vector r.

Add a column of ones to include a bias node.Solve normal equations DTDw = DTr for a vector w of K weights connecting hidden nodes to output node

RBF networks perform best with large datasets

With large datasets, expect redundancy (i.e. multiple examples expressing the same general pattern)

In RBF network, hidden layer is a feature-space representation of the data where redundancy has been used to reduce noise.

A validation set may be helpful to determine K, the best number clusters of input data

6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Supervised learning: mapping input to output

Unsupervised learning: find regularities in the inputregularities reflect some probability distribution of attribute vectors, p(xt)discovering p(xt) called “density estimation”parametric method uses MLE to find q in p(xt|q)

In clustering, we look for regularities as group membershipassume we know the number of clusters, Kgiven K and dataset X, we want to find

the size of each group P(Gi) and itscomponent density p(x|Gi)

Background on clustering


Define group labels based on nearest center

Get new trial centers

Find group labels using the geometric interpretation of a cluster as points in attribute space closer to a “center” than they are to data points not in the cluster

Define trial centers by reference vectors mj j = 1…k

otherwise0

min if1 jt

ji

ttib

mxmx

tti

ttt

ii

b

b xm

Judge convergence by t i itt

ikii bE mxm X1

K-Means Clustering: hard labels

K-means clustering pseudo code



Example of pseudo code application

Example of K-means with arbitrary starting centers and convergence plot


Convergence

K-means is an example of the Expectation-Maximization (EM) approach to MLE

t

k

iii

t GPGp1

|log| xXL


Log likelihood of mixture modelcannot be solved analytically for F

Use a 2-step iterative method:E-step: estimate labels of xt given current knowledge of mixture componentsM-step: update component knowledge using labels from E-step


E - step

M - step

K-means clustering pseudo code with steps labeled

Given converged K-means centers, estimate variance for RBFs by s2 = d2

max/2K, where dmax is the

largest distance between clusters.

Gaussian mixture theory is another approach to getting RBFs

Application of K-means clustering to RBF-ANN

k

iii

tt GPGpp1

|xx

14

X={xt}t is made up of K groups (clusters)

P(Gi) proportion of X in group i

attributes in each group are Gaussian distributed

p(xt|Gi) = Nd ( μi , ∑i ) mi means of xt in group i

Si covariance matrix of xt in group i

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Distribution of attributes is mixture of Gaussians

Gaussian Mixture Densities

Given a group label for each data point, rit, MLE provides

estimates of parameters of Gaussian mixtures

where p ( x | Gi) ~ N ( μi , ∑i )

Φ = {P (Gi ), μi , ∑i }i=1 to k

15

k

iii

tt GPGpp1

|| xx


t

ti

T

it

t itt

ii

t

ti

t

tti

it

ti

i

r

r

r

r

N

r

mxmx

xm

S

Estimators

2

2

2exp

2

1 x-xp

N

mxs

N

xm

t

t

t

t

2

2

p(x) = N ( μ, σ2)

MLE for μ and σ2:

16

μ σ

2

2

22

1

x

xp exp


1D Gaussian distribution

μxμxx

μx

1212

Σ2

1

Σ2

1

Σ

Td

d

p exp

~

//

,N


Mahalanobis distance: (x – μ)T ∑–1 (x – μ)analogous to (x-m)2/s2

x - m is column vector dx1S is dxd matrixM-distance is a scalar

Measures distance of x from mean in units of S

d denotes number of attributes

dD Gaussian distribution

• If xi are independent, offdiagonals of ∑ are 0,

• p(x) is product of probabilities for each component of x

d

i i

iid

ii

d

d

iii

xxpp

1

2

1

21 2

1

2

1

exp

/ x


19

t

ti

Tli

t

t

li

ttil

i

t

ti

t

ttil

it

ti

i

h

h

h

h

N

h

111

1

mxmx

xm

S


Replace the hard labels, rit , by soft label, hi

t , the probability that xt belongs to cluster i.Assume that cluster densities p(xt|F) are Gaussian, then mixture proportions, means and covariance matrix are estimated by

where hit are soft labels from previous E-step

Gaussian mixture model by EM: soft labels


Initialize by k-means clustering. After a few iterations, use centers mi and instances covered by each center to estimate the covariance matrices Si and mixture proportions pi

From mi , Si , and pi, calculate hit, soft labels by

j )j1-j

T)j

1/2-|j|j

)i-1i

T)i

-1/2|i|it

i](S(5.0exp[Sπ

](S(5.0exp[Sπh

mxmx

mxmxtt

tt

Calculate new proportions, centers and covariance by

Use these to calculate new soft labels

Gaussian mixture model by EM: soft labels

t

ti

Tli

t

t

li

ttil

i

t

ti

t

ttil

it

ti

i

h

h

h

h

N

h

111

1

mxmx

xm

S


K-meansHard labelsCenters marked

EM Gaussian mixtures with soft labelsContours show 1 standard deviation Colors show mixture proportions


k-means hard lables

23

P(G1|x)=0.5


Data points color coded by greater soft labelContours show m + s of Gaussian densitiesDashed contour is “separating” curve

Gaussian mixtures; soft labelsx marks cluster mean

Outliers?


In applications of Gaussian mixtures to RBFs, correlation of attributes is ignored and diagonal elements of the covariance matrix are equal.

In this approximation Mahalanobis distance reduces to Euclidence distance.

tti

tli

ttil

i

tti

ttt

ili

tti

i

h

h

h

h

N

h

211

1

|||| mx

xm

s

Variance parameter of radial basis function becomes a scalar

• Cluster based on similarities (distances)• Distance measure between instances xr and xs

Minkowski (Lp) (Euclidean for p = 2)

City-block distance

pd

j

psj

rj

srm xxd

/,

1

1 xx

25

d

jsj

rj

srcb xxd

1xx ,


Hierarchical Clustering

• Start with N groups each with one instance and merge the two closest groups at each iteration

• Distance between two groups Gi and Gj:• Single-link: smallest distance between all possible pairs of attributes

• Complete-link: largest distance between all possible pairs of attributes

• Average-link, distance between centroids

srji dGGd

js

ir

xxxx

,min,, GG

26

srji dGGd

js

ir

xxxx

,max,, GG


tti

ttt

ii

b

b xm

Agglomerative Clustering

27

Dendrogram


At height h > sqrt(2) and < 2, dendrogram has the 3 clusters shown on data graphAt h > 2 dendrogram shows 2 clusters. c, d, and f are one cluster at this distance

Example: single-linked clusters

• Application specific• Plot data (after PCA, for example) and check for clusters• Add one at a time using validation set


Choosing K (how many clusters?)

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the...

Documents

Transcript of Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the...