Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the...

28
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set

Transcript of Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the...

Page 1: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set

Page 2: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Example of Radial Basis Function (RBF) network

Input vectord dimensions

K radial basis functions

Single output

Structure used for multivariate regressionsor binary classification

Page 3: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Review: RBF network provides alternative to back propagationEach hidden node is associated with a cluster of input instances Hidden layer connected to the output by linear least squares

Gaussians are the most frequently used radial basis functionjj(x) = exp(-½(|x-mj|/sj)2)

Clusters of input instances areparameterized by a mean and variance

Page 4: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Nr

r

r

2

1

)()()()(

)()()()(

)()()()(

r

xxxx

xxxx

xxxx

NK

N3

N2

N1

2K

23

22

21

1K

13

12

11

D

Linear least squares with basis functions

Nt

tt ,r 1}{ xXGiven training set

and the mean and variance of K clusters of input data,construct the NxK matrix D and column vector r.

Add a column of ones to include a bias node.Solve normal equations DTDw = DTr for a vector w of K weights connecting hidden nodes to output node

Page 5: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

RBF networks perform best with large datasets

With large datasets, expect redundancy (i.e. multiple examples expressing the same general pattern)

In RBF network, hidden layer is a feature-space representation of the data where redundancy has been used to reduce noise.

A validation set may be helpful to determine K, the best number clusters of input data

Page 6: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Supervised learning: mapping input to output

Unsupervised learning: find regularities in the inputregularities reflect some probability distribution of attribute vectors, p(xt)discovering p(xt) called “density estimation”parametric method uses MLE to find q in p(xt|q)

In clustering, we look for regularities as group membershipassume we know the number of clusters, Kgiven K and dataset X, we want to find

the size of each group P(Gi) and itscomponent density p(x|Gi)

Background on clustering

Page 7: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Define group labels based on nearest center

Get new trial centers

Find group labels using the geometric interpretation of a cluster as points in attribute space closer to a “center” than they are to data points not in the cluster

Define trial centers by reference vectors mj j = 1…k

otherwise0

min if1 jt

ji

ttib

mxmx

tti

ttt

ii

b

b xm

Judge convergence by t i itt

ikii bE mxm X1

K-Means Clustering: hard labels

Page 8: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

K-means clustering pseudo code

8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 9: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

9Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Example of pseudo code application

Page 10: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Example of K-means with arbitrary starting centers and convergence plot

10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Convergence

Page 11: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

K-means is an example of the Expectation-Maximization (EM) approach to MLE

t

k

iii

t GPGp1

|log| xXL

11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Log likelihood of mixture modelcannot be solved analytically for F

Use a 2-step iterative method:E-step: estimate labels of xt given current knowledge of mixture componentsM-step: update component knowledge using labels from E-step

Page 12: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

E - step

M - step

K-means clustering pseudo code with steps labeled

Page 13: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Given converged K-means centers, estimate variance for RBFs by s2 = d2

max/2K, where dmax is the

largest distance between clusters.

Gaussian mixture theory is another approach to getting RBFs

Application of K-means clustering to RBF-ANN

Page 14: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

k

iii

tt GPGpp1

|xx

14

X={xt}t is made up of K groups (clusters)

P(Gi) proportion of X in group i

attributes in each group are Gaussian distributed

p(xt|Gi) = Nd ( μi , ∑i ) mi means of xt in group i

Si covariance matrix of xt in group i

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Distribution of attributes is mixture of Gaussians

Gaussian Mixture Densities

Page 15: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Given a group label for each data point, rit, MLE provides

estimates of parameters of Gaussian mixtures

where p ( x | Gi) ~ N ( μi , ∑i )

Φ = {P (Gi ), μi , ∑i }i=1 to k

15

k

iii

tt GPGpp1

|| xx

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

t

ti

T

it

t itt

ii

t

ti

t

tti

it

ti

i

r

r

r

r

N

r

mxmx

xm

S

Estimators

Page 16: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

2

2

2exp

2

1 x-xp

N

mxs

N

xm

t

t

t

t

2

2

p(x) = N ( μ, σ2)

MLE for μ and σ2:

16

μ σ

2

2

22

1

x

xp exp

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

1D Gaussian distribution

Page 17: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

μxμxx

μx

1212

Σ2

1

Σ2

1

Σ

Td

d

p exp

~

//

,N

17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Mahalanobis distance: (x – μ)T ∑–1 (x – μ)analogous to (x-m)2/s2

x - m is column vector dx1S is dxd matrixM-distance is a scalar

Measures distance of x from mean in units of S

d denotes number of attributes

dD Gaussian distribution

Page 18: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

• If xi are independent, offdiagonals of ∑ are 0,

• p(x) is product of probabilities for each component of x

d

i i

iid

ii

d

d

iii

xxpp

1

2

1

21 2

1

2

1

exp

/ x

18Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 19: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

19

t

ti

Tli

t

t

li

ttil

i

t

ti

t

ttil

it

ti

i

h

h

h

h

N

h

111

1

mxmx

xm

S

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Replace the hard labels, rit , by soft label, hi

t , the probability that xt belongs to cluster i.Assume that cluster densities p(xt|F) are Gaussian, then mixture proportions, means and covariance matrix are estimated by

where hit are soft labels from previous E-step

Gaussian mixture model by EM: soft labels

Page 20: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Initialize by k-means clustering. After a few iterations, use centers mi and instances covered by each center to estimate the covariance matrices Si and mixture proportions pi

From mi , Si , and pi, calculate hit, soft labels by

j )j1-j

T)j

1/2-|j|j

)i-1i

T)i

-1/2|i|it

i](S(5.0exp[Sπ

](S(5.0exp[Sπh

mxmx

mxmxtt

tt

Calculate new proportions, centers and covariance by

Use these to calculate new soft labels

Gaussian mixture model by EM: soft labels

t

ti

Tli

t

t

li

ttil

i

t

ti

t

ttil

it

ti

i

h

h

h

h

N

h

111

1

mxmx

xm

S

Page 21: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

K-meansHard labelsCenters marked

EM Gaussian mixtures with soft labelsContours show 1 standard deviation Colors show mixture proportions

Page 22: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

22Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

k-means hard lables

Page 23: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

23

P(G1|x)=0.5

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Data points color coded by greater soft labelContours show m + s of Gaussian densitiesDashed contour is “separating” curve

Gaussian mixtures; soft labelsx marks cluster mean

Outliers?

Page 24: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

24Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

In applications of Gaussian mixtures to RBFs, correlation of attributes is ignored and diagonal elements of the covariance matrix are equal.

In this approximation Mahalanobis distance reduces to Euclidence distance.

tti

tli

ttil

i

tti

ttt

ili

tti

i

h

h

h

h

N

h

211

1

|||| mx

xm

s

Variance parameter of radial basis function becomes a scalar

Page 25: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

• Cluster based on similarities (distances)• Distance measure between instances xr and xs

Minkowski (Lp) (Euclidean for p = 2)

City-block distance

pd

j

psj

rj

srm xxd

/,

1

1 xx

25

d

jsj

rj

srcb xxd

1xx ,

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Hierarchical Clustering

Page 26: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

• Start with N groups each with one instance and merge the two closest groups at each iteration

• Distance between two groups Gi and Gj:• Single-link: smallest distance between all possible pairs of attributes

• Complete-link: largest distance between all possible pairs of attributes

• Average-link, distance between centroids

srji dGGd

js

ir

xxxx

,min,, GG

26

srji dGGd

js

ir

xxxx

,max,, GG

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

tti

ttt

ii

b

b xm

Agglomerative Clustering

Page 27: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

27

Dendrogram

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

At height h > sqrt(2) and < 2, dendrogram has the 3 clusters shown on data graphAt h > 2 dendrogram shows 2 clusters. c, d, and f are one cluster at this distance

Example: single-linked clusters

Page 28: Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

• Application specific• Plot data (after PCA, for example) and check for clusters• Add one at a time using validation set

28Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Choosing K (how many clusters?)