Incremental Affinity Propagation Clustering Based on Message Passing - IEEE Project 2014-2015
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the...
-
Upload
dylan-wilcox -
Category
Documents
-
view
220 -
download
3
Transcript of Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the...
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set
Example of Radial Basis Function (RBF) network
Input vectord dimensions
K radial basis functions
Single output
Structure used for multivariate regressionsor binary classification
Review: RBF network provides alternative to back propagationEach hidden node is associated with a cluster of input instances Hidden layer connected to the output by linear least squares
Gaussians are the most frequently used radial basis functionjj(x) = exp(-½(|x-mj|/sj)2)
Clusters of input instances areparameterized by a mean and variance
Nr
r
r
2
1
)()()()(
)()()()(
)()()()(
r
xxxx
xxxx
xxxx
NK
N3
N2
N1
2K
23
22
21
1K
13
12
11
D
Linear least squares with basis functions
Nt
tt ,r 1}{ xXGiven training set
and the mean and variance of K clusters of input data,construct the NxK matrix D and column vector r.
Add a column of ones to include a bias node.Solve normal equations DTDw = DTr for a vector w of K weights connecting hidden nodes to output node
RBF networks perform best with large datasets
With large datasets, expect redundancy (i.e. multiple examples expressing the same general pattern)
In RBF network, hidden layer is a feature-space representation of the data where redundancy has been used to reduce noise.
A validation set may be helpful to determine K, the best number clusters of input data
6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Supervised learning: mapping input to output
Unsupervised learning: find regularities in the inputregularities reflect some probability distribution of attribute vectors, p(xt)discovering p(xt) called “density estimation”parametric method uses MLE to find q in p(xt|q)
In clustering, we look for regularities as group membershipassume we know the number of clusters, Kgiven K and dataset X, we want to find
the size of each group P(Gi) and itscomponent density p(x|Gi)
Background on clustering
7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Define group labels based on nearest center
Get new trial centers
Find group labels using the geometric interpretation of a cluster as points in attribute space closer to a “center” than they are to data points not in the cluster
Define trial centers by reference vectors mj j = 1…k
otherwise0
min if1 jt
ji
ttib
mxmx
tti
ttt
ii
b
b xm
Judge convergence by t i itt
ikii bE mxm X1
K-Means Clustering: hard labels
K-means clustering pseudo code
8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
9Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Example of pseudo code application
Example of K-means with arbitrary starting centers and convergence plot
10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Convergence
K-means is an example of the Expectation-Maximization (EM) approach to MLE
t
k
iii
t GPGp1
|log| xXL
11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Log likelihood of mixture modelcannot be solved analytically for F
Use a 2-step iterative method:E-step: estimate labels of xt given current knowledge of mixture componentsM-step: update component knowledge using labels from E-step
12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
E - step
M - step
K-means clustering pseudo code with steps labeled
Given converged K-means centers, estimate variance for RBFs by s2 = d2
max/2K, where dmax is the
largest distance between clusters.
Gaussian mixture theory is another approach to getting RBFs
Application of K-means clustering to RBF-ANN
k
iii
tt GPGpp1
|xx
14
X={xt}t is made up of K groups (clusters)
P(Gi) proportion of X in group i
attributes in each group are Gaussian distributed
p(xt|Gi) = Nd ( μi , ∑i ) mi means of xt in group i
Si covariance matrix of xt in group i
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Distribution of attributes is mixture of Gaussians
Gaussian Mixture Densities
Given a group label for each data point, rit, MLE provides
estimates of parameters of Gaussian mixtures
where p ( x | Gi) ~ N ( μi , ∑i )
Φ = {P (Gi ), μi , ∑i }i=1 to k
15
k
iii
tt GPGpp1
|| xx
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
t
ti
T
it
t itt
ii
t
ti
t
tti
it
ti
i
r
r
r
r
N
r
mxmx
xm
S
Estimators
2
2
2exp
2
1 x-xp
N
mxs
N
xm
t
t
t
t
2
2
p(x) = N ( μ, σ2)
MLE for μ and σ2:
16
μ σ
2
2
22
1
x
xp exp
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
1D Gaussian distribution
μxμxx
μx
1212
Σ2
1
Σ2
1
Σ
Td
d
p exp
~
//
,N
17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Mahalanobis distance: (x – μ)T ∑–1 (x – μ)analogous to (x-m)2/s2
x - m is column vector dx1S is dxd matrixM-distance is a scalar
Measures distance of x from mean in units of S
d denotes number of attributes
dD Gaussian distribution
• If xi are independent, offdiagonals of ∑ are 0,
• p(x) is product of probabilities for each component of x
d
i i
iid
ii
d
d
iii
xxpp
1
2
1
21 2
1
2
1
exp
/ x
18Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
19
t
ti
Tli
t
t
li
ttil
i
t
ti
t
ttil
it
ti
i
h
h
h
h
N
h
111
1
mxmx
xm
S
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Replace the hard labels, rit , by soft label, hi
t , the probability that xt belongs to cluster i.Assume that cluster densities p(xt|F) are Gaussian, then mixture proportions, means and covariance matrix are estimated by
where hit are soft labels from previous E-step
Gaussian mixture model by EM: soft labels
20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Initialize by k-means clustering. After a few iterations, use centers mi and instances covered by each center to estimate the covariance matrices Si and mixture proportions pi
From mi , Si , and pi, calculate hit, soft labels by
j )j1-j
T)j
1/2-|j|j
)i-1i
T)i
-1/2|i|it
i](S(5.0exp[Sπ
](S(5.0exp[Sπh
mxmx
mxmxtt
tt
Calculate new proportions, centers and covariance by
Use these to calculate new soft labels
Gaussian mixture model by EM: soft labels
t
ti
Tli
t
t
li
ttil
i
t
ti
t
ttil
it
ti
i
h
h
h
h
N
h
111
1
mxmx
xm
S
21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
K-meansHard labelsCenters marked
EM Gaussian mixtures with soft labelsContours show 1 standard deviation Colors show mixture proportions
22Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
k-means hard lables
23
P(G1|x)=0.5
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Data points color coded by greater soft labelContours show m + s of Gaussian densitiesDashed contour is “separating” curve
Gaussian mixtures; soft labelsx marks cluster mean
Outliers?
24Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
In applications of Gaussian mixtures to RBFs, correlation of attributes is ignored and diagonal elements of the covariance matrix are equal.
In this approximation Mahalanobis distance reduces to Euclidence distance.
tti
tli
ttil
i
tti
ttt
ili
tti
i
h
h
h
h
N
h
211
1
|||| mx
xm
s
Variance parameter of radial basis function becomes a scalar
• Cluster based on similarities (distances)• Distance measure between instances xr and xs
Minkowski (Lp) (Euclidean for p = 2)
City-block distance
pd
j
psj
rj
srm xxd
/,
1
1 xx
25
d
jsj
rj
srcb xxd
1xx ,
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Hierarchical Clustering
• Start with N groups each with one instance and merge the two closest groups at each iteration
• Distance between two groups Gi and Gj:• Single-link: smallest distance between all possible pairs of attributes
• Complete-link: largest distance between all possible pairs of attributes
• Average-link, distance between centroids
srji dGGd
js
ir
xxxx
,min,, GG
26
srji dGGd
js
ir
xxxx
,max,, GG
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
tti
ttt
ii
b
b xm
Agglomerative Clustering
27
Dendrogram
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
At height h > sqrt(2) and < 2, dendrogram has the 3 clusters shown on data graphAt h > 2 dendrogram shows 2 clusters. c, d, and f are one cluster at this distance
Example: single-linked clusters
• Application specific• Plot data (after PCA, for example) and check for clusters• Add one at a time using validation set
28Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Choosing K (how many clusters?)