Post on 05-Feb-2016
description
1
Gaussian Kernel Width Exploration and Cone Cluster Labeling For Support Vector Clustering
Department of Computer ScienceDepartment of Computer ScienceUniversity of Massachusetts LowellUniversity of Massachusetts Lowell
Nov. 28, 2007
Sei-Hyung LeeKaren Daniels
2
Outline
• Clustering Overview
• SVC Background and Related Work
• Selection of Gaussian Kernel Widths
• Cone Cluster Labeling
• Comparisons
• Contributions
• Future Work
3
Clustering Overview
• Clustering– discovering natural groups in data
• Clustering problems arise in– bioinformatics
• patterns of gene expression
– data mining/compression– pattern recognition/classification
4
Definition of Clustering
• Definition by Everitt(1974)– “A cluster is a set of entities which are alike, and
entities from different clusters are not alike.”
If we assume that the objects to be clustered are represented as points in the measurement space, then
– “Clusters may be described as connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points.”
5
6
7
8
9
10
Sample Clustering Taxonomy(Zaiane1999)
Partitioning Hierarchical Density-based Grid-based Model-based
fixed number of clusters k
Statistical(COBWEB)
Neural Network(SOM)
Hybrids are also possible.
http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/ (Chapter 8)
11
Strengths and Weaknesses
Typical Strength Weakness
Partitioning • relatively efficient O(ikn)
• split large clusters & merge small clusters• find spherical-shape• sensitive to outliers (k-means)• choice of k• sensitive to initial selection
Hierarchical • does not require choice of k• never be undone• requires termination condition• does not scale well
Density-based • discover arbitrary shape • sensitive to parameters
Grid-based • fast processing time• sensitive to parameters• can’t find arbitrary shape
Model-based• exploit underlying data distribution
• assumption is not always true• expensive to update• difficult for large data sets• slow
12
Comparison of Clustering TechniquesScalability
Arbitrary Shape
Handle
Noise
Order
Dependency
High
DimensionTime
Complexity
Partitional
k-means YES NO NO NO YES O(ikN)
k-medoids YES NO Outlier NO YES O(ikN)
CLARANS YES NO Outlier NO NO O(N2)
Hierarchical
BIRCH YES NO ? NO NO O(N)
CURE YES YES YES NO NO O(N2logN)
SVC ? YES YES NO YES O((N-Nbsv)Nsv)
Density-based DBSCAN YES YES YES NO NO O(NlogN)
Grid-based STING YES NO ? NO NO O(N)
Model-based COBWEB NO ? ? YES NO ?
k = number of clusters, i = number of iterations, N = number of data points, Nsv = number of support vectors, Nbsv = number of bounded support vectors. SVC time is for single combination of parameters.
13
Jain et al. Taxonomy (1999)
Distance between 2 clusters = minimum of distances between all inter- cluster pairs.
Distance between 2 clusters = maximum of distances between all inter- cluster pairs.
Cross-cutting Issues
Agglomerative vs. Divisive
Monothetic vs. Polythetic(sequential feature consideration)
Hard vs. Fuzzy
Deterministic vs. Stochastic
Incremental vs. Non-incremental
14
• Clustering Large Datasets (Mercer 2003)– Hybrid Methods: e.g. Distribution-Based Clustering Algorithm for
Clustering Large Spatial Datasets (Xu et al. 1998)• Hybrid: model-based, density-based, grid-based
More Recent Clustering Surveys
• Doctoral Thesis (Lee 2005)– Boundary-Detecting Methods:
• AUTOCLUST (Estivill-Castro et al. 2000)– Voronoi modeling and Delaunay triangulation
• Random Walks (Harel et al. 2001)– Delaunay triangulation modeling and k-nearest-neighbors– Random walk in weighted graph
• Support Vector Clustering (Ben-Hur et al. 2001)– One-class Support Vector Machine + cluster labeling
15
Overview of SVM
0)( bxxf
• Map non-linearly separable data into a feature space where they are linearly separable• Class of hyperplanes :
where, ω is normal vector of a hyper-planeb is the offset from the origin
: non-linear mapping
16
Overview of SVC• Support Vector Clustering (SVC)• Clustering algorithm using (one-class) SVM• Able to handle arbitrary shaped clusters• Able to handle outliers• Able to handle high dimensions, but…• Need input parameters
– For kernel function that defines inner product in feature space
• e.g. Gaussian kernel width q in
– Soft margin C to control outliers 2
1),(
yxqeyxK
17
aΦ(x)
Gaussian Kernel
x
BSVSV
R : Radius of the minimal hyper-spherea : center of the sphereR(x) : distance between (x) and a
BSV : data x outside of sphere, R(x) > R Num(BSV) is controlled by CSV : data x on the surface of sphere, R(x)=R Num(SV) is controlled by qOthers : data x inside of sphere, R(x) < R
R
2
1),(
yxeyxK
unitball
SVC Main Idea
x
“Attract” hyper-plane onto data points instead of “repel.”
Data space contours are not explicitly available.
q
18
BSVs.for rmpenalty te a is andconstant a is
s,multiplier Lagrange are 0 and 0 where
)||)(||(min max
:Lagrangian222
,,
jj
jj
jj
jjj
jjjj
R,a
CC
CaxRRL
0 ||)(|| subject to
minimize22
2
jjj andRax
R
j jijijijjjjijj
i jjijij
jjj
jj
j jjjjj
jj
jjj
j jjjj
jjj
jj
jj
jj
jjjj
j jjjj
jjjj
xxKxxKxxxxK
xxxax
CaaxCRR
CaxaxRR
CaxRRL
,
222
22222
2222
222
),(),(,)()(),( using
)()()()(
2)()(
)(2)(
)||)(||( of form dual Wolfe
sphere. inside0,0
by
0sphere. outside
0,0:
sphere of surfaceon
by0,0:
jj
j
j
jj
jj
C
BSV
CSV
Find Minimal Hyper-sphere (with BSVs)
.0 :subject to s'obtain toMaximize Cjj 0)||)(||(0:conditions KKT 22 jjjjj axR (Only points on boundary contribute.)1 2
1
0)1(222
j
jj RRRR
L
0
j j
jjj
jj
j
C
CL
)(
02)(2
jjj
jjjj
j
xa
axa
L
3
4
5
3by 4
by 4
:point dataclassify to Use j
2
1
5
2
19
ji,jj
2
jj
jj
22
2
2
),(),(2-),(
))(()())(2(-),(
)()2(-)(
||-)(||
and (x)between distance )(
jijij
jj
xxKxxKxxK
xxxxxK
axax
ax
axR
RxRx
contours
)(|
sphere minimal theof surface on the points
Relationship Between Minimal Hyper-sphere and Cluster Contours
R : Radius of the minimal hyper-spherea : center of the sphereR(x) : distance between φ(x) and a
Challenge: Contour boundaries are not explicitly available.
Number of clusters increases with increasing q.
20
SVC (X)
q initial value;C initial C ( =1)loop K computeKernel(X,q); β solveLagrangian(K,C); cluster labeling(X,β ); if clustering result is satisfactory, exit choose new q and/or C;end loop
SVC High-Level Pseudo-Code
21
Previous Work on SVC• Tax and Duin (1999): Novelty detection using (one-
class) SVM. • SVC suggested by A. Ben-Hur, V.Vapnik, et al. (2001)
– Complete Graph– Support Vector Graph
• J. Yang, et al. (2002): Proximity Graph• J. Park, et al. (2004): Spectral Graph Partitioning• J. Lee, et al. (2005): Gradient Descent• W. Puma-Villanueva et al. (2005) Ensembles• S. Lee and K. Daniels (2004, 2005, 2006, 2007): Kernel
width exploration and fast cluster labeling
22
Previous Work on Cluster Labeling
Complete Graph (CG) Support Vector Graph (SVG) Proximity Graph (PG)
all (xi,xj) in Xall (xi,xj),
where xi or xj is a SV
all (xi,xj), where xi and xj are linked in a PG
23
support vectorsNon-SV data pointsstable equilibrium points
Gradient Descent (GD)
24
xi
xj
m sample points
xixj
xj
y
①
②
③
Traditional Sample Points Technique
• CG, SVG, PG, and GD use this technique.
① disconnected
② disconnected③ connected
25
Problems of Sample Points Technique
xi
xj
False Negative False Positive
xi
xj
sample points
26
CG Result (C=1)
27
Problems of SVC
• Difficult to find appropriate q and C– no guidance for choosing q and C– too much trial and error
• Slow cluster labeling– O(N2Nsvm) time for CG method, where m is the number of sample
points on the line segment connecting any pair of data points– general size of Delaunay triangulation in d dimensions =
• Bad performance in high-dimensions– as the number of principal components is increased, there is a
performance degradation
)( 2
d
N
28
Our q Exploration
• Lemmas – If q=0, then R2=0
– If q=∞, then βi=1/N for all i {1,…, ∈ N}
– If q =∞,then R2=1-1/N– R2=1 iff q =∞, and N =∞– If N is finite,
then R2≤1-1/N <1
• Theorem– Under certain circumstances,
R2 is a monotonically nondecreasing function of q
– Secant-like algorithm
29
q-list Length Analysis
• Estimation of q-list length≈
• depends only on – spatial characteristics of the data set and – not on the dimensionality of the data set or
the number of data
• 89% accuracy w.r.t. the actual q-list length for all datasets considered
})lg(min{})lg(max{22
jiji xxxx
30
Our Recent q Exploration Work
*q̂
• Curve typically has one critical radius of curvature at q*.
• Approximate q* to yield (without cluster labeling). • Use as starting q value in sequence.*q̂
31
q Exploration Results
Dim.9
2534
200
• 2D: On average actual number is– 32% of estimate– 22% of secant length
• Higher dimensions: On average actual number is– 112% of estimate.– 82% of secant length
32
2D q Exploration Results
33
Higher Dimensional q Exploration Results
34
Cone Cluster Labeling (CCL)
• Motivation: Avoid line segment sampling• Approach:
– Leverage geometry of feature space. – For Gaussian kernel
• Images of all data points are on surface of unit ball in feature space.
• Hyper-sphere in data space corresponds to cone in feature space with apex at origin.
unit ball
Gaussian Kernel
2
1),(
yxeyxK
q
Sample 2D Data SpaceLow-Dimensional View of High-Dimensional Feature Space
35
Cone Cluster Labeling
ctorssupport ve ofset is where, ))(( : Covering
:ConeVector Support
space featurein ehyperspher minimal the
and ballunit theof surface ebetween thon intersecti :
VPPP
P
ii
i
vVv
v
θ
θ θθ
vi vj
iv jv
36
Cone Cluster Labeling
aavi ')(
')()cos( avi
• Cone base angles are all = .
• Cones have a’ in common.
• Pythagorean Theorem holds in feature space.
• To derive data space hyper-sphere radius, use
21 Ra
21)cos( R
37
Cone Cluster Labeling
P' S
q
R
qZ
ZvSP
PP'
ii
i
vVv
iv
coversely approximat )(
)1ln())ln(cos(-
: radius with at centered ehyperspherctor support ve a toscorrespond )(
space data theinto of mapping :
2
vi
q=0.003q=0.137
P’
Z
38
Cone Cluster Labeling
for end
Labelsprint
for end
)Labels( )Labels(
toSVnearest thefind idx
where,each for
trix)djacencyMamponents(AFindConnCo Labels
),y(onnectivitConstructC atrix AdjacencyM
for compute
each for
),,(LabelingCluster Cone
idxxx
x
VxXx
ZV
qZ
VQX
39
2D CCL Results (C=1)
40
Sample Higher Dimensional CCL Results in “Heat Map” Form
N
d
N = 12d = 93 clusters
N = 30d = 255 clusters
N = 205d = 2005 clusters
41
Comparison – cluster labeling algorithms
CG SVG PG GD CCLConstruct Adjacency
MatrixO(N2Nsvm) O(NN2
svm) O(N(logN+ Nsvm)) O(m(N2i+ NsvN2sep)) O(N2
sv)
Find Connected
ComponentsO(N2) O(NNsv) O(N2) O(N2
sep) O(N2sv)
Non-SV Labeling
N/A O((N-Nsv)Nsv) O((N-Nsv)Nsv) O(N-Nsep) O((N-Nsv)Nsv)
TOTAL O(N2Nsvm) O(NN2svm) O(N2 + NNsvm) O(m(N2i+ NsvN2
sep)) O(NNsv)
m: the number of sample pointsi: the number of iterations for convergenceTime is for a single (q,C) combination.
42
Construct Adjacency Matrix Find Connected Components
Non-SV Labeling Total Time for Cluster Labeling
Comparisons – 2D
43
Construct Adjacency MatrixFind Connected
ComponentsNon-SV Labeling
Comparisons – HD
44
Contributions
• Automatically generate Gaussian kernel width values– include appropriate width values for our test data sets– obtain some reasonable cluster results from the q-list
• Faster cluster labeling method– faster than any other SVC cluster labeling algorithms– good clustering quality
45
Future Work
“The presence or absence of robust, efficient parallel clustering techniques will determine the success or failure of cluster analysis in large-scale data mining applications in the future.” - Jain et al. 1999
Make SVC scalable!
46
End