Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density...
-
Upload
vernon-brooks -
Category
Documents
-
view
228 -
download
2
Transcript of Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density...
Clustering
Prof. Navneet GoyalBITS, Pilani
Density-based methods
Based on connectivity and density functions
Filter out noise, find clusters of arbitrary shape
Grid-based methods
Quantize the object space into a grid structure
Other Approaches to Clustering
Density-Based Clustering Methods
Major features:Discover clusters of arbitrary shapeHandle noiseOne scanNeed density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98)
Density-Based Spatial Clustering of Applications with Noise
Clusters are dense regions of objects separated by regions of low density ( noise)
Outliers will not effect creation of cluster
Input– MinPts – minimum number of points in any
cluster– Eps – for each point in cluster there must be
another point in it less than this distance away
Density-Based Method: DBSCAN
• Eps-neighborhood: Points within Eps distance of a point.
• Core point: Eps-neighborhood dense enough (MinPts)
• Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point.
• Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core points.
DBSCAN Density Concepts
Density-Based Method: DBSCAN Eps-neighborhood: Points within Eps distance of a
point.NEps(p): {q belongs to D | dist(p,q) <= Eps}
Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly
density-reachable from a point q if the distance is small (Eps) and q is a core point.Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
2) core point condition:
|NEps (q)| >= MinPts
pq
MinPts = 5
Eps = 1 cm
Density-Based Method: DBSCAN
Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core pointsA point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi for all i (1,n-1)
p
qp1
Density-connected
– A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.
p q
o
Density-Based Method: DBSCAN
DBSCAN Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
DBSCAN: Core, Border, and Noise Points
1. Label all points as core, border, or noise points
2. Eliminate noise points
3. Put an edge between all core points that are within ε of each other\
4. Make each group of connected core points into a separate cluster
5. Assign each border point to one of the its associated core point
DBSCAN: The Algorithm
DBSCAN: Core, Border and Noise Points
Original Points Point types: core, border and noise
Eps = 10, MinPts = 4
Source of figure: Introduction to Data Mining by Tan et. al.
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Source of figure: Introduction to Data Mining by Tan et. al.
When DBSCAN Does NOT Work Well
Original Points
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
• Varying densities
• High-dimensional data
Source of figure: Introduction to Data Mining by Tan et. al.
• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
DBSCAN: Determining EPS and MinPts
Eps=10Minpts=4
Source of figure: Introduction to Data Mining by Tan et. al.
• Ordering Points To Identify Clustering Structure
• DBSCAN is sensitive to the choice of input parameters
• Parameter setting is done empirically• High dimensional data – more pronounced• High dimensional data clustering
structures are not generally characterized by global density parameters like eps & minpts
• OPTICS as a solution!
OPTICS: Self Study
• Computes an augmented cluster ordering
• Ordering represents the density based clustering structure of the data
• Contains information that is equivalent to density based clustering obtained from a wide range of parameter settings
• Cluster ordering can be used to extract basic clustering information
OPTICS
• In DBSCAN, for constant minpts, clusters with high density (lower eps) are completely contained in density connected sets obtained with lower density
• Extend DBSCAN to process a set of distance parameter eps at the same time.
• For this the objects need to be processed in a specific order
• This order selects an object that is density reachable wrt lowest eps so that clusters of higher density will be finished first.
OPTICS
• 2 values need to be stored for each object:– Core distance– Reachability distance
• Core distance – smallest eps that makes it a core object. If p is not core, it is iundefined.
• Reachability distance of q wrt p is the greater value of the core distance of p and the euclidean distance between p & q. If p is not a core object, distance reachability bet p & q is undefined
OPTICS
• Index-based: • k = number of dimensions • N = 20• p = 75%• M = N(1-p) = 5
– Complexity: O(kN2)• Core Distance
• Reachability Distance
OPTICS: Some Extension from DBSCAN
D
p2
MinPts = 5
= 3 cm
Max (core-distance (o), d (o, p))
r(p1, o) = 2.8cm. r(p2,o) = 4cm
o
o
p1
• Efficiency issues with DBSCAN• Finding clusters in subspaces• Modeling density accurately
We now look at:• Grid-based clustering
– Partitions data space into grid cells and forms clusters from cells that are dense enough
– Efficient approach for low-dimensional data• Subspace clustering
– Finds clusters in subsets of all dimensions– 2n-1 subspaces to be searched!!!
Density-based Clustering Contd…
• GRIDCLUS• STING• CLIQUE• WaveCluster
Grid-based Clustering
• Significant reduction in time complexity, especially for large data sets
• Number of cells << number of data points
• Instead of clustering data points, neighborhood surrounding the data points are clustered
Grid-based Clustering
Steps involved:1.Creating the grid structure2.Calculating cell density for each cell3.Sorting of the cells according to their
densities4.Identifying cluster centers5.Traversal of neighborhood cells
Grid-based Clustering
Algorithm:1.Define a set of grid cells2.Assign objects to appropriate grid cells
and compute the density of each cell3.Eliminate cells having density below a
specified threshold4.Form clusters from contiguous groups of
dense cells
Grid-based Clustering
• Defining Grid Cells– Key step– Equal width intervals along all dimensions
• Each cell has same volume• Density of cell is defined as no. of points in
cell– Alternatively, equi-depth approach can be used
• Equal number of points in each interval• Called as equal frequency discretization
– MAFIA : subspace clustering algorithm initially uses equal width intervals and then combines intervals of similar density
• Definition of grid has strong impact on clustering results
Grid-based Clustering
• Density of Grid Cells– No. of points in the cell divided by the volume
of the cell• No. of road signs per km• No. of tigers in a sq. km• No. of molecules of a gas in cu. cm
Grid-based Clustering
Source of figure: Introduction to Data Mining by Tan et. al.
• Forming Clusters from dense grid cells– Relatively straight forward– In the example on previous slide: 2 clusters– Define adjacency
• 4 or 8 adjacent cells in 2-D?• Efficient technique to find adjacent cells
(only occupied cells are stored)– Partially empty cells on the fringe of clusters
which are not dense and will be discarded– 4 parts of the larger cluster will be lost if the
threshold is 9
Grid-based Clustering
Source of figure: Introduction to Data Mining by Tan et. al.
• Strengths & Limitations– Single pass is enough to determine the cell
and count of every cell– Grid cells created only for non-empty cells– Complexity of O(m)– O(mlogm)– grids are rectangular– Curse of dimensionality– Grid cells containing just one element
Grid-based Clustering
Source of figure: Introduction to Data Mining by Tan et. al.
• Clustering algorithms considered so far take into account all attributes
• Consider only a subspace of data
Subspace Clustering
Source of figure: Introduction to Data Mining by Tan et. al.
Subspace Clustering
Source of figure: Introduction to Data Mining by Tan et. al.
• Ensemble Clustering• Parallelizing Clustering Algorithms to
leverage a Cluster
Some Research Directions
• Similar to Ensemble Classification• Consensus Clustering• Obtain different clustering solutions and
then reconcile them
Ensemble Clustering
• Parallelize to leverage a cluster • Two levels of parallelism
– Node Level– Core Level
• Not Necessarily Orthogonal• Hybrid – Non Trivial• Programming Environment:
– MPI– Open MP
Parallelizing Clustering Algorithms