Review on Density-based Clustering - DBSCAN, DenClue & GRID

download Review on Density-based Clustering - DBSCAN, DenClue & GRID

of 20

description

Its a review on Density based clustering algorithms

Transcript of Review on Density-based Clustering - DBSCAN, DenClue & GRID

  • 1Clustering

    Density-based clustering

    Abraham Otero Quintana, Ph.D.Madrid, July 5th 2010

  • 2Unsupervised Pattern Recognition (Clustering) 2/20

    Course outline:3. Density-based clustering

    3.1. DBSCAN (Density Based Spatial Clustering ofApplications with Noise)

    3.2. Grid Clustering3.3. DENCLUE (DENsity CLUstEring)3.4. More algorithms

    For an overview of these techniques please read, Tan2006 and Berkhin2002 from/Docs. Some of the slides here shown are taken from the publicly availablerepository of the same book. Source: http://www-users.cs.umn.edu/~kumar/dmbook/index.php

  • 3Unsupervised Pattern Recognition (Clustering) 3/20

    3. Density based clustering

    A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.

    Used when the clusters are irregular or intertwined, and when noise and outliers are present.

    6 density-based clusters

    Density based clustering tries to identify those dense (highly populated) regionsof the multidimensional space and separate them from other dense regions. For a review please, read, Tan2006 from /Docs and Ester1996 from /Docs.

  • 4Unsupervised Pattern Recognition (Clustering) 4/20

    A point is a core point if it has more than a specified number of points (MinPts) within a radius Eps (these points are the interior of a cluster)

    A border point has fewer than MinPtswithin Eps, but is in the neighborhood of a core point

    A noise point is any point that is not a core point or a border point.

    3.1 DBSCAN: Definitions

    DBSCAN is based on the following definitions.

  • 5Unsupervised Pattern Recognition (Clustering) 5/20

    Classify points as noise, border, core Eliminate noise points Perform clustering on the remaining points

    3.1 DBSCAN: Algorithm

    Demo

    Demo: http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html

  • 6Unsupervised Pattern Recognition (Clustering) 6/20

    Original Points

    3.1 DBSCAN: Example

    Point types: core, border and noise

    Eps = 10, MinPts = 4

    Clustering

  • 7Unsupervised Pattern Recognition (Clustering) 7/20

    (MinPts=4, Eps=9.75).

    (MinPts=4, Eps=9.92)

    Features: Resistant to Noise Can handle clusters of different shapes and sizes

    But: Varying densities High-dimensional data

    Original Points

    3.1 DBSCAN: Example

    As we have seen DBSCAN is quite insensitive to outliers and can handle non-globular shapes. However, DBSCAN is not the panacea: it is rather sensitive tovarying densities and usually does not work well with high-dimensional data since in this space samples are much more sparse.

  • 8Unsupervised Pattern Recognition (Clustering) 8/20

    3.1 DBSCAN: Example

    Pixels are represented as 6 dimensional vectors (location+color) and segmentedusing DBSCAN. The dull study can be seen at Ye2003 in the course CD.

  • 9Unsupervised Pattern Recognition (Clustering) 9/20

    Parameter determination. For MinPts a small number is usually employed.

    For two-dimensional experimental data it has been shown that 4 is the most reasonable value.

    Eps is more tricky, as we have seen. A possible solution:

    3.1 DBSCAN: Example

  • 10

    Unsupervised Pattern Recognition (Clustering) 10/20

    3.1 DBSCAN: Parameter determination

    The idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance

    Noise points have the kth nearest neighbor at farther distance So, plot sorted distance of every point to its kth nearest neighbor

    Reasonable eps

    Reasonable MinPts for 2D data

    This algorithm is rather simple but it strongly depends on the parameters MinPtsand eps. MinPts is usually a low numbr (for 2D data it has been experimentallyshown that 4 is a reasonable value). Then eps can be easily determined by sortingthe distance of the 4th closest point for every point: Noise points tend to be farfrom all the rest.

  • 11

    Unsupervised Pattern Recognition (Clustering) 11/20

    3.2 Grid clustering

    Basic algorithm:1. Define a set of grid cells2. Compute the density of cells3. Eliminate cells with a density smaller than a threshold4. Form clusters from contiguous cells

    Those wanting to know more about grid clustering, please, read Hinneburg1999 from /Docs.

    The basic algorithm for grid clustering is rather simple, form clusters withcontiguous dense cells. However, in this definition there are a number ofambiguous things:-How to define cells: regular/irregular grids, cell size (too large is not accurate, too small may be empty)-How to define the threshold: it depends on the cell size and the dimensionality ofdata-What kind of adjacency is considered: for instance in 2D, 4 or 8 neighbours

    Grid clustering is a basic idea of many other clustering algorithms: WaveCluster, Bang, Clique, and Mafia

  • 12

    Unsupervised Pattern Recognition (Clustering) 12/20

    3.3 DENCLUE: Definitions

    2

    2( , )

    2

    1 1( ) ( , )

    idistn nDkernel kernel i

    i if f e

    = == = x xx x x

    Influence function

    This algorithm estimates the local density of the input data in a way very similar to the kernel probability density function estimators. The kernel, here calledinfluence function, is copied to each data position yielding the density function. Local maxima of the density function are called density attractors.

    Those interested in the original paper of DENCLUE may read Hinneburg1998 from /Docs. Those interested in knowing more about probability density function(PDF) estimators, please, read Raykar2002 from /Docs.

  • 13

    Unsupervised Pattern Recognition (Clustering) 13/20

    3.3 DENCLUE: Clustering

    { }| ( )Dkernelf x xCenter-defined cluster

    { }| ( )Dkernelf x xMulticenter-defined cluster

    Multicenter defined clusters are a set of center-definedclusters linked by a path ofsignificance

    Generalizeshierarchicalclustering!

    Clusters are formed by a level of significance . For knowing more about theconnection between DENCLUE clustering and Level Set methods, please read, Yip from /Docs.

  • 14

    Unsupervised Pattern Recognition (Clustering) 14/20

    3.3 DENCLUE: Algorithm

    1. Grid Data Set (use r = , the std. dev.)2. Find (Highly) Populated Cells (use a

    threshold=c) (shown in blue)3. Identify populated cells (+nonempty cells)

    4. Find Density Attractor pts, C*, using hill climbing:

    1. Randomly pick a point, pi.2. Compute local density (use r=4)3. Pick another point, pi+1, close to pi,

    compute local density at pi+14. If LocDen(pi) < LocDen(pi+1), climb5. Put all points within distance /2 of

    path, pi, pi+1, C* into a density attractor cluster called C*

    5. Connect the density attractor clusters, using a threshold, , on the local densities of the attractors.

  • 15

    Unsupervised Pattern Recognition (Clustering) 15/20

    3.3 DENCLUE: Examples

    In the slide we show a couple of examples of how DENCLUE clusters data, according the algorithm presented in Hinneburg1998 .

  • 16

    Unsupervised Pattern Recognition (Clustering) 16/20

    3.3 DENCLUE: Examples

    In the slide we show a couple of examples of how DENCLUE clusters data, according the algorithm presented in Yip.

  • 17

    Unsupervised Pattern Recognition (Clustering) 17/20

    3.3 DENCLUE: Examples

    In the slide we show a couple of examples of how DENCLUE clusters data, according the algorithm presented in Yip.

  • 18

    Unsupervised Pattern Recognition (Clustering) 18/20

    3.3 DENCLUE: Features

    Dependence on the kernel width It generalizes DBSCAN, K-means and Hierarchical Clustering Very efficient implementation

    DENCLUE a few positive features, however, it is not free from drawbacks as itsdependency with a user-defined parameter.

  • 19

    Unsupervised Pattern Recognition (Clustering) 19/20

    3.4 DBC: More algorithms

    Generalized DBSCAN: Any divergence function can be used and points withina neighbourhood are weighted according to their similarity to the core point.

    Fuzzy DBSCAN: Fuzzy distance between fuzzy input vectors DBCLASD: Assumes uniform density, no parameters required Recursive DBC: Adaptive change of DBSCAN parameters WaveCluster: Use wavelets to determine multiresolution clusters. Optics: equivalent to DBC with a wide range of parameters Knn DBC: Assign cluster label taking into account the k nearest neighbours KerdenSOM: Self-organizing structure on the density estimation STING (STatistical INformation Grid): Quadtree space division, very efficient Information Theoretic Clustering: Measure the distance between cluster

    distributions using information theory.

    For knowing more about:-Generalized DBSCAN, please, read Sander1998 from /Docs.-Fuzzy DBSCAN, please, read Kriegel2005 from /Docs.-DBCLASD, please, read Xu1998 from /Docs.-Recursive DBC, please, read Su2001 from /Docs.-WaveCluster, please, read Sheikholeslami1997 from /Docs.-Optics, please, read Ankerst1999 from /Docs.-Knn DBC, please, read Tran2003 from /Docs.-KerdenSOM, please, read Pascual2001 from /Docs.-STING, please read, Wang1997 from /Docs.

  • 20

    Unsupervised Pattern Recognition (Clustering) 20/20

    Course outline3. Density-based clustering

    3.1. DBSCAN (Density Based Spatial Clustering ofApplications with Noise)

    3.2. Grid Clustering3.3. DENCLUE (DENsity CLUstEring)3.4. More algorithms

    For an overview of these techniques please read, Tan2006 and Berkhin2002 from/Docs. Some of the slides here shown are taken from the publicly availablerepository of the same book. Source: http://www-users.cs.umn.edu/~kumar/dmbook/index.php