Review on Density-based Clustering - DBSCAN, DenClue & GRID

1Clustering

Density-based clustering

Abraham Otero Quintana, Ph.D.Madrid, July 5th 2010

2Unsupervised Pattern Recognition (Clustering) 2/20

Course outline:3. Density-based clustering

3.1. DBSCAN (Density Based Spatial Clustering ofApplications with Noise)

3.2. Grid Clustering3.3. DENCLUE (DENsity CLUstEring)3.4. More algorithms

For an overview of these techniques please read, Tan2006 and Berkhin2002 from/Docs. Some of the slides here shown are taken from the publicly availablerepository of the same book. Source: http://www-users.cs.umn.edu/~kumar/dmbook/index.php


3. Density based clustering

A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.

Used when the clusters are irregular or intertwined, and when noise and outliers are present.

6 density-based clusters

Density based clustering tries to identify those dense (highly populated) regionsof the multidimensional space and separate them from other dense regions. For a review please, read, Tan2006 from /Docs and Ester1996 from /Docs.


A point is a core point if it has more than a specified number of points (MinPts) within a radius Eps (these points are the interior of a cluster)

A border point has fewer than MinPtswithin Eps, but is in the neighborhood of a core point

A noise point is any point that is not a core point or a border point.

3.1 DBSCAN: Definitions

DBSCAN is based on the following definitions.


Classify points as noise, border, core Eliminate noise points Perform clustering on the remaining points

3.1 DBSCAN: Algorithm

Demo

Demo: http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html


Original Points

3.1 DBSCAN: Example

Point types: core, border and noise

Eps = 10, MinPts = 4

Clustering


(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

Features: Resistant to Noise Can handle clusters of different shapes and sizes

But: Varying densities High-dimensional data

Original Points

3.1 DBSCAN: Example

As we have seen DBSCAN is quite insensitive to outliers and can handle non-globular shapes. However, DBSCAN is not the panacea: it is rather sensitive tovarying densities and usually does not work well with high-dimensional data since in this space samples are much more sparse.


3.1 DBSCAN: Example

Pixels are represented as 6 dimensional vectors (location+color) and segmentedusing DBSCAN. The dull study can be seen at Ye2003 in the course CD.


Parameter determination. For MinPts a small number is usually employed.

For two-dimensional experimental data it has been shown that 4 is the most reasonable value.

Eps is more tricky, as we have seen. A possible solution:

3.1 DBSCAN: Example

10

Unsupervised Pattern Recognition (Clustering) 10/20

3.1 DBSCAN: Parameter determination

The idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance

Noise points have the kth nearest neighbor at farther distance So, plot sorted distance of every point to its kth nearest neighbor

Reasonable eps

Reasonable MinPts for 2D data

This algorithm is rather simple but it strongly depends on the parameters MinPtsand eps. MinPts is usually a low numbr (for 2D data it has been experimentallyshown that 4 is a reasonable value). Then eps can be easily determined by sortingthe distance of the 4th closest point for every point: Noise points tend to be farfrom all the rest.

11


3.2 Grid clustering

Basic algorithm:1. Define a set of grid cells2. Compute the density of cells3. Eliminate cells with a density smaller than a threshold4. Form clusters from contiguous cells

Those wanting to know more about grid clustering, please, read Hinneburg1999 from /Docs.

The basic algorithm for grid clustering is rather simple, form clusters withcontiguous dense cells. However, in this definition there are a number ofambiguous things:-How to define cells: regular/irregular grids, cell size (too large is not accurate, too small may be empty)-How to define the threshold: it depends on the cell size and the dimensionality ofdata-What kind of adjacency is considered: for instance in 2D, 4 or 8 neighbours

Grid clustering is a basic idea of many other clustering algorithms: WaveCluster, Bang, Clique, and Mafia

12


3.3 DENCLUE: Definitions

2

2( , )

2

1 1( ) ( , )

idistn nDkernel kernel i

i if f e

= == = x xx x x

Influence function

This algorithm estimates the local density of the input data in a way very similar to the kernel probability density function estimators. The kernel, here calledinfluence function, is copied to each data position yielding the density function. Local maxima of the density function are called density attractors.

Those interested in the original paper of DENCLUE may read Hinneburg1998 from /Docs. Those interested in knowing more about probability density function(PDF) estimators, please, read Raykar2002 from /Docs.

13


3.3 DENCLUE: Clustering

{ }| ( )Dkernelf x xCenter-defined cluster

{ }| ( )Dkernelf x xMulticenter-defined cluster

Multicenter defined clusters are a set of center-definedclusters linked by a path ofsignificance

Generalizeshierarchicalclustering!

Clusters are formed by a level of significance . For knowing more about theconnection between DENCLUE clustering and Level Set methods, please read, Yip from /Docs.

14


3.3 DENCLUE: Algorithm

1. Grid Data Set (use r = , the std. dev.)2. Find (Highly) Populated Cells (use a

threshold=c) (shown in blue)3. Identify populated cells (+nonempty cells)

4. Find Density Attractor pts, C*, using hill climbing:

1. Randomly pick a point, pi.2. Compute local density (use r=4)3. Pick another point, pi+1, close to pi,

compute local density at pi+14. If LocDen(pi) < LocDen(pi+1), climb5. Put all points within distance /2 of

path, pi, pi+1, C* into a density attractor cluster called C*

5. Connect the density attractor clusters, using a threshold, , on the local densities of the attractors.

15


3.3 DENCLUE: Examples

In the slide we show a couple of examples of how DENCLUE clusters data, according the algorithm presented in Hinneburg1998 .

16



In the slide we show a couple of examples of how DENCLUE clusters data, according the algorithm presented in Yip.

17



In the slide we show a couple of examples of how DENCLUE clusters data, according the algorithm presented in Yip.

18


3.3 DENCLUE: Features

Dependence on the kernel width It generalizes DBSCAN, K-means and Hierarchical Clustering Very efficient implementation

DENCLUE a few positive features, however, it is not free from drawbacks as itsdependency with a user-defined parameter.

19


3.4 DBC: More algorithms

Generalized DBSCAN: Any divergence function can be used and points withina neighbourhood are weighted according to their similarity to the core point.

Fuzzy DBSCAN: Fuzzy distance between fuzzy input vectors DBCLASD: Assumes uniform density, no parameters required Recursive DBC: Adaptive change of DBSCAN parameters WaveCluster: Use wavelets to determine multiresolution clusters. Optics: equivalent to DBC with a wide range of parameters Knn DBC: Assign cluster label taking into account the k nearest neighbours KerdenSOM: Self-organizing structure on the density estimation STING (STatistical INformation Grid): Quadtree space division, very efficient Information Theoretic Clustering: Measure the distance between cluster

distributions using information theory.

For knowing more about:-Generalized DBSCAN, please, read Sander1998 from /Docs.-Fuzzy DBSCAN, please, read Kriegel2005 from /Docs.-DBCLASD, please, read Xu1998 from /Docs.-Recursive DBC, please, read Su2001 from /Docs.-WaveCluster, please, read Sheikholeslami1997 from /Docs.-Optics, please, read Ankerst1999 from /Docs.-Knn DBC, please, read Tran2003 from /Docs.-KerdenSOM, please, read Pascual2001 from /Docs.-STING, please read, Wang1997 from /Docs.

20


Course outline3. Density-based clustering

3.1. DBSCAN (Density Based Spatial Clustering ofApplications with Noise)

3.2. Grid Clustering3.3. DENCLUE (DENsity CLUstEring)3.4. More algorithms

For an overview of these techniques please read, Tan2006 and Berkhin2002 from/Docs. Some of the slides here shown are taken from the publicly availablerepository of the same book. Source: http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Review on Density-based Clustering - DBSCAN, DenClue & GRID

Documents

Transcript of Review on Density-based Clustering - DBSCAN, DenClue & GRID