Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density...

Post on 16-Dec-2015

228 views 2 download

Tags:

Transcript of Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density...

Clustering

Prof. Navneet GoyalBITS, Pilani

Density-based methods

Based on connectivity and density functions

Filter out noise, find clusters of arbitrary shape

Grid-based methods

Quantize the object space into a grid structure

Other Approaches to Clustering

Density-Based Clustering Methods

Major features:Discover clusters of arbitrary shapeHandle noiseOne scanNeed density parameters as termination condition

Several interesting studies:

DBSCAN: Ester, et al. (KDD’96)

OPTICS: Ankerst, et al (SIGMOD’99).

DENCLUE: Hinneburg & D. Keim (KDD’98)

CLIQUE: Agrawal, et al. (SIGMOD’98)

Density-Based Spatial Clustering of Applications with Noise

Clusters are dense regions of objects separated by regions of low density ( noise)

Outliers will not effect creation of cluster

Input– MinPts – minimum number of points in any

cluster– Eps – for each point in cluster there must be

another point in it less than this distance away

Density-Based Method: DBSCAN

• Eps-neighborhood: Points within Eps distance of a point.

• Core point: Eps-neighborhood dense enough (MinPts)

• Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point.

• Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core points.

DBSCAN Density Concepts

Density-Based Method: DBSCAN Eps-neighborhood: Points within Eps distance of a

point.NEps(p): {q belongs to D | dist(p,q) <= Eps}

Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly

density-reachable from a point q if the distance is small (Eps) and q is a core point.Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if

1) p belongs to NEps(q)

2) core point condition:

|NEps (q)| >= MinPts

pq

MinPts = 5

Eps = 1 cm

Density-Based Method: DBSCAN

Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core pointsA point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi for all i (1,n-1)

p

qp1

Density-connected

– A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.

p q

o

Density-Based Method: DBSCAN

DBSCAN Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points

Discovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5

DBSCAN: Core, Border, and Noise Points

1. Label all points as core, border, or noise points

2. Eliminate noise points

3. Put an edge between all core points that are within ε of each other\

4. Make each group of connected core points into a separate cluster

5. Assign each border point to one of the its associated core point

DBSCAN: The Algorithm

DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border and noise

Eps = 10, MinPts = 4

Source of figure: Introduction to Data Mining by Tan et. al.

When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Source of figure: Introduction to Data Mining by Tan et. al.

When DBSCAN Does NOT Work Well

Original Points

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

• Varying densities

• High-dimensional data

Source of figure: Introduction to Data Mining by Tan et. al.

• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance

• Noise points have the kth nearest neighbor at farther distance

• So, plot sorted distance of every point to its kth nearest neighbor

DBSCAN: Determining EPS and MinPts

Eps=10Minpts=4

Source of figure: Introduction to Data Mining by Tan et. al.

• Ordering Points To Identify Clustering Structure

• DBSCAN is sensitive to the choice of input parameters

• Parameter setting is done empirically• High dimensional data – more pronounced• High dimensional data clustering

structures are not generally characterized by global density parameters like eps & minpts

• OPTICS as a solution!

OPTICS: Self Study

• Computes an augmented cluster ordering

• Ordering represents the density based clustering structure of the data

• Contains information that is equivalent to density based clustering obtained from a wide range of parameter settings

• Cluster ordering can be used to extract basic clustering information

OPTICS

• In DBSCAN, for constant minpts, clusters with high density (lower eps) are completely contained in density connected sets obtained with lower density

• Extend DBSCAN to process a set of distance parameter eps at the same time.

• For this the objects need to be processed in a specific order

• This order selects an object that is density reachable wrt lowest eps so that clusters of higher density will be finished first.

OPTICS

• 2 values need to be stored for each object:– Core distance– Reachability distance

• Core distance – smallest eps that makes it a core object. If p is not core, it is iundefined.

• Reachability distance of q wrt p is the greater value of the core distance of p and the euclidean distance between p & q. If p is not a core object, distance reachability bet p & q is undefined

OPTICS

• Index-based: • k = number of dimensions • N = 20• p = 75%• M = N(1-p) = 5

– Complexity: O(kN2)• Core Distance

• Reachability Distance

OPTICS: Some Extension from DBSCAN

D

p2

MinPts = 5

= 3 cm

Max (core-distance (o), d (o, p))

r(p1, o) = 2.8cm. r(p2,o) = 4cm

o

o

p1

• Efficiency issues with DBSCAN• Finding clusters in subspaces• Modeling density accurately

We now look at:• Grid-based clustering

– Partitions data space into grid cells and forms clusters from cells that are dense enough

– Efficient approach for low-dimensional data• Subspace clustering

– Finds clusters in subsets of all dimensions– 2n-1 subspaces to be searched!!!

Density-based Clustering Contd…

• GRIDCLUS• STING• CLIQUE• WaveCluster

Grid-based Clustering

• Significant reduction in time complexity, especially for large data sets

• Number of cells << number of data points

• Instead of clustering data points, neighborhood surrounding the data points are clustered

Grid-based Clustering

Steps involved:1.Creating the grid structure2.Calculating cell density for each cell3.Sorting of the cells according to their

densities4.Identifying cluster centers5.Traversal of neighborhood cells

Grid-based Clustering

Algorithm:1.Define a set of grid cells2.Assign objects to appropriate grid cells

and compute the density of each cell3.Eliminate cells having density below a

specified threshold4.Form clusters from contiguous groups of

dense cells

Grid-based Clustering

• Defining Grid Cells– Key step– Equal width intervals along all dimensions

• Each cell has same volume• Density of cell is defined as no. of points in

cell– Alternatively, equi-depth approach can be used

• Equal number of points in each interval• Called as equal frequency discretization

– MAFIA : subspace clustering algorithm initially uses equal width intervals and then combines intervals of similar density

• Definition of grid has strong impact on clustering results

Grid-based Clustering

• Density of Grid Cells– No. of points in the cell divided by the volume

of the cell• No. of road signs per km• No. of tigers in a sq. km• No. of molecules of a gas in cu. cm

Grid-based Clustering

Source of figure: Introduction to Data Mining by Tan et. al.

• Forming Clusters from dense grid cells– Relatively straight forward– In the example on previous slide: 2 clusters– Define adjacency

• 4 or 8 adjacent cells in 2-D?• Efficient technique to find adjacent cells

(only occupied cells are stored)– Partially empty cells on the fringe of clusters

which are not dense and will be discarded– 4 parts of the larger cluster will be lost if the

threshold is 9

Grid-based Clustering

Source of figure: Introduction to Data Mining by Tan et. al.

• Strengths & Limitations– Single pass is enough to determine the cell

and count of every cell– Grid cells created only for non-empty cells– Complexity of O(m)– O(mlogm)– grids are rectangular– Curse of dimensionality– Grid cells containing just one element

Grid-based Clustering

Source of figure: Introduction to Data Mining by Tan et. al.

• Clustering algorithms considered so far take into account all attributes

• Consider only a subspace of data

Subspace Clustering

Source of figure: Introduction to Data Mining by Tan et. al.

Subspace Clustering

Source of figure: Introduction to Data Mining by Tan et. al.

• Ensemble Clustering• Parallelizing Clustering Algorithms to

leverage a Cluster

Some Research Directions

• Similar to Ensemble Classification• Consensus Clustering• Obtain different clustering solutions and

then reconcile them

Ensemble Clustering

• Parallelize to leverage a cluster • Two levels of parallelism

– Node Level– Core Level

• Not Necessarily Orthogonal• Hybrid – Non Trivial• Programming Environment:

– MPI– Open MP

Parallelizing Clustering Algorithms