Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density...

34
Clustering Prof. Navneet Goyal BITS, Pilani

Transcript of Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density...

Page 1: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

Clustering

Prof. Navneet GoyalBITS, Pilani

Page 2: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

Density-based methods

Based on connectivity and density functions

Filter out noise, find clusters of arbitrary shape

Grid-based methods

Quantize the object space into a grid structure

Other Approaches to Clustering

Page 3: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

Density-Based Clustering Methods

Major features:Discover clusters of arbitrary shapeHandle noiseOne scanNeed density parameters as termination condition

Several interesting studies:

DBSCAN: Ester, et al. (KDD’96)

OPTICS: Ankerst, et al (SIGMOD’99).

DENCLUE: Hinneburg & D. Keim (KDD’98)

CLIQUE: Agrawal, et al. (SIGMOD’98)

Page 4: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

Density-Based Spatial Clustering of Applications with Noise

Clusters are dense regions of objects separated by regions of low density ( noise)

Outliers will not effect creation of cluster

Input– MinPts – minimum number of points in any

cluster– Eps – for each point in cluster there must be

another point in it less than this distance away

Density-Based Method: DBSCAN

Page 5: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Eps-neighborhood: Points within Eps distance of a point.

• Core point: Eps-neighborhood dense enough (MinPts)

• Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point.

• Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core points.

DBSCAN Density Concepts

Page 6: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

Density-Based Method: DBSCAN Eps-neighborhood: Points within Eps distance of a

point.NEps(p): {q belongs to D | dist(p,q) <= Eps}

Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly

density-reachable from a point q if the distance is small (Eps) and q is a core point.Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if

1) p belongs to NEps(q)

2) core point condition:

|NEps (q)| >= MinPts

pq

MinPts = 5

Eps = 1 cm

Page 7: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

Density-Based Method: DBSCAN

Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core pointsA point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi for all i (1,n-1)

p

qp1

Page 8: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

Density-connected

– A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.

p q

o

Density-Based Method: DBSCAN

Page 9: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

DBSCAN Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points

Discovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5

Page 10: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

DBSCAN: Core, Border, and Noise Points

Page 11: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

1. Label all points as core, border, or noise points

2. Eliminate noise points

3. Put an edge between all core points that are within ε of each other\

4. Make each group of connected core points into a separate cluster

5. Assign each border point to one of the its associated core point

DBSCAN: The Algorithm

Page 12: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border and noise

Eps = 10, MinPts = 4

Source of figure: Introduction to Data Mining by Tan et. al.

Page 13: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Source of figure: Introduction to Data Mining by Tan et. al.

Page 14: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

When DBSCAN Does NOT Work Well

Original Points

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

• Varying densities

• High-dimensional data

Source of figure: Introduction to Data Mining by Tan et. al.

Page 15: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance

• Noise points have the kth nearest neighbor at farther distance

• So, plot sorted distance of every point to its kth nearest neighbor

DBSCAN: Determining EPS and MinPts

Eps=10Minpts=4

Source of figure: Introduction to Data Mining by Tan et. al.

Page 16: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Ordering Points To Identify Clustering Structure

• DBSCAN is sensitive to the choice of input parameters

• Parameter setting is done empirically• High dimensional data – more pronounced• High dimensional data clustering

structures are not generally characterized by global density parameters like eps & minpts

• OPTICS as a solution!

OPTICS: Self Study

Page 17: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Computes an augmented cluster ordering

• Ordering represents the density based clustering structure of the data

• Contains information that is equivalent to density based clustering obtained from a wide range of parameter settings

• Cluster ordering can be used to extract basic clustering information

OPTICS

Page 18: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• In DBSCAN, for constant minpts, clusters with high density (lower eps) are completely contained in density connected sets obtained with lower density

• Extend DBSCAN to process a set of distance parameter eps at the same time.

• For this the objects need to be processed in a specific order

• This order selects an object that is density reachable wrt lowest eps so that clusters of higher density will be finished first.

OPTICS

Page 19: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• 2 values need to be stored for each object:– Core distance– Reachability distance

• Core distance – smallest eps that makes it a core object. If p is not core, it is iundefined.

• Reachability distance of q wrt p is the greater value of the core distance of p and the euclidean distance between p & q. If p is not a core object, distance reachability bet p & q is undefined

OPTICS

Page 20: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Index-based: • k = number of dimensions • N = 20• p = 75%• M = N(1-p) = 5

– Complexity: O(kN2)• Core Distance

• Reachability Distance

OPTICS: Some Extension from DBSCAN

D

p2

MinPts = 5

= 3 cm

Max (core-distance (o), d (o, p))

r(p1, o) = 2.8cm. r(p2,o) = 4cm

o

o

p1

Page 21: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Efficiency issues with DBSCAN• Finding clusters in subspaces• Modeling density accurately

We now look at:• Grid-based clustering

– Partitions data space into grid cells and forms clusters from cells that are dense enough

– Efficient approach for low-dimensional data• Subspace clustering

– Finds clusters in subsets of all dimensions– 2n-1 subspaces to be searched!!!

Density-based Clustering Contd…

Page 22: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• GRIDCLUS• STING• CLIQUE• WaveCluster

Grid-based Clustering

Page 23: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Significant reduction in time complexity, especially for large data sets

• Number of cells << number of data points

• Instead of clustering data points, neighborhood surrounding the data points are clustered

Grid-based Clustering

Page 24: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

Steps involved:1.Creating the grid structure2.Calculating cell density for each cell3.Sorting of the cells according to their

densities4.Identifying cluster centers5.Traversal of neighborhood cells

Grid-based Clustering

Page 25: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

Algorithm:1.Define a set of grid cells2.Assign objects to appropriate grid cells

and compute the density of each cell3.Eliminate cells having density below a

specified threshold4.Form clusters from contiguous groups of

dense cells

Grid-based Clustering

Page 26: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Defining Grid Cells– Key step– Equal width intervals along all dimensions

• Each cell has same volume• Density of cell is defined as no. of points in

cell– Alternatively, equi-depth approach can be used

• Equal number of points in each interval• Called as equal frequency discretization

– MAFIA : subspace clustering algorithm initially uses equal width intervals and then combines intervals of similar density

• Definition of grid has strong impact on clustering results

Grid-based Clustering

Page 27: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Density of Grid Cells– No. of points in the cell divided by the volume

of the cell• No. of road signs per km• No. of tigers in a sq. km• No. of molecules of a gas in cu. cm

Grid-based Clustering

Source of figure: Introduction to Data Mining by Tan et. al.

Page 28: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Forming Clusters from dense grid cells– Relatively straight forward– In the example on previous slide: 2 clusters– Define adjacency

• 4 or 8 adjacent cells in 2-D?• Efficient technique to find adjacent cells

(only occupied cells are stored)– Partially empty cells on the fringe of clusters

which are not dense and will be discarded– 4 parts of the larger cluster will be lost if the

threshold is 9

Grid-based Clustering

Source of figure: Introduction to Data Mining by Tan et. al.

Page 29: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Strengths & Limitations– Single pass is enough to determine the cell

and count of every cell– Grid cells created only for non-empty cells– Complexity of O(m)– O(mlogm)– grids are rectangular– Curse of dimensionality– Grid cells containing just one element

Grid-based Clustering

Source of figure: Introduction to Data Mining by Tan et. al.

Page 30: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Clustering algorithms considered so far take into account all attributes

• Consider only a subspace of data

Subspace Clustering

Source of figure: Introduction to Data Mining by Tan et. al.

Page 31: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

Subspace Clustering

Source of figure: Introduction to Data Mining by Tan et. al.

Page 32: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Ensemble Clustering• Parallelizing Clustering Algorithms to

leverage a Cluster

Some Research Directions

Page 33: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Similar to Ensemble Classification• Consensus Clustering• Obtain different clustering solutions and

then reconcile them

Ensemble Clustering

Page 34: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods  Based on connectivity and density functions  Filter out noise, find clusters of.

• Parallelize to leverage a cluster • Two levels of parallelism

– Node Level– Core Level

• Not Necessarily Orthogonal• Hybrid – Non Trivial• Programming Environment:

– MPI– Open MP

Parallelizing Clustering Algorithms