DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering...

Post on 13-Dec-2015

223 views 0 download

Transcript of DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering...

DBSCAN

School of Electrical Engineering, University of Belgrade

Department of Computer Engineering

Data Mining algorithm

Professor

Dr Veljko MilutinovićStudent

Milan Micić2011/3323

milan.z.micic@gmail.com

Content

• Introduction• The DBSCAN basic idea• Algorithm• DBSCAN on R• Example• Advantages• Disadvantages• References

2/13

Introduction

• Data clustering algorithms• Using in machine learning, pattern recognition, image analyses,

information retrieval, and bioinformatics• Hierarchical, centroid-based, distribution-based, density-based, etc

3/13

DBSCAN basic idea

4/13

• Density-Based Spatial Clustering of Applications with Noise• Munich,1996• Derived from a human natural

clustering approach

• Input parameters• The size of epsilon neighborhood – ε• Minimum points in cluster – MinPts

• Neighborhood of a given radius ε has to contain at least a minimum number of points MinPts

DBSCAN basic idea

5/13

• Directly density-reachable, p1 from p2

• p1 belongs to the ε neighborhood of p2

• p2's neighborhood size is greater than a given parameter MinPts

• Density-reachable, p0 from pn

• Exists a chain of points p1,..., pn-1, where pi+1 is directly density-reachable from pi

• Core, border and noise point

Algorithm

DBSCAN(D, eps, MinPts)

C = 0

for each unvisited point P in dataset D

mark P as visited

N = regionQuery(P, eps)

if sizeof(N) < MinPts

mark P as NOISE

else

C = next cluster

expandCluster(P, N, C, eps, MinPts)

6/13

expandCluster(P,N,C,eps,MinPts)

add P to cluster C

for each point P' in N

if P' is not visited

mark P' as visited

N' = regionQuery(P', eps)

if sizeof(N') >= MinPts

N = N joined with N'

if P' is not yet member of any cluster

add P' to cluster C

• Complexity with indexing structure: O(n*log(n))

DBSCAN on R

• GNU General Public License • Various methods for clustering

and cluster validation• Interface functions for many methods

implemented in language R• DBSCAN: O(n2)

7/13

• dbscan(x,0.2,showplot=2)• dbscan Pts=600 MinPts=5 eps=0.2

0 1 2 3 4 5 6 7 8 9 10 11seed 0 50 53 51 52 51 54 54 54 53 51 1border 28 4 4 8 5 3 3 4 3 4 6 4total 28 54 57 59 57 54 57 58 57 57 57 5

• FPC - Flexible Procedures for Clustering

Example

8/13

• Astronomy task• Identifying celestial objects by capturing the radiation they emit

• Captured noise (by sensors, diffuse emission from atmosphere and space itself)

• Eliminating method – to constrain the relevant intensity by a known threshold

• In this case – only pixels whose intensity are less than 50 (and consequently darker) are being considered

Example

• DBSCAN algorithm applied on individual pixels • Linking together a complete emission area

• Each of the generated cluster will define a celestial entity• ε = 5, MinPts = 5, 64 clusters and 224 outliers found

9/13

Disadvantages

• Appropriate parameters ε and MinPts• Numerous experiments indicates best MinPts = 4

• Clustering datasets with large difference in densities• “Curse of dimensionality”

• In every algorithm based on the Euclidean distance for high-dimensional data sets

10/13

Advantages

11/13

• Has a notion of noise• Requires just two parameters

• Does not require number of clusters in the data a priori• Can find arbitrarily shaped clusters

• Even clusters completely surrounded by a different cluster

• Mostly insensitive to the ordering of the points in the database

• Only border points might swap cluster membership

References

12/13

• Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Institute for Computer Science, University of Munich,1996;

• Mehmed Kantardzic: “Data Mining: Concepts, Models, Methods, and Algorithms”, 2011;

• Wikibooks: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Density-Based_Clustering;

• Wiki: http://en.wikipedia.org/wiki/DBSCAN

Thank you for your attention!

Questions

Milan Micić

milan.z.micic@gmail.com13/13