DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering...

13
DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović Student Milan Micić 2011/3323 [email protected]

Transcript of DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering...

Page 1: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

DBSCAN

School of Electrical Engineering, University of Belgrade

Department of Computer Engineering

Data Mining algorithm

Professor

Dr Veljko MilutinovićStudent

Milan Micić2011/3323

[email protected]

Page 2: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

Content

• Introduction• The DBSCAN basic idea• Algorithm• DBSCAN on R• Example• Advantages• Disadvantages• References

2/13

Page 3: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

Introduction

• Data clustering algorithms• Using in machine learning, pattern recognition, image analyses,

information retrieval, and bioinformatics• Hierarchical, centroid-based, distribution-based, density-based, etc

3/13

Page 4: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

DBSCAN basic idea

4/13

• Density-Based Spatial Clustering of Applications with Noise• Munich,1996• Derived from a human natural

clustering approach

• Input parameters• The size of epsilon neighborhood – ε• Minimum points in cluster – MinPts

• Neighborhood of a given radius ε has to contain at least a minimum number of points MinPts

Page 5: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

DBSCAN basic idea

5/13

• Directly density-reachable, p1 from p2

• p1 belongs to the ε neighborhood of p2

• p2's neighborhood size is greater than a given parameter MinPts

• Density-reachable, p0 from pn

• Exists a chain of points p1,..., pn-1, where pi+1 is directly density-reachable from pi

• Core, border and noise point

Page 6: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

Algorithm

DBSCAN(D, eps, MinPts)

C = 0

for each unvisited point P in dataset D

mark P as visited

N = regionQuery(P, eps)

if sizeof(N) < MinPts

mark P as NOISE

else

C = next cluster

expandCluster(P, N, C, eps, MinPts)

6/13

expandCluster(P,N,C,eps,MinPts)

add P to cluster C

for each point P' in N

if P' is not visited

mark P' as visited

N' = regionQuery(P', eps)

if sizeof(N') >= MinPts

N = N joined with N'

if P' is not yet member of any cluster

add P' to cluster C

• Complexity with indexing structure: O(n*log(n))

Page 7: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

DBSCAN on R

• GNU General Public License • Various methods for clustering

and cluster validation• Interface functions for many methods

implemented in language R• DBSCAN: O(n2)

7/13

• dbscan(x,0.2,showplot=2)• dbscan Pts=600 MinPts=5 eps=0.2

0 1 2 3 4 5 6 7 8 9 10 11seed 0 50 53 51 52 51 54 54 54 53 51 1border 28 4 4 8 5 3 3 4 3 4 6 4total 28 54 57 59 57 54 57 58 57 57 57 5

• FPC - Flexible Procedures for Clustering

Page 8: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

Example

8/13

• Astronomy task• Identifying celestial objects by capturing the radiation they emit

• Captured noise (by sensors, diffuse emission from atmosphere and space itself)

• Eliminating method – to constrain the relevant intensity by a known threshold

• In this case – only pixels whose intensity are less than 50 (and consequently darker) are being considered

Page 9: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

Example

• DBSCAN algorithm applied on individual pixels • Linking together a complete emission area

• Each of the generated cluster will define a celestial entity• ε = 5, MinPts = 5, 64 clusters and 224 outliers found

9/13

Page 10: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

Disadvantages

• Appropriate parameters ε and MinPts• Numerous experiments indicates best MinPts = 4

• Clustering datasets with large difference in densities• “Curse of dimensionality”

• In every algorithm based on the Euclidean distance for high-dimensional data sets

10/13

Page 11: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

Advantages

11/13

• Has a notion of noise• Requires just two parameters

• Does not require number of clusters in the data a priori• Can find arbitrarily shaped clusters

• Even clusters completely surrounded by a different cluster

• Mostly insensitive to the ordering of the points in the database

• Only border points might swap cluster membership

Page 12: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

References

12/13

• Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Institute for Computer Science, University of Munich,1996;

• Mehmed Kantardzic: “Data Mining: Concepts, Models, Methods, and Algorithms”, 2011;

• Wikibooks: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Density-Based_Clustering;

• Wiki: http://en.wikipedia.org/wiki/DBSCAN

Page 13: DBSCAN School of Electrical Engineering, University of Belgrade Department of Computer Engineering Data Mining algorithm Professor Dr Veljko Milutinović.

Thank you for your attention!

Questions

Milan Micić

[email protected]/13