Post on 13-Dec-2015
DBSCAN
School of Electrical Engineering, University of Belgrade
Department of Computer Engineering
Data Mining algorithm
Professor
Dr Veljko MilutinovićStudent
Milan Micić2011/3323
milan.z.micic@gmail.com
Content
• Introduction• The DBSCAN basic idea• Algorithm• DBSCAN on R• Example• Advantages• Disadvantages• References
2/13
Introduction
• Data clustering algorithms• Using in machine learning, pattern recognition, image analyses,
information retrieval, and bioinformatics• Hierarchical, centroid-based, distribution-based, density-based, etc
3/13
DBSCAN basic idea
4/13
• Density-Based Spatial Clustering of Applications with Noise• Munich,1996• Derived from a human natural
clustering approach
• Input parameters• The size of epsilon neighborhood – ε• Minimum points in cluster – MinPts
• Neighborhood of a given radius ε has to contain at least a minimum number of points MinPts
DBSCAN basic idea
5/13
• Directly density-reachable, p1 from p2
• p1 belongs to the ε neighborhood of p2
• p2's neighborhood size is greater than a given parameter MinPts
• Density-reachable, p0 from pn
• Exists a chain of points p1,..., pn-1, where pi+1 is directly density-reachable from pi
• Core, border and noise point
Algorithm
DBSCAN(D, eps, MinPts)
C = 0
for each unvisited point P in dataset D
mark P as visited
N = regionQuery(P, eps)
if sizeof(N) < MinPts
mark P as NOISE
else
C = next cluster
expandCluster(P, N, C, eps, MinPts)
6/13
expandCluster(P,N,C,eps,MinPts)
add P to cluster C
for each point P' in N
if P' is not visited
mark P' as visited
N' = regionQuery(P', eps)
if sizeof(N') >= MinPts
N = N joined with N'
if P' is not yet member of any cluster
add P' to cluster C
• Complexity with indexing structure: O(n*log(n))
DBSCAN on R
• GNU General Public License • Various methods for clustering
and cluster validation• Interface functions for many methods
implemented in language R• DBSCAN: O(n2)
7/13
• dbscan(x,0.2,showplot=2)• dbscan Pts=600 MinPts=5 eps=0.2
0 1 2 3 4 5 6 7 8 9 10 11seed 0 50 53 51 52 51 54 54 54 53 51 1border 28 4 4 8 5 3 3 4 3 4 6 4total 28 54 57 59 57 54 57 58 57 57 57 5
• FPC - Flexible Procedures for Clustering
Example
8/13
• Astronomy task• Identifying celestial objects by capturing the radiation they emit
• Captured noise (by sensors, diffuse emission from atmosphere and space itself)
• Eliminating method – to constrain the relevant intensity by a known threshold
• In this case – only pixels whose intensity are less than 50 (and consequently darker) are being considered
Example
• DBSCAN algorithm applied on individual pixels • Linking together a complete emission area
• Each of the generated cluster will define a celestial entity• ε = 5, MinPts = 5, 64 clusters and 224 outliers found
9/13
Disadvantages
• Appropriate parameters ε and MinPts• Numerous experiments indicates best MinPts = 4
• Clustering datasets with large difference in densities• “Curse of dimensionality”
• In every algorithm based on the Euclidean distance for high-dimensional data sets
10/13
Advantages
11/13
• Has a notion of noise• Requires just two parameters
• Does not require number of clusters in the data a priori• Can find arbitrarily shaped clusters
• Even clusters completely surrounded by a different cluster
• Mostly insensitive to the ordering of the points in the database
• Only border points might swap cluster membership
References
12/13
• Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Institute for Computer Science, University of Munich,1996;
• Mehmed Kantardzic: “Data Mining: Concepts, Models, Methods, and Algorithms”, 2011;
• Wikibooks: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Density-Based_Clustering;
• Wiki: http://en.wikipedia.org/wiki/DBSCAN
Thank you for your attention!
Questions
Milan Micić
milan.z.micic@gmail.com13/13