DBSCAN++: Towards fast and scalable density...

DBSCAN++: Towards fast and scalable density clustering

Jennifer Jang 1 Heinrich Jiang 2

AbstractDBSCAN is a classical density-based clusteringprocedure with tremendous practical relevance.However, DBSCAN implicitly needs to computethe empirical density for each sample point, lead-ing to a quadratic worst-case time complexity,which is too slow on large datasets. We proposeDBSCAN++, a simple modification of DBSCANwhich only requires computing the densities for achosen subset of points. We show empirically that,compared to traditional DBSCAN, DBSCAN++can provide not only competitive performance butalso added robustness in the bandwidth hyperpa-rameter while taking a fraction of the runtime.We also present statistical consistency guaranteesshowing the trade-off between computational costand estimation rates. Surprisingly, up to a cer-tain point, we can enjoy the same estimation rateswhile lowering computational cost, showing thatDBSCAN++ is a sub-quadratic algorithm that at-tains minimax optimal rates for level-set estima-tion, a quality that may be of independent interest.

1. IntroductionDensity-based clustering algorithms such as Mean Shift(Cheng, 1995) and DBSCAN (Ester et al., 1996) have madea large impact on a wide range of areas in data analysis,including outlier detection, computer vision, and medicalimaging. As data volumes rise, non-parametric unsuper-vised procedures are becoming ever more important in un-derstanding large datasets. Thus, there is an increasing needto establish more efficient versions of these algorithms. Inthis paper, we focus on improving the classical DBSCANprocedure.

It was long believed that DBSCAN had a runtime ofO(n log n) until it was proven to be O(n2) in the worstcase by Gan and Tao (2015). They showed that while DB-

1Uber 2Google Research. Correspondence to: Jennifer Jang<[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

SCAN can run in O(n log n) when the dimension is at most2, it quickly starts to exhibit quadratic behavior in high di-mensions and/or when n becomes large. In fact, we show inFigure 1 that even with a simple mixture of 3-dimensionalGaussians, DBSCAN already starts to show quadratic be-havior.

The quadratic runtime for these density-based procedurescan be seen from the fact that they implicitly must computedensity estimates for each data point, which is linear timein the worst case for each query. In the case of DBSCAN,such queries are proximity-based. There has been muchwork done in using space-partitioning data structures suchas KD-Trees (Bentley, 1975) and Cover Trees (Beygelzimeret al., 2006) to improve query times, but these structures areall still linear in the worst-case. Another line of work thathas had practical success is in approximate nearest neigh-bor methods (e.g. Indyk and Motwani (1998); Datar et al.(2004)) which have sub-linear queries, but such methodscome with few approximation guarantees.

DBSCAN proceeds by computing the empirical densitiesfor each sample point and then designating points whosedensities are above a threshold as core-points. Then, a neigh-borhood graph of the core-points is constructed, and theclusters are assigned based on the connected components.

In this paper, we present DBSCAN++, a step towards a fastand scalable DBSCAN. DBSCAN++ is based on the obser-vation that we only need to compute the density estimatesfor a subset m of the n data points, where m can be muchsmaller than n, in order to cluster properly. To choose thesem points, we provide two simple strategies: uniform andgreedy K-center-based sampling. The resulting procedurehas O(mn) worst-case runtime.

We show that with this modification, we still maintain statis-tical consistency guarantees. We show the trade-off betweencomputational cost and estimation rates. Interestingly, upto a certain point, we can enjoy the same minimax-optimalestimation rates attained by DBSCAN while making m (in-stead of the larger n) empirical density queries, thus leadingto a sub-quadratic procedure. In some cases, we saw thatour method of limiting the number of core points can actas a regularization, thus reducing the sensitivity of classicalDBSCAN to its parameters.


Figure 1. Runtime (seconds) vs dataset size to cluster a mixtureof four 3-dimensional Gaussians. Using Gaussian mixtures, wesee that DBSCAN starts to show quadratic behavior as the datasetgets large. After 106 points, DBSCAN ran too slowly and wasterminated after 3 hours. This is with only 3 dimensions.

We show on both simulated datasets and real datasets thatDBSCAN++ runs in a fraction of the time compared toDBSCAN, while giving competitive performance and con-sistently producing more robust clustering scores acrosshyperparameter settings.

2. Related WorksThere has been much work done on finding faster variantsof DBSCAN. We can only highlight some of these workshere. One approach is to speed up the nearest neighborqueries that DBSCAN uses (Huang and Bian, 2009; Vijay-alaksmi and Punithavalli, 2012; Kumar and Reddy, 2016),including with approximate nearest neighbors methods (Wuet al., 2007). Another approach is to find a set of "leader"points that still preserve the structure of the original data setand then identify clusters based on the clustering of these"leader" points (Geng et al., 2000; Viswanath and Pinkesh,2006; Viswanath and Babu, 2009). Our approach of find-ing core points is similar but is simpler and comes withtheoretical guarantees. Liu (2006) modified DBSCAN byselecting clustering seeds among the unlabeled core pointsin an orderly manner in order to reduce computation time inregions that have already been clustered. Other heuristicsinclude (Borah and Bhattacharyya, 2004; Zhou et al., 2000b;Patwary et al., 2012; Kryszkiewicz and Lasek, 2010).

There are also numerous approaches based on parallel com-puting such as (Xu et al., 1999; Zhou et al., 2000a; Arliaand Coppola, 2001; Brecheisen et al., 2006; Chen et al.,2010; Patwary et al., 2012; Götz et al., 2015) including map-reduce based approaches (Fu et al., 2011; He et al., 2011;Dai and Lin, 2012; Noticewala and Vaghela, 2014). Thenthere are distributed approaches to DBSCAN where datais partitioned across different locations and there may becommunication cost constraints (Januzaj et al., 2004b;a; Liu

et al., 2012; Neto et al., 2015; Lulli et al., 2016). It is alsoworth mentioning Andrade et al. (2013), who presented aGPU implementation of DBSCAN that can be over 100xfaster than sequential DBSCAN. In this paper, we assumea single processor although extending our approach to theparallel or distributed settings could be a future researchdirection.

We now discuss the theoretical work done for DBSCAN.Despite the practical significance of DBSCAN, its statisticalproperties has only been explored recently (Sriperumbudurand Steinwart, 2012; Jiang, 2017a; Wang et al., 2017; Stein-wart et al., 2017). Such analyses make use of recent devel-opments in topological data analysis to show that DBSCANestimates the connected components of a level-set of theunderlying density.

It turns out there has been a long history in estimatingthe level-sets of the density function (Hartigan, 1975; Tsy-bakov et al., 1997; Singh et al., 2009; Rigollet et al., 2009;Rinaldo and Wasserman, 2010; Chaudhuri and Dasgupta,2010; Steinwart, 2011; Balakrishnan et al., 2013; Chaudhuriet al., 2014; Jiang, 2017b; Chen et al., 2017). However,most of these methods have little practical value (someare unimplementable), and DBSCAN is one of the onlypractical methods that is able to attain the strongest guar-antees, including finite-sample Hausdorff minimax optimalrates. In fact the only previous method that was shown toattain such guarantees was the impractical histogram-basedmethod of Singh et al. (2009), until Jiang (2017a) showedthat DBSCAN attained almost identical guarantees. Thispaper shows that DBSCAN++ can attain similar guaranteeswhile being sub-quadratic in computational complexity aswell as the precise trade-off in estimation rates for furthercomputational speedup.

3. AlgorithmWe have n i.i.d. samples X = {x1, ..., xn} drawn froma distribution F over RD. We now define core-points,which are essentially points with high empirical density de-fined with respect to the two hyperparameters of DBSCAN,minPts and ε. The latter is also known as the bandwidth.Definition 1. Let ε > 0 and minPts be a positive integer.Then x ∈ X is a core-point if |B(x, ε) ∩ X| ≥ minPts,where B(x, ε) := {x′ : |x− x′| ≤ ε}.

In other words, a core-point is a sample point that has atleast minPts sample points within its ε-radius neighborhood.

DBSCAN (Ester et al., 1996) is shown as Algorithm 1,which is in a more concise but equivalent form to the origi-nal version (see Jiang (2017a)). It creates a graph G withcore-points as vertices and edges connecting core points,which are distance at most ε apart. The final clusters are rep-resented by the connected components in this graph along


with non-core-points that are close to such a connected com-ponent. The remaining points are designated as noise pointsand are left unclustered. Noise points can be seen as outliers.

Algorithm 1 DBSCANInputs: X , ε, minPtsC ← core-points in X given ε and minPtsG← initialize empty graphfor c ∈ C do

Add an edge (and possibly a vertex or vertices) in Gfrom c to all points in X ∩B(c, ε)

end forreturn connected components of G.

Figure 2. Core-points from a mixture of three 2D Gaussians. Eachpoint marked with a triangle represents a core-point and the shadedarea its ε-neighborhood. The total ε-radii area of DBSCAN++core-points provides adequate coverage of the dataset. The K-center initialization produces an even more efficient covering. Thepoints that are not covered will be designated as outliers. Thisillustrates that a strategically selected subset of core points canproduce a reasonable ε-neighborhood cover for clustering.

3.1. Uniform Initialization

DBSCAN++, shown in Algorithm 2, proceeds as follows:First, it chooses a subset S of m uniformly sampled pointsfrom the dataset X . Then, it computes the empirical densityof points in S w.r.t. the entire dataset. That is, a pointx ∈ S is a core point if |B(x, ε) ∩ X| ≥ minPts. Fromhere, DBSCAN++ builds a similar neighborhood graph Gof core-points in S and finds the connected components inG. Finally, it clusters the rest of the unlabeled points to theirclosest core-points. Thus, since it only recovers a fractionof the core-points, it requires expensive density estimationqueries on only m of the n samples. However, the intuition,as shown in Figure 2, is that a smaller sample of core-pointscan still provide adequate coverage of the dataset and leadto a reasonable clustering.

3.2. K-Center Initialization

Instead of uniformly choosing the subset of m points atrandom, we can use K-center (Gonzalez, 1985; Har-Peled,2011), which aims at finding the subset of size m that mini-mizes the maximum distance of any point in X to its closest

Algorithm 2 DBSCAN++Inputs: X , m, ε, minPtsS← sample m points from X .C← all core-points in S w.r.t X , ε, and minPtsG← empty graph.for c ∈ C do

Add an edge (and possibly a vertex or vertices) in Gfrom c to all points in X ∩B(c, ε)

end forreturn connected components of G.

point in that subset. In other words, it tries to find themost efficient covering of the sample points. We use thegreedy initialization method for approximating K-center(Algorithm 3), which repeatedly picks the farthest pointfrom any point currently in the set. This process contin-ues until we have a total of m points. This method gives a2-approximation to the K-center problem.

Algorithm 3 Greedy K-center InitializationInput: X , m.S ← {x1}.for i from 1 to m− 1 do

Add argmaxx∈X mins∈S |x− s| to S.end forreturn S.

3.3. Time Complexity

DBSCAN++ has a time complexity of O(nm). Choos-ing the set S takes linear time for the uniform initializationmethod andO(mn) for the greedyK-center approach (Gon-zalez, 1985). The next step is to find the core-points. Weuse a KDTree to query for the points within the ε-radii ballfor each point in S. Each such query takes O(n) in theworst case, and doing so for m sampled points leads to acost of O(nm). Constructing the graph takes O(mn) timeand running a depth-first search on the graph recovers theconnected components in O(nm) since the graph will haveat most O(nm) edges.

The last step is to cluster the remaining points to the nearestcore point. We once again use a KDTree, which takes O(m)for each of O(n) points, leading to a time complexity ofO(nm) as well. Thus, the time complexity of DBSCAN++is O(nm).

4. Theoretical AnalysisIn this section, we show that DBSCAN++ is a consistentestimator of the density level-sets. It was recently shown byJiang (2017a) that DBSCAN does this with finite-sampleguarantees. We extend this analysis to show that our modi-


fied DBSCAN++ also has statistical consistency guarantees,and we show the trade-off between speed and convergencerate.Definition 2. (Level-set) The λ-level-set of f is defined asLf (λ) := {x ∈ X : f(x) ≥ λ}.

Our results are under the setting that the density level λ isknown and gives insight into how to tune the parametersbased on the desired density level.

4.1. Regularity Assumptions

We have n i.i.d. samples X = {x1, ..., xn} drawn from adistribution F over RD. We take f to be the density of Fover the uniform measure on RD.Assumption 1. f is continuous and has compact supportX ⊆ RD.

Much of the results will depend on the behavior of level-set boundaries. Thus, we require sufficient drop-off at theboundaries as well as separation between the CCs at a par-ticular level-set.

Define the following shorthands for distance from a point toa set and the neighborhood around a set.Definition 3. d(x,A) := infx′∈A |x − x′|, B(C, r) :={x ∈ X : d(x,C) ≤ r},Assumption 2 (β-regularity of level-sets). Let 0 < β <∞.There exist C, C, rc > 0 such that the following holds forall x ∈ B(Lf (λ), rc)\Lf (λ).

C · d(x, Lf (λ))β ≤ λ− f(x) ≤ C · d(x, Lf (λ))β .

Remark 1. We can choose any 0 < β < ∞. The β-regularity condition is a standard assumption in level-setanalyses. See (Singh et al., 2009). The higher the β, themore smooth the density is around the boundary of the level-set and thus the less salient it is. This makes it more difficultto recover the level-set.

4.2. Hyperparameter Settings

In this section, we state the hyperparameter settings in termsof n, the sample size, and the desired density level λ in orderfor statistical consistency guarantees to hold. Define Cδ,n =16 log(2/δ)

√log n, where δ, 0 < δ < 1, is a confidence

parameter which will be used later (i.e. guarantees will holdwith probability at least 1− δ).

ε =

(minPts

n · vD · (λ− λ · C2δ,n/√

minPts)

)1/D

,

where vD is the volume of the unit ball in RD and minPtssatisfies

Cl · (log n)2 ≤ minPts ≤ Cu · (log n)2D

2+D · n2β/(2β+D),

and Cl and Cu are positive constants depending on δ, f .

4.3. Level-set estimation result

We give the estimation rate under the Hausdorff metric.

Definition 4 (Hausdorff Distance).

dHaus(A,A′) = max{sup

x∈Ad(x,A′), sup

x′∈A′d(x′, A)}.

Theorem 1. Suppose Assumption 1 and 2 hold, and assumethe parameter settings in the previous section. There existsCl, C sufficiently large and Cu sufficiently small such thatthe following holds with probability at least 1−δ. Let Lf (λ)be the core-points returned by Algorithm 2 under uniforminitialization or greedy K-center initialization. Then,

dHaus(Lf (λ), Lf (λ))

≤ C ·

(C

2/βδ,n · minPts−1/2β + C

1/Dδ,n ·

(√logm

m

)1/D).

Proof. There are two quantities to bound: (i)max

x∈Lf (λ)d(x, Lf (λ)), which ensures that the esti-

mated core-points are not far from the true core-points (i.e.Lf (λ)), and (ii) supx∈Lf (λ) d(x, Lf (λ)), which ensuresthat the estimates core-points provides a good covering ofthe level-set.

The bound for (i) follows by the main result of Jiang(2017a). This is because DBSCAN++’s estimatedcore-points are a subset of that of the original DB-SCAN procedure. Thus, max

x∈Lf (λ)d(x, Lf (λ)) ≤

maxx∈Lf (λ)

d(x, Lf (λ)), where Lf (λ) are the core-pointsreturned by original DBSCAN; this quantity is bounded byO(C

2/βδ,n ·minPts−1/2β) by Jiang (2017a).

We now turn to the other direction and boundsupx∈Lf (λ) d(x, Lf (λ)). Let x ∈ Lf (λ).

Suppose we use the uniform initialization. Define r0 :=(2Cδ,n

√D logm

mvD·λ

)1/D. Then, we have∫

Xf(z) · 1[z ∈ B(x, r0)]dz ≥ vDr0D(λ− Crβ0 )

≥ vDr0Dλ/2 =Cδ,n√D logm

m,

where the first inequality holds from Assumption 2, thesecond inequality holds for n sufficiently large, and the lastholds from the conditions on minPts.

By the uniform ball convergence rates of Lemma 7 of Chaud-huri and Dasgupta (2010), we have that with high probabil-ity, there exists sample point x′ ∈ S such that |x−x′| ≤ r0.This is because the ball B(x, r0) contains sufficiently hightrue mass to be guaranteed a sample point in S. Moreover,this guarantee holds with high probability uniformly over


x ∈ X . Next, we show that x′ is a core-point. This followsby Lemma 8 of Jiang (2017a), which shows that any sam-ple point in x ∈ Lf (λ) satisfies |B(x, ε) ∩ X| ≥ minPts.Thus, x′ ∈ Lf (λ). Hence, supx∈Lf (λ) d(x, Lf (λ)) ≤ r0,as desired.

Now suppose we use the greedy K-center initialization.Define the following attained K-center objective:

τ := maxx∈X

mins∈S

d(s, x),

and the optimal K-center objective:

τopt := minS′⊆X ,|S′|=m

maxx∈X

mins∈S′

d(s, x).

It is known that the greedy K-center initialization is a 2-approximation (see Gonzalez (1985); Har-Peled (2011)),thus

τ ≤ 2τopt ≤ 2r0,

where the last inequality follows with high probability sincethe K-center objective will be sub-optimal if we sampledthe m centers uniformly. Then, we have

supx∈Lf (λ)

mins∈S

d(s, x)

≤ maxx∈X

mins∈S

d(s, x) + dHaus(Lf (λ), X ∩ Lf (λ))

≤ τ + r0 ≤ 3r0.

The argument then proceeds in the same way as with uni-form initialization but with an extra constant factor, as de-sired.

Remark 2. When taking minPts to the maximum allowedrate

minPts ≈ n2β/(2β+D),

we obtain the error rate (ignoring log factors) of

dHaus(Lf (λ), Lf (λ)) . n−1/(2β+D) +m−1/D.

The first term matches the known lower bound for level-set estimation established in Theorem 4 of Tsybakov et al.(1997). The second term is the trade-off for computing theempirical densities for only m of the points. In particular, ifwe take

m & nD/(2β+D),

then the first term dominates, and we thus havedHaus(Lf (λ), Lf (λ)) . n−1/(2β+D), the minimax optimalrate for level-set estimation. This leads to the followingresult.Corollary 1. Suppose the conditions of Theorem 1 and setm ≈ nD/(2β+D). Then, Algorithm 2 is a minimax optimalestimator (up to logarithmic factors) of the density level-setwith sub-quadratic runtime of O(n2−2β/(2β+D)).

n D c m

(A) iris 150 4 3 3(B) wine 178 13 3 5(C) spam 1401 57 2 793(D) images 210 19 7 24(E) MNIST 60000 20 10 958(F) Libras 360 90 15 84(G) mobile 2000 20 4 112(H) zoo 101 16 7 8(I) seeds 210 19 7 6(J) letters 20000 16 26 551(K) phonemes 4509 256 5 396(L) fashion MNIST 60000 784 10 5674(M) celeb-a 10000 40 3 3511

Figure 3. Summary of datasets used. Includes dataset size (n),number of features (D), number of clusters (c), and the (m) usedby both DBSCAN++ uniform and K-center.

4.4. Estimating the connected components

The previous section shows that the core-points returnedby DBSCAN++ recovers the density level-set. The moreinteresting question is about the actual clustering: that is,whether DBSCAN++ can recover the connected compo-nents of the density level-set individually and whether thereis a 1:1 correspondence between the clusters returned byDBSCAN++ and the connected components.

It turns out that to obtain such a result, we need a minormodification of the procedure. That is, after determining thecore points, instead of using the ε cutoff to connect pointsinto the same cluster, we must use a higher cutoff. In fact,any constant value would do as long as it is sufficientlysmaller than the pairwise distances between the connectedcomponents. For example, for the original DBSCAN al-gorithm, many analyses must make this same modification.This is known as pruning false clusters in the literature (seeKpotufe and von Luxburg (2011); Jiang (2017a)). The sameanalysis carries over to our modification, and we omit it here.We note that pruning does not change the final estimationrates but may change the initial sample size required.

4.5. Outlier detection

One important application of DBSCAN is outlier detection(Breunig et al., 2000; Çelik et al., 2011; Thang and Kim,2011). Datapoints not assigned to clusters are noise pointsand can be considered outliers. This is because the noisepoints are the low density points away from the clusters andthus tend to be points with few similar representatives in thedataset. We show that the noise points DBSCAN++ findsare similar to the noise points discovered by DBSCAN++.We give a simple result that shows that every DBSCANnoise point is also one DBSCAN++ finds (Lemma 1). Then,


Figure 4 (Left) shows that the number of noise points ofDBSCAN++ quickly converges to those of DBSCAN as theratiom/n increases, which combined with Lemma 1, showsthat the noise points DBSCAN++ returns closely approx-imates those returned by DBSCAN for m/n sufficientlyhigh.

Lemma 1 (Noise points). For any dataset, if N0 and N1

are the noise points found by DBSCAN and DBSCAN++respectively, then as long as they have the same setting of εand k, we have that N0 ⊆ N1.

Proof. Noise points are those that are further than ε dis-tance away from a core point. The result follows since DB-SCAN++ core points are a subset of that of DBSCAN.

5. Experiments5.1. Dataset and setup

We ran DBSCAN++ with uniform and K-center initializa-tions and compared both to DBSCAN on 11 real datasets asdescribed in Figure 3. We used Phonemes (Friedman et al.,2001), a dataset of log periodograms of spoken phonemes,and MNIST, a sub-sample of the MNIST handwriting recog-nition dataset after running a PCA down to 20 dimensions.The rest of the datasets we used are standard UCI or Kaggledatasets used for clustering. We evaluate the performancevia two widely-used clustering scores: the adjusted RANDindex (Hubert and Arabie, 1985) and adjusted mutual in-formation score (Vinh et al., 2010), which are computedagainst the ground truth. We fixed minPts = 10 for allprocedures throughout experiments.

5.2. Trade-off between accuracy and speed

The theoretical results suggest that up to a certain point, onlycomputing empirical densities for a subset of the samplepoints should not noticeably impact the clustering perfor-mance. Past that point, we begin to see a trade-off. Weconfirm this empirically in Figure 4 (Right), which showsthat indeed past a certain threshold of m/n, the cluster-ing scores remain high. Only when the sub-sample is toosmall do we begin seeing a significant trade-off in clusteringscores. This shows that DBSCAN++ can save considerablecomputational cost while maintaining similar clustering per-formance as DBSCAN.

We further demonstrate this point by applying these proce-dures to image segmentation, where segmentation is doneby clustering the image’s pixels (with each pixel representedas a 5-dimensional vector consisting of (x, y) position andRGB color). Figure 5 shows that DBSCAN++ providesa very similar segmentation as DBSCAN in a fraction ofthe time. While this is just a simple qualitative example, itserves to show the applicability of DBSCAN++ to a poten-tially wide range of problems.

Figure 4. Each row corresponds to a dataset. See Figure 3 fordataset descriptions. Left (Outlier Detection): The number ofoutliers (i.e. noise points) returned by DBSCAN against m/nand compared it to that of DBSCAN++. We see that the set ofDBSCAN++ outliers quickly approaches those of DBSCAN’s thusshowing that DBSCAN++ remains effective at outlier detectioncompared to DBSCAN, especially when m/n is sufficiently high.Right (Clustering Performance): we plot the clustering accuracyand runtimes for eight real datasets as a function of the ratio m/n.As expected, the runtime increases approximately linearly in thisratio, but the clustering scores consistently attain high values whenm/n is sufficiently large. Interestingly, sometimes we attain higherscores with lower values of m/n thus giving both better runtimeand accuracy.


Figure 5. Figure skater Yuzuru Hanyu at the 2018 Olympics. DB-SCAN was initiated with hyperparameters ε = 8 and minPts = 10,and DBSCAN++ with ε = 60, m/n = 0.3, and minPts = 10.DBSCAN++ withK-centers initialization recovers similar clusters(designated by the purple boundaries) in the 988× 750 image asDBSCAN in far less time: 7.38 seconds versus 44.18 seconds. Thespeedup becomes more significant on higher resolution images.

5.3. Robustness to Hyperparameters

In Figure 6, we plot each algorithm’s performance acrossa wide range of its hyperparameters. The table in Figure 7shows the best scores and runtimes for each dataset and algo-rithm. For these experiments, we chose m = p · nD/(D+4),where 0 < p < 1 was chosen based on validation over just 3values, as explained in Figure 7. We found that theK-centerinitialization required smaller p due to its ability to find agood covering of the space and more efficiently choose thesample points to query for the empirical density.

The results in Figure 6 show that DBSCAN++ with uniforminitialization gives competitive performance compared toDBSCAN but with robustness across a much wider range ofε. In fact, in a number of cases, DBSCAN++ was even betterthan DBSCAN under optimal tuning. DBSCAN++ withK-center initialization further improves on the clusteringresults of DBSCAN++ for most of the datasets. Pruning thecore-points as DBSCAN++ may act as a regularizing factorto prevent the algorithm’s dependency on the preciseness ofits parameters.

An explanation of why DBSCAN++ added robustnessacross ε follows. When tuning DBSCAN with respect to ε,we found that DBSCAN often performed optimally on onlya narrow range of ε. Because ε controls both the designa-tion of points as core-points as well as the connectivity ofthe core-points, small changes could produce significantlydifferent clusterings.

Figure 6. Clustering performance over range of hyperparame-ter settings. Experimental results on datasets described in Figure 3.Each row corresponds to a single dataset and each column cor-responds to a clustering score. For each dataset and clusteringscore, we plot the scores for DBSCAN++ with uniform and K-center sampling vs DBSCAN across a wide range of settings for ε(x-axis).


DBSCAN Uniform K-Center(A) 0.5681 0.6163 (±0.01) 0.6634

0.5768 0.6449 (±0.01) 0.7301(B) 0.2851 0.3254 (±0.01) 0.3694

0.3587 0.3605 (±0.00) 0.4148(C) 0.2851 0.3254 (±0.01) 0.3694

0.3587 0.3605 (±0.00) 0.4148(D) 0.2922 0.2701 (±0.01) 0.3853

0.4938 0.4289 (±0.01) 0.5600(E) 0.0844 0.1097 (±0.00) 0.1416

0.1743 0.3774 (±0.00) 0.3152(F) 0.0939 0.1380 (±0.00) 0.2095

0.2170 0.3033 (±0.00) 0.4461(G) 0.0551 0.1741 (±0.00) 0.1091

0.2123 0.2585 (±0.00) 0.2418(H) 0.6846 0.6729 (±0.01) 0.7340

0.6347 0.6356 (±0.00) 0.7456(I) 0.4041 0.4991 (±0.02) 0.4402

0.4403 0.4843 (±0.02) 0.5057(J) 0.0623 0.0488 (±0.00) 0.0901

0.3823 0.3956 (±0.00) 0.3841(K) 0.5101 0.5541 (±0.01) 0.5364

0.6475 0.6259 (±0.01) 0.6452

Figure 7. Highest scores for each dataset and algorithm. Thefirst row is the adjusted RAND index and the second row theadjusted mutual information. The highest score of the row isin green while the second highest is in orange. The standarderror over 10 runs is given in parentheses for DBSCAN++ withuniform initialization. Both other algorithms are deterministic.Each algorithm was tuned across a range of ε with minPts = 10.For both DBSCAN++ algorithms, we used p values of 0.1, 0.2,or 0.3. We see that DBSCAN++ uniform performs better thanDBSCAN on 17 out of 22 metrics, while DBSCAN++ K-centerperforms better than DBSCAN on 21 out of 22 metrics.

In contrast, DBSCAN++ suffers less from the hyper-connectivity of the core-points until ε is very large. It turnsout that only processing a subset of the core-points notonly reduces the runtime of the algorithm, but it providesthe practical benefit of reducing the tenuous connectionsbetween connected components that are actually far away.This way, DBSCAN++ is much less sensitive to changes inε and reaches its saturation point (where there is only onecluster) only at very large ε.

Performance under optimal tuning is often not available inpractice, and this is especially the case in unsupervised prob-lems like clustering where the ground truth is not assumedto be known. Thus, not only should procedures produceaccurate clusterings in the best setting, but it may be evenmore important for procedures to be precise, easy to tune,reasonable across a wide range of its hyperparameter set-tings. This added robustness (not to mention speedup) maymake DBSCAN++ a more practical method. This is espe-

DBSCAN Uniform K-Center(A) 3.07 (±0.08) 1.52 (±0.09) 2.55 (±0.34)(B) 2.04 (±0.07) 1.31 (±0.07) 0.79 (±0.02)(C) 3308 (±26.4) 225.86 (±6.8) 442.69 (±2.0)(D) 4.88 (±0.09) 1.51 (±0.05) 1.32 (±0.04)(E) 1.5e5 (±0.17) 3.5e3 (±39.23) 7.0e3 (±41.1)(F) 37.63 (±0.38) 8.20 (±0.22) 9.84 (±0.06)(G) 67.05 (±0.63 11.41 (±0.21) 15.23 (±0.32)(H) 1.07 (±0.03) 0.78 (±0.03) 0.81 (±0.03)(I) 1.75 (±0.04) 0.91 (±0.03) 0.97 (±0.09)(J) 1.0e5 (±76.43) 5.2e3 (±17.48) 1.5e3 (±36.4)(K) 1.2e4 (±160) 1.9e3 (±32.45) 1.9e3 (±30.4)(L) 3.9e9 (±4.3e4) 7.4e8 (±4.1e3) 3.6e8(±307)(M) 4.1e9 (±6.2e4) 3.1e8 (±411) 2.3e8(±1.1e3)

Figure 8. Runtimes (milliseconds) and standard errors for eachdataset and algorithm. DBSCAN++ using both uniform and K-center initializations performs reasonably well within a fractionof the runtime of DBSCAN. The larger the dataset, the less timeDBSCAN++ requires compared to DBSCAN, showing that DB-SCAN++ scales much better in practice.

cially true on large datasets where it may be costly to iterateover many hyperparameter settings.

5.4. Performance under optimal tuning

We see that under optimal tuning of each algorithm, DB-SCAN++ consistently outperforms DBSCAN in both clus-tering scores and runtime. We see in Figure 7 that DB-SCAN++ with the uniform initialization consistently out-performs DBSCAN and DBSCAN++ with K-center ini-tialization consistently outperforms both of the algorithms.Figure 8 shows that indeed DBSCAN++ gives a speed ad-vantage over DBSCAN for the runs that attained the optimalperformance. These results thus suggest that not only isDBSCAN++ faster, it can achieve better clusterings.

6. ConclusionIn this paper, we presented DBSCAN++, a modified versionof DBSCAN that only computes the density estimates fora subset m of the n points in the original dataset. We es-tablished statistical consistency guarantees which show thetrade-off between computational cost and estimation rates,and we prove that interestingly, up to a certain point, wecan enjoy the same estimation rates while reducing compu-tation cost. We also demonstrate this finding empirically.We then showed empirically that not only can DBSCAN++scale considerably better than DBSCAN, its performance iscompetitive in accuracy and consistently more robust acrosstheir mutual bandwidth hyperparameters. Such robustnesscan be highly desirable in practice where optimal tuning iscostly or unavailable.


ReferencesGuilherme Andrade, Gabriel Ramos, Daniel Madeira,

Rafael Sachetto, Renato Ferreira, and Leonardo Rocha.G-dbscan: A gpu accelerated algorithm for density-basedclustering. Procedia Computer Science, 18:369–378,2013.

Domenica Arlia and Massimo Coppola. Experiments inparallel clustering with dbscan. In European Conferenceon Parallel Processing, pages 326–331. Springer, 2001.

Sivaraman Balakrishnan, Srivatsan Narayanan, AlessandroRinaldo, Aarti Singh, and Larry Wasserman. Clustertrees on manifolds. In Advances in Neural InformationProcessing Systems, pages 2679–2687, 2013.

Jon Louis Bentley. Multidimensional binary search treesused for associative searching. Communications of theACM, 18(9):509–517, 1975.

Alina Beygelzimer, Sham Kakade, and John Langford.Cover trees for nearest neighbor. In Proceedings ofthe 23rd international conference on Machine learning,pages 97–104. ACM, 2006.

B Borah and DK Bhattacharyya. An improved sampling-based dbscan for large spatial databases. In IntelligentSensing and Information Processing, 2004. Proceedingsof International Conference on, pages 92–96. IEEE, 2004.

Stefan Brecheisen, Hans-Peter Kriegel, and Martin Pfeifle.Parallel density-based clustering of complex objects. InPacific-Asia Conference on Knowledge Discovery andData Mining, pages 179–188. Springer, 2006.

Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng,and Jörg Sander. Lof: identifying density-based localoutliers. In ACM sigmod record, volume 29, pages 93–104. ACM, 2000.

Mete Çelik, Filiz Dadaser-Çelik, and Ahmet Sakir Dokuz.Anomaly detection in temperature data using dbscan al-gorithm. In Innovations in Intelligent Systems and Ap-plications (INISTA), 2011 International Symposium on,pages 91–95. IEEE, 2011.

Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of con-vergence for the cluster tree. In Advances in NeuralInformation Processing Systems, pages 343–351, 2010.

Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe,and Ulrike von Luxburg. Consistent procedures for clus-ter tree estimation and pruning. IEEE Transactions onInformation Theory, 60(12):7900–7912, 2014.

Min Chen, Xuedong Gao, and HuiFei Li. Parallel dbscanwith priority r-tree. In Information Management andEngineering (ICIME), 2010 The 2nd IEEE InternationalConference on, pages 508–511. IEEE, 2010.

Yen-Chi Chen, Christopher R Genovese, and Larry Wasser-man. Density level sets: Asymptotics, inference, andvisualization. Journal of the American Statistical Associ-ation, pages 1–13, 2017.

Yizong Cheng. Mean shift, mode seeking, and clustering.IEEE transactions on pattern analysis and machine intel-ligence, 17(8):790–799, 1995.

Bi-Ru Dai and I-Chang Lin. Efficient map/reduce-baseddbscan algorithm with optimized data partition. In 2012IEEE Fifth International Conference on Cloud Comput-ing, pages 59–66. IEEE, 2012.

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab SMirrokni. Locality-sensitive hashing scheme based onp-stable distributions. In Proceedings of the twentiethannual symposium on Computational geometry, pages253–262. ACM, 2004.

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu,et al. A density-based algorithm for discovering clustersin large spatial databases with noise. In Kdd, volume 96,pages 226–231, 1996.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Theelements of statistical learning, volume 1. Springer seriesin statistics New York, 2001.

Yan Xiang Fu, Wei Zhong Zhao, and Hui Fang Ma. Re-search on parallel dbscan algorithm design based onmapreduce. In Advanced Materials Research, volume301, pages 1133–1138. Trans Tech Publ, 2011.

Junhao Gan and Yufei Tao. Dbscan revisited: mis-claim, un-fixability, and approximation. In Proceedings of the 2015ACM SIGMOD International Conference on Managementof Data, pages 519–530. ACM, 2015.

ZHOU Shui Geng, ZHOU Ao Ying, CAO Jing, and HU YunFa. A fast density based clustering algorithm [j]. Journalof Computer Research and Development, 11:001, 2000.

Teofilo F Gonzalez. Clustering to minimize the maximumintercluster distance. Theoretical Computer Science, 38:293–306, 1985.

Markus Götz, Christian Bodenstein, and Morris Riedel.Hpdbscan: highly parallel dbscan. In Proceedings ofthe workshop on machine learning in high-performancecomputing environments, page 2. ACM, 2015.

Sariel Har-Peled. Geometric approximation algorithms.Number 173. American Mathematical Soc., 2011.

John A Hartigan. Clustering algorithms, volume 209. WileyNew York, 1975.


Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma,Shengzhong Feng, and Jianping Fan. Mr-dbscan: an ef-ficient parallel density-based clustering algorithm usingmapreduce. In Parallel and Distributed Systems (IC-PADS), 2011 IEEE 17th International Conference on,pages 473–480. IEEE, 2011.

Ming Huang and Fuling Bian. A grid and density based fastspatial clustering algorithm. In Artificial Intelligence andComputational Intelligence, 2009. AICI’09. InternationalConference on, volume 4, pages 260–263. IEEE, 2009.

Lawrence Hubert and Phipps Arabie. Comparing partitions.Journal of classification, 2(1):193–218, 1985.

Piotr Indyk and Rajeev Motwani. Approximate nearestneighbors: towards removing the curse of dimensionality.In Proceedings of the thirtieth annual ACM symposiumon Theory of computing, pages 604–613. ACM, 1998.

Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle.Dbdc: Density based distributed clustering. In Inter-national Conference on Extending Database Technology,pages 88–105. Springer, 2004a.

Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle. Scal-able density-based distributed clustering. In EuropeanConference on Principles of Data Mining and KnowledgeDiscovery, pages 231–244. Springer, 2004b.

Heinrich Jiang. Density level set estimation on manifoldswith dbscan. In International Conference on MachineLearning, pages 1684–1693, 2017a.

Heinrich Jiang. Uniform convergence rates for kernel den-sity estimation. In International Conference on MachineLearning, pages 1694–1703, 2017b.

Samory Kpotufe and Ulrike von Luxburg. Pruning nearestneighbor cluster trees. arXiv preprint arXiv:1105.0540,2011.

Marzena Kryszkiewicz and Piotr Lasek. Ti-dbscan: Clus-tering with dbscan by means of the triangle inequality.In International Conference on Rough Sets and CurrentTrends in Computing, pages 60–69. Springer, 2010.

K Mahesh Kumar and A Rama Mohan Reddy. A fast dbscanclustering algorithm by accelerating neighbor searchingusing groups method. Pattern Recognition, 58:39–48,2016.

Bing Liu. A fast density-based clustering algorithm forlarge databases. In Machine Learning and Cybernet-ics, 2006 International Conference on, pages 996–1000.IEEE, 2006.

Jinfei Liu, Joshua Zhexue Huang, Jun Luo, and Li Xiong.Privacy preserving distributed dbscan clustering. In Pro-ceedings of the 2012 Joint EDBT/ICDT Workshops, pages177–185. ACM, 2012.

Alessandro Lulli, Matteo Dell’Amico, Pietro Michiardi,and Laura Ricci. Ng-dbscan: scalable density-basedclustering for arbitrary data. Proceedings of the VLDBEndowment, 10(3):157–168, 2016.

Antonio Cavalcante Araujo Neto, Ticiana Linhares Coelhoda Silva, Victor Aguiar Evangelista de Farias, José Anto-nio F Macêdo, and Javam de Castro Machado. G2p: Apartitioning approach for processing dbscan with mapre-duce. In International Symposium on Web and Wire-less Geographical Information Systems, pages 191–202.Springer, 2015.

Maitry Noticewala and Dinesh Vaghela. Mr-idbscan: Effi-cient parallel incremental dbscan algorithm using mapre-duce. International Journal of Computer Applications,93(4), 2014.

Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary. A newscalable parallel dbscan algorithm using the disjoint-setdata structure. In Proceedings of the International Con-ference on High Performance Computing, Networking,Storage and Analysis, page 62. IEEE Computer SocietyPress, 2012.

Philippe Rigollet, Régis Vert, et al. Optimal rates for plug-inestimators of density level sets. Bernoulli, 15(4):1154–1178, 2009.

Alessandro Rinaldo and Larry Wasserman. Generalizeddensity clustering. The Annals of Statistics, pages 2678–2722, 2010.

Aarti Singh, Clayton Scott, Robert Nowak, et al. Adaptivehausdorff estimation of density level sets. The Annals ofStatistics, 37(5B):2760–2782, 2009.

Bharath Sriperumbudur and Ingo Steinwart. Consistencyand rates for clustering with dbscan. In Artificial Intelli-gence and Statistics, pages 1090–1098, 2012.

Ingo Steinwart. Adaptive density level set clustering. InProceedings of the 24th Annual Conference on LearningTheory, pages 703–738, 2011.

Ingo Steinwart, Bharath K Sriperumbudur, and PhilippThomann. Adaptive clustering using kernel density esti-mators. arXiv preprint arXiv:1708.05254, 2017.

Tran Manh Thang and Juntae Kim. The anomaly detec-tion by using dbscan clustering with multiple parameters.In Information Science and Applications (ICISA), 2011International Conference on, pages 1–5. IEEE, 2011.


Alexandre B Tsybakov et al. On nonparametric estimationof density level sets. The Annals of Statistics, 25(3):948–969, 1997.

S Vijayalaksmi and M Punithavalli. A fast approach toclustering datasets using dbscan and pruning algorithms.International Journal of Computer Applications, 60(14),2012.

Nguyen Xuan Vinh, Julien Epps, and James Bailey. Informa-tion theoretic measures for clusterings comparison: Vari-ants, properties, normalization and correction for chance.Journal of Machine Learning Research, 11(Oct):2837–2854, 2010.

P Viswanath and V Suresh Babu. Rough-dbscan: A fasthybrid density based clustering method for large data sets.Pattern Recognition Letters, 30(16):1477–1488, 2009.

P Viswanath and Rajwala Pinkesh. l-dbscan: A fast hybriddensity based clustering method. In Pattern Recogni-tion, 2006. ICPR 2006. 18th International Conference on,volume 1, pages 912–915. IEEE, 2006.

Daren Wang, Xinyang Lu, and Alessandro Rinaldo. Opti-mal rates for cluster tree estimation using kernel densityestimators. arXiv preprint arXiv:1706.03113, 2017.

Yi-Pu Wu, Jin-Jiang Guo, and Xue-Jie Zhang. A lineardbscan algorithm based on lsh. In 2007 InternationalConference on Machine Learning and Cybernetics, vol-ume 5, pages 2608–2614. IEEE, 2007.

Xiaowei Xu, Jochen Jäger, and Hans-Peter Kriegel. A fastparallel clustering algorithm for large spatial databases.In High Performance Data Mining, pages 263–290.Springer, 1999.

Aoying Zhou, Shuigeng Zhou, Jing Cao, Ye Fan, and YunfaHu. Approaches for scaling dbscan algorithm to largespatial databases. Journal of computer science and tech-nology, 15(6):509–526, 2000a.

Shuigeng Zhou, Aoying Zhou, Wen Jin, Ye Fan, and Wein-ing Qian. Fdbscan: a fast dbscan algorithm. Ruan JianXue Bao, 11(6):735–744, 2000b.

DBSCAN++: Towards fast and scalable density...

Documents

Transcript of DBSCAN++: Towards fast and scalable density...