Rehashing Kernel Evaluation in High Dimensions

21
Rehashing Kernel Evaluation in High Dimensions Paris Siminelakis *1 Kexin Rong *1 Peter Bailis 1 Moses Charikar 1 Philip Levis 1 Abstract Kernel methods are effective but do not scale well to large scale data, especially in high dimensions where the geometric data structures used to accel- erate kernel evaluation suffer from the curse of dimensionality. Recent theoretical advances have proposed fast kernel evaluation algorithms lever- aging hashing techniques with worst-case asymp- totic improvements. However, these advances are largely conļ¬ned to the theoretical realm due to concerns such as super-linear preprocessing time and diminishing gains in non-worst case datasets. In this paper, we close the gap between theory and practice by addressing these challenges via prov- able and practical procedures for adaptive sample size selection, preprocessing time reduction, and reļ¬ned variance bounds that quantify the data- dependent performance of random sampling and hashing-based kernel evaluation methods. Our experiments show that these new tools offer up to 10Ɨ improvement in evaluation time on a range of synthetic and real-world datasets. 1. Introduction Kernel methods are a class of non-parametric methods used for a variety of tasks including density estimation, regres- sion, clustering and distribution testing (Muandet et al., 2017). They have a long and inļ¬‚uential history in statisti- cal learning (Sch ĀØ olkopf et al., 2002) and scientiļ¬c comput- ing (Buhmann, 2003). However, the scalability challenges during both training and inference limit their applicability to large scale high-dimensional datasets: a larger training set improves accuracy but incurs a quadratic increase in overall evaluation time. In this paper, we focus on the problem of fast approxi- mate evaluation of the Kernel Density Estimate. Given * Equal contribution 1 Stanford University, Stanford, California, US. Correspondence to: Paris Siminelakis <[email protected]>, Kexin Rong <[email protected]>. Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). (a) kernel (b) difļ¬cult case (c) simple case Figure 1. (a) For radially decreasing kernels (e.g. Gaussian), points close to the query have higher kernel values than those that are far. (b) Random sampling does not perform well when a small number of close points contribute signiļ¬cantly to the query density. (c) Random sampling performs well when most points have similar kernel values (distance from query). a dataset P = {x 1 ,...,x n }āŠ‚ R d , a kernel function k : R d ƗR d ā†’ [0, 1], and a vector of (non-negative) weights u āˆˆ R n , the weighted Kernel Density Estimate (KDE) at q āˆˆ R d is given by KDE u P (q) := āˆ‘ n i=1 u i k(q,x i ). Our goal is to, after some preprocessing, efļ¬ciently esti- mate the KDE at a query point with (1 Ā± ) multiplicative accuracy under a small failure probability Ī“. This prob- lem is well-studied (Greengard & Rokhlin, 1987; Gray & Moore, 2003; March et al., 2015), but in high dimensions only recently novel importance sampling algorithms, called Hashing-Based-Estimators (HBE), demonstrate provable improvements over random sampling (RS) under worst-case assumptions (Charikar & Siminelakis, 2017). HBE at its core uses a hash function h (randomized space partition) to construct for any q an unbiased estimator of Ī¼ := KDE u P (q) 1 .(Charikar & Siminelakis, 2017) showed that given a hash function with evaluation time T and colli- sion probability P[h(x)= h(y)] = Ī˜( p k(x, y)), one can get a (1 Ā± ) approximation to Ī¼ ā‰„ Ļ„ in time Ėœ O(T/ 2 āˆš Ī¼) and space Ėœ O(n/ 2 āˆš Ļ„ ), improving over random sampling that requires time O(1/ 2 Ī¼) in the worst case. HBE improves upon RS by being able to better sample points with larger weights (kernel values). In particular, RS does not perform well (Figure 1(b)) when a signiļ¬cant portion of the density comes from a few points with large weights (close to the query). HBE samples uniformly from a set of biased samples (hash buckets where the query is mapped to) that has a higher probability of containing high- 1 The value Ī¼ āˆˆ (0, 1] can be seen as a margin for the decision surface āˆ‘ n i=1 ui k(q,xi ) ā‰„ 0.

Transcript of Rehashing Kernel Evaluation in High Dimensions

Page 1: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Paris Siminelakis * 1 Kexin Rong * 1 Peter Bailis 1 Moses Charikar 1 Philip Levis 1

AbstractKernel methods are effective but do not scale wellto large scale data, especially in high dimensionswhere the geometric data structures used to accel-erate kernel evaluation suffer from the curse ofdimensionality. Recent theoretical advances haveproposed fast kernel evaluation algorithms lever-aging hashing techniques with worst-case asymp-totic improvements. However, these advances arelargely confined to the theoretical realm due toconcerns such as super-linear preprocessing timeand diminishing gains in non-worst case datasets.In this paper, we close the gap between theory andpractice by addressing these challenges via prov-able and practical procedures for adaptive samplesize selection, preprocessing time reduction, andrefined variance bounds that quantify the data-dependent performance of random sampling andhashing-based kernel evaluation methods. Ourexperiments show that these new tools offer up to10Ɨ improvement in evaluation time on a rangeof synthetic and real-world datasets.

1. IntroductionKernel methods are a class of non-parametric methods usedfor a variety of tasks including density estimation, regres-sion, clustering and distribution testing (Muandet et al.,2017). They have a long and influential history in statisti-cal learning (Scholkopf et al., 2002) and scientific comput-ing (Buhmann, 2003). However, the scalability challengesduring both training and inference limit their applicability tolarge scale high-dimensional datasets: a larger training setimproves accuracy but incurs a quadratic increase in overallevaluation time.

In this paper, we focus on the problem of fast approxi-mate evaluation of the Kernel Density Estimate. Given

*Equal contribution 1Stanford University, Stanford,California, US. Correspondence to: Paris Siminelakis<[email protected]>, Kexin Rong <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

(a) kernel (b) difficult case (c) simple case

Figure 1. (a) For radially decreasing kernels (e.g. Gaussian), pointsclose to the query have higher kernel values than those that are far.(b) Random sampling does not perform well when a small numberof close points contribute significantly to the query density. (c)Random sampling performs well when most points have similarkernel values (distance from query).

a dataset P = x1, . . . , xn āŠ‚ Rd, a kernel functionk : RdƗRd ā†’ [0, 1], and a vector of (non-negative) weightsu āˆˆ Rn, the weighted Kernel Density Estimate (KDE)at q āˆˆ Rd is given by KDEuP (q) :=

āˆ‘ni=1 uik(q, xi).

Our goal is to, after some preprocessing, efficiently esti-mate the KDE at a query point with (1Ā± Īµ) multiplicativeaccuracy under a small failure probability Ī“. This prob-lem is well-studied (Greengard & Rokhlin, 1987; Gray &Moore, 2003; March et al., 2015), but in high dimensionsonly recently novel importance sampling algorithms, calledHashing-Based-Estimators (HBE), demonstrate provableimprovements over random sampling (RS) under worst-caseassumptions (Charikar & Siminelakis, 2017).

HBE at its core uses a hash function h (randomized spacepartition) to construct for any q an unbiased estimator ofĀµ := KDEuP (q)1. (Charikar & Siminelakis, 2017) showedthat given a hash function with evaluation time T and colli-sion probability P[h(x) = h(y)] = Ī˜(

āˆšk(x, y)), one can

get a (1Ā± Īµ) approximation to Āµ ā‰„ Ļ„ in time O(T/Īµ2āˆšĀµ)

and space O(n/Īµ2āˆšĻ„), improving over random sampling

that requires time O(1/Īµ2Āµ) in the worst case.

HBE improves upon RS by being able to better samplepoints with larger weights (kernel values). In particular,RS does not perform well (Figure 1(b)) when a significantportion of the density comes from a few points with largeweights (close to the query). HBE samples uniformly froma set of biased samples (hash buckets where the query ismapped to) that has a higher probability of containing high-

1The value Āµ āˆˆ (0, 1] can be seen as a margin for the decisionsurface

āˆ‘ni=1 uik(q, xi) ā‰„ 0.

Page 2: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

weight points. For radially decreasing kernels (Figure 1(a)),the biased sample can be produced via Locality SensitiveHashing (LSH) (Indyk & Motwani, 1998). To obtain msuch biased samples for estimating the query density, thescheme requires building m independent hash tables on theentire dataset. The runtime mentioned above is achievedby setting m = O(1/Īµ2

āˆšĀµ). In practice, one cares about

values of Āµ ā‰„ 1/āˆšn, which is a lower bound on the order

of statistical fluctuations when P consists of n i.i.d samplesfrom some smooth distribution (Sriperumbudur et al., 2016).

Limitations. Despite progress on the worst case querytime, the idea of using hashing for kernel evaluation, in-troduced independently in (Spring & Shrivastava, 2017)and (Charikar & Siminelakis, 2017), has largely been con-fined to the theoretical realm due to a few practical concerns.First, straightforward implementations of the proposed ap-proach require a large amount (e.g. O(n

54 ) for Ļ„ = 1/

āˆšn)

of preprocessing time and memory to create the requisitenumber of hash tables. Second, RS can outperform HBEon certain datasets (Figure 1(c)), a fact that is not capturedby the worst-case theoretical bounds. Third, to implementthis approach (Charikar & Siminelakis, 2017) use an adap-tive procedure to estimate the sufficient sample size m. Forthis procedure to work, the estimators need to satisfy someuniform polynomial decay of the variance as a function ofĀµ, that for the popular Gaussian kernel, only the hashingscheme of (Andoni & Indyk, 2006) (with an significantexp(O(log

23 n)) runtime and memory overhead per hash ta-

ble) was shown to satisfy. These issues call the applicabilityof HBE into question and to the best of our knowledge, themethod has not been shown before to empirically improveupon competing methods.

1.1. Contributions

In this paper, we close the gap between theory and practiceby making the following contributions:

1. Overhead reduction via sketching. HBE requiressuper-linear preprocessing time and memory, a result ofhaving to hash the entire dataset once for each sample.To reduce this overhead, we develop new theoreticalresults on sketching the KDE by a weighted subset ofpoints using hashing and non-uniform sampling (Sec-tion 6). Our Hashing-Based-Sketch (HBS) is able tobetter sample sparse regions of space, while still hav-ing variance close to that of a random sketch, leadingto better performance on low-density points. ApplyingHBE on top of the sketch results in considerable gainswhile maintaining accuracy.

2. Data-dependent diagnostics. Despite worst-case the-oretical guarantees, RS outperforms HBE on certaindatasets in practice. To quantify this phenomenon, weintroduce an inequality that bounds the variance of un-

biased estimators that include HBE and RS (Section4). We propose a diagnostic procedure utilizing thevariance bounds that, for a given dataset, chooses atruntime with minimal overhead the better of the twoapproaches, and does so without invoking HBE; ourevaluation shows that the procedure is both accurateand efficient. The new variance bound also allows usto simplify and recover theoretical results of (Charikar& Siminelakis, 2017).

3. A practical algorithm for the Gaussian kernel. Wedesign a simplified version of the adaptive procedureof (Charikar & Siminelakis, 2017), that estimates thesufficient sample size m in parallel with estimatingthe density Āµ, and provide an improved analysis thatresults in an order of magnitude speedup in the querytime (Section 5). More importantly, our new analysisshows that slightly weaker conditions (non-uniformbut bounded polynomial decay) on the variance ofthe estimators are sufficient for the procedure to work.By proving that HBE based on the hashing schemeof (Datar et al., 2004) satisfy the condition, we givethe first practical algorithm that provably improvesupon RS for the Gaussian kernel in high dimensions.Previously, it was not known how to use this hashingscheme within an adaptive procedure for this purpose.

We perform extensive experimental evaluations of our meth-ods on a variety of synthetic benchmarks as well as real-world datasets. Together, our evaluation against other state-of-the-art competing methods on kernel evaluation shows:

ā€¢ HBE outperforms all competing methods for syntheticā€œworst-caseā€ instances with multi-scale structure and di-mensions d ā‰„ 10, as well as for ā€œstructuredā€ instanceswith a moderate number of clusters (Section 7.1).

ā€¢ For real-world datasets, HBE is always competitivewith alternative methods and is faster for many datasetsby up to Ɨ10 over the next best method (Section 7.2).

In summary, our theoretical and experimental results con-stitute an important step toward making Hashing-Based-Estimators practical and in turn improving the scalability ofkernel evaluation in large high-dimensional data sets.

2. Related WorkFast Multipole Methods and Dual Tree Algorithms. His-torically, the problem of KDE evaluation was first stud-ied in low dimensions (d ā‰¤ 3) in the scientific computingliterature, resulting in the Fast Multipole Method (FMM)(Greengard & Rokhlin, 1987). This influential line of workis based on a hierarchical spatial decomposition and runs inO(2d) time per evaluation point. Motivated by applicationsin statistical learning (Gray & Moore, 2001), the problemwas revisited in 2000ā€™s and analogs of the Fast Multipole

Page 3: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Method (Gray & Moore, 2003), known as Dual Tree Al-gorithms (Lee et al., 2006), were proposed that aimed tocapture latent underlying low-dimensional structure (Ramet al., 2009). Unfortunately, when such low-dimensionalstructure is absent, these methods, in general, have near-linear O(n1āˆ’o(1)) runtime. Thus, under general assump-tions, the only method that was known (Lee & Gray, 2009)to provably accelerate kernel evaluation in high dimensionswas simple uniform random sampling (RS).

Hashing-Based-Estimators. The framework of HBE hasrecently been extended to apply to more general ker-nels (Charikar & Siminelakis, 2018) and hashing-basedideas have been used to design faster algorithms forā€œsmoothā€ kernels (Backurs et al., 2018). Besides kernel eval-uation, researchers have also applied hashing techniquesto get practical improvements in outlier detection (Luo &Shrivastava, 2018a), gradient estimation (Chen et al., 2018),non-parametric clustering (Luo & Shrivastava, 2018b) andrange-counting (Wu et al., 2018).

Approximate Skeletonization. ASKIT (March et al.,2015) represents a different line of work aimed at addressingthe deficiencies of FMM and Dual-Tree methods. ASKITalso produces a hierarchical partition of points, but then useslinear algebraic techniques to approximate the contributionof points to each part via a combination of uniform samplesand nearest neighbors. This makes the method robust to thedimension and mostly dependent on the number of points.

Core-sets. The problem of sketching the KDE for allpoints has been studied under the name of Core-sets orĪµ-samples (Phillips, 2013). The methods are very effectivein low dimensions (Zheng et al., 2013) d ā‰¤ 3, but becomeimpractical to implement in higher dimensions for largedata-sets due to their computational requirements Ī©(n2)(e.g. (Chen et al., 2010)). The recent paper (Phillips & Tai,2018) presents an up-to-date summary.

3. PreliminariesIn this section, we present basic definitions and give a selfcontained presentation of Hashing-Based-Estimators thatsummarizes the parts of (Charikar & Siminelakis, 2017)upon which we build. All material beyond Section 3 is new.

Notation. For a set S āŠ‚ [n] and numbers u1, . . . , un, letuS :=

āˆ‘iāˆˆS ui. Let āˆ†n := u āˆˆ Rn+ : ā€–uā€–1 = 1

denote the n-dimensional simplex. Throughout the paper weassume that we want to approximate KDEuP with u āˆˆ āˆ†n.We can handle the general case u āˆˆ Rd by treating thepositive and negative part of u separately.

Kernels. All our theoretical results, unless explicitly stated,apply to general non-negative kernels. For concreteness andin our experiments, we focus on the Gaussian exp(āˆ’ā€–xāˆ’

yā€–2/Ļƒ2) and Laplace kernels exp(āˆ’ā€–x āˆ’ yā€–/Ļƒ) (Belkinet al., 2018), and typically suppress the dependence on thebandwidth Ļƒ > 0 (equivalently, we can rescale our points).

3.1. Multiplicative Approximation & Relative Variance

Definition 1. A random variable Z is called an (Īµ, Ī“)-approximation to Āµ if P[|Z āˆ’ Āµ| ā‰„ ĪµĀµ] ā‰¤ Ī“.

Given access to an unbiased estimator E[Z] = Āµ, our goalis to output an (Īµ, Ī“)-approximation to Āµ. Towards that endthe main quantity to control is the relative variance.

Definition 2. For a non-negative random variable Z we de-fine the relative variance as RelVar[Z] := Var[Z]

E[Z]2 ā‰¤E[Z2]E[Z]2 .

The relative variance captures the fluctuations of the randomvariable at the scale of the expectation. This is made precisein the following lemma that combines Chebyshevā€™s andPaley-Zygmund inequality.

Lemma 1. For a non-negative random variable Z andparameters t > 0, Īø āˆˆ [0, 1], we have:

P[Z ā‰„ (t+ 1) Ā· E[Z]] ā‰¤ 1

t2Ā· RelVar[Z], (1)

P[Z > (1āˆ’ Īø)E[Z]] ā‰„ 1

1 + 1Īø2 Ā· RelVar[Z]

. (2)

As RelVar[Z] decreases, the upper bound in (1) decreaseswhile the lower bound in (2) increases, increasing ouroverall confidence of Z being an accurate estimate of themean. Thus, if one can construct a random variable Z withE[Z] = Āµ and small relative variance RelVar[Z] = O(Īµ2),Lemma 1 shows that Z is an (Īµ, O(1))-approximation to Āµ.In fact, by Chernoff bounds, one can use the median-trick toboost the probability of success (Supplementary material).

3.2. Hashing-Based-Estimators

HBE uses hashing to create a data-structure that, after somepreprocessing, can produce unbiased estimators for the KDEat query time (Figure 2) with low variance. LetH be a setof functions and Ī½ a distribution over H. We denote byh āˆ¼ HĪ½ a random function h āˆˆ H sampled from Ī½ andrefer toHĪ½ as a hashing scheme.

Definition 3. Given a hashing scheme HĪ½ , we define thecollision probability between two elements x, y āˆˆ Rd asp(x, y) := Phāˆ¼HĪ½ [h(x) = h(y)].

Pre-processing: given dataset P and hashing schemeHĪ½ :

ā€¢ Sample m hash functions hti.i.d.āˆ¼ HĪ½ for t āˆˆ [m].

ā€¢ Create hash tables Ht := ht(P ) for t āˆˆ [m] by evalu-ating the hash functions on P .

Page 4: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Figure 2. Given a dataset, the HBE approach samples a number ofhash functions and populates a separate hash table for each hashfunction. At query time, for each hash table, we sample a point atrandom from the hash bucket that the query maps to.

The m hash tables allow us to produce at most m indepen-dent samples to estimate the KDE for each query.

Query process: query q āˆˆ Rd, hash table index t āˆˆ [m]

ā€¢ let Ht(q) := i āˆˆ [n] : ht(xi) = ht(q) denote thehash bucket (potentially empty) that q maps to.

ā€¢ If Ht(q) is not empty return Zht := k(Xt,q)p(Xt,q)

uHt(q)where Xt is sampled with probability proportional toui from Ht(q), otherwise return 0.

By averaging many samples produced by the hash tableswe get accurate estimates. The salient properties of theestimators Zht are captured in the following lemma.

Lemma 2. Assuming that āˆ€i āˆˆ [n], p(xi, q) > 0 then

E[Zh] =

nāˆ‘i=1

uik(x, xi), (3)

E[Z2h] =

nāˆ‘i,j=1

k2(q, xi)uiP[i, j āˆˆ H(q)]uj

p2(q, xi). (4)

As we see in (4), the second moment depends on the hash-ing scheme through the ratio P[i,jāˆˆH(q)]

p2(q,xi). Thus, we hope

to reduce the variance by selecting the hashing scheme ap-propriately. The difficulty lies in that the variance dependson the positions of points in the whole dataset. (Charikar& Siminelakis, 2017) introduced a property of HBE thatallowed them to obtain provable bounds on the variance.

Definition 4. Given a kernel k, a hashing scheme iscalled (Ī²,M)-scale-free for Ī² āˆˆ [0, 1] and M ā‰„ 1 if:1M Ā· k(x, y)Ī² ā‰¤ p(x, y) ā‰¤M Ā· k(x, y)Ī² .

Thus, one needs to design hashing scheme that ā€œadaptsā€to the kernel function. Next, we present a specific familyof hashing schemes that can be used for kernel evaluationunder the Exponential and Gaussian kernel.

3.3. HBE via Euclidean LSH

In the context of solving Approximate Nearest Neighborsearch in Euclidean space, Datar et al. (Datar et al., 2004) in-troduced the following hashing scheme, Euclidean LSH(eLSH), that uses two parameters w > 0 (width) andĪŗ ā‰„ 1 (power). First, hash functions in the form ofhi(x) := d g

>i xw + bie map a d dimensional vector x into

a set of integers (ā€œbucketsā€) by applying a random projec-tion (gi

i.i.d.āˆ¼ N (0, Id)), a random shift (bii.i.d.āˆ¼ U [0, 1]))

and quantizing the result (width w > 0). A concatenation(power) of Īŗ such functions gives the final hash function.

They showed that the collision probability for two pointsx, y āˆˆ Rd at distance ā€–xāˆ’ yā€– = c Ā· w equals pĪŗ1 (c) where

p1(c) := 1āˆ’2Ī¦(cāˆ’1)āˆ’āˆš

2

Ļ€c

(1āˆ’ exp

āˆ’cāˆ’2

2

). (5)

In order to evaluate the hashing scheme on a dataset P , oneneeds space and pre-processing time O(dĪŗ Ā· n). By pickingw large enough, i.e. for small c, one can show the followingbounds on the collision probability.Lemma 3 ((Charikar & Siminelakis, 2017)). For Ī“ ā‰¤ 1

2

and c ā‰¤ minĪ“, 1āˆšlog( 1

Ī“ ): eāˆ’

āˆš2Ļ€ Ī“ ā‰¤ p1(c)

eāˆ’āˆš

2Ļ€cā‰¤ eāˆš

2Ļ€ Ī“

3

.

Taking appropriate powers Īŗ, one can then use EuclideanLSH to create collision probabilities appropriate for theExponential and Gaussian kernel.Theorem 1 ((Charikar & Siminelakis, 2017)). LetH1(w, Īŗ)denote the eLSH hashing scheme with width w and powerĪŗ ā‰„ 1. Let R = max

x,yāˆˆPāˆŖqā€–xāˆ’ yā€–. Then for the

ā€¢ Laplace kernel, setting Īŗe = dāˆš

2Ļ€ log( 1Ļ„ )Re and

we =āˆš

2Ļ€

1Ī²Īŗe results in (Ī²,

āˆše)-scale free hashing

scheme for k(x, y) = eāˆ’ā€–xāˆ’yā€–2 .

ā€¢ Gaussian kernel, setting rt := 12

āˆšt log(1 + Ī³),

Īŗt := 3drtRe2, wt =āˆš

2Ļ€Īŗtrt

results in an estimator

for k(x, y) = eāˆ’ā€–xāˆ’yā€–2

with relative variance

RelVar[Zt] ā‰¤ Vt(Āµ) := 4e32

1

Āµer2tāˆ’rt

āˆšlog( 1

Āµ ). (6)

4. Refined Variance bounds and DiagnosticsTo understand how many samples are needed to estimate thedensity through RS or HBE, we need bounds for expressionsof the type (4). In particular, for a sequence of numbersw1, . . . , wn āˆˆ [0, 1], e.g. wi = k(q, xi), and u āˆˆ āˆ†n suchthat Āµ =

āˆ‘iāˆˆ[n] uiwi, we want to bound:

supuāˆˆāˆ†n,u>w=Āµ

āˆ‘i,jāˆˆ[n]

w2i (uiVijuj) (7)

Page 5: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

where V āˆˆ RnƗn+ is a non-negative matrix. Our boundwill be decomposed into the contribution Āµ` from subset ofindices S` āŠ† [n] where the weights (kernel values) havebounded range. Specifically, let Āµ ā‰¤ Ī» ā‰¤ L ā‰¤ 1 and define:S1 = i āˆˆ [n] : L ā‰¤ wi ā‰¤ 1, S2 = i āˆˆ [n] \ S1 : Ī» ā‰¤wi ā‰¤ L, S3 = i āˆˆ [n] \ (S2 āˆŖ S1) : Āµ ā‰¤ wi ā‰¤ Ī», S4 =i āˆˆ [n] : wi < Āµ as well as Āµ` =

āˆ‘iāˆˆS` uiwi ā‰¤ Āµ for

` āˆˆ [4]. The intuition behind the definition of the sets is thatfor radial decreasing kernels, they correspond to sphericalannuli around the query.Lemma 4. For non-negative weights w1, . . . , wn, vectoru āˆˆ āˆ†n and sets S1, . . . , S4 āŠ† [n] as above it holdsāˆ‘i,jāˆˆ[n]

w2i uiVijuj ā‰¤

āˆ‘`āˆˆ[3],`ā€²āˆˆ[3]

supiāˆˆS`,jāˆˆS`ā€²

Vijwiwj

Āµ`Āµ`ā€²

+ uS4

āˆ‘`āˆˆ[3]

supiāˆˆS`,jāˆˆS4

Vij

wiĀµ

Āµ`Āµ

+ supiāˆˆS4,jāˆˆ[n]

Vijwi Ā· Āµ4 (8)

where uS :=āˆ‘jāˆˆS uj ā‰¤ 1.

This simple lemma allows us to get strong bounds for scale-free hashing schemes simplifying results of (Charikar &Siminelakis, 2017) and extended them to Ī² < 1/2.Theorem 2. LetHĪ½ be (Ī²,M)-scale free hashing schemeand Zh the corresponding HBE. Then RelVar[Zh] ā‰¤VĪ²,M (Āµ) := 3M3/ĀµmaxĪ²,1āˆ’Ī².

Proof. Letting wi = k(q, xi), we start from (4) and usethe fact that P[i, j āˆˆ H(q)] ā‰¤ minp(xi, q), p(xj , q)and that the hashing scheme is (Ī²,M)-scale free, toarrive at E[Z2

h] ā‰¤āˆ‘i,jāˆˆ[n] w

2i uiVijuj with Vij =

M3 minwĪ²i ,wĪ²j

w2Ī²i

. Let S1, . . . , S4 be sets defined for ` =

L = Āµ. Then S2 = S3 = āˆ…. By Lemma 4 we get

E[Z2h] ā‰¤M3( sup

i,jāˆˆS1

w1āˆ’2Ī²i

w1āˆ’Ī²j

Āµ21 + sup

iāˆˆS4

w1āˆ’Ī²i Āµ4

+ uS4 supiāˆˆS1,jāˆˆS4

w1āˆ’2Ī²i wĪ²j Āµ1)

ā‰¤M3(1

Āµ1āˆ’Ī² Āµ21 + uS4

supiāˆˆS1

w1āˆ’2Ī²i

Āµ1āˆ’Ī² ĀµĀµ1 + Āµ1āˆ’Ī²Āµ4)

ā‰¤M3(1

Āµ1āˆ’Ī² +1

ĀµmaxĪ²,1āˆ’Ī² +1

ĀµĪ²)Āµ2.

Noting that E[Zh] = Āµ and Āµ ā‰¤ 1 concludes the proof.

Remark 1. It is easy to see that random sampling is a(trivial) (0, 1)-scale free hashing scheme, hence the relativevariance is bounded by 3

Āµ .

Remark 2. For any ( 12 ,M)-scale free hashing scheme the

relative variance is bounded by 3M3āˆšĀµ .

4.1. Diagnostics for unbiased estimators

The inequality provides means to quantify the dataset de-pendent performance of RS and HBE: by setting the coeffi-cients Vij appropriately, Lemma 4 bounds the variance ofRS (Vij = 1) and HBE (Vij =

minp(q,xi),p(q,xj)p(q,xi)2

). How-ever, evaluating the bound over the whole dataset is nocheaper than evaluating the methods directly.

Instead, we evaluate the upper bound on a ā€œrepresentativesampleā€ S0 in place of [n] by using the sets S` = S0 āˆ© S`for ` āˆˆ [4]. Given a set S0 define the estimator Āµ` =āˆ‘jāˆˆS` ujwj . For Īµ āˆˆ (0, 1), let:

Ī»Īµ := arg maxĀµ0ā‰¤Ī»ā‰¤1

Āµ3 ā‰¤

1

2(ĪµĀµ0 āˆ’ Āµ4)

, (9)

LĪµ := arg minĀµ0ā‰¤Lā‰¤1

Āµ1 ā‰¤

1

2(ĪµĀµ0 āˆ’ Āµ4)

. (10)

Since Lemma 4 holds for all Āµ ā‰¤ Ī» ā‰¤ L ā‰¤ 1, LĪµ, Ī»Īµcomplete the definition of four sets on S0, which we use toevaluate the upper bound. Finally, we produce a representa-tive sample S0 by running our adaptive procedure (Section5) with Random Sampling. The procedure returns a (ran-dom) set S0 such that Āµ0 is an (Īµ, O(1))-approximation toĀµ for any given query.

Algorithm 1 Data-dependent Diagnostic

1: Input: set P , threshold Ļ„ āˆˆ (0, 1), accuracy Īµ, T ā‰„ 1,collision probability p(x, y) of the hashing schemeHĪ½ .

2: for t = 1, . . . , T do3: q ā† Random(P ) . For each random query4: (S0, Āµ0)ā† AMR(ZRS(q), Īµ, Ļ„) . Algorithm 25: Set Ī»Īµ, LĪµ using (9) and (10)6: VRS ā† r.h.s of (8) for S0 and Vij = 1.7: VH ā† r.h.s of (8) for S0, Vij =

minp(q,xi),p(q,xj)p(q,xi)2

.8: rVRS(t)ā† VRS/maxĀµ0, Ļ„29: rVHBE(t)ā† VH/maxĀµ0, Ļ„2.

10: Output: (meanT (rVRS),meanT (rVHBE))

5. Adaptive estimation of the meanTowards our goal of producing an (Īµ, O(1))-approximationto Āµ for a given query q, the arguments in Section 3.1 reducethis task to constructing an unbiased estimator Z of low-relative variance RelVar[Z] ā‰¤ Īµ2

6 . Given a basic estimatorZ0 and bound V (Āµ) on its relative variance, one can alwayscreate such an estimator by taking the mean of O(V (Āµ)/Īµ2)independent samples. The question that remains in order toimplement this approach is how to deal with the fact that Āµand thus V (Āµ) is unknown?

We handle this by providing an improvement to the adaptiveestimation procedure of (Charikar & Siminelakis, 2017).

Page 6: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

The procedure makes a guess of the density and uses theguess to set the number of samples. Subsequently, the proce-dure performs a consistency check to evaluate whether theguess was correct up to a small constant and if not, revisesthe guess and increases the sample size. The procedure ap-plies generically to the following class of estimators (whosevariance varies polynomially with Āµ) that we introduce.

5.1. (Ī±, Ī², Ī³)-regular estimators

Definition 5. For Ī±, Ī² āˆˆ (0, 2] and Ī³ ā‰„ 1, an estimatorZ is (Ī±, Ī², Ī³)-regular if for some constant C ā‰„ 1 and allt āˆˆ [T ], there exist a random variable Zt and a functionVt : (0, 1]ā†’ R++ such that

(A) E[Zt] = Āµ and RelVar[Zt] ā‰¤ Vt(Āµ), āˆ€Āµ āˆˆ (0, 1],

(B) 1 ā‰¤ Vt(y)Vt(x) ā‰¤

(xy

)2āˆ’Ī±for all x ā‰„ y > 0,

(C) Vt(Āµt+1) ā‰¤ CĀµāˆ’Ī²t with Āµt := (1 + Ī³)āˆ’t.

We write Zt āˆ¼ Z(t, Ī³) to denote a sample from such anestimator at level t āˆˆ [T ].

At a high level, regular estimators generalize properties ofscale-free estimators relevant to adaptively estimating themean. Property (B) affects the success probability whereas(C) affects the running time and space requirements. Thislevel of generality will be necessary to analyze the hashingscheme designed for the Gaussian kernel.

Regular Estimators via HBE. For any t āˆˆ [T ], given acollection of hash tables H(i)

t iāˆˆ[mt] created by evaluatingmt i.i.d hash functions with collision probability pt(x, y)on a set P and a fixed kernel k(x, y), we will denote by

Z(t, Ī³)ā† HBEk,pt(H(i)t iāˆˆ[mt]) (11)

a data structure at level t, that for any query q is able toproduce up to mt i.i.d unbiased random variables Z(i)

t (q)for the density Āµ according to Section 3.2. The union ofsuch data structures for t āˆˆ [T ] will be denoted by Z .

Before proceeding with the description of the estimationprocedure we show that (Ī²,M)-scale free HBE (that includerandom sampling) satisfy the above definition.

Theorem 3. Given a (Ī²,M)-scale free hashing schemewith Ī² ā‰„ 1

2 , the corresponding estimator is (2 āˆ’ Ī², Ī², Ī³)-regular with constant C = 3M3(1 + Ī³)Ī² .

The proof follows easily by Theorem 2 by using the sameestimator for all t āˆˆ [T ] and Vt(Āµ) = V (Āµ) = 3M3

ĀµĪ².

5.2. Adaptive Mean Relaxation

For regular estimators we propose the following procedure(Algorithm 2). In the supplementary material we analyze

Algorithm 2 Adaptive Mean Relaxation (AMR)

1: Input: (a, Ī², Ī³)-regular estimator Z , accuracy Īµ āˆˆ(0, 1), threshold Ļ„ āˆˆ (0, 1).

2: T ā† dlog1+Ī³( 1ĪµĻ„ )e+ 1

3: for t = 1, . . . , T do . For each level4: Āµt ā† (1 + Ī³)āˆ’t . Current guess of mean5: mt ā† d 6

Īµ2Vt(Āµt+1)e . sufficient samples in level t6: Z

(i)t āˆ¼ Z(t, Ī³) i.i.d samples for i āˆˆ [mt].

7: Zt ā† meanZ(1)t , . . . , Z

(mt)t

8: if Zt ā‰„ Āµt then . consistency check9: return Zt

10: return 0 . In this case Āµ ā‰¤ ĪµĻ„

the probability that this procedure successfully estimates themean as well as the number of samples it uses and obtainthe following result.

Theorem 4. Given an (a, Ī², Ī³)-regular estimator Z , theAMR procedure outputs a number Z such that

P[|Z āˆ’ Āµ| ā‰¤ Īµ Ā·maxĀµ, Ļ„] ā‰„ 2

3āˆ’OĪ³,Ī±(Īµ2)

and with the same probability uses OĪ³( 1Īµ2

1ĀµĪ²

) samples.

Using the bounds on the relative variance in Section 4, weget that any (Ī² ā‰„ 1/2,M)-scale free estimator can be usedto estimate the density Āµ using O(1/Īµ2ĀµĪ²) samples.

5.3. Regular estimator for Gaussian Kernel

We show next show a construction of a HBE for Gaussiankernel (Algorithm 3) that results in a regular estimator.

Algorithm 3 Gaussian HBE (Gauss-HBE)

1: Input: Ī³ ā‰„ 1, Ļ„ āˆˆ (0, 1), Īµ āˆˆ (0, 1), dataset P .2: T ā† dlog1+Ī³( 1

ĪµĻ„ )e+ 1, Rā†āˆš

log(1/ĪµĻ„)3: for t = 1, . . . , T do4: mt ā† d 6

Īµ2Vt((1 + Ī³)āˆ’(t+1))e . see (6)5: for i = 1, . . . ,mt do6: H

(i)t ā† eLSH(wt, Īŗt, P ) . see Theorem 3

7: pt(x, y) := pĪŗt1 (ā€–xāˆ’ yā€–/wt) . see (5)8: k(x, y) := eāˆ’ā€–xāˆ’yā€–

2

9: ZGauss(t, Ī³)ā† HBEk,pt(H(i)t iāˆˆ[mt])

10: Output: ZGauss

Theorem 5. ZGauss is (1, 34 , Ī³)-regular and takes prepro-

cessing time/space bounded by Od,ĪŗT ,Ī³(Īµāˆ’3+ 14 Ļ„āˆ’

34 Ā· n).

The proof (given in the supplementary material) is based onusing (6) to show that Definition 5 is satisfied with appro-priate selection of constants. We also note that (6) can bederived using Lemma 4.

Page 7: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

6. Sketching the Kernel Density EstimateIn this section, we introduce an approach based on hashingto create a ā€œsketchā€ of KDE that we can evaluate using HBEor other methods.Definition 6. Let (u, P ) be a set of weights and points. Wecall (w, S) an (Īµ, Ī“, Ļ„)-sketch iff for any q āˆˆ Rd:

E[|KDEwS (q)āˆ’KDEuP (q)|2] ā‰¤ Īµ2Ļ„Ī“ Ā·KDEuP (q). (12)

Let Āµ = KDEuP (q), using Chebyshevā€™s inequality it is im-mediate that any (Īµ, Ī“, Ļ„)-sketch satisfies for any q āˆˆ Rd:P[|KDEwS (q)āˆ’ Āµ| ā‰„ ĪµmaxĻ„, Āµ] ā‰¤ Ī“.Remark 3. It is easy to see that one can construct such asketch by random sampling m ā‰„ 1

Īµ2Ī“1Ļ„ points.

Hashing-Based-Sketch (HBS). The scheme we proposesamples a random point by first sampling a hash bucket Hi

with probability āˆ uĪ³Hi and then sampling a point j fromthe bucket with probability āˆ uj . The weights are chosensuch that the sample is an unbiased estimate of the density.This scheme interpolates between uniform over bucketsand uniform over the whole data set as we vary Ī³ āˆˆ [0, 1].We pick Ī³āˆ— so that the variance is controlled and with theadditional property that any non-trivial bucket uHi ā‰„ Ļ„ willhave a representative in the sketch with some probability.This last property is useful in estimating low-density points(e.g. for outlier detection (Gan & Bailis, 2017)). For anyhash tableH and a vector u āˆˆ āˆ†n (simplex), letB = B(H)denote the number of buckets and umax = umax(H) :=maxuHi : i āˆˆ [B] the maximum weight of any hashbucket of H . The precise definition of our Hashing-Based-Sketch is given in Algorithm 4.Theorem 6. Let H be the hash function sampled by theHBS procedure. For Īµ > 0 and Ī“ āˆˆ [eāˆ’

6Īµ2

umaxnĻ„ , eāˆ’

6Īµ2 ), let:

Ī³āˆ— =

1āˆ’

log( Īµ2

6 log(1/Ī“))

log(umax

Ļ„ )

I[Bā‰¤( 12 )

16 1Ļ„ ]

, (13)

m =6

Īµ21

Ļ„(Bumax)

1āˆ’Ī³āˆ—<

log( 1Ī“ )

Ļ„. (14)

Then (Sm, w) is an (Īµ, 16 , Ļ„)-sketch and if B ā‰¤

(12

) 16 1Ļ„

any hash bucket with weight at least Ļ„ will have non emptyintersection with Sm with probability at least 1āˆ’ Ī“.

7. ExperimentsIn this section, we evaluate the performance of hashing-based methods on kernel evaluation on real and syntheticdatasets, as well as test the diagnostic procedureā€™s ability topredict dataset-dependent estimator performance 2.

2Source code available at: http://github.com/kexinrong/rehashing

Algorithm 4 Hashing-Based-Sketch (HBS)

1: Input: set P , sketch size m, hashing scheme HĪ½ ,threshold Ļ„ āˆˆ (0, 1), u āˆˆ āˆ†n

2: Sample h āˆ¼ HĪ½ and create hash table H = h(P ).3: Set Ī³ according to (13)4: Sm ā† āˆ…, w ā† 0 Ā· 1m, B ā† B(H)5: for j = 1, . . . ,m do6: Sample hash bucket Hi with probability āˆ uĪ³Hi7: Sample a point Xj from Hi with probability āˆ uj8: Sm ā† Sm āˆŖ Xj

9: wj(Ī³,m)ā† uHim

āˆ‘Biā€²=1

uĪ³Hiā€²

uĪ³Hi

10: Output: (Sm, w)

7.1. Experimental Setup

Baselines and Evaluation Metrics. As baselines fordensity estimation (using u = 1

n1), we compareagainst ASKIT (March et al., 2016), a tree-based methodFigTree (Morariu et al., 2009), and random sampling (RS).We used the open sourced libraries for FigTree and ASKIT;for comparison, all implementations are in C++ and we re-port results on a single core. We tune each set of parametersvia binary search to guarantee an average relative error of atmost 0.1. By default, the reported query time and relativeerror are averaged over 10K random queries that we evalu-ate exactly as ground truth. The majority of query densitiesin evaluation are larger than Ļ„ = 10āˆ’4 (roughly 1

10āˆšn

).

Synthetic Benchmarks. To evaluate algorithms undergeneric scenarios, we need benchmarks with diverse struc-ture. The variance bound suggests that having a large num-ber of points far from the query is hard for HBE, whereashaving a few points close to the query is hard for RS. Inaddition, the performance of space-partitioning methodsdepends mostly on how clustered vs spread-out the datasetsare, with the latter being harder. A fair benchmark shouldtherefore include all above regimes.

We propose a procedure that generates a d-dimensionaldataset where for D random directions, clusters of pointsare placed on s different distance scales, such that i) eachdistance scale contributes equally to the kernel density atthe origin, ii) the outermost scale has n points per direction,iii) the kernel density around the origin is approximately Āµ.By picking D n the instances become more random like,while for D n they become more clustered. Using thisprocedure, we create two families of instances:

ā€¢ ā€œworst-caseā€: we take the union of two datasets gen-erated for D = 10, n = 50K and D = 5K, n = 100while keeping s = 4, Āµ = 10āˆ’3 fixed and varying d.

ā€¢ D-structured: we set N = 500K, s = 4, Āµ = 10āˆ’3,d = 100 and vary D while keeping nD = N .

Page 8: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Table 1. Comparison of preprocess time and average query time on real world datasets. Bold numbers correspond to the best result.The sources of the datasets are: MSD (Bertin-Mahieux et al., 2011), GloVe (Pennington et al., 2014), SVHN (Netzer et al., 2011),TMY3 (Hendron & Engebrecht, 2010), covtype (Blackard & Dean, 1999), TIMIT (Garofolo, 1993).

Preprocess Time (sec) Average Query Time (ms)

Dataset Description n d HBE FigTree ASKIT RS HBE FigTree ASKIT RS

TMY3 energy 1.8M 8 15 28 838 0 0.8 3.0 28.4 55.0census census 2.5M 68 66 4841 1456 0 2.8 1039.9 103.7 27.5covertype forest 581K 54 22 31 132 0 1.9 23.0 47.0 149.4TIMIT speech 1M 440 298 1579 439 0 24.3 1531.8 169.8 32.8ALOI image 108K 128 24 70 14 0 6.3 53.5 5.4 21.2SVHN image 630K 3072 2184 > 105 877 0 67.9 > 104 43.0 370.1MSD song 463K 90 55 2609 107 0 5.2 326.9 8.1 2.1GloVe embedding 400K 100 98 5603 76 0 19.0 656.5 86.1 5.0

acoustic mnist susy hep higgs MSD GloVe TIMIT SVHN TMY3 ALOI census covertype home shuttle skin corel ijcnn sensorless poker cadata codrna

101

102

Varia

nce

Boun

d

RS better HBE betterRSHBE

Figure 3. Predicted relative variance upper bound on real datasets. RS exhibits lower variance for datasets on the left. Overall, thediagnostic procedure correctly predicts the performance all but one dataset (SVHN).

In the supplementary material we include a precise descrip-tion of the procedure as well as example scatter plots.

7.2. Results

Synthetic datasets. We evaluate the four algorithms forkernel density estimation on worst-case instances withd āˆˆ [10, 500] and on D-structured instances with D āˆˆ[1, 100K]. For ā€œworst-caseā€ instances, while all methodsexperience increased query time with increased dimension,HBE consistently outperforms other methods for dimen-sions ranging from 10 to 500 (Figure 4 left). For the D-structured instances (Figure 4 right), FigTree and RS domi-nate the two extremes where the dataset is highly ā€œclusteredā€D n or ā€œscatteredā€ on a single scale D n, whileHBE achieves the best performance for datasets in between.ASKITā€™s runtime is relatively unaffected by the number ofclusters but is outperformed by other methods.

10 50 100 200 500# dimensions

10āˆ’5

10āˆ’4

10āˆ’3

10āˆ’2

10āˆ’1

100

Avg

Quer

y Ti

me

(s)

100 101 102 103 104 105

# clusters

FigTree HBE RS

FigTree ASKIT RS HBE

Figure 4. Synthetic Experiment.

Real-world datasets. We repeat the above experiments oneight large real-world datasets from various domains. We

z-normalize each dataset dimension, and tune bandwidthbased on Scottā€™s rule (Scott, 2015). We exclude a small per-cent of queries whose density is below Ļ„ . Table 1 reports thepreprocessing time as well as average query time for eachmethod. Overall, HBE achieves the best average query timeon four datasets; we focus on query time since it dominatesthe runtime given enough queries. With sketching, HBEalso exhibits comparable preprocessing overhead to FigTreeand ASKIT. While the performances of FigTree and ASKITdegrades with the increased dimension and dataset size re-spectively, HBE remains competitive across the board.

Diagnosis. To assess the accuracy of the diagnostic proce-dure, we compare the predicted variance of RS and HBEon an extended set of datasets. RS exhibits lower variances(smaller error with the same number of samples) for datasetson the left. Overall, our proposed diagnosis correctly iden-tified the better estimator for all but one dataset. Noticethat although RS has better sampling efficiency on TIMIT,HBE ends up having better query time since the frequentcache misses induced by the large dataset dimension offsetthe additional computation cost of HBE. For all datasetsin Table 1, the diagnosis costs between 8% to 59% of thesetup time of HBE, indicating that running the diagnosis ischeaper than running even a single query with the HBE.

Supplementary material. We include an experimentalcomparison between HBS and competing methods (Chenet al., 2010; Cortes & Scott, 2017), where it is consistentlythe best method for low density queries while having similarperformance for random queries. We also show how to usethe diagnostic procedure to produce dataset visualizations.

Page 9: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

AcknowledgmentsWe thank Leslie Greengard for valuable conversations andencouragement in earlier stages of this work. We also thankEdward Gan, Daniel Kang, Sahaana Suri, and Kai-ShengTai for feedback on an early draft of this paper. The firstauthor has been partially supported by an Onassis Foun-dation Scholarship. This research was supported in partby affiliate members and other supporters of the StanfordDAWN projectā€”Ant Financial, Facebook, Google, Intel,Microsoft, NEC, SAP, Teradata, and VMware ā€“ as well asToyota Research Institute, Keysight Technologies, NorthropGrumman, and Hitachi. We also acknowledge the supportof the Intel/NSF CPS Security under grant No. 1505728, theStanford Secure Internet of Things Project and its supporters(VMware, Ford, NXP, Google), and the Stanford System XAlliance.

ReferencesAndoni, A. and Indyk, P. Near-optimal hashing algorithms

for approximate nearest neighbor in high dimensions. InFoundations of Computer Science, 2006. FOCSā€™06. 47thAnnual IEEE Symposium on, pp. 459ā€“468. IEEE, 2006.

Backurs, A., Charikar, M., Indyk, P., and Siminelakis, P.Efficient Density Evaluation for Smooth Kernels. In2018 IEEE 59th Annual Symposium on Foundations ofComputer Science (FOCS), pp. 615ā€“626. IEEE, 2018.

Belkin, M., Ma, S., and Mandal, S. To understand deeplearning we need to understand kernel learning. arXivpreprint arXiv:1802.01396, 2018.

Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere,P. The million song dataset. In Proceedings of the 12thInternational Conference on Music Information Retrieval(ISMIR 2011), 2011.

Blackard, J. A. and Dean, D. J. Comparative accuraciesof artificial neural networks and discriminant analysis inpredicting forest cover types from cartographic variables.Computers and Electronics in Agriculture, vol.24:131ā€“151, 1999.

Buhmann, M. D. Radial basis functions: theory and im-plementations, volume 12. Cambridge university press,2003.

Charikar, M. and Siminelakis, P. Hashing-Based-Estimatorsfor kernel density in high dimensions. In Foundationsof Computer Science (FOCS), 2017 IEEE 58th AnnualSymposium on, pp. 1032ā€“1043. IEEE, 2017.

Charikar, M. and Siminelakis, P. Multi-Resolution Hash-ing for Fast Pairwise Summations. arXiv preprintarXiv:1807.07635, 2018.

Chen, B., Xu, Y., and Shrivastava, A. Lsh-sampling breaksthe computational chicken-and-egg loop in adaptivestochastic gradient estimation. In 6th International Con-ference on Learning Representations, ICLR 2018, Vancou-ver, BC, Canada, April 30 - May 3, 2018, Workshop TrackProceedings, 2018. URL https://openreview.net/forum?id=Hk8zZjRLf.

Chen, Y., Welling, M., and Smola, A. Super-samples fromKernel Herding. In Proceedings of the Twenty-Sixth Con-ference on Uncertainty in Artificial Intelligence, UAIā€™10,pp. 109ā€“116, Arlington, Virginia, United States, 2010.AUAI Press. ISBN 978-0-9749039-6-5.

Cortes, E. C. and Scott, C. Sparse Approximation of aKernel Mean. Trans. Sig. Proc., 65(5):1310ā€“1323, March2017. ISSN 1053-587X.

Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S.Locality-sensitive hashing scheme based on p-stable dis-tributions. In Proceedings of the twentieth annual sym-posium on Computational geometry, pp. 253ā€“262. ACM,2004.

Gan, E. and Bailis, P. Scalable kernel density classificationvia threshold-based pruning. In Proceedings of the 2017ACM International Conference on Management of Data,pp. 945ā€“959. ACM, 2017.

Garofolo, J. S. TIMIT acoustic phonetic continuous speechcorpus. Linguistic Data Consortium, 1993, 1993.

Gray, A. G. and Moore, A. W. N-bodyā€™problems in statisti-cal learning. In Advances in neural information process-ing systems, pp. 521ā€“527, 2001.

Gray, A. G. and Moore, A. W. Nonparametric densityestimation: Toward computational tractability. In Pro-ceedings of the 2003 SIAM International Conference onData Mining, pp. 203ā€“211. SIAM, 2003.

Greengard, L. and Rokhlin, V. A fast algorithm for particlesimulations. Journal of computational physics, 73(2):325ā€“348, 1987.

Hendron, R. and Engebrecht, C. Building america researchbenchmark definition: Updated december 2009, 2010.

Indyk, P. and Motwani, R. Approximate nearest neigh-bors: towards removing the curse of dimensionality. InProceedings of the thirtieth annual ACM symposium onTheory of computing, pp. 604ā€“613. ACM, 1998.

Lee, D. and Gray, A. G. Fast high-dimensional kernelsummations using the monte carlo multipole method. InAdvances in Neural Information Processing Systems, pp.929ā€“936, 2009.

Page 10: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Lee, D., Moore, A. W., and Gray, A. G. Dual-tree fastgauss transforms. In Advances in Neural InformationProcessing Systems, pp. 747ā€“754, 2006.

Luo, C. and Shrivastava, A. Arrays of (locality-sensitive)Count Estimators (ACE): Anomaly Detection on the Edge.In Proceedings of the 2018 World Wide Web Conferenceon World Wide Web, pp. 1439ā€“1448. International WorldWide Web Conferences Steering Committee, 2018a.

Luo, C. and Shrivastava, A. Scaling-up Split-Merge MCMCwith Locality Sensitive Sampling (LSS). arXiv preprintarXiv:1802.07444, 2018b.

March, W., Xiao, B., and Biros, G. ASKIT: ApproximateSkeletonization Kernel-Independent Treecode in HighDimensions. SIAM Journal on Scientific Computing, 37(2):A1089ā€“A1110, 2015.

March, W., Xiao, B., Yu, C., and Biros, G. ASKIT: AnEfficient, Parallel Library for High-Dimensional KernelSummations. SIAM Journal on Scientific Computing, 38(5):S720ā€“S749, 2016.

Morariu, V. I., Srinivasan, B. V., Raykar, V. C., Duraiswami,R., and Davis, L. S. Automatic online tuning for fastGaussian summation. In Koller, D., Schuurmans, D.,Bengio, Y., and Bottou, L. (eds.), Advances in Neural In-formation Processing Systems 21, pp. 1113ā€“1120. CurranAssociates, Inc., 2009.

Muandet, K., Fukumizu, K., Sriperumbudur, B., Scholkopf,B., et al. Kernel mean embedding of distributions: A re-view and beyond. Foundations and Trends RĀ© in MachineLearning, 10(1-2):1ā€“141, 2017.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B.,and Ng, A. Y. Reading digits in natural images withunsupervised feature learning. In NeurIPS workshop ondeep learning and unsupervised feature learning, 2011.

Pennington, J., Socher, R., and Manning, C. D. Glove:Global vectors for word representation. In EmpiricalMethods in Natural Language Processing (EMNLP), pp.1532ā€“1543, 2014.

Phillips, J. M. Īµ-samples for kernels. In Proceedings of thetwenty-fourth annual ACM-SIAM symposium on Discretealgorithms, pp. 1622ā€“1632. SIAM, 2013.

Phillips, J. M. and Tai, W. M. Near-optimal coresets ofkernel density estimates. In LIPIcs-Leibniz InternationalProceedings in Informatics, volume 99. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.

Ram, P., Lee, D., March, W., and Gray, A. G. Linear-time al-gorithms for pairwise statistical problems. In Advances inNeural Information Processing Systems, pp. 1527ā€“1535,2009.

Scholkopf, B., Smola, A. J., Bach, F., et al. Learning withkernels: support vector machines, regularization, opti-mization, and beyond. MIT press, 2002.

Scott, D. W. Multivariate density estimation: theory, prac-tice, and visualization. John Wiley & Sons, 2015.

Spring, R. and Shrivastava, A. A New Unbiased and Effi-cient Class of LSH-Based Samplers and Estimators forPartition Function Computation in Log-Linear Models.arXiv preprint arXiv:1703.05160, 2017.

Sriperumbudur, B. et al. On the optimal estimation of proba-bility measures in weak and strong topologies. Bernoulli,22(3):1839ā€“1893, 2016.

Wu, X., Charikar, M., and Natchu, V. Local density estima-tion in high dimensions. In Proceedings of the 35th In-ternational Conference on Machine Learning, volume 80of Proceedings of Machine Learning Research, pp. 5296ā€“5305. PMLR, 10ā€“15 Jul 2018.

Zheng, Y., Jestes, J., Phillips, J. M., and Li, F. Quality andefficiency for kernel density estimates in large data. InProceedings of the 2013 ACM SIGMOD InternationalConference on Management of Data, pp. 433ā€“444. ACM,2013.

Page 11: Rehashing Kernel Evaluation in High Dimensions

Supplementary material for:Rehashing Kernel Evaluation in High Dimensions

Paris Siminelakis * 1 Kexin Rong * 1 Peter Bailis 1 Moses Charikar 1 Philip Levis 1

Outline In the first two sections, we give proofs for allour formal results while restating them for convenience. InSection 2, we give the precise description of our Hashing-Based-Sketch and its theoretical analysis. In Section 3, wepresent our diagnostic and visualization procedures in moredetail. In Section 4, we explain our design decisions andthe procedure for generating a fair synthetic benchmark. InSection 5, we provide additional setup details and resultsfor the experimental evaluation.

1. Proofs1.1. Basic inequalities

We first state without proof some well known inequalitiesthat we will use in the proofs.

Lemma 1 (Chebyshevā€™s and Paley-Zygmund inequalities).For a non-negative random variable Z and parameters t >0, Īø āˆˆ [0, 1], we have

P[Z ā‰„ (t+ 1) Ā· E[Z]] ā‰¤ 1

t2Ā· RelVar[Z], (1)

P[Z > (1āˆ’ Īø)E[Z]] ā‰„ 1

1 + 1Īø2 Ā· RelVar[Z]

. (2)

Theorem 1 (Chernoff bounds). Let X =āˆ‘ni=1Xi, where

Xi = 1 with probability pi and Xi = 0 with probability1 āˆ’ pi, and all Xi are independent. Let v = E[X] =āˆ‘n

i=1 pi. Then for Ī“ > 0

P[X ā‰„ (1 + Ī“)v] ā‰¤ eāˆ’Ī“2

2+Ī“ v, (3)

P[X ā‰¤ (1āˆ’ Ī“)v] ā‰¤ eāˆ’ 12 Ī“

2v. (4)

1.2. Median-trick to boost success probability

The median-trick is based on concentration of sums of in-dependent binary random variables. If we define binary

*Equal contribution 1Stanford University, Stanford,California, US. Correspondence to: Paris Siminelakis<[email protected]>, Kexin Rong <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

random variables appropriately we can obtain bounds forthe concentration of the median of i.i.d. random variablesaround their expectation.

Lemma 2. Let Z1, . . . , ZL be L ā‰„ 1 i.i.d. copies of anon-negative random variable with RelVar[Z] ā‰¤ Īµ2

6 then:

P [medianZ1, . . . , ZL ā‰„ (1 + Īµ)E[Z]] ā‰¤ eāˆ’L6 ,

P [medianZ1, . . . , ZL ā‰¤ (1āˆ’ Īµ)E[Z]] ā‰¤ eāˆ’L4 .

Proof of Lemma 2. Let

Xi = I[Zi ā‰„ (1 + Īµ)E[Z]],

Yi = I[Zi ā‰¤ (1āˆ’ Īµ)E[Z]].

By Lemma 1, we have that

ai = E[Xi] ā‰¤1

Īµ2Īµ2

6ā‰¤ 1

6, bi = E[Yi] ā‰¤

1

7.

We get the following upper bounds

P[medianZ1, . . . , ZL ā‰„ (1 + Īµ)E[Z]] ā‰¤ P[

Lāˆ‘i=1

Xi ā‰„L

2].

P[medianZ1, . . . , ZL ā‰„ (1 + Īµ)E[Z]] ā‰¤ P[

Lāˆ‘i=1

Yi ā‰„L

2],

that along with Chernoff bounds will give us our result. Weonly show the first inequality as the second one followssimilarly. Let A =

āˆ‘Li=1 ai ā‰¤ L/6, the first event is

bounded by exp(āˆ’ (L/2A āˆ’1)

2

2+(L/2A āˆ’1)

A) ā‰¤ exp(āˆ’L/6).

1.3. Moments of Hashing-Based-Estimators

Lemma 3. Assuming that āˆ€i āˆˆ [n], p(xi, q) > 0 then

E[Zh] =

nāˆ‘i=1

uik(x, xi), (5)

E[Z2h] =

nāˆ‘i,j=1

k2(q, xi)uiP[i, j āˆˆ H(q)]uj

p2(q, xi). (6)

Page 12: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Proof of Lemma 3. We start with the expectation:

Eh,X [k(q,X)

p(q,X)uH(q)] = Eh[EX [

k(q,X)

p(q,X)]uH(q)]

= Eh[āˆ‘

iāˆˆH(q)

uiuH(q)

k(q, xi)

p(q, xi)uH(q)]

=

nāˆ‘i=1

uiE[I[h(xi) = h(q)]]k(xi, q)

p(xi, q)

=

nāˆ‘i=1

uik(xi, q)

We proceed with the second moment:

Eh,X [k2(q,X)

p2(q,X)u2H(q)] = Eh[EX [

k2(q,X)

p2(q,X)]u2H(q)]

= Eh[āˆ‘

iāˆˆH(q)

uiuH(q)

k2(q, xi)

p2(q, xi)u2H(q)]

= Eh[āˆ‘

iāˆˆH(q)

uik2(q, xi)

p2(q, xi)uH(q)]

= Eh[āˆ‘

i,jāˆˆH(q)

uiujk2(q, xi)

p2(q, xi)]

=

nāˆ‘i,j=1

k2(xi, q)uiP[i, j āˆˆ H(q)]uj

p2(xi, q)

1.4. Refined Variance bound

Here, we derive our new inequality bounding the varianceof HBE and RS. Let Āµ ā‰¤ Ī» ā‰¤ L ā‰¤ 1 and define:

S1 = i āˆˆ [n] : L ā‰¤ wi ā‰¤ 1 (7)S2 = i āˆˆ [n] \ S1 : Ī» ā‰¤ wi ā‰¤ L (8)S3 = i āˆˆ [n] \ (S2 āˆŖ S1) : Āµ ā‰¤ wi ā‰¤ Ī» (9)S4 = i āˆˆ [n] : wi < Āµ (10)

as well as Āµ` =āˆ‘iāˆˆS` uiwi ā‰¤ Āµ. The intuition behind the

definition of the sets is that for radial decreasing kernels theycorrespond to spherical annuli around the query (Figure 1).Lemma 4. For non-negative weights w1, . . . , wn, vectoru āˆˆ āˆ†n and sets S1, . . . , S4 āŠ† [n] as above it holdsāˆ‘i,jāˆˆ[n]

w2i uiVijuj ā‰¤

āˆ‘`āˆˆ[3],`ā€²āˆˆ[3]

supiāˆˆS`,jāˆˆS`ā€²

Vijwiwj

Āµ`Āµ`ā€²

+ uS4

āˆ‘`āˆˆ[3]

supiāˆˆS`,jāˆˆS4

Vij

wiĀµ

Āµ`Āµ

+ supiāˆˆS4,jāˆˆ[n]

Vijwi Ā· Āµ4 (11)

where uS :=āˆ‘jāˆˆS uj ā‰¤ 1.

Figure 1. Depiction of the sets that appear in Lemma 4

Proof of Lemma 4. First we observe that S1]S2]S3]S4 =[n] forms a partition:āˆ‘

i,jāˆˆ[n]

uiujVijw2i =

āˆ‘`,`ā€²āˆˆ[3]

āˆ‘iāˆˆS`,jāˆˆS`ā€²

uiujVijw2i

+āˆ‘`āˆˆ[3]

āˆ‘iāˆˆS`,jāˆˆS4

uiujVijw2i

+āˆ‘

iāˆˆS4,jāˆˆ[n]

uiujVij .w2i (12)

For the first three sets we have some bounds on the rationwiwj

whereas for the last set we have a bound on the wi. Weutilize these by:āˆ‘iāˆˆS`,jāˆˆS`ā€²

Vijwiwj

uiwiujwj ā‰¤ supiāˆˆS`,jāˆˆS`ā€²

Vijwiwjāˆ‘iāˆˆS`

wiuiāˆ‘jāˆˆS`ā€²

wjuj ,

āˆ‘iāˆˆS`,jāˆˆS4

VijwiĀµ

wiuiujĀµ ā‰¤ uS4supiāˆˆS`,jāˆˆS4

VijwiĀµ

Āµāˆ‘iāˆˆS`

wiui,

āˆ‘iāˆˆS4,jāˆˆ[n]

Vijwiuiujwi ā‰¤ supiāˆˆS4,jāˆˆ[n]

Vijwiā€–uā€–1āˆ‘jāˆˆS4

wiui.

Identifying Āµi in the above expressions and substituting thebounds in (12) completes the proof.

1.5. Adaptive procedure

Theorem 2. Given an (a, Ī², Ī³)-regular estimator Z , theAMR procedure outputs a number Z such that

P[|Z āˆ’ Āµ| ā‰¤ Īµ Ā·maxĀµ, Ļ„] ā‰„ 2

3āˆ’OĪ³,Ī±(Īµ2)

and with the same probability uses OĪ³( 1Īµ2

1ĀµĪ²

) samples.

Proof of Theorem 2. Recall that Āµt = (1 + Ī³)āˆ’t and lett0 := t0(Āµ) āˆˆ Z such that:

Āµt0+1 ā‰¤ Āµ ā‰¤ Āµt0 (13)

We consider two cases t0 < T or t0 ā‰„ T .

Page 13: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Case I (t0 < T ). In this case, we want to show that our al-gorithm with constant probability does not terminate beforet0 and not after t0 + 1.

Let Zt be the mean of mt i.i.d. samples Z(i)t āˆ¼ Z(t, Ī³)

with mean E[Z(i)t ] = Āµ and RelVar[Z

(i)t ] ā‰¤ Vt(Āµ). Then,

RelVar[Zt] ā‰¤Īµ2

6

Vt(Āµ)

Vt(Āµt+1). (14)

Let A0 be the event that the algorithm terminates before t0.

P[A0] = P[āˆƒt < t0, Zt ā‰„ Āµt] (15)

ā‰¤āˆ‘t<t0

P[Zt ā‰„(ĀµtĀµ

)Āµ] (16)

ā‰¤ Īµ2

6

t0āˆ’1āˆ‘t=1

Āµ2

(Āµt āˆ’ Āµ)2Vt(Āµ)

Vt(Āµt+1)(17)

ā‰¤ Īµ2

6

t0āˆ’1āˆ‘t=1

Āµ2

(Āµt āˆ’ Āµ)2(Āµt+1

Āµ)2āˆ’Ī±. (18)

where in (16) we use union bound, in (17) we use the firstpart of Lemma 1 and in (18) property (B) of a regular esti-mator. In the next three inequalities we use (13), t ā‰¤ t0 āˆ’ 1and

āˆ‘st=0 x

s ā‰¤ (1āˆ’ x)āˆ’1 for x < 1.

P[A0] ā‰¤ Īµ2

6

t0āˆ’1āˆ‘t=1

1

(1āˆ’ Āµt0Āµt

)2Āµ2t+1

Āµ2t

(Āµt0Āµt+1

)Ī± (19)

ā‰¤ Īµ2

6

1

Ī³2ĀµĪ±t0

t0āˆ’1āˆ‘t=1

(1 + Ī³)āˆ’Ī±(t0āˆ’tāˆ’1) (20)

ā‰¤ Īµ2

6

1

Ī³2ĀµĪ±t0

1

1āˆ’ (1 + Ī³)āˆ’Ī±. (21)

Furthermore, let A1 be the event that the algorithm termi-nates after t > t0 + 1.

P[A1] = P[āˆ€t ā‰¤ t0 + 1, Zt < Āµt] (22)ā‰¤ P[Zt0+1 < Āµt0+1] (23)= 1āˆ’ P[Zt0+1 ā‰„ Āµt0+1]. (24)

Using the second part of Lemma 1 (Paley-Zygmund)

P[Zt0+1 ā‰„ Āµt0+1] ā‰„ 1

1 + (Ī³+1)2

Ī³2Īµ2

6

Vt0+1(Āµ)

Vt0+1(Āµt0+1)

(25)

ā‰„(

1 +(Ī³ + 1)2

Ī³2Īµ2

6

)āˆ’1. (26)

Therefore, P[A1] ā‰¤ 1 āˆ’(

1 + (Ī³+1)2

Ī³2Īµ2

6

)āˆ’1ā‰¤ (Ī³+1)2

Ī³2Īµ2

6 .Finally, let tāˆ— be the (random) level where the algorithmterminates and A2 be the event that |Ztāˆ— āˆ’ Āµ| > ĪµĀµ. If any

of the three events happen we say that the procedure fails.We can bound the failure probability by:

P[F ] = P[A0 āˆØA1 āˆØA2]

= P[A0 āˆØA1 āˆØA2 āˆ§ A0] + P[(A0 āˆØA1 āˆØA2) āˆ§ Ac0]

ā‰¤ P[A0] + P[A1 āˆ§Ac0] + P[A2 āˆ§Ac0]. (27)

To bound the last term we use:

P[A2 āˆ§Ac0] = P[A2 āˆ§Ac0 āˆ§A1] + P[A2 āˆ§Ac0 āˆ§Ac1]

ā‰¤ P[A1] + P[A2 āˆ§Ac0 āˆ§Ac1].

and

P[A2 āˆ§Ac0 āˆ§Ac1] =āˆ‘

tāˆˆt0,t0+1

P[|Zt āˆ’ Āµ| > ĪµĀµ āˆ§ tāˆ— = t]

ā‰¤āˆ‘

tāˆˆt0,t0+1

P[|Zt āˆ’ Āµ| > ĪµĀµ]

ā‰¤ 1

Īµ2

āˆ‘tāˆˆt0,t0+1

RelVar[Zt]

ā‰¤ 1

Īµ2

āˆ‘tāˆˆt0,t0+1

Īµ2

6

Vt(Āµ)

Vt(Āµt+1).

By definition Āµ ā‰„ Āµt+1 for all t ā‰„ t0, thus by (B) and (28):

P[A2 āˆ§Ac0 āˆ§Ac1] ā‰¤ 2

6=

1

3. (28)

Hence, the overall probability failure is bounded by:

P[F ] ā‰¤ P[A0] + 2P[A1] + P[A2 āˆ§Ac0 āˆ§Ac1]

ā‰¤ Īµ2

6

1

Ī³2ĀµĪ±t0

1

1āˆ’ (1 + Ī³)āˆ’Ī±+ 2

(Ī³ + 1)2

Ī³2Īµ2

6+

1

3.

When the algorithm succeeds the total number of samplesis bounded by

t0+1āˆ‘t=1

d 6

Īµ2Vt(Āµt+1)e ā‰¤ (t0 + 1) +

6C

Īµ2

t0+1āˆ‘t=1

(1 + Ī³)Ī²(t+1)

ā‰¤ (t0 + 1) +6C

Īµ2(1 + Ī³)2Ī²

(1 + Ī³)Ī²t0

Ī³

ā‰¤ (t0 + 1) +6C

Īµ2(1 + Ī³)Ī²

Ī³

1

ĀµĪ².

Case II (t0 ā‰„ T ). In this case Āµ ā‰¤ ĀµT ā‰¤ 11+Ī³ ĪµĻ„ . By the

same arguments as in the case t0 < T we get that the proba-bility terminates before t < t0 is at most Īµ2

6Ī³2ĀµĪ±t0

11āˆ’(1+Ī³)āˆ’Ī± .

If the condition ZT ā‰„ ĀµT is satisfied then:

P[|ZT āˆ’ Āµ| > ĪµĀµ] ā‰¤ 1

Īµ2RelVar[ZT ] ā‰¤ 1

6(29)

If ZT < ĀµT then:

|0āˆ’ Āµ| ā‰¤ Āµ ā‰¤ ĀµT ā‰¤1

1 + Ī³ĪµĻ„ ā‰¤ ĪµmaxĀµ, Ļ„ (30)

Page 14: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Conclusion. Thus, overall if Z is the output of AMR:

P[|Z āˆ’ Āµ| > ĪµmaxĀµ, Ļ„] ā‰¤ Īµ2

6

1

Ī³2ĀµĪ±t0

1

1āˆ’ (1 + Ī³)āˆ’Ī±

+ 2(Ī³ + 1)2

Ī³2Īµ2

6+

1

3

As we see in the above expression the failure probability isdominated by the 1

3 term. For example for Ī³ = 1, Īµ = 0.2,Ī± = 1 we have that the extra term is less than 0.0667.

1.6. Regular estimator for Gaussian Kernel

Theorem 3. ZGauss is (1, 34 , Ī³)-regular and takes prepro-cessing time/space bounded by Od,ĪŗT ,Ī³(Īµāˆ’3+

14 Ļ„āˆ’

34 Ā· n).

Proof of Theorem 3. By Lemma 3 and Theorem 1 (Sec-tion 2.3 in main paper), (A) holds with Vt(Āµ) :=

4e32

Āµ er2tāˆ’rt

āˆšlog( 1

Āµ ). Moreover, since āˆ€x ā‰„ y > 0

Vt(y)

Vt(x)=x

yeāˆ’rt(

āˆšlog( 1

y )āˆ’āˆš

log( 1x )) ā‰¤

(x

y

)2āˆ’1

(31)

and Vā€²

t (x) = āˆ’ 4e32

x er2tāˆ’rt

āˆšlog( 1

Āµ )( 1x + rt

2āˆš

log( 1x )

) < 0,

property (B) holds with Ī± = 1. Finally,

Vt(Āµt+1) = 4e32 e

14āˆ’

12

āˆšt+1t +(1+ 1

t )t log(1+Ī³) (32)

= 4e32

(1

Āµt

) 14āˆ’

12

āˆšt+1t +(1+ 1

t )

(33)

ā‰¤ 4e32 (1 + Ī³)

1āˆ’ 1āˆš2 Ā·(

1

Āµt

) 34

, (34)

and consequently (C) holds with Ī² = 34 . Finally, the estima-

tor uses at most O( 1Īµ2VT (ĀµT+1)) hash tables each taking

preprocessing time/space Od,qT ,Ī³(n) space.

2. SketchingFor any hash table H and a vector u āˆˆ āˆ†n (simplex), letB = B(H) denote the number of buckets and umax =umax(H) := maxuHi : i āˆˆ [B] the maximum weightof any hash bucket of H . The precise definition of ourHashing-Based-Sketch is given below in Algorithm 1.

For a fixed H , we can obtain the following bounds on thefirst two moments of our sketch (Sm, w).

Lemma 5 (Moments). For the sketch (Sm, w) produced bythe HBS procedure it holds that

E[KDEwSm |H] = KDEuP (q),

Var[(KDEwSm)2|H] ā‰¤ 1

m(Bumax)1āˆ’Ī³

āˆ—nāˆ‘i=1

k2(xi, q)ui.

Algorithm 1 Hashing-Based-Sketch (HBS)

1: Input: set P , size m, hashing scheme HĪ½ , thresholdĻ„ āˆˆ (0, 1), u āˆˆ āˆ†n

2: Sample h āˆ¼ HĪ½ and create hash tabel H = h(P ).3: Set Ī³ according to (35)4: Sm ā† āˆ…, w ā† 0 Ā· 1m, B ā† B(H)5: for j = 1, . . . ,m do6: Sample hash bucket Hi with probability āˆ uĪ³Hi7: Sample a point Xj from Hi with probability āˆ uj8: Sm ā† Sm āˆŖ Xj

9: wj(Ī³,m)ā† uHim

āˆ‘Biā€²=1

uĪ³Hiā€²

uĪ³Hi

10: Output: (Sm, w)

The above analysis shows that the sketch is always unbiasedand that the variance depends on the hash function H onlythrough (Bumax)1āˆ’Ī³

āˆ— ā‰„ 1. We postpone the proof of thislemma after showing how it implies the following theorem.

Theorem 4. Let H be the hash function sampled by theHBS procedure. For Īµ > 0 and Ī“ āˆˆ [eāˆ’

6Īµ2

umaxnĻ„ , eāˆ’

6Īµ2 ), let:

Ī³āˆ— =

1āˆ’

log( Īµ2

6 log(1/Ī“))

log(umax

Ļ„ )

I[Bā‰¤( 12 )

16 1Ļ„ ]

, (35)

m =6

Īµ21

Ļ„(Bumax)

1āˆ’Ī³āˆ—<

log( 1Ī“ )

Ļ„. (36)

Then (Sm, w) is an (Īµ, 16 , Ļ„)-sketch and if B ā‰¤(12

) 16 1Ļ„

any hash bucket with weight at least Ļ„ will have non emptyintersection with Sm with probability at least 1āˆ’ Ī“.

Proof of Theorem 4. Given a hash bucket with weight atleast Ļ„ , the probability that we sample a point from thatbucket is at least:

Ļ ā‰„ Ļ„Ī³

B1āˆ’Ī³ = Ļ„1

(BĻ„)1āˆ’Ī³(37)

The probability that we see no point after m indepen-dent samples is less than (1 āˆ’ Ļ)m ā‰¤ eāˆ’m

Ļ„Ī³

B1āˆ’Ī³ Form ā‰„ log(1/Ī“)

Ļ„ (BĻ„)1āˆ’Ī³ this probability is at most Ī“. Onthe other hand by Lemma 5 if m ā‰„ 6

Īµ21Ļ„ (Bumax)

1āˆ’Ī³ wehave that Var[KDEwSm ] ā‰¤ Īµ2

6 ĀµĻ„ . The case B > 2āˆ’161Ļ„ is

trivial as Ī³āˆ— = 1. For B ā‰¤ 2āˆ’161Ļ„ ā‡’ umax ā‰„ 1

B ā‰„ Ļ„216 .

We set Ī³ to make the two lower bounds on m equal,

6

Īµ21

Ļ„(Bumax)

1āˆ’Ī³=

log(1/Ī“)

Ļ„(BĻ„)1āˆ’Ī³ (38)

ā‡”(umax

Ļ„

)1āˆ’Ī³=Īµ2 log(1/Ī“)

6(39)

ā‡” Ī³ = 1āˆ’log( Īµ

2

6 log(1/Ī“))

log(umax

Ļ„ ). (40)

Page 15: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

This is strictly less than one for log(1/Ī“) Īµ2

6 > 1 ā‡’ Ī“ <

eāˆ’6Īµ2 , and more than zero for Ī“ ā‰„ eāˆ’

6Īµ2

umaxĻ„ . Since umax ā‰„

Ļ„21/6 the two inequalities are consistent. Furthermore,

m =6

Ļ„Īµ2Ā· (Bumax)1āˆ’Ī³

āˆ—(41)

=6

Ļ„Īµ2Ā· (Bumax)

log(Īµ2 log(1/Ī“)

6 ) 1log(

umaxĻ„

) (42)

=6

Ļ„Īµ2Ā· e

log(log(1/Ī“) Īµ2

6 )log(Bumax)

log(umaxĻ„

) (43)

ā‰¤ 6

Ļ„Īµ2Ā·(

log(1/Ī“)Īµ2

6

)(1āˆ’ 16

log 2

log(umaxĻ„

))

(44)

<log(1/Ī“)

Ļ„. (45)

Remark 1. Observe that log 1Ī“

Ļ„ is the number of samplesthat random sampling would require in order to havethe same property for any bucket with uHi ā‰„ Ļ„ . WhenĪ³āˆ— < 1, our scheme always uses less samples by a factor of(

log(1/Ī“) Īµ2

6

) log(BĻ„)

log(umaxĻ„

)< 1.

Thus, our sketch will have similar variance with randomsampling in dense regions of the space but will have betterperformance for relatively ā€œsparseā€ regions.

2.1. Proof of Lemma 5

Proof of Lemma 5. Let I be the random hash bucket andXI the corresponding random point, then for a single point:

E[KDEw1

XI] = EI [EXI [uHIm

āˆ‘Biā€²=1 u

Ī³Hiā€²

uĪ³HIk(XI , q)]]

= EI [āˆ‘jāˆˆHI

uHIm

āˆ‘Biā€²=1 u

Ī³Hiā€²

uĪ³HIk(xj , q)

ujuHI

]

=1

mEI [āˆ‘Biā€²=1 u

Ī³Hiā€²

uĪ³HI

āˆ‘jāˆˆHI

k(xj , q)uj ]

=1

m

āˆ‘iāˆˆ[B]

āˆ‘jāˆˆHi

k(xj , q)uj

=1

mKDFuP (q).

The first part follows by linearity of expectation. Similarly,

E[(KDFwSm)2] ā‰¤māˆ‘j=1

E[(KDFwjxj)

2] + (KDFuP (q))2.

By linearity we only have to bound the first term

E[(KDEw1

XI)2] = EI [EXI [(

uHIm

āˆ‘Biā€²=1 u

Ī³Hiā€²

uĪ³HIk(XI , q))

2]]

= EI [āˆ‘jāˆˆHI

(uHIm

āˆ‘Biā€²=1 u

Ī³Hiā€²

uĪ³HIk(xj , q))

2 ujuHI

]

= EI [(āˆ‘Biā€²=1 u

Ī³Hiā€²

muĪ³HI)2uHI

āˆ‘jāˆˆHI

k2(xj , q)uj ]

=

āˆ‘Biā€²=1 u

Ī³Hiā€²

m2

āˆ‘iāˆˆ[B]

u1āˆ’Ī³HI

āˆ‘jāˆˆHI

k2(xj , q)uj

ā‰¤āˆ‘Biā€²=1 u

Ī³Hiā€²

m2u1āˆ’Ī³max

āˆ‘iāˆˆ[B]

āˆ‘jāˆˆHI

k2(xj , q)uj

ā‰¤ (Bumax)1āˆ’Ī³

m2

nāˆ‘j=1

k2(xj , q)uj .

The last inequality follows by applying Holderā€™s inequalitywith p = 1

Ī³ and q = 11āˆ’Ī³ , and due to u āˆˆ āˆ†n.

3. Diagnostic and Visualization proceduresIn this section, we show how our refined variance boundsalong with the adaptive procedure lead to a diagnostic pro-cedure estimating the variance of RS and HBE, as well asto a visualization procedure that gives intuition about theā€œlocal structureā€ of the queries in a given dataset.

3.1. Diagnostic procedure

In order to go beyond worst-case bounds (typically ex-pressed as a function of the query density Āµ) and providedataset specific bounds on the variance of different methods(RS and HBE) we use Lemma 4. By setting the coeffi-cients Vij appropriately, we can bound the variance of RS(Vij = 1) and HBE (Vij =

minp(q,xi),p(q,xj)p(q,xi)2

). Unfortu-nately, evaluating the bound (11) directly over the wholedataset for a single query is no cheaper than evaluating themethods directly.

At a high level, our diagnostic procedure goes around thisby evaluating the upper bound for each query q on ā€˜repre-sentative sampleā€ S0(q) instead of P . By doing this fora number T of random queries picked uniformly from thedataset P , we get an estimate of the average relative variancefor different methods.

Specifically, given Ļ„ āˆˆ (0, 1) and Īµ āˆˆ (0, 1), for a sin-gle query let S0 be the random set produced by the AMRprocedure (Algorithm 1, Section 4.2) called with randomsampling and define the sets S` = S0 āˆ© S` for ` āˆˆ [4] and

Page 16: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

their corresponding ā€œdensitiesā€ Āµ` =āˆ‘iāˆˆS` uiwi. Let

Ī»Īµ := arg maxĀµ0ā‰¤Ī»ā‰¤1

Āµ3 ā‰¤

1

2(ĪµĀµ0 āˆ’ Āµ4)

(46)

LĪµ := arg minĀµ0ā‰¤Lā‰¤1

Āµ1 ā‰¤

1

2(ĪµĀµ0 āˆ’ Āµ4)

(47)

be such that Āµ2 ā‰„ (1 āˆ’ Īµ)Āµ0, i.e. most of the mass iscaptured by the set S2 (that is an spherical annulus forkernels that are decreasing with distance). Since Lemma 4holds for all Āµ ā‰¤ Ī» ā‰¤ L ā‰¤ 1, LĪµ, Ī»Īµ complete the definitionof four sets S1, . . . , S4 which we use to evaluate (11), anddenote by Vmethod(q) the corresponding bound used withVij corresponding to a certain estimator, e.g. method āˆˆRS,HBE. Below we give the procedure in pseudo-code.

Algorithm 2 Diagnostic

1: Input: set P , threshold Ļ„ āˆˆ ( 1n , 1), accuracy Īµ, T ā‰„ 1,

collision probability p(x, y) of the hashing schemeHĪ½ .2: for t = 1, . . . , T do3: q ā† Random(P ) . For each random query4: (S0, Āµ0)ā† AMR(ZRS(q), Īµ, Ļ„)5: Set Ī»Īµ, LĪµ using (46) and (47)6: Let VRS be the r.h.s of (11) for S0 and Vij = 1.7: Let VHBE be the r.h.s of (11) for S0 and

Vij =minp(q, xi), p(q, xj)

p(q, xi)2(48)

8: rVRS(t)ā† VRS/maxĀµ, Ļ„29: rVHBE(t)ā† VHBE/maxĀµ, Ļ„2.

10: Output: (meanT (rVRS),meanT (rVHBE))

Remark 2. We only show the procedure for choosing be-tween RS and HBE with a specific hashing scheme. Thesame procedure can be used to evaluate a multitude of hash-ing schemes to select the best one for a given dataset.

3.2. Visualization procedure

We can use the information from our diagnostics to visualizewhat is the data set like by aggregating local information forrandom queries. For a set S, let rS = miniāˆˆS log( 1

k(xi,q))

and RS = maxjāˆˆS log( 1k(xj ,q)

). The basis of our visualiza-tion is the following fact:

Lemma 6. Let X be a random sample from S, thenE[k2(X, q)] ā‰¤ exp(RS āˆ’ rS) Ā· Āµ2

S .

Proof of Lemma 6. We have that eāˆ’RS ā‰¤ k(xi, q) ā‰¤ eāˆ’rS ,

therefore

E[k2(X, q)] =āˆ‘iāˆˆS

k2(xi, q)ui (49)

ā‰¤ eāˆ’rSāˆ‘iāˆˆS

k(xi, q)uiĀµSĀµS

(50)

ā‰¤ eRSāˆ’rSĀµ2S (51)

where in the last part we used ĀµS ā‰„ eāˆ’RS .

Thus if we plot an annulus of width wS = RSāˆ’rs then ewSis an estimate of the relative variance for RS! The visualiza-tion procedure when given a sequence of T pairs of numbers(Ī»t, Lt) for t āˆˆ [T ] (produced by the diagnostic procedure)plots overlapping annuli around the origin representing thequeries. Since often the ratio maxi,jāˆˆS

k(xi,q)k(xj ,q)

is refered toas the condition number of the set S, we call our procedurethe Log-Condition plot.

Algorithm 3 Log-Condition Plot

1: Input: (Ī»t, Lt)tāˆˆ[T ].2: for t = 1, . . . , T do . For each query3: rt ā† log(1/Lt),4: Rt ā† log(1/Ī»t)5: draw 2D-annulus(rt, Rt)6: Output: figure with overlapping annuli.

Remark 3. In the specific case of the Laplace (exponen-tial) kernel, the radii we are plotting correspond to actualdistances.

4. Synthetic benchmarksIn this section, we introduce a general procedure to cre-ate tunable synthetic datasets that exhibit different localstructure around the query. We then show how to use thisprocedure as a building block to create two different familyof instances with specific characteristics aimed to test kerneldensity evaluation methods.

4.1. (Āµ,D, n, s, d, Ļƒ)-Instance

Since the problem of kernel density is query dependentand the kernel typically depends only on the distance, weshall always assume that the query point is at the originq = 0 āˆˆ Rd.

We further assume that the kernel is an invertible functionof the distance K(r) āˆˆ [0, 1] and let Kāˆ’1(Āµ) āˆˆ [0,āˆž) bethe inverse function. For example, the exponential kernel isgiven by K(r) = eāˆ’r and the inverse function is given byKāˆ’1(Āµ) = log( 1

Āµ ).

The dataset is created with points lying in D different direc-tions and s distance scales (equally spaced between 0 and

Page 17: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Figure 2. (Āµ = 0.01, D = 3, s = 4, d = 2, Ļƒ = 0.05)-Instance.Each of the D = 3 directions is coded with a different color.

R = Kāˆ’1(Āµ)) such that the contribution from each direc-tion and scale to the kernel density at the origin is equal.To achieve this the number of points nj placed a at the j-thdistance scale rj is given by

n` := bn Āµ

K(rj)c. (52)

The reasoning behind this design choice is to make sure thatwe have diversity in the distance scales that matter in theproblem, so not to favor a particular class of methods (e.g.random sampling , nearest-neighbor based). Also, placingthe points on the same direction makes the instance more dif-ficult for HBE as the variance in (6) increases with the ratioP[h(i)=h(j)=h(q)]

P[h(i)=h(q)]2 , that expresses how correlated the valuesh(i), h(j), h(q) are. We give an example visualizationof such data sets in 2 dimensions in Figure 2. The detailedprocedure is described below (Algorithm 4).

Algorithm 4 (Āµ,D, n, s, d, Ļƒ)-Instance

1: Input: Āµ āˆˆ [ 1n , 1], D ā‰„ 1, n ā‰„ 1, s ā‰„ 2 , d ā‰„ 1,Ļƒ ā‰„ 0, kernel K, inverse Kāˆ’1.

2: Rā† Kāˆ’1(Āµ), r0 ā† Kāˆ’1(1), P ā† āˆ….3: for j = 0, . . . , sāˆ’ 1 do4: rj+1 ā† Rāˆ’r0

sāˆ’1 j + r0 . distances for each D5: nj+1 ā† bn Āµ

K(rj+1)c . points at each distance

6: for i = 1, . . . , D do7: vi ā† gi

ā€–giā€– with gi āˆ¼ N (0, Id). . random direction8: for j=1,. . . , s do . For each distance scale9: for ` = 1, . . . , nj do . generate a ā€œclusterā€

10: gij` āˆ¼ N (0, Id)11: xij` ā† sjvi + Ļƒāˆš

dsjgij`

12: P ā† P āˆŖ xij`13: Output: Set of points P

Remark 4. IfD n this class of instances becomes highlystructured with a small number of tightly knit ā€œclustersā€

(a) ā€œworst-caseā€ instance (b) D-structured instance

Figure 3. The two family of instances for d = 2.

(Figure 2). One would expect in this case, space-partioningmethods to perform well. At the same time by Lemma 4, thistype of instances are the ones that maximize the variance ofboth HBE (s ā‰„ 1) and RS (s > 1) .

Remark 5. On the other hand if D n the instancesbecome spread out (especially in high dimensions). Thistype of instances are ideal for sampling based methods whens = 1, and difficult for space-partitioning methods.

Based on the above remarks we propose the following sub-class of instances.

4.2. ā€œWorst-caseā€ instance

In order to create an instance that is hard for all methods wetake a union of the two extremes D n and D n. Wecall such instances ā€œworst-caseā€ as there does not seem tobe a single type of structure that one can exploit, and thesetype of instances realize the worst-case variance bounds forboth HBE and RS. In particular, if we want to generate aninstance with N points, we first set D a small constant andn = Ī˜(N) and take the union of such a dataset with anotherusing D = Ī©(N1āˆ’o(1)) and n = O(No(1)). An exampleof such a dataset is given in 3(a).

4.3. D-structured instance

ā€œWorst-caseā€ instances are aimed to be difficult for any ker-nel evaluation method. In order to create instances that havemore varied structure, we use our basic method to create asingle parameter family of instance by fixing N,Āµ, Ļƒ, s, dand setting n = N

D . We call this family of instances asD-structured. As one increases D, two things happen:

ā€¢ The number of directions (clusters) increases.

ā€¢ n = ND decreases and hence certain distance scales

disappear. By (52), if nĀµ < K(rj) ā‡’ Dj >NĀµK(rj)

then distance scale j will have no points assigned to it.

Hence, for this family when D ND ā†” D

āˆšN

the instances are highly structured and we expect space-

Page 18: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Table 1. Preprocessing time (init) and total query time (query)on 10K random queries for additional datasets. All runtime mea-surements are reported in seconds.

Dataset Time RS HBE ASKIT FigTree

higgs init 0 141 25505 > 1dayquery 6 18 1966 > 1day

hep init 0 138 23421 > 20 hoursquery 6 11 1581 > 1day

susy init 0 67 5326 3245query 18 12 > 9756 5392

home init 0 11 237 7query 2369 17 376 33

mnist init 0 211 14 437query 168 389 ? 1823

Table 2. Preprocessing time (in seconds) for clustering test.

n D HBE FigTree ASKIT

500K 1 192 2 11350K 10 20 3 1055K 100 16 16 105500 1000 19 174 10450 10000 39 1516 1025 100000 334 0.3 101

partitioning methods to perform well. On the other extremeas D increases and different distance scales start to die out(Figure 3(b)) the performance of random sampling keepsimproving until there is only one (the outer) distance scale,where random sampling will be extremely efficient. On theother hand HBEā€™s will have roughly similar performance onthe two extremes as both correspond to worst-case datasetsfor scale-free estimators with Ī² = 1/2, and will show slightimprovement in between 1 D n. This picture isconfirmed by our experiments.

5. Experiments5.1. Datasets

We provide detailed descriptions of the datasets as wellas the bandwidth used for the kernel density evaluationin Table 3. We also include specifications for the ad-ditional datasets acquired from LIBSVM (Chang & Lin,2011) and the UCI Machine Learning Repository (Dheeru& Karra Taniskidou, 2017) that were used to evaluate theaccuracy of the diagnostic procedure.

We selected the top eight datasets in Table 3 for densityevaluation in the main paper as they are the largest, most

10 50 100 200 500# dimensions

10 5

10 4

10 3

10 2

10 1

100

Avg

Quer

y Ti

me

(s)

100 101 102 103 104 105

# clusters

FigTree HBE RS

FigTree ASKIT RS HBE

Figure 4. Results from the synthetic experiment (repeat of resultsin the main paper for easy reference).

complex datasets in our collection. We provide additionaldensity evaluation results in Table 1 for datasets with com-parable sizes or dimensions to the ones reported in the mainpaper. For higgs and hep, FigTree failed to finish the evalu-ation within a day. Given the performance of RS on thesedatasets, we donā€™t expect FigTree to achieve better perfor-mance even if the query returns successfully. For mnist, wewere not able to get ASKIT to achieve relative error below 1even after trying parameters that span a few orders of mag-nitude; this is potentially caused by the high-dimensionalityand sparsity of this dataset.

5.2. Synthetic Experiment

For the clustering test, we set Āµ = 0.001, s = 4, d =100, Ļƒ = 0.01, N = 500K. The varying parameters arethe number of clusters (D) and the number of points percluster n. We report preprocessing time (in seconds) for allmethods in Table 2. The ordering of methods according topreprocessing time largely follows the that of query time.

As discussed in Section 4.3, for the D-structured instancesas the number of points per cluster n decreases, smallerdistance scales start to disappear due to (52). Let Di be thethreshold such that for D > Di, there are no-points in scalei. The corresponding numbers for our experiment is roughlyD1 = 500 (at distance 0) , D2 = 1077, D3 = 10K,D4 = N = 500K. In particular, only a single distancescale remains when D = 100K > D3, a set up in whichRS is orders of magnitude more efficient than alternativemethods (Figure 4 right).

5.3. Sketching Experiment

In this section, we evaluate the quality of the proposedhashing-based sketch. As baselines, we compare againstsparse kernel approximation (SKA) (Cortes & Scott, 2017),kernel herding algorithm (Herding) (Chen et al., 2010) anduniform sampling. To control for the difference in thecomplexity (Table 4), we compare the approximation er-ror achieved by sketches of the same size (s) under the samecompute budget (2n, where n is dataset size). We describethe detailed setup below, including necessary modificationsto meet the computational constraints.

Page 19: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Table 3. Specifications of real-world datasets.

Dataset N d Ļƒ Description

census 2.5M 68 3.46 Samples from 1900 US census.TMY3 1.8M 8 0.43 Hourly energy load profiles for US references buildings.TIMIT 1M 440 10.97 Speech data for acoustic-phonetic studies. First 1M data points used.SVHN 630K 3072 28.16 Google Street View house numbers. Raw pixel values of 32x32 images.covertype 581K 54 2.25 Cartographic variables for predicting forest cover type.MSD 463K 90 4.92 Audio features of popular songs.GloVe 400K 100 4.99 Pre-trained word vectors from

Wikipedia 2014 + Giga 5 word. 5. 6B tokens, 400K vocab.ALOI 108K 128 3.89 Color image collection of 1000 small objects.

Each image is represented by a 128 dimensional SIFT feature vector.

higgs 11M 28 3.41 Signatures of Higgs bosons from Monte Carlo simulations.hep 10.5M 27 3.36 Signatures of high energy physics particles (Monte Carlo simulations).susy 5M 18 2.24 Signatures of supersymmetric particles (Monte Carlo simulations).home 969K 10 0.53 Home gas sensor measurements.skin 245K 3 0.24 Skin Segmentation dataset.ijcnn 142K 22 0.90 IJCNN 2001 Neural Network Competition.acoustic 79K 50 1.15 Vehicle classification in distributed sensor networks.mnist 70K 784 11.15 28x28 images of handwritten digits.corel 68K 32 1.04 Image dataset, with color histograms as features.sensorless 59K 48 2.29 Dataset for sensorless drive diagnosis.codrna 59K 8 1.13 Detection of non-coding RNAs.shuttle 44K 9 0.62 Space shuttle flight sensors.poker 25K 10 1.37 Poker hand dataset.cadata 21K 8 0.62 California housing prices.

Table 4. Overview of algorithm complexity and parameter choicefor the sketching experiment (n: dataset size, s: sketch size, T :number of hash tables, m: sample size for herding).

Algorithm Complexity Parameters

HBS O(nā€²T + s) nā€² = 2n5 , T = 5

SKA O(ncsc + s3c) sc = n13 , nc = n

23

Herding O(nhm+ nhs) m = s, nh = ns

HBS. For HBS, we used 5 hash tables, each hashing asubset of 2

5n points in the dataset. In practice, we found thatvarying this small constant on the number of hash tablesdoes not have a noticeable impact on the performance.

SKA. SKA (Algorithm 5) produces the sketch by greedilyfinding s points in the dataset that minimizes the maxi-mum distance. The associated weights are given by solvingan equation that involves the kernel matrix of the selectedpoints. SKAā€™s complexity O(ns+ s3) is dominated by thematrix inversion procedure used to solve the kernel matrixequation. To ensure that SKA is able to match the sketchsize of alternative methods under the compute budget of 2n,we augment SKA with random samples when necessary:

ā€¢ If the target sketch size is smaller than n13 (s < n

13 ),

we use SKA to produce a sketch of size s from a sub-sample of n/s data points.

ā€¢ For s > n13 , we use SKA to produce a sketch of size

sc = n13 from a subsample of nc = n

23 data points.

We match the difference in sketch size by taking anadditional sāˆ’ sc random samples from the remainingn āˆ’ nc data points that were not used for the SKAsketch. The final estimate is a weighted average be-tween the SKA sketch and the uniform sketch: 1

scfor

SKA and (1āˆ’ 1sc

) for uniform, where the weights aredetermined by the size of the two sketches.

The modification uses SKA as a form of regularization onrandom samples. Since SKA iteratively selects points thatare farthest away from the current set, the resulting sketchis helpful in predicting the ā€œsparserā€ regions of the space.These sparser regions, in turn, are the ones that survive inthe n2/3 random sample of the dataset (sub-sampling withprobability nāˆ’1/3), therefore SKA naturally includes pointsfrom ā€œsparseā€ clusters of size Ī©(n1/3) in the original set.

Herding. The kernel Herding algorithm (Algorithm 6)first estimates the density of the dataset via random sam-

Page 20: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

pling; the sketch is then produced by iteratively selectingpoints with maximum residual density. The algorithm has acomplexity of O(nm+ ns), where m stands for the samplesize used to produce the initial density estimates.

To keep Herding under the same 2n compute budget, wedownsample the dataset to size nh = n

s , and use m = ssamples to estimate the initial density. This means that, thelarger the sketch size is, the less accurate the initial densityestimate is. As a result, we observe degrading performanceat larger sketch sizes s = Ī©(

āˆšn).

Algorithm 5 Sparse Kernel Approximation (SKA)

1: Input: set P , kernel K, size s.2: S = x1, . . . , xs ā† Greedy-kcenter(P, s)3: K āˆˆ RsƗs with Kij ā† k(xi, xj) for xi, xj āˆˆ S.4: y āˆˆ Rs with yi ā† KDEwP (xi) for xi āˆˆ S.5: Let w be a solution to Kw = y.6: Output: (S, w)

Algorithm 6 Approximate Kernel Herding (AKH)

1: Input: set P , kernel K, size s, samples m.2: for i = 1, . . . , |P | do3: Pi ā† Random(P,m). . random set of m points4: di ā† KDEPi(xi) . estimate of the density5: S0 ā† āˆ… . initialization6: for t = 1, . . . , s do7: jāˆ— ā† arg maxiāˆˆ[n]di āˆ’KDEStāˆ’1

(xi) . greedy8: St ā† Stāˆ’1 āˆŖ xāˆ—j . add point to the set

9: Output: (Ss,1s1s)) . return the sketch

Results. Figure 5 reports the relative error on randomqueries (left), i.e. uniformly random points from the dataset,and low-density queries (right), uniformly random points ofthe dataset with density around a threshold Ļ„ . Uniform, SKAand HBS achieve similar mean error on random queries,while the latter two have improved performance on low-density queries, with HBS performing slightly better thanSKA. By design, HBS has similar performance with ran-dom sampling on average, but performs better on relativelyā€œsparseā€ regions due to the theoretical guarantee that bucketswith weight at least Ļ„ are sampled with high probability.Combining SKA with random samples initially results inperformance degradation but eventually acts as a form ofregularization improving upon random. Kernel Herding iscompetitive only for a small number of points in the sketch.

Overhead reduction. In Table 5, we report on the esti-mated preprocessing runtime reduction enabled by HBSfor the density estimation results reported in Table 1 of themain paper. The estimates are calculated by the dividingthe number of data points hashed according to the original

0 500 1000 1500 2000sketch size

0

1

2log(

Mea

n Re

lativ

e Er

ror) TMY3

0 500 1000 1500 2000sketch size

0.5

0.0

0.5

TMY3 (low)Uniform Herding SKA HBS

0 500 1000 1500 2000sketch size

0.0

0.5

1.0

log(

Mea

n Re

lativ

e Er

ror) covertype

0 500 1000 1500 2000sketch size

0.6

0.4

0.2

0.0

0.2

0.4

0.6 covertype (low)

Uniform Herding SKA HBS

0 500 1000 1500 2000sketch size

0.0

0.5

1.0

log(

Mea

n Re

lativ

e Er

ror) ALOI

0 500 1000 1500 2000sketch size

0.8

0.6

0.4

0.2

0.0

0.2 ALOI (low)

0 500 1000 1500 2000sketch size

0.0

0.5

1.0

log(

Mea

n Re

lativ

e Er

ror) home

0 500 1000 1500 2000sketch size

0.2

0.0

0.2

0.4

0.6 home (low)

0 500 1000 1500 2000sketch size

0.0

0.5

1.0

log(

Mea

n Re

lativ

e Er

ror) census

0 500 1000 1500 2000sketch size

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

0.2 census (low)

0 500 1000 1500 2000sketch size

0.0

0.5

1.0

log(

Mea

n Re

lativ

e Er

ror) MSD

0 500 1000 1500 2000sketch size

3.0

2.5

2.0

1.5

1.0MSD (low)

Figure 5. Sketching results on random and low-density queries.

Page 21: Rehashing Kernel Evaluation in High Dimensions

Rehashing Kernel Evaluation in High Dimensions

Table 5. Estimated overhead reduction enabled by HBS.

census TMY3 TIMIT SVHN covertype MSD GloVe ALOI

Reduction (est.) 958Ɨ 755Ɨ 821Ɨ 659Ɨ 554Ɨ 607Ɨ 595Ɨ 303Ɨ

HBE procedure by the number of data points hashed afterenabling HBS.

5.4. Visualizations of real-world data sets

Our Log-Condition plots use circles with radius r to repre-sent points with weights roughly eāˆ’r (roughly at distanceāˆšr for the Gaussian kernel). The visualizations are gener-

ated by plotting overlapping annuli around the origin thatrepresent a random queries from the dataset, such that thewidth of the annulus roughly corresponds to the log of therelative variance of random sampling.

We observe two distinctive types of visualizations. Datasetslike census exhibit dense inner circles, meaning that a smallnumber of points close to the query contribute significantlytowards the density. To estimate the density accurately, onemust sample from these small clusters, which HBE doesbetter than RS. In contrast, datasets like MSD exhibit moreweight on the outer circles, meaning that a large number ofā€œfarā€ points is the main source of density. Random samplinghas a good chance of seeing these ā€œfarā€ points, and therefore,tends to perform better on such datasets. The top two plotsin Figure 6 amplify these observations on synthetic datasetswith highly clustered/scattered structures. RS performsbetter for all datasets in the right column except for SVHN.

ReferencesChang, C.-C. and Lin, C.-J. LIBSVM: A library for

support vector machines. ACM Transactions on In-telligent Systems and Technology, 2:27:1ā€“27:27, 2011.Software available at http://www.csie.ntu.edu.tw/Ėœcjlin/libsvm.

Chen, Y., Welling, M., and Smola, A. Super-samples fromKernel Herding. In Proceedings of the Twenty-Sixth Con-ference on Uncertainty in Artificial Intelligence, UAIā€™10,pp. 109ā€“116, Arlington, Virginia, United States, 2010.AUAI Press. ISBN 978-0-9749039-6-5.

Cortes, E. C. and Scott, C. Sparse Approximation of aKernel Mean. Trans. Sig. Proc., 65(5):1310ā€“1323, March2017. ISSN 1053-587X.

Dheeru, D. and Karra Taniskidou, E. UCI machine learningrepository, 2017. URL http://archive.ics.uci.edu/ml.

11 0 1111

0

11 # cluster=1

11 0 1111

0

11 # cluster=100k

11 0 1111

0

11 census

11 0 1111

0

11 MSD

11 0 1111

0

11 ALOI

11 0 1111

0

11 GloVe

11 0 1111

0

11 TMY3

11 0 1111

0

11 TIMIT

11 0 1111

0

11 covertype

11 0 1111

0

11 SVHN

Figure 6. Visualizations of datasets. The top row shows two ex-treme cases of highly clustered (# cluster=1) versus highly scat-tered (#cluster=100k) datasets. RS performs better for all datasetsin the right column expect for SVHN.