arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

46
arXiv:1803.08777v1 [cs.DS] 23 Mar 2018 Data Streams with Bounded Deletions Rajesh Jayaram Carnegie Mellon University [email protected] David P. Woodruļ¬€ Carnegie Mellon University [email protected] Abstract Two prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Ī˜(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. This complexity gap often arises because the underlying frequency vector f is very close to 0, after accounting for all insertions and deletions to items. Signal detection in such streams is diļ¬ƒcult, given the large number of deletions. In this work, we propose an intermediate model which, given a parameter Ī± ā‰„ 1, lower bounds the norm ā€–f ā€– p by a 1/Ī±-fraction of the L p mass of the stream had all updates been positive. Here, for a vector f , ā€–f ā€– p =( āˆ‘ n i=1 |f i | p ) 1/p , and the value of p we choose depends on the application. This gives a ļ¬‚uid medium between insertion only streams (with Ī± = 1), and turnstile streams (with Ī± = poly(n)), and allows for analysis in terms of Ī±. We show that for streams with this Ī±-property, for many fundamental streaming problems we can replace a O(log(n)) factor in the space usage for algorithms in the turnstile model with a O(log(Ī±)) factor. This is true for identifying heavy hitters, inner product estimation, L 0 estimation, L 1 estimation, L 1 sampling, and support sampling. For each problem, we give matching or nearly matching lower bounds for Ī±-property streams. We note that in practice, many important turnstile data streams are in fact Ī±-property streams for small values of Ī±. For such applications, our results represent signiļ¬cant improvements in eļ¬ƒciency for all the aforementioned problems. 1

Transcript of arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Page 1: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

arX

iv:1

803.

0877

7v1

[cs

.DS]

23

Mar

201

8

Data Streams with Bounded Deletions

Rajesh Jayaram

Carnegie Mellon University

[email protected]

David P. Woodruff

Carnegie Mellon University

[email protected]

Abstract

Two prevalent models in the data stream literature are the insertion-only and turnstilemodels. Unfortunately, many important streaming problems require a Ī˜(log(n)) multiplicativefactor more space for turnstile streams than for insertion-only streams. This complexity gapoften arises because the underlying frequency vector f is very close to 0, after accounting for allinsertions and deletions to items. Signal detection in such streams is difficult, given the largenumber of deletions.

In this work, we propose an intermediate model which, given a parameter Ī± ā‰„ 1, lowerbounds the norm ā€–fā€–p by a 1/Ī±-fraction of the Lp mass of the stream had all updates been

positive. Here, for a vector f , ā€–fā€–p = (āˆ‘n

i=1|fi|p)1/p, and the value of p we choose depends on

the application. This gives a fluid medium between insertion only streams (with Ī± = 1), andturnstile streams (with Ī± = poly(n)), and allows for analysis in terms of Ī±.

We show that for streams with this Ī±-property, for many fundamental streaming problemswe can replace a O(log(n)) factor in the space usage for algorithms in the turnstile modelwith a O(log(Ī±)) factor. This is true for identifying heavy hitters, inner product estimation,L0 estimation, L1 estimation, L1 sampling, and support sampling. For each problem, we givematching or nearly matching lower bounds for Ī±-property streams. We note that in practice,many important turnstile data streams are in fact Ī±-property streams for small values of Ī±.For such applications, our results represent significant improvements in efficiency for all theaforementioned problems.

1

Page 2: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Contents

1 Introduction 31.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Our Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Frequency Estimation via Sampling 72.1 Count-Sketch Sampling Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 L1 Heavy Hitters 16

4 L1 Sampling 174.1 The L1 Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 L1 estimation 205.1 Strict Turnstile L1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 General Turnstile L1 Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 L0 Estimation 246.1 Review of Unbounded Deletion Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 Dealing With Small L0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286.3 The Algorithm for Ī±-Property Streams . . . . . . . . . . . . . . . . . . . . . . . . . . 286.4 Our Constant Factor L0 Estimator for Ī±-Property Streams . . . . . . . . . . . . . . 31

7 Support Sampling 33

8 Lower Bounds 35

9 Conclusion 41

A Sketch of L2 heavy hitters algorithm 46

2

Page 3: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

1 Introduction

Data streams have become increasingly important in modern applications, where the sheer size of adataset imposes stringent restrictions on the resources available to an algorithm. Examples of suchdatasets include internet traffic logs, sensor networks, financial transaction data, database logs, andscientific data streams (such as huge experiments in particle physics, genomics, and astronomy).Given their prevalence, there is a substantial body of literature devoted to designing extremelyefficient one-pass algorithms for important data stream problems. We refer the reader to [9, 50] forsurveys of these algorithms and their applications.

Formally, a data stream is given by an underlying vector f āˆˆ Rn, called the frequency vector,which is initialized to 0n. The frequency vector then receives a stream of m updates of the form(it,āˆ†t) āˆˆ [n] Ɨ āˆ’M, . . . ,M for some M > 0 and t āˆˆ [m]. The update (i,āˆ†) causes the changefit ā† fit+āˆ†t. For simplicity, we make the common assumption that log(mM) = O(log(n)), thoughour results generalize naturally to arbitrary n,m [12].

Two well-studied models in the data stream literature are the insertion-only model and theturnstile model. In the former model, it is required that āˆ†t > 0 for all t āˆˆ [m], whereas in thelatter āˆ†t can be any integer in āˆ’M, . . . ,M. It is known that there are significant differencesbetween these models. For instance, identifying an index i āˆˆ [n] for which |xi| > 1

10

āˆ‘nj=1 |xj | can

be accomplished with only O(log(n)) bits of space in the insertion-only model [10], but requiresĪ©(log2(n)) bits in the turnstile model [38]. This log(n) gap between the complexity in the twomodels occurs in many other important streaming problems.

Due to this ā€œcomplexity gapā€, it is perhaps surprising that no intermediate streaming modelhad been systematically studied in the literature before. For motivation on the usefulness of sucha model, we note that nearly all of the lower bounds for turnstile streams involve inserting a largenumber of items before deleting nearly all of them [41, 38, 39, 37]. This behavior seems unlikely inpractice, as the resulting norm ā€–fā€–p becomes arbitrarily small regardless of the size of the stream.In this paper, we introduce a new model which avoids the lower bounds for turnstile streams bylower bounding the norm ā€–fā€–p. We remark that while similar notions of bounded deletion streamshave been considered for their practical applicability [26] (see also [21], where a bound on themaximum number of edges that could be deleted in a graph stream was useful), to the best of ourknowledge there is no comprehensive theoretical study of data stream algorithms in this setting.

Formally, we define the insertion vector I āˆˆ Rn to be the frequency vector of the substream ofpositive updates (āˆ†t ā‰„ 0), and the deletion vector D āˆˆ Rn to be the entry-wise absolute value ofthe frequency vector of the substream of negative updates. Then f = I āˆ’ D by definition. Ourmodel is as follows.

Definition 1. For Ī± ā‰„ 1 and p ā‰„ 0, a data stream f satisfies the Lp Ī±-property if ā€–I+Dā€–p ā‰¤ Ī±ā€–fā€–p.

For p = 1, the definition simply asserts that the final L1 norm of f must be no less than a1/Ī± fraction of the total weight of updates in the stream

āˆ‘mt=1 |āˆ†t|. For strict turnstile streams,

this is equivalent to the number of deletions being less than a (1āˆ’ 1/Ī±) fraction of the number ofinsertions, hence a bounded deletion stream.

For p = 0, the Ī±-property simply states that ā€–fā€–0, the number of non-zero coordinates at theend of the stream, is no smaller than a 1/Ī± fraction of the number of distinct elements seen in thestream (known as the F0 of the stream). Importantly, note that for both cases this constraint needonly hold at the time of query, and not necessarily at every point in the stream.

Observe for Ī± = 1, we recover the insertion-only model, whereas for Ī± = mM in the L1 caseor Ī± = n in the L0 case we recover the turnstile model (with the minor exception of streams withā€–fā€–p = 0). So Ī±-property streams are a natural parameterized intermediate model between the

3

Page 4: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Problem Turnstile L.B. Ī±-Property U.B. Citation Notes

Ē«-Heavy Hitters Ī©(Ē«āˆ’1 log2(n)) O(Ē«āˆ’1 log(n) log(Ī±)) [38]Strict-turnstilesucceeds w.h.p.

Ē«-Heavy Hitters Ī©(Ē«āˆ’1 log2(n)) O(Ē«āˆ’1 log(n) log(Ī±)) [38]General-turnstile

Ī“ = O(1)

Inner Product Ī©(Ē«āˆ’1 log(n)) O(Ē«āˆ’1 log(Ī±)) Theorem 21 General-turnstile

L1 Estimation Ī©(log(n)) O(log(Ī±)) Theorem 16 Strict-turnstile

L1 Estimation Ī©(Ē«āˆ’2 log(n))O(Ē«āˆ’2 log(Ī±))+ log(n))

[39] General-turnstile

L0 Estimation Ī©(Ē«āˆ’2 log(n))O(Ē«āˆ’2 log(Ī±))+ log(n))

[39] General-turnstile

L1 Sampling Ī©(log2(n)) O(log(n) log(Ī±)) [38] General-turnstile (āˆ—)Support Sampling Ī©(log2(n)) O(log(n) log(Ī±)) [41] Strict-turnstile

Figure 1: The best known lower bounds (L.B.) for classic data stream problems in the turnstilemodel, along with the upper bounds (U.B.) for Ī±-property streams from this paper. The notesspecify whether an U.B./L.B. pair applies to the strict or general turnstile model. For simplicity,we have suppressed log log(n) and log(1/Ē«) terms, and all results are for Ī“ = O(1) failure probability,unless otherwise stated. (āˆ—) L1 sampling note: strong Ī±-property, with Ē« = Ī˜(1) for both U.B. &L.B.

insertion-only and turnstile models. For clarity, we use the term unbounded deletion stream to referto a (general or strict) turnstile stream which does not satisfy the Ī± property for Ī± = o(n).

For many applications of turnstile data streaming algorithms, the streams in question are in factĪ±-property streams for small values of Ī±. For instance, in network traffic monitoring it is useful toestimate differences between network traffic patterns across distinct time intervals or routers [50].If f1i , f

2i represent the number of packets sent between the i-th [source, destination] IP address pair

in the first and second intervals (or routers), then the stream in question is f1 āˆ’ f2. In realisticsystems, the traffic behavior will not be identical across days or routers, and even differences assmall as 0.1% in overall traffic behavior (i.e. ā€–f1 āˆ’ f2ā€–1 > .001ā€–f1 + f2ā€–1) will result in Ī± < 1000(which is significantly smaller than the theoretical universe size of n ā‰ˆ 2256 potential IP addressespairs in IPv6).

A similar case for small Ī± can be made for differences between streams whenever these differencesare not arbitrarily small. This includes applications in streams of financial transactions, sensornetwork data, and telecommunication call records [34, 20], as well as for identifying newly trendingsearch terms, detecting DDoS attacks, and estimating the spread of malicious worms [51, 24, 48,44, 58].

A setting in which Ī± is likely even smaller is database analytics. For instance, an importanttool for database synchronization is Remote Differential Compression (RDC)[55, 3], which allowssimilar files to be compared between a client and server by transmitting only the differences betweenthem. For files given by large data streams, one can feed these differences back into sketches of thefile to complete the synchronization. Even if as much as a half of the file must be resynchronizedbetween client and sever (an inordinately large fraction for typical RDC applications), streamingalgorithms with Ī± = 2 would suffice to recover the data.

For L0, there are important applications of streams with bounded ratio Ī± = F0/L0. For example,L0 estimation is applicable to networks of cheap moving sensors, such as those monitoring wildlife

4

Page 5: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

movement or water-flow patterns [34]. In such networks, some degree of clustering is expected (inregions of abundant food, points of water-flow accumulation), and these clusters will be consistentlyoccupied by sensors, resulting in a bounded ratio of inactive to active regions. Furthermore, innetworking one often needs to estimate the number of distinct IP addresses with active networkconnections at a given time [27, 50]. Here we also observe some degree of clustering on high-activityIPā€™s, with persistently active IPā€™s likely resulting in an L0 to F0 ratio much larger than 1/n (wheren is the universe size of IP addresses).

In many of the above applications, Ī± can even be regarded as a constant when compared withn. For such applications, the space improvements detailed in Figure 1 are considerable, and reducethe space complexity of the problems nearly or exactly to known upper bounds for insertion-onlystreams [10, 49, 40, 38].

Finally, we remark that in some cases it is not unreasonable to assume that the magnitude ofevery coordinate would be bounded by some fraction of the updates to it. For instance, in thecase of RDC it is seems likely that none of the files would be totally removed. We summarize thisstronger guarantee as the strong Ī±-property.

Definition 2. For Ī± ā‰„ 1, a data stream f satisfies the strong Ī±-property if Ii +Di ā‰¤ Ī±|fi| for alli āˆˆ [n].

Note that this property is strictly stronger that the Lp Ī±-property for any p ā‰„ 0. In particular,it forces fi 6= 0 if i is updated in the stream. In this paper, however, we focus primarily on themore general Ī±-property of Definition 1, and use Ī±-property to refer to Definition 1 unless otherwiseexplicitly stated. Nevertheless, we show that our lower bounds for Lp heavy hitters, L1 estimation,L1 sampling, and inner product estimation, all hold even for the more restricted strong Ī±-propertystreams.

1.1 Our Contributions

We show that for many well-studied streaming problems, we can replace a log(n) factor in algorithmsfor general turnstile streams with a log(Ī±) factor for Ī±-property streams. This is a significantimprovement for small values of Ī±. Our upper bound results, along with the lower bounds for theunbounded deletion case, are given in Figure 1. Several of our results come from the introductionof a new data structure, CSSampSim (Section 2), which produces point queries for the frequenciesfi with small additive error. Our improvements from CSSampSim and other L1 problems are theresult of new sampling techniques for Ī±-property streams

While sampling of streams has been studied in many papers, most have been in the contextof insertion only streams (see, e.g., [15, 18, 19, 17, 23, 31, 42, 45, 57]). Notable examples of theuse of sampling in the presence of deletions in a stream are [16, 30, 33, 28, 29]. We note thatthese works are concerned with unbiased estimators and do not provide the (1 Ā± Ē«)-approximaterelative error guarantees with small space that we obtain. They are also concerned with unboundeddeletion streams, whereas our algorithms exploit the Ī±-property of the underlying stream to obtainconsiderable savings.

In addition to upper bounds, we give matching or nearly matching lower bounds in the Ī±-property setting for all the problems we consider. In particular, for the L1-related problems (heavyhitters, L1 estimation, L1 sampling, and inner product estimation), we show that these lowerbounds hold even for the stricter case of strong Ī±-property streams.

We also demonstrate that for general turnstile streams, obtaining a constant approximation ofthe L1 still requires Ī©(log(n))-bits for Ī±-property streams. For streams with unbounded deletions,there is an Ī©(Ē«āˆ’2 log(n)) lower bound for (1 Ā± Ē«)-approximation [39]. Although we cannot remove

5

Page 6: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

this log n factor for Ī±-property streams, we are able to show an upper bound of O(Ē«āˆ’2 logĪ±+log n)bits of space for strong Ī±-property streams, where the O notation hides log(1/Ē«)+ log log n factors.We thus separate the dependence of Ē«āˆ’2 and log n in the space complexity, illustrating an additionalbenefit of Ī±-property streams, and show a matching lower bound for strong Ī±-property streams.

1.2 Our Techniques

Our results for L1 streaming problems in Sections 2 to 5 are built off of the observation that forĪ± property streams the number of insertions and deletions made to a given i āˆˆ [n] are both upperbounded by Ī±ā€–fā€–1. We can then think of sampling updates to i as a biased coin flip with bias atleast 1/2 + (Ī±ā€–fā€–1)āˆ’1. By sampling poly(Ī±/Ē«) stream updates and scaling up, we can recover thedifference between the number of insertions and deletions up to an additive Ē«ā€–fā€–1 error.

To exploit this fact, in Section 2 we introduce a data structure CSSampSim inspired by the classicCountsketch of [14], which simulates running each row of Countsketch on a small uniform sample ofstream updates. Our data structure does not correspond to a valid instantiation of Countsketch onany stream since we sample different stream updates for different rows of Countsketch. Nevertheless,we show via a Bernstein inequality that our data structure obtains the Countsketch guarantee plusan Ē«ā€–fā€–1 additive error, with only a logarithmic dependence on Ē« in the space. This results inmore efficient algorithms for the L1 heavy hitters problem (Section 3), and is also used in our L1

sampling algorithm (Section 4). We are able to argue that the counters used in our algorithms canbe represented with much fewer than log n bits because we sample a very small number of streamupdates.

Additionally, we demonstrate that sampling poly(Ī±/Ē«) updates preserves the inner productbetween Ī±-property streams f, g up to an additive Ē«ā€–fā€–1ā€–gā€–1 error. Then by hashing the sampleduniverse down modulo a sufficiently large prime, we show that the inner product remains preserved,allowing us to estimate it in O(Ē«āˆ’1 log(Ī±)) space (Section 2.2).

Our algorithm for L1 estimation (Section 5) utilizes our biased coin observation to show thatsampling will recover the L1 of a strict turnstile Ī±-property stream. To carry out the sampling ino(log(n)) space, give a alternate analysis of the well known Morris counting algorithm, giving betterspace but worse error bounds. This allows us to obtain a rough estimate of the position in thestream so that elements can be sampled with the correct probability. For L1 estimation in generalturnstile streams, we analyze a virtual stream which corresponds to scaling our input stream byCauchy random variables, argue it still has the Ī±-property, and apply our sampling analysis for L1

estimation on it.Our results for the L0 streaming problems in Sections 6 and 7 mainly exploit the Ī±-property

in sub-sampling algorithms. Namely, many data structure for L0 streaming problems subsamplethe universe [n] at log(n) levels, corresponding to log(n) possible thresholds which could be O(1)-approximations of the L0. If, however, an O(1) approximation were known in advance, we couldimmediately subsample to this level and remove the log(n) factor from the space bound. For Ī±property streams, we note that the non-decreasing values F t

0 of F0 after t updates must be boundedin the interval [Lt

0, O(Ī±)L0]. Thus, by employing an O(1) estimator Rt of F t0 , we show that it suffices

to subsample to only the O(log(Ī±/Ē«)) levels which are closest to log(Rt) at time t, from which ourspace improvements follows.

1.3 Preliminaries

If g is any function of the updates of the stream, for t āˆˆ [m] we write gt to denote the value of gafter the updates (i1,āˆ†1), . . . , (it,āˆ†t). For Sections 2 to 5, it will suffice to assume āˆ†t āˆˆ āˆ’1, 1

6

Page 7: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

for all t āˆˆ [m]. For general updates, we can implicitly consider them to be several consecutiveupdates in āˆ’1, 1, and our analysis will hold in this expanded stream. This implicit expandingof updates is only necessary for our algorithms which sample updates with some probability p. Ifan update |āˆ†t| > 1 arrives, we update our data structures with the value sign(āˆ†t) Ā· Bin(|āˆ†t|, p),where Bin(n, p) is the binomial random variable on n trials with success probability p, which hasthe same effect. In this unit-update setting, the L1 Ī± property reduces to m ā‰¤ Ī±ā€–fmā€–1, so this isthe definition which we will use for the L1 Ī± property. We use the term unbounded deletion streamto refer to streams without the Ī±-property (or equivalently streams that have the Ī± = poly(n)-property and can also have all 0 frequencies). For Sections 2 to 5, we will consider only the L1

Ī±-property, and thus drop the L1 in these sections for simplicity, and for Sections 6 and 7 we willconsider only the L0 Ī±-property, and similarly drop the L0 there.

We call a vector y āˆˆ Rn k-sparse if it has at most k non-zero entries. For a given vector f āˆˆ Rn,let Errkp(f) = miny kāˆ’sparse ā€–f āˆ’ yā€–p. In other words, Errkp(f) is the p-norm of f with the k heaviestentries removed. We call the argument minimizer of ā€–f āˆ’ yā€–p the best k-sparse approximationto f . Finally, we use the term with high probability (w.h.p.) to describe events that occur withprobability 1 āˆ’ nāˆ’c, where c is a constant. Events occur with low probability (w.l.p.) if they area complement to a w.h.p. event. We will often ignore the precise constant c when it only factorsinto our memory usage as a constant.

2 Frequency Estimation via Sampling

In this section, we will develop many of the tools needed for answering approximate queries aboutĪ± property streams. Primarily, we develop a data structure CSSampSim, inspired by the classicCountsketch of [14], that computes frequency estimates of items in the stream by sampling. Thisdata structure will immediately result in an improved heavy hitters algorithm in Section 3, and isat the heart of our L1 sampler in Section 4. In this section, we will write Ī±-property to refer to theL1 Ī±-property throughout.

Firstly, for Ī±-property streams, the following observation is crucial to many of our results.Given a fixed item i āˆˆ [n], by sampling at least poly(Ī±/Ē«) stream updates we can preserve fi (afterscaling) up to an additive Ē«ā€–fā€–1 error.

Lemma 1 (Sampling Lemma). Let f be the frequency vector of a general turnstile stream withthe Ī±-property, and let fāˆ— be the frequency vector of a uniformly sampled substream scaled up by1p , where each update is sampled uniformly with probability p > Ī±2Ē«āˆ’3 log(Ī“āˆ’1)/m. Then withprobability at least 1āˆ’ Ī“ for i āˆˆ [n], we have

|fāˆ—i āˆ’ fi| < Ē«ā€–fā€–1

Moreover, we haveāˆ‘n

i=1 fāˆ—i =

āˆ‘ni=1 fi Ā± Ē«ā€–fā€–1.

Proof. Assume we sample each update to the stream independently with probability p. Let f+i , fāˆ’i

be the number of insertions and deletions of element i respectively, so fi = f+i āˆ’ fāˆ’i . Let X+j

indicate that the j-th insertion to item i is sampled. First, if Ē«ā€–fā€–1 < f+i then by Chernoff bounds:

Pr[|1p

f+

iāˆ‘

j=1

X+j āˆ’ f+i | ā‰„ Ē«ā€–fā€–1] ā‰¤ 2 exp

(āˆ’pf+i (Ē«ā€–fā€–1)23(f+i )2

)

ā‰¤ exp(

āˆ’ pĒ«3m/Ī±2)

7

Page 8: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

CSSampSim (CSSS)

Input: sensitivity parameters k ā‰„ 1, Ē« āˆˆ (0, 1).

1. Set S = Ī˜(Ī±2

Ē«2 T2 log(n)), where T ā‰„ 4/Ē«2 + log(n).

2. Instantiate d Ɨ 6k count-sketch table A, for d = O(log(n)). For each table entry aij āˆˆ A,store two values a+ij and aāˆ’ij, both initialized to 0.

3. Select 4-wise independent hash functions hi : [n]ā†’ [6k], gi : [n]ā†’ 1,āˆ’1, for i āˆˆ [d].4. Set pā† 0, and start log(n) bit counter to store the position in the stream.5. On Update (it,āˆ†t):

(a) if t = 2r log(S) + 1 for any r ā‰„ 1, then for every entry aij āˆˆ A set a+ij ā†Bin(a+ij , 1/2),

aāˆ’ij ā†Bin(aāˆ’ij , 1/2), and pā† p+ 1

(b) On Update (it,āˆ†t): Sample (it,āˆ†t) with probability 2āˆ’p. If sampled, then for i āˆˆ [d]

i. if āˆ†tgi(it) > 0, set a+i,hi(it)

ā† a+ihi(it)

+āˆ†tgi(it).

ii. else set aāˆ’i,hi(it)ā† aāˆ’ihi(it)

+ |āˆ†tgi(it)|6. On Query for fj: return yāˆ—j = median2pgi(j) Ā· (a+i,hi(j)

āˆ’ aāˆ’i,hi(j)) | i āˆˆ [d].

Figure 2: Our data structure to simulate running Countsketch on a uniform sample of the stream.

where the last inequality holds because f+i ā‰¤ m ā‰¤ Ī±ā€–fā€–1. Taking p ā‰„ Ī±2 log(1/Ī“)/(Ē«3m) gives the

desired probability Ī“. Now if Ē«ā€–fā€–1 ā‰„ f+i , then Pr[1pāˆ‘f+

ij=1X

+j ā‰„ f+i + Ē«ā€–fā€–1] ā‰¤ exp

(

āˆ’ pf+i Ē«ā€–fā€–1/f+i

)

ā‰¤ exp(

āˆ’ pĒ«m/Ī±)

ā‰¤ Ī“ for the same value of p. Applying the same argument to fāˆ’i , we obtainfāˆ—i = f+i āˆ’ fāˆ’i Ā± 2Ē«ā€–fā€–1 = fi Ā± 2Ē«ā€–fā€–1 as needed after rescaling Ē«. For the final statement, we canconsider all updates to the stream as being made to a single element i, and then simply apply thesame argument given above.

2.1 Count-Sketch Sampling Simulator

We now describe the Countsketch algorithm of [14], which is a simple yet classic algorithm inthe data stream literature. For a parameter k and d = O(log(n)), it creates a d Ɨ 6k matrix Ainitialized to 0, and for every row i āˆˆ [d] it selects hash functions hi : [n] ā†’ [6k], and gi : [n] ā†’1,āˆ’1 from 4-wise independent uniform hash families. On every update (it,āˆ†t), it hashes thisupdate into every row i āˆˆ [d] by updating ai,hi(it) ā† ai,hi(it) + gi(it)āˆ†t. It then outputs yāˆ— whereyāˆ—j = mediangi(j)ai,hi(j) | i āˆˆ [d]. The guarantee of one row of the Countsketch is as follows.

Lemma 2. Let f āˆˆ Rn be the frequency vector of any general turnstile stream hashed into a dƗ 6kCountsketch table. Then with probability at least 2/3, for a given j āˆˆ [n] and row i āˆˆ [d] we haveāˆ£

āˆ£gi(j)ai,hi(j) āˆ’ fjāˆ£

āˆ£ < kāˆ’1/2Errk2(f). It follows if d = O(log(n)), then with high probability, for all

j āˆˆ [n] we haveāˆ£

āˆ£yāˆ—j āˆ’ fjāˆ£

āˆ£ < kāˆ’1/2Errk2(f), where yāˆ— is the estimate of Countsketch. The space

required is O(k log2(n)) bits.

We now introduce a data structure which simulates running Countsketch on a uniform sampleof the stream of size poly(Ī± log(n)/Ē«). The full data structure is given in Figure 2. Note that for afixed row, each update is chosen with probability at least 2āˆ’p ā‰„ S/(2m) = Ī©(Ī±2T 2 log(n)/(Ē«2m)).We will use this fact to apply Lemma 1 with Ē«ā€² = (Ē«/T ) and Ī“ = 1/poly(n). The parameter T willbe poly(log(n)/Ē«), and we introduce it as a new symbol purely for clarity.

Now the updates to each row in CSSS are sampled independently from the other rows, thus CSSSmay not represent running Countsketch on a single valid sample. However, each row independently

8

Page 9: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

contains the result of running a row of Countsketch on a valid sample of the stream. Since theCountsketch guarantee holds with probability 2/3 for each row, and we simply take the median ofO(log(n)) rows to obtain high probability, it follows that the output of CSSS will still satisfy anadditive error bound w.h.p. if each row also satisfies that error bound with probability 2/3.

By Lemma 1 with sensitivity parameter (Ē«/T ), we know that we preserve the weight of all itemsfi with |fi| ā‰„ 1/Tā€–fā€–1 up to a (1 Ā± Ē«) factor w.h.p. after sampling and rescaling. For all smallerelements, however, we obtain error additive in Ē«ā€–fā€–1/T . This gives rise to the natural division ofthe coordinates of f . Let big āŠ‚ [n] be the set of i with |fi| ā‰„ 1/Tā€–fā€–1, and let small āŠ‚ [n] be thecomplement. Let E āŠ‚ [n] be the top k heaviest coordinates in f , and let s āˆˆ Rn be a fixed samplevector of f after rescaling by pāˆ’1. Since Errk2(s) = miny k-sparse ā€–s āˆ’ yā€–2, it follows that Errk2(s) ā‰¤ā€–sbig\Eā€–2 + ā€–ssmallā€–2. Furthermore, by Lemma 1 ā€–sbig\Eā€–2 ā‰¤ (1 + Ē«)ā€–fbig\Eā€–2 ā‰¤ (1 + Ē«)Errk2(f). Soit remains to upper bound ā€–ssmallā€–22, which we do in the following technical lemma.

The intuition for the Lemma is that ā€–fsmallā€–2 is maximized when all the L1 weight is con-centrated in T elements, thus ā€–fsmallā€–2 ā‰¤ (T (ā€–fā€–1/T )2)1/2 = ā€–fā€–1/T 1/2. By the Ī± property, weknow that the number of insertions made to the elements of small is bounded by Ī±ā€–fā€–1. Thus,computing the variance of ā€–ssmallā€–22 and applying Bernsteinā€™s inequality, we obtain a similar upperbound for ā€–ssmallā€–2.

Lemma 3. If s is the rescaled frequency vector resulting from uniformly sampling with probabilityp ā‰„ S/(2m), where S = Ī©(Ī±2T 2 log(n)) for T = Ī©(log(n)), of a general turnstile stream f withthe Ī± property, then we have ā€–ssmallā€–2 ā‰¤ 2Tāˆ’1/2ā€–fā€–1 with high probability.

Proof. Fix f āˆˆ Rn, and assume that ā€–fā€–1 > S (otherwise we could just sample all the updatesand our counters in Countsketch would never exceed log(Ī±S)). For any i āˆˆ [n], let fi = f+i āˆ’ fāˆ’i ,so that f = f+ āˆ’ fāˆ’, and let si be our rescaled sample of fi. By definition, for each i āˆˆ smallwe have fi <

1T ā€–fā€–1. Thus the quantity ā€–fsmallā€–22 is maximized by having T coordinates equal

to 1T ā€–fā€–1. Thus ā€–fsmallā€–22 ā‰¤ T (ā€–fā€–1T )2 ā‰¤ ā€–fā€–2

1

T . Now note that if we condition on the fact thatsi ā‰¤ 2/Tā€–fā€– for all i āˆˆ small, which occurs with probability greater than 1 āˆ’ nāˆ’5 by Lemma1, then since E[

āˆ‘

i |si|4] = O(n4), all of the following expectations change by a factor of at most(1 Ā± 1/n) by conditioning on this fact. Thus we can safely ignore this conditioning and proceedby analyzing E[ā€–ssmallā€–22] and E[ā€–ssmallā€–44] without this condition, but use the conditioning whenapplying Bernsteinā€™s inequality later.

We have that |si| = |1pāˆ‘f+

i +fāˆ’i

j=1 Xij |, where Xij is an indicator random variable that is Ā±1 ifthe j-th update to fi is sampled, and 0 otherwise, where the sign depends on whether or not theupdate was an insertion or deletion and p ā‰„ S/(2m) is the probability of sampling an update. Then

E[|si|] = E[|1pāˆ‘f+

i +fāˆ’i

j=1 Xij |] = |fi|. Furthermore, we have E[s2i ] =1p2E[(

āˆ‘f+

i +fāˆ’i

j=1 Xij)2]

=1

p2(

f+

i +fāˆ’i

āˆ‘

j=1

E[X2ij ] +

āˆ‘

j1 6=j2

E[Xij1Xij2 ])

=1

p(f+i + fāˆ’i ) + ((f+i )(f+i āˆ’ 1) + (fāˆ’i )(fāˆ’i āˆ’ 1)āˆ’ 2f+i f

āˆ’i )

Substituting fāˆ’i = f+i āˆ’fi in part of the above equation gives E[s2i ] =1p(f

+i +fāˆ’i )+f2i +fiāˆ’2f+i ā‰¤

1p(f

+i + fāˆ’i ) + 2f2i . So E[

āˆ‘

iāˆˆsmall s2i ] ā‰¤ 1

p(ā€–f+smallā€–1 + ā€–fāˆ’smallā€–1) + 2ā€–fsmallā€–22, which is at most

Ī±p ā€–fā€–1+2

ā€–fā€–21T by the Ī±-property of f and the upper bound on ā€–fsmallā€–22. Now E[s4i ] =

1p4E[(

āˆ‘f+

i +fāˆ’i

j=1

9

Page 10: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Xij)4], which we can write as

E[

āˆ‘

j

X4ij + 4

āˆ‘

j1 6=j2

X3ij1Xij2 + 12

āˆ‘

j1,j2,j3distinct

X2ij1Xij2Xij3 +

āˆ‘

j1,j2,j3,j4distinct

Xij1Xij2Xij3Xij4

] 1

p4(1)

We analyze Equation 1 term by term. First note that E[āˆ‘

j X4ij ] = p(f+i +fāˆ’i ), and E[

āˆ‘

j1 6=j2X3

ij1Xij2 ] =

p2(

(f+i )(f+i āˆ’1)+(fāˆ’i )(fāˆ’i āˆ’1)āˆ’2(f+i fāˆ’i ))

. Substituting fāˆ’i = f+i āˆ’fi, we obtain E[āˆ‘

j1 6=j2X3

ij1Xij2 ] ā‰¤

2p2f2i . Now for the third term we have E[āˆ‘

j1 6=j2 6=j3X2

ij1Xij2Xij3 ] = p3

(

(f+i )(f+i āˆ’ 1)(f+i āˆ’ 2) +

(fāˆ’i )(fāˆ’i āˆ’ 1)(fāˆ’i āˆ’ 2)+ f+i (fāˆ’i )(fāˆ’i āˆ’ 1)+ fāˆ’i (f+i )(f+i āˆ’ 1)āˆ’ 2(f+i fāˆ’i )(f+i āˆ’ 1)āˆ’ 2(f+i f

āˆ’i )(fāˆ’i āˆ’ 1)

)

,which after the same substitution is upper bounded by 10p3 maxf+i , |fi|f2i ā‰¤ 10p3Ī±ā€–fā€–1f2i , wherethe last inequality follows from the Ī±-property of f . Finally, the last term is E[

āˆ‘

j1 6=j2 6=j3 6=j4

Xij1Xij2Xij3Xij4 ] = p4(

f+i (f+i āˆ’ 1)(f+i āˆ’ 2)(f+i āˆ’ 3) + fāˆ’i (fāˆ’i āˆ’ 1)(fāˆ’i āˆ’ 2)(fāˆ’i āˆ’ 3) + 6(f+i (f+i āˆ’1))(fāˆ’i (fāˆ’i āˆ’1))āˆ’4

(

f+i (f+i āˆ’1)(f+i āˆ’2)fāˆ’i +fāˆ’i (fāˆ’i āˆ’1)(fāˆ’i āˆ’2)f+i))

. Making the same substitutionallows us to bound this above by p4(24f4i āˆ’ 12fif

+i + 12(f+i )2) ā‰¤ p4(36(fi)4 + 24(f+i )2).

Now f satisfies the Ī± property. Thus ā€–f+smallā€–1 + |fāˆ’smallā€–1 ā‰¤ Ī±ā€–fā€–1, so summing the boundsfrom the last paragraph over all i āˆˆ small we obtain

E[1

p4ā€–ssmallā€–44] ā‰¤

1

p3Ī±ā€–fā€–1 +

8

p2ā€–fsmallā€–22

+120Ī±

pā€–fsmallā€–22 |fā€–1 + 36ā€–fsmallā€–44 + 24Ī±2ā€–fā€–21

Now 1/p ā‰¤ 2mS ā‰¤ 2Ī±

S ā€–fā€–1 by the Ī±-property. Applying this with the fact that ā€–fsmallā€–22 is maximized

atā€–fā€–2

1

T , and similarly ā€–fsmallā€–22 is maximized at T (ā€–fā€–1/T )4 = ā€–fā€–41/T 3 we have that E[ 1p4ā€–ssmallā€–44]

is at most8Ī±4

S3ā€–fā€–41 +

36Ī±2

S2Tā€–fā€–41 +

240Ī±2

STā€–fā€–41 +

36

T 3ā€–fā€–41 + 24Ī±2ā€–fā€–21

Since we have ā€–fā€–1 > S > Ī±2T 2, the above expectation is upper bounded by 300T 3 ā€–fā€–41. We now

apply Bernsteinā€™s inequality. Using the fact that we conditioned on earlier, we have the upper bound1psi ā‰¤ 2ā€–fā€–1/T , so plugging this into Bernsteinā€™s inequality yields: Pr[

āˆ£

āˆ£ā€–ssmallā€–22 āˆ’ E[ā€–ssmallā€–22]āˆ£

āˆ£ >ā€–fā€–21T ] ā‰¤ exp

(

āˆ’ ā€–fā€–41/(2T 2)

300ā€–fā€–41/T 3+2ā€–fā€–3

1/(3T 2)

)

ā‰¤ exp(

āˆ’ T/604)

Finally, T = Ī©(log(n)), so the above

probability is poly( 1n) for T = c log(n) and a sufficiently large constant c. Since the expectation

E[ā€–ssmallā€–22] is at most Ī±p ā€–fā€–1 + 2ā€–fā€–21/T ā‰¤ 3ā€–fā€–21/T , it follows that ā€–ssmallā€–2 ā‰¤ 2ā€–fā€–1āˆš

Twith high

probability, which is the desired result.

Applying the result of Lemma 3, along with the bound on Errk2(s) from the previous paragraphs,we obtain the following corollary.

Corollary 1. With high probability, if s is as in Lemma 3, then Errk2(s) ā‰¤ (1 + Ē«)Errk2(f) +2Tāˆ’1/2ā€–fā€–1

Now we analyze the error from CSSS. Observe that each row in CSSS contains the result ofhashing a uniform sample into 6k buckets. Let si āˆˆ Rn be the frequency vector, after scaling by1/p, of the sample hashed into the i-th row of CSSS, and let yi āˆˆ Rn be the estimate of si takenfrom the i-th row of CSSS. Let Ļƒ(i) : n ā†’ [O(log(n))] be the row from which Countsketch returns

its estimate for fi, meaning yāˆ—i = yĻƒ(i)i .

10

Page 11: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Theorem 1. For Ē« > 0, k > 1, with high probability, when run on a general turnstile stream f āˆˆ Rn

with the Ī± property, CSSS with 6k columns, O(log(n)) rows, returns yāˆ— such that, for every i āˆˆ [n]we have

|yāˆ—i āˆ’ fi| ā‰¤ 2(1

k1/2Errk2(f) + Ē«ā€–fā€–1)

It follows that if y is the best k-sparse approximation to yāˆ—, then Errk2(f) ā‰¤ ā€–fāˆ’yā€–2 ā‰¤ 5(k1/2Ē«ā€–fā€–1+Errk2(f)) with high probability. The space required is O(k log(n) log(Ī± log(n)/Ē«)).

Proof. Set S = Ī˜(Ī±2

Ē«2T 2 log(n)) as in Figure 2, and set T = 4/Ē«2 +O(log(n)). Fix any i āˆˆ [n]. Now

CSSS samples updates uniformly with probability p > S/(2m), so applying Lemma 1 to our samplesji of fi for each row j and union bounding, with high probability we have sji = fiĀ± Ē«

T ā€–fā€–1 = sqi forall rows j, q. Then by the Countsketch guarantee of Lemma 2, for each row q that yqi = fiĀ±( Ē«

T ā€–fā€–1+kāˆ’1/2 maxj Err

k2(s

j)) with probability 2/3. Thus yāˆ—i = yĻƒ(i)i = fi Ā± ( Ē«

T ā€–fā€–1 + kāˆ’1/2 maxj Errk2(s

j))

with high probability. Now noting thatāˆšT ā‰„ 2/Ē«, we apply Corollary 1 to Errk2(s

j) and unionbound over all j āˆˆ [O(log(n))] to obtain maxj Err

k2(s

j) ā‰¤ (1 + Ē«)Errk2(f) + Ē«ā€–fā€–1 w.h.p., and unionbounding over all i āˆˆ [n] gives |yāˆ—i āˆ’fi| ā‰¤ 2( 1

k1/2Errk2(f)+Ē«ā€–fā€–1) for all i āˆˆ [n] with high probability.

For the second claim, note that Errk2(f) ā‰¤ ā€–f āˆ’ f ā€²ā€–2 for any k-sparse vector f ā€², from which thefirst inequality follows. Now if the top k coordinates are the same in yāˆ— as in f , then ā€–f āˆ’ yā€–2 is atmost Errk2(f) plus the L2 error from estimating the top k elements, which is at most (4k(Ē«ā€–fā€–1 +kāˆ’1/2Errk2(f))

2)1/2 ā‰¤ 2(k1/2Ē«ā€–fā€–1 + Errk2(f)). In the worst case the top k coordinates of yāˆ— aredisjoint from the top k in f . Applying the triangle inequality, the error is at most the error on thetop k coordinates of yāˆ— plus the error on the top k in f . Thus ā€–f āˆ’ yā€–2 ā‰¤ 5(k1/2Ē«ā€–fā€–1 + Errk2(f))as required.

For the space bound, note that the Countsketch table has O(k log(n)) entries, each of whichstores two counters which holds O(S) samples in expectation. So the counters never exceedpoly(S) = poly(Ī±Ē« log(n)) w.h.p. by Chernoff bounds, and so can be stored using O(log(Ī± log(n)/Ē«))bits each (we can simply terminate if a counter gets too large).

We now address how the error term of Theorem 1 can be estimated so as to bound the potentialerror. This will be necessary for our L1 sampling algorithm. We first state the following well knownfact about norm estimation [56], and give a proof for completeness.

Lemma 4. Let R āˆˆ Rk be a row of the Countsketch matrix with k columns run on a stream withfrequency vector f . Then with probability 99/100, we have

āˆ‘ki=1R

2i = (1Ā±O(kāˆ’1/2))ā€–fā€–22

Proof. Let 1(E) be the indicator function that is 1 if the event E is true, and 0 otherwise. Leth : [n] ā†’ [k] and g : [n] ā†’ 0, 1 be 4-wise independent hash functions which specify the rowof Countsketch. Then E[

āˆ‘ki=1R

2i ] = E[

āˆ‘kj=1(

āˆ‘ni=1 1(h(i) = j)g(i)fi)

2] = E[āˆ‘n

i=1 f2i ] + E[

āˆ‘kj=1

(āˆ‘

i16=i2 1(h(i1) = j)1(h(i2) = j)g(i1)g(i2)fi1fi2)2]. By the 2-wise independence of g, the second

quantity is 0, and we are left with ā€–fā€–22. By a similar computation using the full 4-wise indepen-dence, we can show that V ar(

āˆ‘ki=1R

2i ) = 2(ā€–fā€–42 āˆ’ ā€–fā€–44)/k ā‰¤ 2ā€–fā€–42/k. Then by Chebyshevā€™s

inequality, we obtain Pr[|āˆ‘ki=1R

2i āˆ’ ā€–fā€–22| > 10

āˆš2ā€–fā€–22/

āˆšk] < 1/100 as needed.

Lemma 5. For k > 1, Ē« āˆˆ (0, 1), given a Ī±-property stream f , there is an algorithm that canproduce an estimate v such that Errk2(f) ā‰¤ v ā‰¤ 45k1/2Ē«ā€–fā€–1 +20Errk2(f) with high probability. Thespace required is the space needed to run two instances of CSSS with paramters k, Ē«.

Proof. By Lemma 4, the L2 of row i of CSSS with a constant number of columns will be a (1 Ā±1/2) approximation of ā€–siā€–2 with probability 99/100, where si is the scaled up sampled vector

11

Page 12: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

corresponding to row i. Our algorithm is to run two copies of CSSS on the entire stream, say CSSS1and CSSS2, with k columns and sensitivity Ē«. At the end of the stream we compute yāˆ— and y fromCSSS1 where y is the best k-sparse approximation to yāˆ—. We then feed āˆ’y into CSSS2. The resultingL2 of the i-th row of CSSS2 is (1 Ā± 1/2)ā€–si2 āˆ’ yā€–2 (after rescaling of si) with probability 99/100,where si2 is the sample corresponding to the i-th row of CSSS2.

Now let T = 4/Ē«2 as in Theorem 1, and let si be the sample corresponding to the i-th row ofCSSS1. Then for any i, ā€–sismallā€–2 ā‰¤ 2Tāˆ’1/2ā€–fā€–1 w.h.p. by Lemma 3, and the fact that ā€–fsmallā€–2 ā‰¤Tāˆ’1/2ā€–fā€–1 follows from the definition of small. Furthermore, by Lemma 1, we know w.h.p. that|sijāˆ’fj| < Ē«|fj| for all j āˆˆ big, and thus ā€–sibigāˆ’fbigā€–2 ā‰¤ Ē«ā€–fā€–1. Then by the reverse triangle inequality

we haveāˆ£

āˆ£ā€–si āˆ’ yā€–2 āˆ’ ā€–f āˆ’ yā€–2āˆ£

āˆ£ ā‰¤ ā€–si āˆ’ fā€–2 ā‰¤ ā€–sismall āˆ’ fsmallā€–2 + ā€–sibig āˆ’ fbigā€–2 ā‰¤ 5Ē«ā€–fā€–1. Thus

ā€–siāˆ’ yā€–2 = ā€–fāˆ’ yā€–2Ā±5Ē«ā€–fā€–1 for all rows i w.h.p, so the L2 of the i-th row of CSSS2 at the end of thealgorithm is the value vi such that 1

2(ā€–fāˆ’yā€–2āˆ’5Ē«ā€–fā€–1) ā‰¤ vi ā‰¤ 2(ā€–fāˆ’yā€–2+5Ē«ā€–fā€–1) with probability9/10 (note that the bounds do not depend on i). Taking v = 2 Ā·median(v1, . . . , vO(log(n)))+5Ē«ā€–fā€–1,it follows that ā€–f āˆ’ yā€–2 ā‰¤ v ā‰¤ 4ā€–f āˆ’ yā€–2 + 25Ē«ā€–fā€–1 with high probability. Applying the upper andlower bounds on ā€–f āˆ’ yā€–2 given in Theorem 1 yields the desired result. The space required is thespace needed to run two instances of CSSS, as stated.

2.2 Inner Products

Given two streams f, g āˆˆ Rn, the problem of estimating the inner product 怈f, g怉 = āˆ‘ni=1 figi has

attracted considerable attention in the streaming community for its usage in estimating the size ofjoin and self-join relations for databases [5, 53, 52]. For unbounded deletion streams, to obtain anĒ«ā€–fā€–1ā€–gā€–1-additive error approximation the best known result requires O(Ē«āˆ’1 log(n)) bits with aconstant probability of success [22]. We show in Theorem 21 that Ī©(Ē«āˆ’1 log(Ī±)) bits are requiredeven when f, g are strong Ī±-property streams. This also gives a matching Ī©(Ē«āˆ’1 log(n)) lower boundfor the unbounded deletion case

In Theorem 2, we give a matching upper bound for Ī±-property streams up to log log(n) andlog(1/Ē«) terms. We first prove the following technical Lemma, which shows that inner products arepreserved under a sufficiently large sample.

Lemma 6. Let f, g āˆˆ Rn be two Ī±-property streams with lengths mf ,mg respectively. Let fā€², gā€² āˆˆ Rn

be unscaled uniform samples of f and g, sampled with probability pf ā‰„ s/mf and pg ā‰„ s/mg

respectively, where s = Ī©(Ī±2/Ē«2). Then with probability 99/100, we have 怈pāˆ’1f f ā€², pāˆ’1

g gā€²ć€‰ = 怈f, g, 怉 Ā±Ē«ā€–fā€–1ā€–gā€–1.

Proof. We have E[怈pāˆ’1f f ā€², pāˆ’1

g gā€²ć€‰] = āˆ‘

i E[pāˆ’1f f ā€²i ]E[p

āˆ’1g gā€²i] = 怈f, g怉. Now the (f ā€²ig

ā€²i)ā€™s are indepen-

dent, and thus we need only compute Var(pāˆ’1f pāˆ’1

g f ā€²igā€²i). We have Var(pāˆ’1

f pāˆ’1g f ā€²ig

ā€²i) = (pāˆ’1

f pāˆ’1g )2

E[(f ā€²i)2]E[(gā€²i)

2] āˆ’ (figi)2. Let Xi,j be Ā±1 if the j-th update to fi is sampled, where the sign

indicates whether the update was an insertion or deletion. Let f+, fāˆ’ be the insertion and dele-tion vectors of f respectively (see Definition 1), and let F = f+ + fi. Then f = f+ āˆ’ fāˆ’, andf ā€²i =

āˆ‘f+

i +fāˆ’i

j=1 Xij . We have

E[(f ā€²i)2/p2f ] = E[pāˆ’2

f (

f+

i +fāˆ’i

āˆ‘

j=1

X2ij +

āˆ‘

j1 6=j2

Xij1Xij2)]

ā‰¤ pāˆ’1f Fi + (f+i )2 + (fāˆ’i )2 āˆ’ 2f+i f

āˆ’i

= pāˆ’1f Fi + f2i

12

Page 13: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Now define g+, gāˆ’ and G = g+ + gāˆ’ similarly. Note that the Ī±-property states that mf = ā€–Fā€–1 ā‰¤Ī±ā€–fā€–1 and mg = ā€–Gā€–1 ā‰¤ Ī±ā€–gā€–1. Moreover, it gives pāˆ’1

f ā‰¤ mf/s ā‰¤ Ē«2ā€–fā€–1/Ī±, and pāˆ’1g ā‰¤ mg/s ā‰¤

Ē«2ā€–gā€–1/Ī±. Then Var(怈pāˆ’1f f ā€², pāˆ’1

g gā€²ć€‰ = āˆ‘ni=1Var(p

āˆ’1f pāˆ’1

g f ā€²igā€²i) ā‰¤

āˆ‘ni=1(f

2i p

āˆ’1g Gi + g2i p

āˆ’1f Fi + pāˆ’1

f pāˆ’1g

FiGi). First note thatāˆ‘n

i=1 pāˆ’1g f2i Gi ā‰¤ pāˆ’1

g ā€–Gā€–1ā€–fā€–21 ā‰¤ Ē«2ā€–gā€–21ā€–fā€–21 and similarlyāˆ‘n

i=1 pāˆ’1f g2i

Fi ā‰¤ Ē«2ā€–gā€–21ā€–fā€–21. Finally, we haveāˆ‘n

i=1 pāˆ’1f pāˆ’1

g FiGi ā‰¤ pāˆ’1f pāˆ’1

g ā€–Fā€–1ā€–Gā€–1 ā‰¤ Ē«4ā€–fā€–21ā€–gā€–21, so the

variance is at most 3Ē«2ā€–fā€–21ā€–gā€–21. Then by Chebyshevā€™s inequality:

Pr[āˆ£

āˆ£ć€ˆpāˆ’1f f ā€², pāˆ’1

g gā€²ć€‰ āˆ’ 怈f, g怉āˆ£

āˆ£ > 30Ē«ā€–fā€–1ā€–gā€–1] ā‰¤ 1/100

and rescaling Ē« by a constant gives the desired result.

Our algorithm obtains such a sample as needed for Lemma 6 by sampling in exponentiallyincreasing intervals. Next, we hash the universe down by a sufficiently large prime to avoid collisionsin the samples, and then run a inner product estimator in the smaller universe. Note that thishashing is not pairwise independent, as that would require the space to be at least log n bits; ratherthe hashing just has the property that it preserves distinctness with good probability. We nowprove a fact that we will need for our desired complexity.

Lemma 7. Given a log(n) bit integer x, the value x (mod p) can be computed using only log log(n)+log(p) bits of space.

Proof. We initialize a counter c ā† 0. Let x1, x2, . . . , xlog(n) be the bits of x, where x1 the leastsignificant bit. Then we set y1 = 1, and at every step t āˆˆ [log(n)] of our algorithm, we store yt, ytāˆ’1,where we compute yt as yt = 2ytāˆ’1 (mod p). This can be done in O(log(p)) space. Our algorithmthen takes log(n) steps, where on the t-th step it checks if xt = 1, and if so it updates c ā† c + yt

(mod p), and otherwise sets c ā† c. At the end we have c =āˆ‘log(n)

i=0 2ixi (mod p) = x (mod p) asdesired, and c never exceeds 2p, and can be stored using log(p) bits of space. The only other valuestored is the index t āˆˆ [log(n)], which can be stored using log log(n) bits as stated.

We now recall the Countsketch data structure (see Section 2). Countsketch picks a 2-wiseindependent hash function h : [n]ā†’ [k] and a 4-wise independent hash function Ļƒ : [n]ā†’ 1,āˆ’1,and creates a vector A āˆˆ Rk. Then on every update (it,āˆ†t) to a stream f āˆˆ Rn, it updatesAh(it) ā† Ah(it) + Ļƒitāˆ†t where Ļƒi = Ļƒ(i). Countsketch can be used to estimate inner products.In the following Lemma, we show that feeding uniform samples of Ī±-property streams f, g intotwo instances of Countmin, denoted A and B, then 怈A,B怉 is a good approximation of 怈f, g怉 afterrescaling A and B by the inverse of the sampling probability.

Lemma 8. Let f ā€², gā€² be uniformly sampled rescaled vectors of general-turnstile Ī±-property streamsf, g with lengths mf ,mg and sampling probabilities pf , pg respectively. Suppose that pf ā‰„ s/mf andpg ā‰„ s/mg, where s = Ī©(Ī±2 log7(n)T 2Ē«āˆ’10). Then let A āˆˆ Rk be the Countsketch vector run on f ā€²,and let B āˆˆ Rk be the Countsketch vector run on gā€², where A,B share the same hash function hand k = Ī˜(1/Ē«) Then

kāˆ‘

i=1

AiBi = 怈f, g怉 Ā± Ē«ā€–fā€–1ā€–gā€–1

with probability 11/13.

Proof. Set k = 1002/Ē«. Let Yij indicate h(i) = h(j), and let Xij = ĻƒiĻƒjfā€²ig

ā€²jYij . We have

āˆ‘ki=1AiBi = 怈f ā€², gā€²ć€‰ + āˆ‘

i 6=j Xij . Let Ē«0 = Ī˜(log2(n)/Ē«3) and let T = Ī˜(log(n)/Ē«2), so we

13

Page 14: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

can write s = Ī©(Ī±2 log(n)T 2/Ē«20) (this will align our notation with that of Lemma 3). DefineF = i āˆˆ [n] | |fi| ā‰„ ā€–fā€–1/T and G = i āˆˆ [n] | |gi| ā‰„ ā€–gā€–1/T. We now bound the term|āˆ‘i 6=j,(i,j)āˆˆFƗGXij | ā‰¤

āˆ‘

(i,j)āˆˆFƗG Yij |f ā€²igā€²j|.Now by Lemma 1 and union bounding over all i āˆˆ [n], we have ā€–f ā€² āˆ’ fā€–āˆž ā‰¤ Ē«0ā€–fā€–1/T and

ā€–gā€² āˆ’ gā€–āˆž ā‰¤ Ē«0ā€–gā€–1/T with high probability, and we condition on this now. Thus for every(i, j) āˆˆ F Ɨ G, we have f ā€²i = (1 Ā± Ē«0)fi and gā€²j = (1 Ā± Ē«0)gj . It follows that

āˆ‘

(i,j)āˆˆFƗG Yij|f ā€²igā€²j | ā‰¤2āˆ‘

(i,j)āˆˆFƗG Yij |fi||gj | (using only Ē«0 < 1/4). Since E[Yij] = 1/k, we have

E[āˆ‘

(i,j)āˆˆFƗGYij|fi||gj |] ā‰¤

1

kā€–fā€–1ā€–gā€–1

By Markovā€™s inequality with k = 1002/Ē«, it follows that |āˆ‘i 6=j,(i,j)āˆˆFƗGXij | ā‰¤ 2āˆ‘

(i,j)āˆˆFƗG Yij|fi||gj | ā‰¤ Ē«ā€–fā€–1ā€–gā€–1 with probability greater than 1 āˆ’ (1/1000 + O(nāˆ’c)) > 99/100, where theO(nāˆ’c) comes from conditioning on the high probability events from Lemma 1. Call this event E1.

Now let FC = [n] \ F , and GC = [n] \ G. Let A āŠ‚ [n]2 be the set of all (i, j) with i 6= j andsuch that either i āˆˆ FC or j āˆˆ GC . Now consider the variables |Xij |i<j , and let Xij ,Xpq betwo such distinct variables. Then E[XijXpq] = E[f ā€²ig

ā€²if

ā€²pg

ā€²qYijYpqĻƒiĻƒjĻƒpĻƒq]. Now the variables Ļƒ are

independent from YijYpq, which are determined by h. Since i < j and p < q and (i, j) 6= (p, q), itfollows that one of i, j, p, q is unique among them. WLOG it is i, so by 4-wise independence of Ļƒwe have E[f ā€²ig

ā€²if

ā€²pg

ā€²qYijYpqĻƒiĻƒjĻƒpĻƒq] = E[f ā€²ig

ā€²if

ā€²pg

ā€²qYijYpqĻƒjĻƒpĻƒq]E[Ļƒi] = 0 = E[Xij ]E[Xpq]. Thus the

variables |Xij |i<j (and symmetrically |Xij |i>j) are uncorrelated so

Var(āˆ‘

i<j,(i,j)āˆˆAXij) =

āˆ‘

i<j,(i,j)āˆˆAVar(Xij) ā‰¤

āˆ‘

(i,j)āˆˆA(f ā€²ig

ā€²j)

2/k

Since E[āˆ‘

i<j,(i,j)āˆˆAXij ] = 0, by Chebyshevā€™s inequality with k = 1002/Ē«, we have |āˆ‘i<j,(i,j)āˆˆAXij| ā‰¤(Ē«āˆ‘

(i,j)āˆˆA(fā€²ig

ā€²j)

2)1/2 with probability 99/100. So by the union bound and a symmetric arguement

for j > i, we have |āˆ‘(i,j)āˆˆAXij | ā‰¤ |āˆ‘

i<j,(i,j)āˆˆAXij| + |āˆ‘

i>j,(i,j)āˆˆAXij | ā‰¤ 2(Ē«āˆ‘

(i,j)āˆˆA(fā€²ig

ā€²j)

2)1/2

with probability 1āˆ’ (2/100) = 49/50. Call this event E2. We have

āˆ‘

(i,j)āˆˆA(f ā€²ig

ā€²j)

2 =āˆ‘

i 6=j,(i,j)āˆˆFƗGC

(f ā€²igā€²j)

2 +āˆ‘

i 6=j,(i,j)āˆˆFCƗG(f ā€²ig

ā€²j)

2

+āˆ‘

i 6=j,(i,j)āˆˆFCƗGC

(f ā€²igā€²j)

2

Now the last term is at most ā€–f ā€²FCā€–22ā€–gā€²GCā€–22, which is at most 16ā€–fā€–21ā€–gā€–21/T 2 ā‰¤ Ē«4ā€–fā€–21ā€–gā€–21 w.h.p.

by Lemma 3 applied to both f and g and union bounding (note that FC ,GC are exactly the setsmall in Lemma 3 for their respective vectors). We hereafter condition on this w.h.p. event thatā€–f ā€²FCā€–22 ā‰¤ 4ā€–fā€–21/T and ā€–gā€²GCā€–22 ā‰¤ 4ā€–gā€–21/T (note T > 1/Ē«2).

Now as noted earlier, w.h.p. we have f ā€²i = (1 Ā± Ē«0)fi and f ā€²j = (1 Ā± Ē«0)fj for i āˆˆ F and j āˆˆ G.Thus ā€–f ā€²Fā€–22 = (1Ā±O(Ē«0))ā€–fFā€–22 < 2ā€–fā€–21 and ā€–gā€²Gā€–22 ā‰¤ 2ā€–gā€–21. Now the term

āˆ‘

i 6=j,(i,j)āˆˆFƗGC(f ā€²igā€²j)

2

is at mostāˆ‘

iāˆˆF (fā€²i)

2ā€–gā€²Gcā€–22 ā‰¤ O(Ē«2)ā€–f ā€²Fā€–22ā€–gā€–21 ā‰¤ O(Ē«2)ā€–fā€–21ā€–gā€–21. Applying a symmetric argument,we obtain

āˆ‘

i 6=j,(i,j)āˆˆFCƗG(fā€²ig

ā€²j)

2 ā‰¤ O(Ē«2)ā€–fā€–21ā€–gā€–21 with high probability. Thus each of the three

terms is O(Ē«2)ā€–fā€–21ā€–gā€–21, soāˆ‘

(i,j)āˆˆA(f ā€²ig

ā€²j)

2 = O(Ē«2)ā€–fā€–21ā€–gā€–21

14

Page 15: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

with high probability. Call this event E3. Now by the union bound, Pr[E1 āˆŖ E2 āˆŖ E3] > 1āˆ’ (1/50 +1/100+O(nāˆ’c)) > 24/25. Conditioned on this, the error term |āˆ‘i 6=jXij | ā‰¤ |

āˆ‘

i 6=j,(i,j)āˆˆFƗGXij |+|āˆ‘(i,j)āˆˆAXij | is at most Ē«ā€–f ||1ā€–gā€–1+2(Ē«

āˆ‘

(i,j)āˆˆA(fā€²ig

ā€²j)

2)1/2 = Ē«ā€–f ||1ā€–gā€–1+2(O(Ē«3)ā€–fā€–21ā€–gā€–21)1/2 =O(Ē«)ā€–fā€–1ā€–g|1 as desired. Thus with probability 24/25 we have

āˆ‘ki=1AiBi = 怈f ā€², gā€²ć€‰Ā±O(Ē«)ā€–fā€–1ā€–gā€–1.

By Lemma 6, noting that in the notation of this Lemma the vectors f ā€², gā€² have already been scaledby pāˆ’1

f and pāˆ’1g , we have 怈f ā€², gā€²ć€‰ = 怈f, g怉 Ā± Ē«ā€–fā€–1ā€–gā€–1 with probability 99/100. Altogether, with

probability 1āˆ’ (1/25 + 1/100) > 11/13 we haveāˆ‘k

i=1AiBi = 怈f, g怉 Ā± O(Ē«)ā€–fā€–1ā€–gā€–1, which is thedesired result after a constant rescaling of Ē«.

We will apply Lemma 8 to complete our proof of our main Theorem.

Theorem 2. Given two general-turnstile stream vectors f, g with the Ī±-property, there is a one-pass algorithm which with probability 11/13 produces an estimate IP(f, g) such that IP(f, g) =

怈f, g怉 Ā±O(Ē«)ā€–fā€–1ā€–gā€–1 using O(Ē«āˆ’1 log(Ī± log(n)Ē« )) bits of space.

Proof. Let s = Ī˜(Ī±2 log(n)7/Ē«10), and let Ir = [sr, sr+2]. Then for every r = 1, 2, . . . , logs(n), wechoose a random prime P = [D,D3] where D = 100s4. We then do the following for the streamf (and apply the same following procedure for g). For every interval Ir and time step t āˆˆ Ir,we sample the t-th update (it,āˆ†t) to f with probability sāˆ’r (we assume āˆ†t āˆˆ 1,āˆ’1, for biggerupdates we expand them into multiple such unit updates.). If sampled, we let iā€²t be the log(|P |)bit identity obtained by taking it modulo P . We then feed the unscaled update (iā€²t,āˆ†t) into an

instance of count-sketch with k = O(1/Ē«) buckets. Call this instance CSfr . At any time step t, we

only store the instance CSfr and CS

fr+1 such that t āˆˆ Ir āˆ© Ir+1. At the end of the stream it is time

step mf , and fix r such that mf āˆˆ Ir āˆ© Ir+1.Now let f ā€² āˆˆ Rn be the (scaled up by sr) sample of f taken in Ir. Let f

ā€²ā€² āˆˆ Rp be the (unscaled)

stream on |P | items that is actually fed into CSfr . Then CS

fr is run on the stream f ā€²ā€² which has a

universe of |P | items. Let F āˆˆ Rk be the Countsketch vector from the instance CSfr .Let f be the frequency vector f restricted to the suffix of the stream sr, sr + 1, . . . ,mf (these

were the updates that were being sampled from while running CSr). Since mf āˆˆ Ir+1, we havemf ā‰„ sr+1, so ā€–f (sr)ā€–1 ā‰¤ m/s < Ē«ā€–fā€–1 (by the Ī± property), meaning the L1 mass of the prefix ofthe stream 1, 2, . . . , sr is an Ē« fraction of the whole stream L1, so removing it changes the L1 by atmost Ē«ā€–fā€–1. It follows that ā€–f āˆ’ fā€–1 ā‰¤ O(Ē«)ā€–fā€–1 and thus ā€–fā€–1 = (1 Ā± O(Ē«))ā€–fā€–1. If we let g bedefined analogously by replacing f with g in the previous paragraphs, we obtain ā€–gāˆ’gā€–1 ā‰¤ O(Ē«)ā€–gā€–1and ā€–gā€–1 = (1Ā±O(Ē«))ā€–gā€–1 as well.

Now with high probability we sampled fewer than 2sj+2/sj = 2s2 distinct identities whencreating f ā€², so f ā€² is 2s2-sparse. Let J āŠ‚ [n]Ɨ [n] be the set of pairs pf indices i, j with f ā€²i , f

ā€²j non-

zero. Then |J | < 2s4. Let Qi,j be the event that iāˆ’ j = 0 (mod P ). For this to happen, P mustdivide the difference. By standard results on the density of primes, there are s8 primes in [D,D3],and since |iāˆ’ j| ā‰¤ n it follows that |iāˆ’ j| has at most log(n) prime factors. So Pr[Qi,j] < log(n)/s8,Let Q = āˆŖ(i,j)āˆˆJQi,j, then by the union bound Pr[Q] < sāˆ’3. It follows that no two sampledidentities collide when being hashed to the universe of p elements with probability 1āˆ’ 1/s3.

Let supp(f ā€²) āŠ‚ [n] be the support of f ā€² (non-zero indices of f ā€²). Conditioned on Q, we havepāˆ’1f f ā€²ā€²i (mod p) = f ā€²i for all i āˆˆ supp(f ā€²), and f ā€²ā€²i = 0 if i 6= j (mod p) for any j āˆˆ supp(f ā€²). Thus

there is a bijection between supp(f ā€²) and supp(f ā€²ā€²), and the values of the respective coordinates off ā€², f ā€²ā€² are equal under this bijection. Let g, gā€²ā€², g,mg, pg be defined analogously to f, f ā€²ā€² by replacingf with g in the past paragraphs, and let G āˆˆ Rk be the Countsketch vector obtained by runningCountsketch on f just as we did to obtain F āˆˆ Rk.

15

Page 16: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Conditioned on Q occurring (no collisions in samples) for both f ā€²ā€² and gā€²ā€² (call these Qf andQg respectively), which together hold with probability 1āˆ’O(sāˆ’3) by a union bound, the non-zeroentries of pāˆ’1

f f ā€²ā€² and f ā€² and of pāˆ’1g gā€²ā€² and gā€² are identical. Thus the scaled up Countsketch vectors

pāˆ’1f F and pāˆ’1

g G of F,G obtained from running Countsketch on f ā€²ā€², gā€²ā€² are identical in distributionto running Countsketch on f ā€², gā€². This holds because Countsketch hashes the non-zero coordinates4-wise independently into k buckets A,B āˆˆ Rk. Thus conditioned on Qf , Qg, we can assume thatpāˆ’1f F and pāˆ’1

g G are the result of running Countsketch on f ā€² and gā€². We claim that pāˆ’1f pāˆ’1

g 怈F,G怉is the desired estimator.

Now recall that f ā€² is a uniform sample of f , as is gā€² of g. Then applying the Countsketcherror of Lemma 8, we have pāˆ’1

f pāˆ’1g 怈F,G怉 = 怈f , g怉 + Ē«ā€–fā€–1ā€–gā€–1 = 怈f , g怉 + O(Ē«)ā€–fā€–1ā€–gā€–1 with

probability 11/13. Now since ā€–f āˆ’ fā€–1 ā‰¤ O(Ē«)ā€–fā€–1 and ā€–g āˆ’ gā€–1 ā‰¤ O(Ē«)ā€–gā€–1, we have 怈f , g怉 =怈f, g怉 Ā±āˆ‘n

i=1(Ē«giā€–fā€–1 + Ē«fiā€–gā€–1 + Ē«2ā€–fā€–1ā€–gā€–1 = 怈f, g怉 Ā±O(Ē«)ā€–fā€–1ā€–gā€–1. Thus

pāˆ’1f pāˆ’1

g 怈F,G怉 = 怈f, g怉 Ā±O(Ē«)ā€–fā€–1ā€–gā€–1

as required. This last fact holds deterministically using only the Ī±-property, so the probability ofsuccess is 11/13 as stated.

For the space, at most 2s2 samples were sent to each of f ā€²ā€², gā€²ā€² with high probability. Thus thelength of each stream was at most poly(Ī± log(n)/Ē«), and each stream had P = poly(Ī± log(n)/Ē«)items. Thus each counter in A,B āˆˆ Rk can be stored with O(log(Ī± log(n)/Ē«)) bits. So storing A,Brequires O(Ē«āˆ’1 log(Ī± log(n)/Ē«)) bits. Note we can safely terminate if too many samples are sentand the counters become too large, as this happens with probability O(poly(1/n)) . The 4-wiseindependent hash function h : [P ]ā†’ [k] used to create A,B requires O(log(Ī± log(n)/Ē«)) bits.

Next, by Lemma 7, the space required to hash the log(n)-bit identities down to [P ] is log(P ) +log log(n), which is dominated by the space for Countsketch. Finally, we can assume that s is apower of two so pāˆ’1

f , pāˆ’1g = poly(s) = 2q can be stored by just storing the exponent q, which takes

log log(n) bits. To sample we then flip log(n) coins sequentially, and keep a counter to check if thenumber of heads reaches q before a tail is seen.

3 L1 Heavy Hitters

As an application of the Countsketch sampling algorithm presented in the last section, we givean improved upper bound for the classic L1 Ē«-heavy hitters problem in the Ī±-property setting.Formally, given Ē« āˆˆ (0, 1), the L1 Ē«-heavy hitters problem asks to return a subset of [n] thatcontains all items i such that |fi| ā‰„ Ē«ā€–fā€–1, and no items j such that |fj| < (Ē«/2)ā€–fā€–1.

The heavy hitters problem is one of the most well-studied problems in the data stream lit-erature. For general turnstile unbounded deletion streams, there is a known lower bound ofĪ©(Ē«āˆ’1 log(n) log(Ē«n)) (see [8], in the language of compressed sensing, and [38]), and the Counts-ketch of [14] gives a matching upper bound (assuming Ē«āˆ’1 = o(n)). In the insertion only case,however, the problem can be solved using O(Ē«āˆ’1 log(n)) bits [10], and for the strictly harderL2 heavy hitters problem (where ā€–fā€–1 is replaced with ā€–fā€–2 in the problem definition), thereis an O(Ē«āˆ’2 log(1/Ē«) log(n))-bit algorithm [11]. In this section, we beat the lower bounds for un-bounded deletion streams in the Ī±-property case. We first run a subroutine to obtain a valueR = (1Ā± 1/8)ā€–fā€–1 with probability 1āˆ’ Ī“. To do this, we use the following algorithm from [39].

Fact 1 ([39]). There is an algorithm which gives a (1Ā± Ē«) multiplicative approximation with prob-ability 1āˆ’ Ī“ of the value ā€–fā€–1 using space O(Ē«āˆ’2 log(n) log(1/Ī“)).

16

Page 17: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Next, we run an instance of CSSS with parameters k = 32/Ē« and Ē«/32 to obtain our estimate yāˆ—

of f . This requires space O(Ē«āˆ’1 log(n) log(Ī± log(n)Ē« )), and by Theorem 1 gives an estimate yāˆ— āˆˆ Rn

such that |yāˆ—i āˆ’ fi| < 2(āˆš

Ē«/32Errk2(f) + Ē«ā€–fā€–1/32) for all i āˆˆ [n] with high probability. We thenreturn all items i with |yāˆ—i | ā‰„ 3Ē«R/4. Since the top 1/Ē« elements do not contribute to Errk2(f), thequantity is maximized by having k elements with weight ā€–fā€–1/k, so Errk2(f) ā‰¤ kāˆ’1/2ā€–fā€–1. Thusā€–yāˆ—i āˆ’ fā€–āˆž < (Ē«/8)ā€–fā€–1.

Given this, it follows that for any i āˆˆ [n] if |fi| ā‰„ Ē«ā€–fā€–1, then |yāˆ—i | > (7Ē«/8)ā€–fā€–1 > (3Ē«/4)R.Similarly if |fi| < (Ē«/2)ā€–fā€–1, then |yāˆ—i | < (5Ē«/8)ā€–fā€–1 < (3Ē«/4)R. So our algorithm correctlydistinguishes Ē« heavy hitters from items with weight less than Ē«/2. The probability of failure is

O(nāˆ’c) from CSSS and Ī“ for estimating R, and the space required is O(Ē«āˆ’1 log(n) log(Ī± log(n)Ē« )) for

running CSSS and O(log(n) log(1/Ī“)) to obtain the estimate R. This gives the following theorem.

Theorem 3. Given Ē« āˆˆ (0, 1), there is an algorithm that solves the Ē«-heavy hitters problem for

general turnstile Ī±-property streams with probability 1 āˆ’ Ī“ using space O(Ē«āˆ’1 log(n) log(Ī± log(n)Ē« ) +

log(n) log(1/Ī“)).

Now note for strict turnstile streams, we can compute R = ā€–fā€–1 exactly with probability 1using an O(log(n))-bit counter. Since the error bounds from CSSS holds with high probability, weobtain the following result.

Theorem 4. Given Ē« āˆˆ (0, 1), there is an algorithm that solves the Ē«-heavy hitters problem forstrict turnstile Ī±-property streams with high probability using space O(Ē«āˆ’1 log(n) log(Ī± log(n)/Ē«)).

4 L1 Sampling

Another problem of interest is the problem of designing Lp samplers. First introduced by Mon-emizadeh and Woodruff in [47], it has since been observed that Lp samplers lead to alternativealgorithms for many important streaming problems, such as heavy hitters, Lp estimation, andfinding duplicates [7, 47, 38].

Formally, given a data stream frequency vector f , the problem of returning an Ē«-approximaterelative error uniform Lp sampler is to design an algorithm that returns an index i āˆˆ [n] such that

Pr[i = j] = (1Ā± Ē«) |fj|p

ā€–fā€–pp

for every j āˆˆ [n]. An approximate Lp sampler is allowed to fail with some probability Ī“, howeverin this case it must not output any index. For the case of p = 1, the best known upper bound isO(Ē«āˆ’1 log(Ē«āˆ’1) log2(n) log(Ī“āˆ’1)) bits of space, and there is also an Ī©(log2(n)) lower bound for Lp

samplers with Ē« = O(1) for any p [38]. In this section, using the data structure CSSS of Section 2,we will design an L1 sampler for strict-turnstile strong L1 Ī±-property streams using O(Ē«āˆ’1 log(Ē«āˆ’1)

log(n) log(Ī± log(n)Ē« ) log(Ī“āˆ’1)) bits of space. Throughout the section we use Ī±-property to refer to the

L1 Ī±-property.

4.1 The L1 Sampler

Our algorithm employs the technique of precision sampling in a similar fashion as in the L1 samplerof [38]. The idea is to scale every item fi by 1/ti where ti āˆˆ [0, 1] is a uniform random variable, and

return any index i such that zi = |fi|/ti > 1Ē«ā€–fā€–1, since this occurs with probability exactly Ē« |fi|

ā€–fā€–1 .

17

Page 18: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Ī±L1Sampler: L1 sampling algorithm for strict-turnstile strong Ī±-property streams:Initialization

1. Instantiate CSSS with k = O(log(Ē«āˆ’1)) columns and parameter Ē«ā€² = Ē«3/ log2(n).2. Select k = O(log(1Ē« ))-wise independent uniform scaling factors ti āˆˆ [0, 1] for i āˆˆ [n].3. Run CSSS on scaled input z where zi = fi/ti.4. Keep log(n)-bit counters r, q to store r = ā€–fā€–1 and q = ā€–zā€–1.

Recovery1. Compute estimate yāˆ— via CSSS.

2. Via algorithm of Lemma 5, compute v such that Errk2(z) ā‰¤ v ā‰¤ 45 k1/2Ē«3

log2(n)ā€–zā€–1 + 20Errk2(z).

3. Find i with |yāˆ—i | maximal.

4. If v > k1/2r + 45 k1/2Ē«3

log2(n)q, or |yāˆ—i | < max1Ē« r,

(c/2)Ē«2

log2(n)q where the constant c > 0 is as in

Proposition 1, output FAIL, otherwise output i and tiyāˆ—i as the estimate for fi.

Figure 3: Our L1 sampling algorithm with sucsess probability Ī˜(Ē«)

One can then run a traditional Countsketch on the scaled stream z to determine when an elementpasses this threshold.

In this section, we will adapt this idea to strong Ī±-property streams (Definition 2). The necessityof the strong Ī±-property arises from the fact that if f has the strong Ī±-property, then any coordinate-wise scaling z of f still has the Ī±-property with probability 1. Thus the stream z given by zi = fi/tihas the Ī±-property (in fact, it again has the strong Ī±-property, but we will only need the fact thatz has the Ī±-property). Our full L1 sampler is given in Figure 3.

By running CSSS to find the heavy hitters of z, we introduce error additive in O(Ē«ā€²ā€–zā€–1) =O(Ē«3/ log2(n)ā€–zā€–1), but as we will see the heaviest item in z is an Ī©(Ē«2/ log2(n)) heavy hitter withprobability 1 āˆ’ O(Ē«) conditioned on an arbitrary value of ti, so this error will only be an O(Ē«)fraction of the weight of the maximum weight element. Note that we use the term c-heavy hitterfor c āˆˆ (0, 1) to denote an item with weight at least cā€–zā€–1. Our algorithm then attempts to returnan item zi which crosses the threshold ā€–fā€–1/Ē«, and we will be correct in doing so if the tail errorErrk2(z) from CSSS is not too great.

To determine if this is the case, since we are in the strict turnstile case we can compute r = ā€–fā€–1and q = ā€–zā€–1 exactly by keeping a log(n)-bit counter (note however that we will only need constantfactor approximations for these). Next, using the result of Lemma 5 we can accurately estimateErrk2(z), and abort if it is too large in Recovery Step 4 of Figure 3. If the conditions of this stephold, we will be guaranteed that if i is the maximal element, then yāˆ—i = (1 Ā± O(Ē«))zi. This allowsus to sample Ē«-approximately, as well as guarantee that our estimate of zi has relative error Ē«. Wenow begin our analysis our L1 sampler. First, the proof of the following fact can be found in [38].

Lemma 9. Given that the values of ti are k = log(1/Ē«)-wise independent, then conditioned onan arbitrary fixed value t = tl āˆˆ [0, 1] for a single l āˆˆ [n], we have Pr[20Errk2(z) > k1/2ā€–fā€–1] =O(Ē«+ nāˆ’c).

The following proposition shows that the Ē«/ log2(n) term in the additive error of our CSSS willbe an Ē« fraction of the maximal element with high probability.

Proposition 1. There exists some constant c > 0 such that conditioned on an arbitrary fixed valuet = tl āˆˆ [0, 1] for a single l āˆˆ [n], if j is such that |zj | is maximal, then with probability 1āˆ’O(Ē«) wehave |zj | ā‰„ cĒ«2/ log2(n)ā€–zā€–1.

18

Page 19: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Proof. Let f ā€²i = fi, zā€²i = zi for all i 6= l, and let f ā€²l = 0 = zā€²l. Let It = i āˆˆ [n] \ l | zi āˆˆ

[ā€–fā€²ā€–1

2t+1 ,ā€–f ā€²ā€–12t ). Then Pr[i āˆˆ It, i 6= l] = 2t|fi|/ā€–f ā€²ā€–1, so E[|It|] = 2t and by Markovā€™s inequality

Pr[|It| > log(n)Ē«āˆ’12t] < Ē«/ log(n). By the union boundāˆ‘

tāˆˆ[log(n)]āˆ‘

iāˆˆIt |zi| ā‰¤ log2(n)ā€–f ā€²ā€–1/Ē«with probability 1 āˆ’ Ē«. Call this event E . Now for i 6= l let Xi indicate zi ā‰„ Ē«ā€–f ā€²ā€–/2. ThenVar(Xi) = (2/Ē«)|fi|/ā€–f ā€²ā€–1 āˆ’ ((2/Ē«)|fi|/ā€–f ā€²ā€–1)2 < E[Xi], and so pairwise independence of the ti isenough to conclude that Var(

āˆ‘

i 6=lXi) < E[āˆ‘

i 6=lXi] = 2/Ē«. So by Chebyshevā€™s inequality,

Pr[

āˆ£

āˆ£

āˆ‘

i 6=l

Xi āˆ’ 2Ē«āˆ’1āˆ£

āˆ£ > Ē«āˆ’1]

< 2Ē«

so there is at least one and at most 3/Ē« items in zā€² with weight Ē«ā€–f ā€²ā€–1/2 with probability 1āˆ’O(Ē«).Call these ā€œheavy itemsā€. By the union bound, both this and E occur with probability 1 āˆ’ O(Ē«).Now the largest heavy item will be greater than the average of them, and thus it will be an Ē«/3-heavy hitter among the heavy items. Moreover, we have shown that the non-heavy items in zā€² haveweight at most log2(n)ā€–f ā€²ā€–1/Ē« with probability 1 āˆ’ Ē«, so it follows that the maximum item in zā€²

will have weight at least Ē«2/(2 log2(n))ā€–zā€–1 with probability 1āˆ’O(Ē«).Now if zl is less than the heaviest item in zā€², then that item will still be an Ē«2/(4 log2(n)) heavy

hitter. If zl is greater, then zl will be an Ē«2/(4 log2(n)) heavy hitter, which completes the proofwith c = 1/4.

Lemma 10. The probability that Ī±L1Sampler outputs the index i āˆˆ [n] is (Ē«Ā±O(Ē«2)) |fi|ā€–fā€–1 +O(nāˆ’c).

The relative error of the estimate of fi is O(Ē«) with high probability.

Proof. Ideally, we would like to output i āˆˆ [n] with |zi| ā‰„ Ē«āˆ’1r, as this happens if ti ā‰¤ Ē«|fi|/r, whichoccurs with probability precisely Ē«|fi|/r. We now examine what could go wrong and cause us tooutput i when this condition is not met or vice-versa. We condition first on the fact that v satisfiesErrk2(z) ā‰¤ v ā‰¤ (45k1/2Ē«3/ log2(n))ā€–zā€–1 + 20Errk2(z) as in Lemma 5 with parameter Ē«ā€² = Ē«3/ log2(n),and on the fact that |yāˆ—j āˆ’ zj | ā‰¤ 2(kāˆ’1/2Errk2(z) + (Ē«3/ log2(n))ā€–zā€–1) for all j āˆˆ [n] as detailed inTheorem 1, each of which occur with high probability.

The first type of error is if zi ā‰„ Ē«āˆ’1ā€–fā€–1 but the algorithm fails and we do not output i.First, condition on the fact that 20Errk2(z) < k1/2ā€–fā€–1 and that |zjāˆ— | ā‰„ cĒ«2/ log2(n)ā€–zā€–1 wherejāˆ— āˆˆ [n] is such that |zjāˆ— | is maximal, which by the union bound using Lemma 9 and Proposition1 together hold with probability 1 āˆ’ O(Ē«) conditioned on any value of ti. Call this conditionalevent E . Since v ā‰¤ 20Errk2(z) + (45k1/2Ē«3/ log2(n))ā€–zā€–1 w.h.p., it follows that conditioned on Ewe have v ā‰¤ k1/2ā€–fā€–1 + (45k1/2Ē«3/ log2(n))ā€–zā€–1. So the algorithm does not fail due to the firstcondition in Recovery Step 4. of Figure 3. Since v ā‰„ Errk2(z) w.h.p, we now have 1

k1/2Errk2(z) ā‰¤

ā€–fā€–1 +45 Ē«3

log2(n)ā€–zā€–1, and so |yāˆ—j āˆ’ zj | ā‰¤ 2( 1

k1/2Errk2(z) +

Ē«3

log2(n)ā€–zā€–1) ā‰¤ 2ā€–fā€–1 +92 Ē«3

log2(n)ā€–zā€–1 for all

j āˆˆ [n].The second way we output FAIL when we should not have is if |yāˆ—i | < ((c/2)Ē«2/ log2(n))ā€–zā€–1 but

|zi| ā‰„ Ē«āˆ’1ā€–fā€–1. Now E gives us that |zjāˆ— | ā‰„ cĒ«2/ log2(n)ā€–zā€–1 where |zjāˆ— | is maximal in z, and since

|yāˆ—i | was maximal in yāˆ—, it follows that |yāˆ—i | ā‰„ |yāˆ—jāˆ—| > |zjāˆ— | āˆ’ (2ā€–fā€–1 + 92 Ē«3

log2(n)ā€–zā€–1). But |zjāˆ— | is

maximal, so |zjāˆ— | ā‰„ |zi| ā‰„ Ē«āˆ’1ā€–fā€–1. The two lower bounds on |zjāˆ— | give us (2ā€–fā€–1+92 Ē«3

log2(n)ā€–zā€–1) =

O(Ē«)|zjāˆ— | < |zjāˆ— |/2, so |yāˆ—i | ā‰„ |zjāˆ— |/2 ā‰„ ((c/2)Ē«2/ log2(n))ā€–zā€–1. So conditioned on E , this type offailure can never occur. Thus the probability that we output FAIL for either of the last two reasonswhen |zi| ā‰„ Ē«āˆ’1ā€–fā€–1 is O(Ē«2 |fi|

ā€–fā€–1 ) as needed. So we can now assume that yāˆ—i > ((c/2)Ē«2/ log2(n))ā€–zā€–1.Given this, if an index i was returned we must have yāˆ—i >

1Ē« r =

1Ē«ā€–fā€– and yāˆ—i > ((c/2)Ē«2/ log2(n))ā€–zā€–1.

These two facts together imply that our additive error from CSSS is at most O(Ē«)|yāˆ—i |, and thus atmost O(Ē«)|zi|, so |yāˆ—i āˆ’ zi| ā‰¤ O(Ē«)|zi|.

19

Page 20: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

With this in mind, another source of error is if we output i with yāˆ—i ā‰„ Ē«āˆ’1r but zi < Ē«āˆ’1r, so weshould not have output i. This can only happen if zi is close to the threshold rĒ«

āˆ’1. Since the additiveerror from our Countsketch is O(Ē«)|zi|, it must be that ti lies in the interval |fi|

r (1/Ē« + O(1))āˆ’1 ā‰¤ti ā‰¤ |fi|

r (1/Ē«āˆ’O(1))āˆ’1, which occurs with probability O(1)1/Ē«2āˆ’O(1)

|fi|r = O(Ē«2 |fi|

ā€–fā€–1 ) as needed.

Finally, an error can occur if we should have output i because zi ā‰„ Ē«āˆ’1ā€–fā€–1, but we output

another index iā€² 6= i because yāˆ—iā€² > yāˆ—i . This can only occur if tiā€² < (1/Ē«āˆ’O(1))āˆ’1 |fiā€² |r , which occurs

with probability O(Ē«|fiā€² |ā€–fā€–1 ). By the union bound, the probability that such an iā€² exists is O(Ē«),

and by pairwise independence this bound holds conditioned on the fact that zi > Ē«āˆ’1r. So theprobability of this type of error is O(Ē«2 |fi|

ā€–fā€–1 ) as needed.

Altogether, this gives the stated Ē« |fi|ā€–fā€–1 (1 Ā± O(Ē«)) + O(nāˆ’c) probability of outputting i, where

the O(nāˆ’c) comes from conditioning on the high probability events. For the O(Ē«) relative errorestimate, if we return an index i we have shown that our additive error from CSSS was at mostO(Ē«)|zi|, thus tiyāˆ—i = (1Ā±O(Ē«))tizi = (1Ā±O(Ē«))fi as needed.

Theorem 5. For Ē«, Ī“ > 0, there is an O(Ē«)-relative error one-pass L1 sampler for Ī±-propertystreams which also returns an O(Ē«)-relative error approximation of the returned item. The algorithmoutputs FAIL with probability at most Ī“, and the space is O(1Ē« log(

1Ē« ) log(n) log(Ī± log(n)/Ē«) log(1Ī“ )).

Proof. By the last lemma, it follows that the prior algorithm fails with probability at most 1 āˆ’Ē« + O(nāˆ’c). Conditioned on the fact that an index i is output, the probability that i = j is

(1 Ā± O(Ē«)) |fi|ā€–fā€–1 + O(nāˆ’c). By running O(1/Ē« log(1/Ī“)) copies of this algorithm in parallel and

returning the first index returned by the copies, we obtain an O(Ē«) relative error sampler withfailure probability at most Ī“. The O(Ē«) relative error estimation of fi follows from Lemma 10.

For space, note that CSSS requires O(k log(n) log(Ī± log(n)/Ē«) = O(log(n) log(1/Ē«) log(Ī± log(n)Ē« ))

bits of space, which dominates the cost of storing r, q and the cost of computing v via Lemma 5,as well as the cost of storing the randomness to compute k-wise independent scaling factors ti.Running O(1/Ē« log(1/Ī“)) copies in parallel gives the stated space bound.

Remark 1. Note that the only action taken by our algorithm which requires more space in thegeneral turnstile case is the L1 estimation step, obtaining r, q in step 2 of the Recovery in Figure3. Note that r, q, need only be constant factor approximations in our proof, and such constantfactor approximations can be obtained with high probability using O(log2(n)) bits (see Fact 1).

This gives an O(1Ē« log(1Ē« ) log(n) log(

Ī± log(n)Ē« ) + log2(n))-bit algorithm for the general turnstile case.

5 L1 estimation

We now consider the well-studied L1 estimation problem in the Ī±-property setting (in this section wewrite Ī±-property to refer to the L1 Ī±-property). We remark that in the general turnstile unboundeddeletion setting, an O(1) estimation of ā€–fā€–1 can be accomplished inO(log(n)) space [39]. We show inSection 8, however, that even for Ī± as small as 3/2, estimating ā€–fā€–1 in general turnstile Ī±-propertystreams still requires Ī©(log(n))-bits of space. Nevertheless, in 5.2 we show that for Ī±-propertygeneral turnstile streams there is a O(Ē«āˆ’2 log(Ī±) + log(n)) bits of space algorithm, where O hideslog(1/Ē«) and log log(n) terms, thereby separating the Ē«āˆ’2 and log n factors. Furthermore, we showa nearly matching lower bound of Ī©( 1

Ē«2 log(Ē«2Ī±)) for the problem (Theorem 14).

20

Page 21: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Ī±L1Estimator: Input (Ē«, Ī“) to estimate L1 of an Ī±-property strict turnstile stream.1. Initialization: Set sā† O(Ī±2Ī“āˆ’1 log3(n)/Ē«2), and initialize Morris-Counter vt with param-

eter Ī“ā€². Define Ij = [sj, sj+2].2. Processing: on update ut, for each j such that vt āˆˆ Ij , sample ut with probability sāˆ’j.3. For each update ut sampled while vt āˆˆ Ij, keep counters c+j , c

āˆ’j initialized to 0. Store all

positive updates sampled in c+j , and (the absolute value of) all negative updates sampled in

cāˆ’j .

4. if vt /āˆˆ Ij for any j, delete the counters c+j , cāˆ’j .

5. Return: sāˆ’jāˆ—(c+jāˆ— āˆ’ cāˆ’jāˆ—) for which jāˆ— is such that c+jāˆ— , cāˆ’jāˆ— have existed the longest (the stored

counters which have been receiving updates for the most time steps).

Figure 4: L1 Estimator for strict turnstile Ī±-property streams.

5.1 Strict Turnstile L1 Estimation

Now for strict-turnstile Ī±-property streams, we show that the problem can be solved with O(log(Ī±))-bits. Ideally, to do so we would sample poly(Ī± log(n)/Ē«) updates uniformly from the stream andapply Lemma 1. To do this without knowing the length of the stream in advance, we sample inexponentially increasing intervals, throwing away a prefix of the stream. At any given time, we willsample at two different rates in two overlapping intervals, and we will return the estimate givenby the sample corresponding to the interval from which we have sampled from the longest upontermination. We first give a looser analysis of the well known Morris counting algorithm.

Lemma 11. There is an algorithm, Morris-Counter, that given Ī“ āˆˆ (0, 1) and a sequence of mevents, produces non-decreasing estimates vt of t such that

Ī“/(12 log(m))t ā‰¤ vt ā‰¤ 1/Ī“t

for a fixed t āˆˆ [m] with probability 1āˆ’ Ī“. The algorithm uses O(log log(m)) bits of space.

Proof. The well known Morris Counter algorithm is as follows. We initialize a counter v0 = 0, andon each update t we set vt = vtāˆ’1 + 1 with probability 1/2vt , otherwise vt = vtāˆ’1. The estimateof t at time t is 2vt āˆ’ 1. It is easily shown that E[2vt ] = t + 1, and thus by Markovā€™s bound,Pr[2vt āˆ’ 1 > t/Ī“] < Ī“.

For the lower bound consider any interval Ei = [2i, 2i+1]. Now suppose our estimate of t = 2i

is less than 6(2iĪ“/ log(n)). Then we expect to sample at least 3 log(n)/Ī“ updates in Ei at 1/2 thecurrent rate of sampling (note that the sampling rate would decrease below this if we did samplemore than 2 updates). Then by Chernoff bounds, with high probability w.r.t. n and Ī“ we sampleat least two updates. Thus our relative error decreases by a factor of at least 2 by the time wereach the interval Ei+1. Union bounding over all log(m) = O(log(n)) intervals, the estimate neverdrops below Ī“/12 log(n)t for all t āˆˆ [m] w.h.p. in n and Ī“. The counter is O(log log(n)) bits withthe same probability, which gives the stated space bound.

Our full L1 estimation algorithm is given in Figure 4. Note that the value sāˆ’jāˆ— can be returnedsymbolically by storing s and jāˆ—, without explicitly computing the entire value. Also observe thatwe can assume that s is a power of 2 by rescaling, and sample with probability sāˆ’i by flippinglog(s)i fair coins sequentially and sampling only if all are heads, which requires O(log log(n)) bitsof space.

21

Page 22: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Theorem 6. The algorithm Ī±L1Estimator gives a (1Ā±Ē«) approximation of the value ā€–fā€–1 of a strictturnstile stream with the Ī±-property with probability 1āˆ’Ī“ using O(log(Ī±/Ē«)+log(1/Ī“)+log(log(n))))bits of space.

Proof. Let Ļˆ = 12 log2(m)/Ī“. By the union bound on Lemma 11, with probability 1 āˆ’ Ī“ theMorris counter vt will produce estimates vt such that t/Ļˆ ā‰¤ vt ā‰¤ Ļˆt for all points t = si/Ļˆand t = Ļˆsi for i = 1, . . . , log(m)/ log(s). Conditioned on this, Ij will be initialized by time Ļˆsj

and not deleted before sj+2/Ļˆ for every j = 1, 2, . . . , log(n)/ log(s). Thus, upon termination, theoldest set of counters c+jāˆ— , c

āˆ’jāˆ— must have been receiving samples from an interval of size at least

māˆ’ 2Ļˆm/s, with sampling probability sāˆ’jāˆ— ā‰„ s/(2Ļˆm). Since s ā‰„ 2ĻˆĪ±2/Ē«2, it follows by Lemma1 that sāˆ’jāˆ—(c+jāˆ— āˆ’ cāˆ’jāˆ—) =

āˆ‘ni=1 fi Ā± Ē«ā€–fā€–1 w.h.p., where f is the frequency vector of all updates

after time tāˆ— and tāˆ— is the time step where c+jāˆ— , cāˆ’jāˆ— started receiving updates. By correctness of

our Morris counter, we know that tāˆ— < 2Ļˆm/s < Ē«ā€–fā€–1, where the last inequality follows from thesize of s and the the Ī±-property, so the number of updates we missed before initializing c+jāˆ— , c

āˆ’jāˆ—

is at most Ē«ā€–fā€–1. Since f is a strict turnstile stream,āˆ‘n

i=1 fi = ā€–fā€–1 Ā± tāˆ— = (1 Ā± O(Ē«))ā€–fā€–1 and

ā€–fā€–1 = (1Ā±O(Ē«))ā€–fā€–1. After rescaling of Ē« we obtain sāˆ’jāˆ—(c+jāˆ— āˆ’ cāˆ’jāˆ—) = (1Ā± Ē«)ā€–fā€–1 as needed.For space, conditioned on the success of the Morris counter, which requires O(log log(n))-bits

of space, we never sample from an interval Ij for more than Ļˆsj+2 steps, and thus the maximumexpected number of samples is Ļˆs2, and is at most s3 with high probability by Chernoff bounds.Union bounding over all intervals, we never have more than s2 samples in any interval with highprobability. At any given time we store counters for at most 2 intervals, so the space required isO(log(s) + log log(m)) = O(log(Ī±/Ē«) + log(1/Ī“) + log log(n)) as stated.

Remark 2. Note that if an update āˆ†t to some coordinate it arrives with |āˆ†t| > 1, our algorithmmust implicitly expand āˆ†t to updates in āˆ’1, 1 by updating the counters by Sign(āˆ†t)Ā·Bin(|āˆ†t|, sāˆ’j)for some j. Note that computing this requires O(log(|āˆ†t|)) bits of working memory, which is poten-tially larger than O(log(Ī± log(n)/Ē«)). However, if the updates are streamed to the algorithm usingO(log(|āˆ†t|)) bits then it is reasonable to allow the algorithm to have at least this much workingmemory. Once computed, this working memory is no longer needed and does not factor into thespace complexity of maintaining the sketch of Ī±L1Estimator.

5.2 General Turnstile L1 Estimator

In [39], an O(Ē«āˆ’2 log(n))-bit algorithm is given for general turnstile L1 estimation. We show howmodifications to this algorithm can result in improved algorithms for Ī±-property streams. We statetheir algorithm in Figure 5, along with the results given in [39]. Here D1 is the distribution of a1-stable random variable. In [35, 39], the variables X = tan(Īø) are used, where Īø is drawn uniformlyfrom [āˆ’Ļ€

2 ,Ļ€2 ]. We refer the reader to [35] for a further discussion of p-stable distributions.

Lemma 12 ( A.6 [39]). The entries of A,Aā€² can be generated to precision Ī“ = Ī˜(Ē«/m) usingO(k log(n/Ē«)) bits.

Theorem 7 (Theorem 2.2 [39]). The algorithm above can be implemented using precision Ī“ inthe variables Ai,j, A

ā€²i,j , and thus precision Ī“ in the entries yi, y

ā€²i, such that the output L satisfies

L = (1Ā± Ē«)ā€–fā€–1 with probability 3/4, where Ī“ = Ī˜(Ē«/m). In this setting, we have yā€²med = Ī˜(1)ā€–f ||1,and

āˆ£

āˆ£

āˆ£

(1

r

rāˆ‘

i=1

cos(yiyā€²med

))

āˆ’ eāˆ’(

ā€–fā€–1yā€²med

)āˆ£

āˆ£

āˆ£ā‰¤ O(Ē«)

.

22

Page 23: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

1. Initialization: Generate random matrices A āˆˆ RrƗn and A āˆˆ Rrā€²Ć—n of variables drawnfrom D1, where r = Ī˜(1/Ē«2) and rā€² = Ī˜(1). The variables Aij are k-wise independent, fork = Ī˜(log(1/Ē«)/ log log(1/Ē«)) , and the variables Aā€²

ij are kā€²-wise independent for kā€² = Ī˜(1).For i 6= iā€², the seeds used to generate the variables Ai,jnj=1 and Aiā€²,jnj=1 are pairwiseindependent

2. Processing: Maintain vectors y = Af and yā€² = Aā€²f .3. Return: Let yā€²med = median|yā€²i|r

ā€²

i=1. Output L = yā€²med

(

āˆ’ ln(

1r

āˆ‘ri=1 cos(

yiyā€²med

)))

Figure 5: L1 estimator of [39] for general turnstile unbounded deletion streams.

We demonstrate that this algorithm can be implemented with reduced space complexity forĪ±-property streams by sampling to estimate the values yi, y

ā€²i. We first prove an alternative version

of our earlier sampling Lemma.

Lemma 13. Suppose a sequence of I insertions and D deletions are made to a single item, andlet m = I + D be the total number of updates. Then if X is the result of sampling updates withprobability p = Ī©(Ī³āˆ’3 log(n)/m), then with high probability

X = (I āˆ’D)Ā± Ī³m

Proof. Let X+ be the positive samples and Xāˆ’ be the negative samples. First suppose I > Ē«m,

then Pr[|pāˆ’1Xāˆ’ āˆ’ I| > Ī³m] < 2 exp(

āˆ’ Ī³2pI3

)

< exp(

āˆ’ Ī³3pm3

)

= 1/poly(n). Next, if I < Ī³m, wehave Pr[pāˆ’1X+ > 2Ī³m] < exp

(

āˆ’ (Ī³mp/3)

< 1/poly(n) as needed. A similar bound shows thatXāˆ’ = D Ā±O(Ī³)m, thus X = X+ āˆ’Xāˆ’ = I āˆ’D Ā±O(Ī³m) as desired after rescaling Ī³.

Theorem 8. There is an algorithm that, given a general turnstile Ī±-property stream f , produces

an estimate L = (1Ā±O(Ē«))ā€–fā€–1 with probability 2/3 using O(Ē«āˆ’2 log(Ī± log(n)/Ē«) +log( 1

Ē«) log(n)

log log( 1Ē«)) bits

of space.

Proof. Using Lemma 12, we can generate the matrices A,Aā€² usingO(k log(n)/Ē«) = O(log(1Ē« ) log(n/Ē«)/ log log(1Ē« )) bits of space, with precision Ī“ = Ī˜(Ē«/m). Every time an update (it,āˆ†t) arrives, wecompute the update Ī·i = āˆ†tAi,it to yi for i āˆˆ [r], and the update Ī·ā€²iā€² = āˆ†tA

ā€²iā€²,it

to yā€²iā€² for iā€² āˆˆ [rā€²].

Let X āˆ¼ D1 for a 1-stable distribution D1. We think of yi and yā€²iā€² as streams on one variable,

and we will sample from them and apply Lemma 13. We condition on the success of the algorithmof Theorem 7, which occurs with probability 3/4. Conditioned on this, this estimator L of Figure5 is a (1Ā± Ē«) approximation, so we need only show we can estimate L in small space.

Now the number of updates to yi isāˆ‘n

q=1 |Aiq|Fq, where Fq = Iq+Dq is the number of updates

to fq. Conditioned on maxjāˆˆ[n] |Aij | = O(n2), which occurs with probability 1 āˆ’ O(1/n) by aunion bound, E[|Aiq|] = Ī˜(log(n)) (see, e.g., [35]). Then E[

āˆ‘nq=1 |Aiq|Fq] = O(log(n))ā€–Fā€–1 is the

expected number of updates to yi, so by Markovā€™s inequalityāˆ‘n

q=1 |Aiq|Fq = O(log(n)/Ē«2ā€–Fā€–1)with probability 1āˆ’ 1/(100(r + rā€²)), and so with probability 99/100, by a union bound this holdsfor all i āˆˆ [r], iā€² āˆˆ [rā€²]. We condition on this now.

Our algorithm then is as follows. We then scale āˆ†i up by Ī“āˆ’1 to make each update Ī·i, Ī·ā€²i an

integer, and sample updates to each yi with probability p = Ī©(Ē«āˆ’30 log(n)/m) and store the result

in a counter ci. Note that scaling by Ī“āˆ’1 only blows up the size of the stream by a factor of m/Ē«.Furthermore, if we can (1Ā± Ē«) approximation the L1 of this stream, scaling our estimate down by

23

Page 24: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

a factor of Ī“ gives a (1Ā± Ē«) approximation of the actual L1, so we can assume from now on that allupdates are integers.

Let yi = pāˆ’1ci. We know by Lemma 13 that yi = yiĀ±Ē«0(āˆ‘n

q=1 |Aiq|Fq) = yiĀ±O(Ē«0 log(n)ā€–Fā€–1/Ē«2)with high probability. Setting Ē«0 = Ē«3/(Ī± log(n)), we have |yi āˆ’ yi| < Ē«/Ī±ā€–Fā€–1 ā‰¤ Ē«ā€–fā€–1 by the Ī±-property for all i āˆˆ [r]. Note that we can deal with the issue of not knowing the length of thestream by sampling in exponentially increasing intervals [s1, s3], [s2, s4], . . . of size s = poly(Ī±/Ē«0)as in Figure 4, throwing out a Ē«0 fraction of the stream. Since our error is an additive Ē«0 fractionof the length of the stream already, our error does not change. We run the same routine to obtainestimates yā€²i of y

ā€²i with the same error guarantee, and output

Lā€² = yā€²med

(

āˆ’ ln(1

r

rāˆ‘

i=1

cos(yiyā€²med

))

)

where yā€²med = median(yiā€²). By Theorem 7, we have that yā€²med = Ī˜(ā€–fā€–1), thus yā€²med = yā€²medĀ± Ē«ā€–fā€–1

since |yā€²i āˆ’ yā€²i| < Ē«ā€–fā€–1 for all i. Using the fact that yā€²med = Ī˜(ā€–fā€–1) by Theorem 7, we have:

Lā€² = (yā€²med Ā± Ē«ā€–fā€–1)(

āˆ’ ln(1

r

rāˆ‘

i=1

cos(yi

yā€²med(1Ā±O(Ē«))Ā± Ē«ā€–fā€–1yā€²med(1Ā±O(Ē«))

))

)

= (yā€²med Ā± Ē«ā€–fā€–1)(

āˆ’ ln(1

r

rāˆ‘

i=1

cos(yiyā€²med

)Ā±O(Ē«))

)

Where the last equation follows from the angle formula cos(Ī½ + Ī²) = cos(Ī½) cos(Ī²) āˆ’ sin(Ī½) sin(Ī²)

and the Taylor series expansion of sin and cos. Next, since |(

1r

āˆ‘ri=1 cos(

yiyā€²med

))

āˆ’ eāˆ’(

ā€–fā€–1yā€²med

)| < O(Ē«)

and yā€²med = Ī˜(ā€–fā€–1) by Theorem 7, it follows that(

1r

āˆ‘ri=1 cos(

yiyā€²med

))

= Ī˜(1), so using the fact that

ln(1Ā±O(Ē«)) = Ī˜(Ē«), this is (yā€²medĀ± Ē«ā€–fā€–1)(

āˆ’ ln(

1r

āˆ‘ri=1 cos(

yiyā€²med

))

Ā±O(Ē«))

, which is LĀ±O(Ē«)ā€–fā€–1,where L is the output of Figure 5, which satisfies L = (1Ā± Ē«)ā€–fā€–1 by Theorem 7. It follows that ourestimate Lā€² satisfies L = (1Ā±O(Ē«))ā€–fā€–1, which is the desired result. Note that we only conditionedon the success of Figure 5, which occurs with probability 3/4, and the bound on the number ofupdates to every yi, y

ā€²i, which occurs with probability 99/100, and high probability events, by the

union bound our result holds with probability 2/3 as needed.For space, generating the entries of A,Aā€² requires O(log(1Ē« ) log(n/Ē«)/ log log(

1Ē« )) bits as noted,

which dominates the cost of storing Ī“. Moreover, every counter yi, yā€²i is at most poly(p

āˆ‘nq=1 |Aiq|Fq) =

poly(Ī± log(n)/Ē«) with high probability, and can thus be stored with O(log(Ī± log(n)/Ē«)) bits eachby storing a counter and separately storing p (which is the same for every counter). As there areO(1/Ē«2) counters, the total space is as stated.

6 L0 Estimation

The problem of estimating the support size of a stream is known as L0 estimation. In otherwords, this is L0 = |i āˆˆ [n] | fi 6= 0|. L0 estimation is a fundamental problem for networktraffic monitoring, query optimization, and database analytics [54, 1, 25]. The problem also hasapplications in detecting DDoS attacks [4] and port scans [24].

For general turnstile streams, Kane, Nelson, and Woodruff gave an O(Ē«āˆ’2 log(n)(log(Ē«āˆ’1) +log log(n)))-bit algorithm with constant probability of success [40], which nearly matches the knownlower bound of Ī©(Ē«āˆ’2 log(Ē«2n)) [39]. For insertion only streams, they also demonstrated an O(Ē«āˆ’2+

24

Page 25: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

L0Estimator: L0 estimation algorithmInput: Ē« > 0

1. Set K = 1/Ē«2, and instantiate log(n)ƗK matrix B to 0.2. Fix random h1 āˆˆ H2([n], 0, . . . , n āˆ’ 1), h2 āˆˆ H2([n], [K

3]), h3 āˆˆ Hk([K3], [K]), and h4 āˆˆ

H2([K3], [K]), for k = Ī©(log(Ē«āˆ’1)/ log log(Ē«āˆ’1)).

3. Randomly choose prime p āˆˆ [D,D3], for D = 100K log(mM), and vector ~u āˆˆ FKp .

4. On Update (i,āˆ†): set

Blsb(h1(i)),h3(h2(i)) ā†(

Blsb(h1(i)),h3(h2(i)) +āˆ† Ā· ~uh4(h2(i))

)

(mod p)

5. Return: run RoughL0Estimator to obtain R āˆˆ [L0, 110L0]. Set T = |j āˆˆ [K] | Biāˆ—,j 6= 0|,where iāˆ— = max0, log(16R/K). Return estimate L0 =

32RK

ln(1āˆ’T/K)ln(1āˆ’1/K) .

Figure 6: L0 Estimation Algorithm of [40]

log(n)) upper bound. In this section we show that the ideas of [40] can be adapted to yield moreefficient algorithms for general turnstile L0 Ī±-property streams. For the rest of the section, we willsimply write Ī±-property to refer to the L0 Ī±-property.

The idea of the algorithm stems from the observation that if A = Ī˜(K), then the numberof non-empty bins after hashing A balls into K bins is well concentrated around its expectation.Treating this expectation as a function of A and inverting it, one can then recover A with goodprobability. By treating the (non-zero) elements of the stream as balls, we can hash the universedown into K = 1/Ē«2 bins and recover L0 if L0 = Ī˜(K). The primary challenge will be to ensurethis last condition. In order to do so, we subsample the elements of the stream at log(n) levels, andsimultaneously run an O(1) estimator R of the L0. To recover a (1 Ā± Ē«) approximation, we use Rto index into the level of subsampling corresponding to a substream with Ī˜(K) non-zero elements.We then invert the number of non-empty bins and scale up by a factor to account for the degreeof subsampling.

6.1 Review of Unbounded Deletion Case

For sets U, V and integer k, let Hk(U, V ) denote some k-wise independent hash family of functionsmapping U into V . Assuming that |U |, |V | are powers of 2, such hash functions can be representedusing O(k log(|U | + |V |)) bits [13] (without loss of generality we assume n, Ē«āˆ’1 are powers of 2for the remainder of the section). For x āˆˆ Zā‰„0, we write lsb(x) to denote the (0-based index of)the least significant bit of x written in binary. For instance, lsb(6) = 1 and lsb(5) = 0. We setlsb(0) = log(n). In order to fulfill the algorithmic template outlined above, we need to obtain aconstant factor approximation R to L0. This is done using the following result which can be foundin [40].

Lemma 14. Given a fixed constant Ī“ > 0, there is an algorithm, RoughL0Estimator, that withprobability 1āˆ’Ī“ outputs a value R = L0 satisfying L0 ā‰¤ R ā‰¤ 110L0, using space O(log(n) log log(n)).

The main algorithm then subsamples the stream at log(n) levels. This is accomplished bychoosing a hash function h1 : [n] ā†’ 0, . . . , n āˆ’ 1, and subsampling an item i at level lsb(h1(i)).Then at each level of subsampling, the updates to the subsampled items are hashed intoK = 1

Ē«2 binsk = Ī©(log(Ē«āˆ’1/ log(log(Ē«āˆ’1))))-wise independently. The entire data structure is then representedby a log(n) Ɨ K matrix B. The matrix B is stored modulo a sufficiently large prime p, and the

25

Page 26: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

updates to the rows are scaled via a random linear function to reduce the probability that deletionsto one item cancel with insertions to another, resulting in false negatives in the number of bucketshit by items from the support. At the termination of the algorithm, we count the number T ofnon-empty bins in the iāˆ—-th row of B, where iāˆ— = max0, log(16RK ). We then return the value

L0 = 32RK

ln(1āˆ’T/K)ln(1āˆ’1/K) . The full algorithm is given in Figure 6. First, the following Lemma can be

found in [40].

Lemma 15. There exists a constant Ē«0 such that the following holds. Let A balls be mapped intoK = 1/Ē«2 bins using a random h āˆˆ Hk([A], [K]) for k = c log(1/Ē«)/ log log(1/Ē«) for a sufficientlylarge constant c. Let X be a random variable which counts the number of non-empty bins after allballs are mapped, and assume 100 ā‰¤ A ā‰¤ K/20 and Ē« ā‰¤ Ē«0. Then E[X] = K(1āˆ’ (1āˆ’Kāˆ’1)A) and

Pr[āˆ£

āˆ£X āˆ’ E[X]āˆ£

āˆ£ ā‰¤ 8Ē«E[X]] ā‰„ 4/5

Let A be the log(n) Ɨ K bit matrix such that Ai,j = 1 iff there is at least one v āˆˆ [n] withfv 6= 0 such that lsb(h1(v)) = i and h3(h2(v)) = j. In other words, Ai,j is an indicator bit which is1 if an element from the support of f is hashed to the entry Bi,j in the above algorithm. Clearlyif Bi,j 6= 0, then Ai,j 6= 0. However, the other direction may not always hold. The proofs of thefollowing facts and lemmas can be found in [40]. However we give them here for completeness.

Fact 2. Let t, r > 0 be integers. Pick h āˆˆ H2([r], [t]). For any S āŠ‚ [r], E[āˆ‘s

i=1

(|hāˆ’1(i)āˆ©S|2

)

] ā‰¤|S|2/(2t).

Proof. Let Xi,j be an indicator variable which is 1 if h(i) = j. Utilizing linearity of expectation,

the desired expectation is then tāˆ‘

i<iā€² E[Xi,1]E[Xiā€²,1] = t(|S|

2

)

1t2ā‰¤ |S|2

2t .

Fact 3. Let Fq be a finite field and v āˆˆ Fdq be a non-zero vector. Then, if w āˆˆ Fd

q is selectedrandomly, we have Pr[v Ā· w = 0] = 1/q where v Ā· w is the inner product over Fq.

Proof. The set of vectors orthogonal to v is a linear subspace V āŠ‚ Fdq of dimension d āˆ’ 1, and

therefore contains qdāˆ’1 points. Thus Pr[w āˆˆ V ] = 1/q as needed.

Lemma 16 (Lemma 6 of [40]). Assuming that L0 ā‰„ K/32, with probability 3/4, for all j āˆˆ [K]we have Aiāˆ—,j = 0 if and only if Biāˆ—,j = 0. Moreover, the space required to store each Bi,j isO(log log(n) + log(1/Ē«)).

Proof. The space follows by the choice of p āˆˆ O(D3), and thus it suffices to bound the probabilitythat Biāˆ—,j = 0 when Aiāˆ—,j 6= 0. Define Iiāˆ— = j āˆˆ [n] | lsb(h1(j)) = iāˆ—, fj 6= 0. This is the set of non-zero coordinates of f which are subsampled to row iāˆ— of B. Now conditioned on R āˆˆ [L0, 110L0],which occurs with arbitrarily large constant probability Ī“ = Ī˜(1), we have E[|Iiāˆ— |] ā‰¤ K/32, andusing the pairwise independence of h1 we have that Var(|Iiāˆ— |) < E[|Iiāˆ— |]. So by Chebyshevā€™sinequality Pr[|Iiāˆ— | ā‰¤ K/20] = 1āˆ’O(1/K), which we now condition on. Given this, since the rangeof h2 has sizeK

3, the indices of Iiāˆ— are perfectly hashed by h2 with probability 1āˆ’O(1/K) = 1āˆ’o(1),an event we call Q and condition on occurring.

Since we choose a prime p āˆˆ [D,D3], with D = 100K log(mM), for mM larger than someconstant, by standard results on the density of primes there are at least (K2 log2(mM)) primesin the interval [D,D3]. Since each fj has magnitude at most mM and thus has at most log(mM)prime factors, it follows that fj 6= 0 (mod p) with probability 1 āˆ’ O(1/K2) = 1 āˆ’ o(1). Unionbounding over all j āˆˆ Iiāˆ— , it follows that p does not divide |fj | for any j āˆˆ Iiāˆ— with probability1āˆ’ o(1/K) = 1āˆ’ o(1). Call this event Qā€² and condition on it occurring. Also let Qā€²ā€² be the eventthat h4(h2(j)) 6= h4(h2(j

ā€²)) for any distinct j, jā€² āˆˆ Iiāˆ— such that h3(h2(j)) = h3(h2(jā€²)).

26

Page 27: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

To bound Pr[Ā¬Qā€²ā€²], let Xj,jā€² indicate h3(h2(j)) = h3(h2(jā€²)), and let X =

āˆ‘

j<jā€² Xj,jā€². By

Fact 2 with r = K3, t = K and |S| = |Iiāˆ— | < K/20, we have that E[X] ā‰¤ K/800. Let Z =(j, jā€²) | h3(h2(j)) = h3(h2(j

ā€²)). For (j, jā€²) āˆˆ Z, let Yj,jā€² indicate h4(h2(j)) = h4(h2(jā€²)), and let

Y =āˆ‘

(j,jā€²)āˆˆZ Yj,jā€². By the pairwise independence of h4 and our conditioning on Q, we have E[Y ] =āˆ‘

(j,jā€²)āˆˆZ Pr[h4(h2(j)) = h4(h2(jā€²))] = |Z|/K = X/K. Now conditioned on X < 20E[X] = K/40

which occurs with probability 19/20 by Markovā€™s inequality, we have E[Y ] ā‰¤ 1/40, so Pr[Y ā‰„ 1] ā‰¤1/40. So we have shown that Qā€²ā€² holds with probability (19/20)(39/40) ā‰„ 7/8.

Now for each j āˆˆ [K] such that Aiāˆ—,j = 1, we can view Biāˆ—,j as the dot product of the vectorv, which is f restricted to the coordinates of Iiāˆ— that hashed to j, with a random vector w āˆˆ Fp

which is u restricted to coordinates of Iiāˆ— that hashed to j. Conditioned on Qā€² it follows that v isnon-zero, and conditioned on Qā€²ā€² it follows that w is indeed random. So by Fact 3 with q = p, unionbounding over all K counters Biāˆ—,j, we have that Biāˆ—,j 6= 0 whenever Aiāˆ—,j 6= 0 with probability1 āˆ’ K/p ā‰„ 99/100. Altogether, the success probability is then (7/8)(99/100) āˆ’ o(1) ā‰„ 3/4 asdesired.

Theorem 9. Assuming that L0 > K/32, the value returned by L0Estimator is a (1 Ā± Ē«) approxi-mation of the L0 using space O(Ē«āˆ’2 log(n)(log(1Ē« )+log(log(n)) log(1Ī“ )), with 3/4 success probability.

Proof. By Lemma 16, we have shown that TA = |j āˆˆ [K] | Aiāˆ—,j 6= 0| = T with probability 3/4,

where T is as in Figure 6. So it suffices to show that LA0 = 32R

Kln(1āˆ’TA/K)ln(1āˆ’1/K) is a (1Ā±Ē«) approximation.

Condition on the event E where R āˆˆ [L0, 110L0], which occurs with large constant probabilityĪ“ = Ī˜(1). Let Iiāˆ— = j āˆˆ [n] | lsb(h1(j)) = iāˆ—, fj 6= 0. Then E[|Iiāˆ— |] = L0/2

iāˆ—+1 = L0K/(32R)(assuming L0 > K/32) and Var(|Iiāˆ— |) < E[|Iiāˆ— |] by the pairwise independence of h1. ThenK/3520 ā‰¤E[|Iiāˆ— |] ā‰¤ K/32 by E , and by Chebyshevā€™s inequality K/4224 ā‰¤ |Iiāˆ— | ā‰¤ K/20 with probability1 āˆ’ O(1/K) = 1 āˆ’ o(1). Call this event E ā€², and condition on it. We then condition on the eventE ā€²ā€² that the indices of Iiāˆ— are perfectly hashed, meaning they do not collide with each other inany bucket, by h2. Given E ā€², then by the pairwise independence of h2 the event E ā€²ā€² occurs withprobability 1āˆ’O(1/K) as well.

Conditioned on E ā€²āˆ§E ā€²ā€², it follows that TA is a random variable counting the number of bins hit byat least one ball under a k-wise independent hash function, where there are C = |Iiāˆ— | balls, K bins,and k = Ī©(log(K/Ē«)/ log log(K/Ē«)). Then by Lemma 15, we have TA = (1Ā± 8Ē«)K(1āˆ’ (1āˆ’ 1/K)C)with probability 4/5. So

ln(1āˆ’ TA/K) = ln((1āˆ’ 1/K)C Ā± 8Ē«(1 āˆ’ (1āˆ’ 1/K)C))

Since we condition on the fact that K/4244 ā‰¤ C ā‰¤ K/32, it follows that (1 āˆ’ 1/K)C = Ī˜(1), sothe above is ln((1 Ā± O(Ē«))(1 āˆ’ 1/K)C) = C ln(1 āˆ’ 1/K) Ā± O(Ē«), and since ln(1 + x) = O(|x|) for|x| < 1/2, we have LA

0 = 32RCK + O(Ē«R). Now the latter term is O(Ē«L0), since R = Ī˜(L0), so it

suffices to show the concentration of C. Now since Var(C) ā‰¤ E[C] by pairwise independence of

h1, so Chebyshevā€™s inequality gives Pr[āˆ£

āˆ£C āˆ’ L0K/(32R)āˆ£

āˆ£ ā‰„ c/āˆšK] < E[C]

(c2/K)E[C]2ā‰¤ (16c )

2 and this

probability can be made arbitrarily small big increasing c, so set c such that the probability is 1/100.Note that 1/

āˆšK = Ē«, so it follows that C = (1 Ā± O(Ē«))L0K/(32R). From this we conclude that

LA0 = (1Ā±O(Ē«))L0. By the union bound the events E āˆ§E ā€²āˆ§E ā€²ā€² occur with arbitrarily large constant

probability, say 99/100, and conditioned on this we showed that LA0 = (1 Ā± Ē«)L0 with probability

4/5āˆ’1/100 = 79/100. Finally, LA0 = L0 with probability 3/4, and so together we obtain the desired

result with probability 1āˆ’ 21/100 āˆ’ 1/100 āˆ’ 1/4 = 53/100. Running this algorithm O(1) times inparallel and outputting the median gives the desired probability.

27

Page 28: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Ī±L0Estimator: L0 estimator for Ī±-property streams.1. Initialize instance of L0Estimator, constructing only the top 2 log(4Ī±/Ē«) rows of B. Let all

parameters and hash functions be as in L0Estimator.2. Initialize Ī±StreamRoughL0Est to obtain a value Lt

0 āˆˆ [Lt0, 8Ī±L0] for all t āˆˆ [m], and set

Lt0 = maxLt

0, 8 log(n)/ log log(n)3. Update the matrix B as in Figure 6, but only store the rows with index i such that i =

log(16Lt0/K)Ā± 2 log(4Ī±/Ē«).

4. Return: run Ī±StreamConstL0Est to obtain R āˆˆ [L0, 100L0], and set T = |j āˆˆ [K] | Biāˆ—,j 6=0|, where iāˆ— = log(16R/K). Return estimate L0 =

32RK

ln(1āˆ’T/K)ln(1āˆ’1/K) .

Figure 7: Our L0 estimation algorithm for Ī±-property streams with L0 > K/32

6.2 Dealing With Small L0

In the prior section it was assumed that L0 ā‰„ K/32 = Ē«āˆ’2/32. We handle the estimation when thisis not the case the same way as [40]. We consider two cases. First, if L0 ā‰¤ 100 we can perfectlyhash the elements into O(1) buckets and recovery the L0 exactly with large constant probabilityby counting the number of nonzero buckets, as each non-zero item will be hashed to its own bucketwith good probability (see Lemma 21).

Now for K/32 > L0 > 100, a similar algorithm as in the last section is used, except we use onlyone row of B and no subsampling. In this case, we set K ā€² = 2K, and create a vector Bā€² of lengthK ā€². We then run the algorithm of the last section, but update Bj instead of Bi,j every time Bi,j isupdated. In other words, Bā€²

j is the j-th column of B collapsed, so the updates to all items in [n]are hashed into a bucket of Bā€². Let I = i āˆˆ [n] | fi 6= 0. Note that the only fact about iāˆ— thatthe proof of Lemma 16 uses was that E[|Iiāˆ— |] < K/32, and since I = L0 < K/32, this is still thecase. Thus by the the same argument given in Lemma 16, with probability 3/4 we can recover abitvector A from B satisfying Aj = 1 iff there is some v āˆˆ [n] with fv 6= 0 and h3(h2(v)) = j. Thenif TA is the number of non-zero bits of A, it follows by a similar argument as in Theorem 9 thatLā€²0 = ln(1 āˆ’ TA/K ā€²)/ ln(1 āˆ’ 1/K ā€²) = (1 Ā± Ē«)L0 for 100 < L0 < K/32. So if Lā€²

0 > K ā€²/32 = K/16,we return the output of the algorithm from the last section, otherwise we return Lā€²

0. The spacerequired to store B is O(Ē«āˆ’2(log log(n) + log(1/Ē«))), giving the following Lemma.

Lemma 17. Let Ē« > 0 be given and let Ī“ > 0 be a fixed constant. Then there is a subroutine usingO(Ē«āˆ’2(log(Ē«āˆ’1) + log log(n)) + log(n)) bits of space which with probability 1 āˆ’ Ī“ either returns a(1Ā± Ē«) approximation to L0, or returns LARGE, with the guarantee that L0 > Ē«āˆ’2/16.

6.3 The Algorithm for Ī±-Property Streams

We will give a modified version of the algorithm in Figure 6 for L0 Ī± property streams. Ouralgorithm is given in Figure 7. We note first that the return value of the unbounded deletionalgorithm only depends on the row iāˆ— = log(16R/K), and so we need only ensure that this rowis stored. Our L0 Ī±-property implies that if Lt

0 is the L0 value at time t, then we must haveLm0 = L0 ā‰„ 1/Ī±Lt

0. So if we can obtain an O(Ī±) approximation Rt to Lt0, then at time t we

need only maintain and sample the rows of the matrix with index within c log(Ī±/Ē«) distance ofit = log(16Rt/K), for some small constant c.

By doing this, the output of our algorithm will then be the same as the output of L0Estimatorwhen run on the suffix of the stream beginning at the time when we first begin sampling to therow iāˆ—. Since we begin sampling to this row when the current Lt

0 is less than an Ē« fraction of the

28

Page 29: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

final L0, it will follow that the L0 of this suffix will be an Ē«-relative error approximation of the L0

of the entire stream. Thus by the correctness of L0Estimator, the output of our algorithm will bea (1Ā± Ē«)2 approximation.

To obtain an O(Ī±) approximation to Lt0 at all points t in the stream, we employ another

algorithm of [40], which gives an O(1) estimation of the F0 value at all points in the stream,where F0 = |i āˆˆ [n] | f ti 6= 0 for some t āˆˆ [m]|. So by definition for any time t āˆˆ [m] we haveF t0 ā‰¤ F0 = ā€–I +Dā€–0 ā‰¤ Ī±ā€–fā€–0 = Ī±L0 by the Ī±-property, and also by definition F t

0 ā‰„ Lt0 at all times

t. These two facts together imply that [F t0 , 8F

t0 ] āŠ† [Lt

0, 8Ī±L0].

Lemma 18 ([40]). There is an algorithm, RoughF0Est, that with probability 1āˆ’ Ī“ outputs non de-

creasing estimates F0tsuch that F0

t āˆˆ [F t0 , 8F

t0 ] for all t āˆˆ [m] such that F t

0 ā‰„ max8, log(n)/ log log(n),where m is the length of the stream. The spacedrequired is O(log(n) log(1Ī“ ))-bits.

Corollary 2. There is an algorithm, Ī±StreamRoughL0Est, that with probability 1 āˆ’ Ī“ on an Ī±-

deletion stream outputs non-decreasing estimates L0tsuch that L0

t āˆˆ [Lt0, 8Ī±L0] for all t āˆˆ [m] such

that F t0 ā‰„ max8, log(n)/ log log(n), where m is the length of the stream. The space required is

O(log(n) log(1Ī“ )) bits.

Note that the approximation is only promised for t such that F t0 ā‰„ max8, log(n)/ log log(n).

To handle this, we give a subroutine which produces the L0 exactly for F0 < 8 log(n)/ log log(n)using O(log(n)) bits. Our main algorithm will assume that F0 > 8 log(n)/ log log(n), and initialize

its estimate of L00 to be L0

0 = 8 log(n)/ log log(n), where Lt0 āˆˆ [Lt

0.8Ī±L0] is the estimate producedby Ī±StreamRoughL0Est.

Lemma 19. Given c ā‰„ 1, there is an algorithm that with probability 49/50 returns the L0 exactlyif F0 ā‰¤ c, and returns LARGE if F0 > c. The space required is O(c log(c) + c log log(n) + log(n))bits

Proof. The algorithm chooses a random hash function h āˆˆ H2([n], [C]) for some C = Ī˜(c2). Everytime an update (it,āˆ†t) arrives, the algorithm hashes the identity it and keeps a counter, initializedto 0, for each identity h(it) seen. The counter for h(it) is incremented by all updates āˆ†Ļ„ such thath(iĻ„ ) = h(it). Furthermore, all counters are stored mod p, where p is a random prime picked inthe interval [P,P 3] for P = 1002c log(mM). Finally, if at any time the algorithm has more thanc counters stored, it returns LARGE. Otherwise, the algorithm reports the number of non-zerocounters at the end of the stream.

To prove correctness, first note that at most F0 items will ever be seen in the stream by definition.Suppose F0 ā‰¤ c. By the pairwise independence of h, and scaling C by a sufficiently large constantfactor, with probability 99/100 none of the F0 ā‰¤ c identities will be hashed to the same bucket.Condition on this now. Let I āŠ‚ [n] be the set of non-zero indices of f . Our algorithm will correctlyreport |I| if p does not divide fi for any i āˆˆ I. Now for mM larger then some constant, by standardresults on the density of primes there are at least 100c2 log2(mM) primes in the interval [P,P 3].Since each fi has magnitude at most mM , and thus at most log(mM) prime factors, it follows thatp does not divide fi with probability 1āˆ’ 1/(100c2). Since |I| = L0 < F0 ā‰¤ c, union bounding overall i āˆˆ I, it follows that p āˆ¤ fi for all i āˆˆ I with probability 99/100. Thus our algorithm succeedswith probability 1āˆ’ (1/100 + 1/100) > 49/50.

If F0 > c then conditioned on no collisions for the first c+ 1 distinct items seen in the stream,which again occurs with probability 99/100 for sufficiently large C = Ī˜(c2), the algorithm willnecessarily see c+1 distinct hashed identities once the (c+1)-st item arrives, and correctly returnLARGE.

29

Page 30: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

For the space, each hashed identity requires O(log(c)) bits to store, and each counter requiresO(log(P )) = log(c log(n)) bits to store. There are at most c pairs of identities and counters, andthe hash function h can be stored using O(log(n)) bits, giving the stated bound.

Finally, to remove the log(n) log log(n) memory overhead of running the RoughL0Estimator

procedure to determine the row iāˆ—, we show that the exact same O(1) approximation of the finalL0 can be obtained using O(log(Ī± log(n)) log(log(n))+ log(n)) bits of space for Ī±-property streams.We defer the proof of the following Lemma to Section 6.4

Lemma 20. Given a fixed constant Ī“, there is an algorithm, Ī±StreamConstL0Est that with proba-bility 1āˆ’ Ī“ when run on an Ī± property stream outputs a value R = L0 satisfying L0 ā‰¤ R ā‰¤ 100L0,using space O(log(Ī±) log log(n) + log(n)).

Theorem 10. There is an algorithm that gives a (1Ā± Ē«) approximation of the L0 value of a generalturnstile stream with the Ī±-property, using space O( 1

Ē«2log(Ī±Ē« )(log(

1Ē« ) + log log(n)) + log(n)), with

2/3 success probability.

Proof. The case of L0 < K/32 can be handled by Lemma 17 with probability 49/50, thus wecan assume L0 ā‰„ K/32. Now if F0 < 8 log(n)/ log log(n), we can use Lemma 19 with c =8 log(n)/ log log(n) to compute the L0 exactly. Conditioning on the success of Lemma 19, which oc-curs with probability 49/50, we will know whether or not F0 < 8 log(n)/ log log(n), and can correctlyoutput the result corresponding to the correct algorithm. So we can assume that F0 > 8 log(n)/ loglog(n). Then by the Ī±-property, it follows that L0 > 8 log(n)/(Ī± log log(n)).

Let tāˆ— be the time step when our algorithm initialized and began sampling to row iāˆ—. Conditionedon the success of Ī±StreamRoughL0Est and Ī±StreamConstL0Est, which by the union bound togetheroccur with probability 49/50 for constant Ī“, we argue that tāˆ— exists

First, we know that iāˆ— = log(16R/K) > log(16L0/K) > log(16 Ā· (8 log(n)/(log log(n)))/K) āˆ’log(Ī±) by the success of Ī±StreamConstL0Est. Moreover, at the start of the algorithm we ini-tialize all rows with indices i = log(16 Ā· (8 log log(n)/ log(n))/K) Ā± 2 log(4Ī±/Ē«), so if iāˆ— < log(16 Ā·(8 log log(n)/ log(n))/K) + 2 log(4Ī±/Ē«) then we initialize iāˆ— at the very beginning (time tāˆ— = 0).Next, if iāˆ— > log(16 Ā· (8 log log(n)/ log(n))/L) + 2 log(4Ī±/Ē«), then we initialize iāˆ— at the first time

t when Lt0 ā‰„ R(Ē«/(4Ī±))2. We know by termination that L0

m āˆˆ [L0, 8Ī±L0] since F0 > 8 log(n)/ loglog(n) and therefore by the end of the stream Ī±StreamRoughL0Est will give its promised approxi-mation. So our final estimate satisfies L0

m ā‰„ L0m ā‰„ L0 ā‰„ R/8 > R(Ē«/(4Ī±))2. Thus iāˆ— will always

be initialized at some time tāˆ—.Now because the estimates Lt

0 are non-decreasing and we have L0m āˆˆ [L0, 8Ī±L0], it follows that

L0t< 8Ī±L0 for all t. Then, since Lt

0 < max8Ī±L0, 8 log(n)/ log log(n) < L0(4Ī±/Ē«)2, it follows that

at the termination of our algorithm the row iāˆ— was not deleted, and will therefore be stored at theend of the stream.

Now, at time tāˆ— āˆ’ 1 right before row iāˆ— was initialized we have Ltāˆ—āˆ’10 ā‰¤ Ltāˆ—āˆ’1

0 < R(Ē«/(4Ī±))2,and since R < 110L0 we have Ltāˆ—āˆ’1

0 /L0 ā‰¤ O(Ē«2). It follows that the L0 value of the streamsuffix starting at the time step tāˆ— is a value L0 such that L0 = (1Ā±O(Ē«2))Lm

0 . Since our algorithmproduces the same output as running L0Estimator on this suffix, we obtain a (1Ā±Ē«) approximationof L0 by the proof of Theorem 9 with probability 3/4, which in turn is a (1 Ā± Ē«)2 approximationof the actual L0, so the desired result follows after rescaling Ē«. Thus the probability of success is1āˆ’ (3/50 + 1/4) > 2/3.

For space, note that we only ever store O(log(Ī±/Ē«)) rows of the matrix B, each with entriesof value at most the prime p āˆˆ O((K log(n))3), and thus storing all rows of the matrix requiresO(1/Ē«2 log(Ī±/Ē«)(log(1/Ē«) + log log(n))) bits. The space required to run Ī±StreamConstL0Est is

30

Page 31: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

an additional additive O(log(Ī±) log log(n) + log(n)) bits. The cost of storing the hash functionsh1, h2, h3, h4 is O(log(n) + log2(1/Ē«)) which dominates the cost of running Ī±StreamRoughL0Est.Along with the cost of storing the matrix, this dominates the space required to run the smallL0 algorithm of Lemma 17 and the small F0 algorithm of Lemma 19 with c on the order ofO(log(n)/ log log(n)). Putting these together yields the stated bound.

6.4 Our Constant Factor L0 Estimator for Ī±-Property Streams

In this section we prove Lemma 20. Our algorithm Ī±StreamConstL0Est is a modification of theRoughL0Estimator of [40], which gives the same approximation for turnstile streams. Their algo-rithm subsamples the stream at log(n) levels, and our improvement comes from the observationthat for Ī±-property streams we need only consider O(log(Ī±)) levels at a time. Both algorithms uti-lize the following lemma, which states that if the L0 is at most some small constant c, then it canbe computed exactly using O(c2 log log(mM)) space. The lemma follows from picking a randomprime p = Ī˜(log(mM) log log(mM)) and pairwise independently hashing the universe into [Ī˜(c2)]buckets. Each bucket is a counter which contains the sum of frequencies modulo p of updates tothe universe which land in that bucket. The L0 estimate of the algorithm is the total number ofnon-zero counters. The maximum estimate is returned after O(log(1/Ī·)) trials.

Lemma 21 (Lemma 8, [40]). There is an algorithm which, given the promise that L0 ā‰¤ c, outputsL0 exactly with probability at least 1āˆ’Ī· using O(c2 log log(n)) space, in addition to needing to storeO(log(1/Ī·)) pairwise independent hash functions mapping [n] onto [c2].

We now describe the whole algorithm RoughL0Estimator along with our modifications to it.The algorithm is very similar to the main algorithm of Figure 7. First, a random hash functionh : [n]ā†’ [n] is chosen from a pairwise independent family. For each 0 ā‰¤ j ā‰¤ log(n), a substream Sjis created which consists of the indices i āˆˆ [n] with lsb(h(i)) = j. For any t āˆˆ [m], let Stā†’j denotethe substream of S restricted to the updates t, t+1, . . . ,m, and similarly let Ltā†’

0 denote the L0 ofthe stream suffix t, t+ 1, . . . ,m. Let L0(S) denote the L0 of the substream S.

We then initialize the algorithm RoughĪ±StreamL0-Estimator, which by Corollary 2 gives non-decreasing estimates Lt

0 āˆˆ [Lt0, 8Ī±L0] at all times t such that F0 > 8 log(n)/ log log(n) with proba-

bility 99/100. Let Lt0 = maxLt

0, 8 log(n)/ log log(n), and let Ut āŠ‚ [log(n)] denote the set of indices

i such that i = log(Lt0)Ā± 2 log(Ī±/Ē«), for some constant Ē« later specified.

Then at time t, for each Sj with j āˆˆ Ut, we run an instantiation of Lemma 21 with c = 132 andĪ· = 1/16 on Sj , and all instantiations share the same O(log(1/Ī·)) hash functions h1, . . . , hĪ˜(log(1/Ī·)).If j āˆˆ Ut but j /āˆˆ Ut+1, then we throw away all data structures related to Sj at time t+1. Similarly,if j enters Ut at time t, we initialize a new instantiation of Lemma 21 for Sj at time t.

To obtain the final L0 estimate for the entire stream, the largest value j āˆˆ Um with j < 2Lm0

such that Bj declares L0(Sj) > 8 is found. Then the L0 estimate is L0 = 20000/99 Ā· 2j , and if nosuch j exists the estimate is L0 = 50. Note that the main difference between our algorithm andRoughL0Estimator is that RoughL0Estimator sets Ut = [log(n)] for all t āˆˆ [m], so our proof ofLemma 20 will follow along the lines of [40].

Proof of Lemma 20 . The space required to store the hash function h is O(log(n)) and each of theO(log(1/Ī·)) = O(1) hash functions hi takes log(n) bits to store. The remaining space to store asingle Bj is O(log log(n)) by Lemma 21, and thus storing all Bjs for j āˆˆ Ut at any time t requiresat most O(|Ut| log log(n)) = O(log(Ī±) log log(n)) bits (since Ē« = O(1)), giving the stated bound.

We now argue correctness. First, for F0 ā‰¤ 8 log(n) log log(n), we can run the algorithm ofLemma 19 to produce the L0 exactly using less space than stated above. So we condition on the

31

Page 32: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

success of this algorithm, which occurs with probability 49/50, and assume F0 > 8 log(n)/ log log(n).This gives L0 > 8 log(n)/(Ī± log log(n)) by the Ī±-property, and it follows that Lm

0 ā‰¤ 8Ī±L0.Now for any t āˆˆ [m], E[L0(Stā†’j )] = Ltā†’

0 /2j+1 if j < log(n), and E[L0(Stā†’j )] = Ltā†’0 /n if

j = log(n). At the end of the algorithm we have all data structures stored for Bj ā€™s with j āˆˆUm. Now let j āˆˆ Um be such that j < log(2Lm

0 ), and observe that Bj will be initialized at

time tj such that Ltj0 > 2Lm

0 (Ē«/Ī±)2, which clearly occurs before the algorithm terminates. Ifj = log(8 log(n)/ log log(n)) Ā± 2 log(Ī±/Ē«), then j āˆˆ U0 so tj = 0, and otherwise tj is such that

Ltj0 ā‰¤ L

tj0 ā‰¤ (Ē«/Ī±)22j ā‰¤ (Ē«/Ī±)2(Lm

0 ) < 8Ē«2/Ī±L0. So Ltj0 /L0 = O(Ē«2). This means that when Bj was

initialized, the value Ltj0 was at most an O(Ē«2) fraction of the final L0, from which it follows that

Ltjā†’0 = (1 Ā± Ē«)L0 after rescaling Ē«. Thus the expected output of Bj (if it has not been deleted by

termination) for j < log(2L0) is E[L0(Stjā†’j )] = (1Ā± Ē«)L0/2j+1.

Now let jāˆ— be the largest j satisfying E[L0(Stjā†’j )] ā‰„ 1. Then jāˆ— < log(2L0) < log(2Lm0 ),

and observe that 1 ā‰¤ E[L0(Stjāˆ—ā†’jāˆ— )] ā‰¤ 2(1 + Ē«) (since the expectations decrease geometrically

with constant (1 Ā± Ē«)/2). Then for any log(2Lm0 ) > j > jāˆ—, by Markovā€™s inequality we have

Pr[L0(Stjā†’j ) > 8] < (1 + Ē«)1/(8 Ā· 2jāˆ’jāˆ—āˆ’1). By the union bound, the probability that any such

j āˆˆ (jāˆ—, log(2Lm0 )) has L0(Stjā†’j ) > 8 is at most (1+Ē«)

8

āˆ‘log(2Lm0)

j=jāˆ—+1 2āˆ’(jāˆ’jāˆ—āˆ’1) ā‰¤ (1 + Ē«)/4. Now let

jāˆ—āˆ— < jāˆ— be the largest j such that E[L0(Stjā†’j )] ā‰„ 50. Since the expectations are geometrically

decreasing by a factor of 2 (up to a factor of 1Ā± Ē«), we have 100(1 + Ē«) ā‰„ E[L0(Stjāˆ—āˆ—ā†’jāˆ—āˆ— )] ā‰„ 50, and

by the pairwise independence of h we have Var[L0(Stjāˆ—āˆ—ā†’jāˆ—āˆ— )] ā‰¤ E[L0(S

tjāˆ—āˆ—ā†’jāˆ—āˆ— )], so by Chebyshevā€™s

inequality we have

Pr[āˆ£

āˆ£L0(Stjāˆ—āˆ—ā†’jāˆ—āˆ— )āˆ’ E[L0(S

tjāˆ—āˆ—ā†’jāˆ—āˆ— )]

āˆ£

āˆ£ < 3

āˆš

E[L0(Stjāˆ—āˆ—ā†’jāˆ—āˆ— )]]

> 8/9

Then assuming this holds and setting Ē« = 1/100, we have

L0(Stjāˆ—āˆ—ā†’jāˆ—āˆ— ) > 50āˆ’ 3

āˆš50 > 28

L0(Stjāˆ—āˆ—ā†’jāˆ—āˆ— ) < 100(1 + Ē«) + 3

āˆš

100(1 + Ē«) < 132

What we have shown is that for every log(2Lm0 ) > j > jāˆ—, with probability at least 3/4(1āˆ’ Ē«/3) we

will have L0(Stjā†’j ) ā‰¤ 8. Since we only consider returning 2j for j āˆˆ Um with j < log(2Lm0 ), it follows

that we will not return L0 = 2j for any j > jāˆ—. In addition, we have shown that with probability

8/9 we will have 28 < L0(Stjāˆ—āˆ—ā†’jāˆ—āˆ— ) < 132, and by our choice of c = 132 and Ī· = 1/16, it follows that

Bjāˆ—āˆ— will output the exact value L0(Stjāˆ—āˆ—ā†’jāˆ—āˆ— ) > 8 with probability at least 1āˆ’ (1/9 +1/16) > 13/16

by Lemma 21. Hence, noting that Ē« = 1/100, with probability 1 āˆ’ (3/16 + 1/4(1 + Ē«)) < 14/25,we output 2j for some jāˆ—āˆ— ā‰¤ j ā‰¤ jāˆ— for which j āˆˆ Um. Observe that since Um contains all indicesi = log(Lm

0 ) Ā± 2 log(Ī±/Ē«), and along with the fact that L0 < Lm0 < 8Ī±L0, it follows that all

j āˆˆ [jāˆ—āˆ—, jāˆ—] will be in Um at termination for sufficiently large Ē« āˆˆ O(1).Now since (1 + 1/100)L0/2 > 2j

āˆ—> (1 āˆ’ 1/100)L0/4, and (1 + 1/100)L0/100 > 2j

āˆ—āˆ—> (1 āˆ’

1/100)L0/200, it follows that (99/100)L0/200 < 2j < 99L0/200, and thus 20000/99Ā·2j āˆˆ [L0, 100L0]as desired. If such a jāˆ—āˆ— does not exist then L0 < 50 and 50 āˆˆ [L0, 100L0]. Note that because ofthe Ī± property, unless the stream is empty (m = 0), then we must have L0 ā‰„ 1, and our ourapproximation is always within the correct range. Finally, if F0 ā‰¤ 8 log(n) log log(n) then with

32

Page 33: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Ī±-SupportSampler: support sampling algorithm for Ī± property streams.Initialization:

1. Set sā† 205k, and initialize linear sketch function J : Rn ā†’ Rq for q = O(s) via Lemma 22.2. Select random h āˆˆ H2([n], [n]), set Ij = i āˆˆ [n] | h(i) ā‰¤ 2j, and set Ē« = 1/48.

Processing:1. Run Ī±StreamRoughL0Est of Corollary 2 with Ī“ = 1/12 to obtain non-decreasing Rt āˆˆ

[Lt0, 8Ī±L0].

2. Let Bt =

j āˆˆ [log(n)] | j = log(ns/3Rt)Ā± 2 log(Ī±/Ē«) or j ā‰„ log(

ns log log(n)/(24 log(n)))

.3. Let tj āˆˆ [m] be the first time t such that j āˆˆ Bt (if one exists). Let tā€²j be the first time steptā€² > tj such that j /āˆˆ Btā€² (if one exists).

4. For all j āˆˆ Bt, maintain linear sketch xj = J(f tj :tā€²j |Ij).

Recovery:1. For j āˆˆ Bm at the end of the stream, attempt to invert xj into f

tj :m|Ij via Lemma 22. Returnall strictly positive coordinates of all successfully returned f tj :m|Ij ā€™s.

Figure 8: Our support sampling algorithm for Ī±-property streams.

probability 49/50 Lemma 19 produces the L0 exactly, and for larger F0 we output the result ofthe algorithm just stated. This brings the overall sucsess probability to 14/25 āˆ’ 49/50 > 13/25.Running O(log(1/Ī“)) = O(1) copies of the algorithm and returning the median, we can amplify theprobability 13/25 to 1āˆ’ Ī“.

7 Support Sampling

The problem of support sampling asks, given a stream vector f āˆˆ Rn and a parameter k ā‰„ 1,return a set U āŠ‚ [n] of size at least mink, ā€–fā€–0 such that for every i āˆˆ U we have fi 6= 0. Supportsamplers are needed crucially as subroutines for many dynamic graph streaming algorithms, such asconnectivity, bipartitness, minimum spanning trees, min-cut, cut sparsifiers, spanners, and spectralsparsifiers [2]. They have also been applied to solve maximum matching [43], as well as hyperedgeconnectivity [32]. A more comprehensive study of their usefulness in dynamic graph applicationscan be found in [41].

For strict turnstile streams, an Ī©(k log2(n/k)) lower bound is known [41], and for generalturnstile streams there is an O(k log2(n)) algorithm [38]. In this section we demonstrate that forL0 Ī±-property streams in the strict-turnstile case, more efficient support samplers exist. For therest of the section, we write Ī±-property to refer to the L0 Ī±-property, and we use the notationdefined at the beginning of Section 6.1.

First consider the following template for the unbounded deletion case (as in [38]). First, wesubsample the set of items [n] at log(n) levels, where at level j, the set Ij āŠ† [n] is subsampledwith expected size |Ij | = 2j . Let f |Ij be the vector f restricted to the coordinates of Ij (and 0elsewhere). Then for each Ij, the algorithm creates a small sketch xj of the vector f |Ij . If f |Ijis sparse, we can use techniques from sparse recovery to recover f |Ij and report all the non-zerocoordinates. We first state the following well known result which we utilize for this recovery.

Lemma 22 ([38]). Given 1 ā‰¤ s ā‰¤ n, there is a linear sketch and a recovery algorithm which, givenf āˆˆ Rn, constructs a linear sketch J(f) : Rn ā†’ Rq for q = O(s) such that if f is s-sparse then therecovery algorithm returns f on input J(f), otherwise it returns DENSE with high probability. Thespace required is O(s log(n)) bits.

33

Page 34: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Next, observe that for L0 Ī±-deletion streams, the value F t0 is at least Lt

0 and at most Ī±L0 forevery t āˆˆ [m]. Therefore, if we are given an estimate of F t

0 , we show it will suffice to only subsampleat O(log(Ī±))-levels at a time. In order to estimate F t

0 we utilize the estimator Ī±StreamRoughL0Estfrom Corollary 2 of Section 6. For tā€² ā‰„ t, let f t:t

ā€² āˆˆ Rn be the frequency vector of the stream ofupdates t to tā€². We use the notation given in our full algorithm in Figure 8. Notice that since Rt

is non-decreasing, once j is removed from Bt at time tā€²j it will never enter again. So at the time oftermination, we have xj = J(f tj :m|Ij) for all j āˆˆ Bm.

Theorem 11. Given a strict turnstile stream f with the L0 Ī±-property and k ā‰„ 1, the algorithmĪ±-SupportSampler outputs a set U āŠ‚ [n] such that fi 6= 0 for every i āˆˆ U , and such that withprobability 1 āˆ’ Ī“ we have |U | ā‰„ mink, ā€–fā€–0 . The space required is O(k log(n) log(Ī“āˆ’1)(log(Ī±) +log log(n))) bits.

Proof. First condition on the success of Ī±StreamRoughL0Est, which occurs with probability 11/12.Set iāˆ— = min(āŒˆ(log( ns

3L0))āŒ‰, log(n)). We first argue that tiāˆ— exists. Now xiāˆ— would be initialized as

soon as Rt ā‰„ L0(Ē«/Ī±)2, but Rt ā‰„ Lt

0, so this holds before termination of the algorithm. Further-more, for xiāˆ— to have been deleted by the end of the algorithm we would need Rt > (Ī±/Ē«)2L0, butwe know Rt < 8Ī±L0, so this can never happen. Finally, if L0 ā‰¤ F0 < 8 log(n)/ log log(n) and ourĪ±StreamRoughL0Est fails, then note iāˆ— ā‰„ āŒˆlog(ns log log(n)/(24 log(n)))āŒ‰, so we store iāˆ— for theentire algorithm.

Now we have Ltiāˆ—0 ā‰¤ Rtiāˆ— < (Ē«/Ī±)L0, and thus L

tiāˆ—0 /L0 < Ē«. Since f is a turnstile stream, it

follows that the number of strictly positive coordinates in f tiāˆ— :m is at least L0 āˆ’ Ltiāˆ—0 and at most

L0. Thus there are (1 Ā± Ē«)L0 strictly positive coordinates in f tiāˆ— :m. By same argument, we haveā€–f tiāˆ— :mā€–0 = (1Ā± Ē«)L0.

Let Xi indicate the event that ftiāˆ— :mi |Iiāˆ— 6= 0, and X =

āˆ‘

iXi. Using the pairwise inde-

pendence of h, the Xiā€™s with ftiāˆ— :mi 6= 0 are pairwise independent, so we obtain Var(X) <

E[X] = ā€–f tiāˆ— :mā€–0E[|Iiāˆ— |]/n. First assume L0 > s. Then ns/(3L0) ā‰¤ E[|Iiāˆ— |] < 2ns/(3L0), sofor Ē« < 1/48, we have E[X] āˆˆ [15s/48, 33s/48]. Then

āˆš

E[X] < 1/8E[X], so by Chebyshevā€™s in-

equality, Pr[|X āˆ’ E[X]| > 1/4E[X]] < 1/4, and thus ā€–f tiāˆ— :mi |Iiāˆ—ā€–0 ā‰¤ 15/16s with probability 3/4.

In this case, ftiāˆ— :mi |Iiāˆ— is s-sparse, so we recover the vector w.h.p. by Lemma 22. Now if L0 ā‰¤ s

then F0 < Ī±s, so the index iā€² = log(n) will be stored for the entire algorithm. Thus xiā€² = J(f) andsince f is s sparse we recover f exactly w.h.p., and return all non-zero elements. So we can assumethat L0 > s.

It suffices now to show that there are at least k strictly positive coordinates in ftiāˆ— :mi |Iiāˆ— . Since

this number is also (1 Ā± L0), letting Xā€²i indicate f

tiāˆ— :mi |Iiāˆ— > 0 and using the same inequalities as

in the last paragraph, it follows that there are at least s/15 > k strictly positive coordinates withprobability 3/4. Since the stream is a strict-turnstile stream, every strictly positive coordinate ofa suffix of the stream must be in the support of f , so we successfully return at least k coordinatesfrom the support with probability at least 1āˆ’(1/4+1/4+1/12+1/12) = 1/3. Running O(log(Ī“āˆ’1))copies in parallel and setting U to be the union of all coordinates returned, it follows that withprobability 1āˆ’ Ī“ at least mink, ā€–fā€–0 distinct coordinates will be returned.

For memory, for each of the O(log(Ī“āˆ’1)) copies we subsample at O(log(Ī±)+ log log(n)) differentlevels, and each to a vector of size O(k) (and each coordinate of each vector takes log(n) bits tostore) which gives our desired bound. This dominates the additive O(log(n)) bits needed to runĪ±StreamRoughL0Est.

34

Page 35: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

8 Lower Bounds

We now show matching or nearly matching lower bounds for all problems we have considered. Ourlower bounds all follow via reductions from one-way randomized communication complexity. Weconsider both the public coin model, where Alice and Bob are given access to infinitely many sharedrandom bits, as well as the private coin model, where they do not have shared randomness.

We first state the communication complexity problems we will be reducing from. The first suchproblem we use is the Augmented-Indexing problem (Ind), which is defined as follows. Alice isgiven a vector y āˆˆ 0, 1n, and Bob is given a index iāˆ— āˆˆ [n], and the values yiāˆ—+1, . . . , yn. Alicethen sends a message M to Bob, from which Bob must output the bit yiāˆ— correctly. A correctprotocol for Ind is one for which Bob correctly outputs yiāˆ— with probability at least 2/3. Thecommunication cost of a correct protocol is the maximum size of the message M that the protocolspecifies to deliver. This problem has a well known lower bound of Ī©(n) (see [46], or [39]).

Lemma 23 (Miltersen et al. [46]). The one-way communication cost of any protocol for AugmentedIndexing ( Ind) in the public coin model that succeeds with probability at least 2/3 is Ī©(n).

We present the second communication complexity result which we will use for our reductions.We define the problem equality as follows. Alice is given y āˆˆ 0, 1n and Bob is given x āˆˆ 0, 1n,and are required to decide whether x = y. This problem has a well known Ī©(log(n))-bit lower boundwhen shared randomness is not allowed (see e.g., [6] where it is used).

Lemma 24. The one way communication complexity in the private coin model with 2/3 successprobability of Equality is Ī©(log(n)).

We begin with the hardness of the heavy hitters problem in the strict-turnstile setting. Ourhardness result holds not just for Ī±-property streams, but even for the special case of strongĪ±-property streams (Definition 2). The result matches our upper bound for normal Ī±-propertystreams from Theorem 4 up to log log(n) and log(Ē«āˆ’1) terms.

Theorem 12. For p ā‰„ 1 and Ē« āˆˆ (0, 1), any one-pass Lp heavy hitters algorithm for strong L1

Ī±-property streams in the strict turnstile model which returns a set containing all i āˆˆ [n] such that|fi| ā‰„ Ē«ā€–fā€–p and no i āˆˆ [n] such that |fi| < (Ē«/2)ā€–fā€–p with probability at least 2/3 requires Ī©(Ē«āˆ’p log(nĒ«p) log(Ī±)) bits.

Proof. Suppose there is such an algorithm which succeeds with probability at least 2/3, and consideran instance of augmented indexing. Alice receives y āˆˆ 0, 1d, and Bob gets iāˆ— āˆˆ [d] and yj forj > iāˆ—. Set D = 6, and let X be the set of all subsets of [n] with āŒŠ1/(2Ē«)pāŒ‹ elements, and setd = log6(Ī±/4)āŒŠlog(|X|)āŒ‹. Alice divides y into r = log6(Ī±/4) contiguous chunks y1, y2, . . . , yr eachcontaining āŒŠlog(|X|)āŒ‹ bits. She uses yj as an index into X to determine a subset xj āŠ‚ [n] with|xj | = āŒŠ1/(2Ē«)pāŒ‹. Thinking of xj as a binary vector in Rn, Alice defines the vector v āˆˆ Rn by.

v = (Ī±D + 1)x1 + (Ī±D2 + 1)x2 + Ā· Ā· Ā·+ (Ī±Dr + 1)xr

She then creates a stream and inserts the necessary items so that the current frequency vector ofthe stream is v. She then sends the state of her heavy hitters algorithm to Bob, who wants toknow yiāˆ— āˆˆ yj for some j = j(iāˆ—). Knowing yiāˆ—+1, . . . , yd already, he can compute u = Ī±Dj+1xj+1+Ī±Dj+2xj+2+ Ā· Ā· Ā·+Ī±Drxr. Bob then runs the stream which subtracts off u from the current stream,resulting in a final frequency vector of f = vāˆ’u. He then runs his heavy hitters algorithm to obtaina set S āŠ‚ [n] of heavy hitters.

35

Page 36: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

We now argue that if correct, his algorithm must produce S = xj. Note that some items in [n]may belong to multiple sets xi, and would then be inserted as part of multiple xiā€™s, so we mustdeal with this possibility. For p ā‰„ 1, the weight ā€–fā€–pp is maximized by having x1 = x2 = Ā· Ā· Ā· = xj ,and thus

ā€–fā€–pp ā‰¤ 1/(2Ē«)p(

jāˆ‘

i=1

Ī±Di + 1)p

ā‰¤ Ē«āˆ’p(Ī±Dj+1/10 + 1)p

ā‰¤ Ē«āˆ’pĪ±pDjp

and for every k āˆˆ xj we have |fk|p ā‰„ (Ī±Dj + 1)p ā‰„ Ē«pā€–fā€–pp as desired. Furthermore, ā€–fā€–pp ā‰„|xj |Ī±pDjp = āŒŠ1/(2Ē«)pāŒ‹Ī±pDjp, and the weight of any element kā€² āˆˆ [n] \ xj is at most (

āˆ‘jāˆ’1i=1 Ī±D

i +1)p ā‰¤ (Ī±Dj/5+(jāˆ’1))p < Djp/4p for Ī± ā‰„ 3, so fkā€² will not be a Ē«/2 heavy hitter. Thus, if correct,Bobā€™s heavy hitter algorithm will return S = xj . So Bob obtains S = xj , and thus recovers yj

which indexed xj, and can compute the relevant bit yiāˆ— āˆˆ yj and return this value successfully.Hence Bob solves ind with probability at least 2/3.

Now observe that at the end of the stream each coordinate has frequency at least 1 and re-ceived fewer than 3Ī±2 updates (assuming updates have magnitude 1). Thus this stream on [n]items has the strong (3Ī±2)-property. Additionally, the frequencies of the stream are always non-negative, so the stream is a strict turnstile stream. It follows by Lemma 23 that any heavy hittersalgorithm for strict turnstile strong Ī±-property streams requires Ī©(d) = Ī©(log(

āˆš

Ī±/3) log(|X|))= Ī©(Ē«āˆ’p log(Ī±) log(nĒ«p)) bits as needed.

Next, we demonstrate the hardness of estimating the L1 norm in the Ī±-property setting. First,we show that the problem of L1 estimation in the general turnstile model requires Ī©(log(n))-bitseven for Ī± property streams with Ī± = O(1). We also give a lower bound of Ī©(1/Ē«2 log(Ī±)) bits forgeneral turnstile L1 estimation for strong Ī±-property streams.

Theorem 13. For any Ī± ā‰„ 3/2, any algorithm that produces an estimate L1 āˆˆ (1 Ā± 1/16)ā€–fā€–1 ofa general turnstile stream f with the L1 Ī± property with probability 2/3 requires Ī©(log(n)) bits ofspace.

Proof. Let G be a family of t = 2Ī©(n/2) = 2nā€²subsets of [n/2], each of size n/8 such that no two

sets have more than n/16 elements in common. As noted in [6], the existence of G follows fromstandard results in coding theory, and can be derived via a simple counting argument. We nowreduce from equality, where Alice has y āˆˆ 0, 1nā€²

and Bob has x āˆˆ 0, 1nā€². Alice can use y

to index into G to obtain a subset sy āŠ‚ [n/2], and similarly Bob obtains sx āŠ‚ [n/2] via x. Letyā€², xā€² āˆˆ 0, 1n be the characteristic vectors of sy, sx respectively, padded with n/2 0ā€™s at the end.

Now Alice creates a stream f on n elements by first inserting yā€², and then inserting the vectorv where vi = 1 for i > n/2 and 0 otherwise. She then sends the state of her streaming algorithm toBob, who deletes xā€² from the stream. Now if x = y, then xā€² = yā€² and ā€–fā€–1 = ā€–yā€² + v āˆ’ xā€²ā€–1 = n/2.On the other hand, if x 6= y then each of sx, sy have at least n/16 elements in one and not inthe other. Thus ā€–yā€² āˆ’ xā€²ā€–1 ā‰„ n/8, so ā€–fā€–1 ā‰„ 5n/8. Thus a streaming algorithm that producesL1 āˆˆ (1Ā±1/16)ā€–fā€–1 with probability 2/3 can distinguish between the two cases, and therefore solveequality, giving an Ī©(log(nā€²)) lower bound. Since nā€² = Ī©(n), it follows that such an L1 estimatorrequires Ī©(log(n)) bits as stated.

Finally, note that at most 3n/4 unit increments were made to f , and ā€–fā€–1 ā‰„ n/2, so f indeedhas the Ī± = 3/2 property.

36

Page 37: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Theorem 14. Any algorithm that produces an estimate L āˆˆ (1Ā± Ē«)ā€–fā€–1 with probability 11/12 of ageneral turnstile stream f āˆˆ Rn with the strong L1 Ī± property requires Ī©( 1

Ē«2log(Ē«2Ī±)) bits of space.

To prove Theorem 14, we first define the following communication complexity problem.

Definition 3. In the Gap-Ham communication complexity problem, Alice is given x āˆˆ 0, 1nand Bob is given y āˆˆ 0, 1n. Bob is promised that either ā€–xāˆ’ yā€–1 < n/2āˆ’āˆšn (NO instance) orthat ā€–xāˆ’ yā€–1 > n/2 +

āˆšn (YES instance), and must decide which instance holds.

Our proof of Theorem 14 will use the following reduction. Our proof is similar to the lowerbound proof for unbounded deletion streams in [39].

Theorem 15 ([36], [59] Section 4.3). There is a reduction from Ind to Gap-Ham such that decidingGap-Ham with probability at least 11/12 implies a solution to Ind with probability at least 2/3.Furthermore, in this reduction the parameter n in Ind is within a constant factor of that for thereduced Gap-Ham instance.

We are now ready to prove Theorem 14.

Proof. Set k = āŒŠ1/Ē«2āŒ‹, and let t = āŒŠlog(Ī±Ē«2)āŒ‹. The reduction is from Ind. Alice receives x āˆˆ 0, 1kt,and Bob obtains iāˆ— āˆˆ [kt] and xj for j > iāˆ—. Alice conceptually breaks her string x up into tcontiguous blocks bi of size k. Bobā€™s index i

āˆ— lies in some block bjāˆ— for jāˆ— = jāˆ—(iāˆ—), and Bob knowsall the bits of the blocks bi for i > jāˆ—. Alice then applies the reduction of Theorem 15 on each blockbi separately to obtain new vectors yi of length ck for i āˆˆ [t], where c ā‰„ 1 is some small constant.Let Ī² = c2Ē«āˆ’2Ī±. Alice then creates a stream f on ckt items by inserting the update ((i, j), Ī²2i +1)for all (i, j) such that (yi)j = 1. Here we are using (i, j) āˆˆ [t]Ɨ [ck] to index into [ckt]. Alice thencomputes the value vi = ā€–yiā€–1 for i āˆˆ [t], and sends v1, . . . , vt to Bob, along with the state of thestreaming algorithm run on f .

Upon receiving this, since Bob knows the bits yz for z ā‰„ iāˆ—, Bob can run the same reductionson the blocks bi as Alice did to obtain yi for i > jāˆ—. He then can make the deletions ((i, j),āˆ’Ī²2i)for all (i, j) such that i > jāˆ— and (yi)j = 1, leaving these coordinate to be f(i,j) = 1. Bob thenperforms the reduction from Ind to Gap-Ham specifically on the block bjāˆ— to obtain a vector y(B)of length ck, such that deciding whether ā€–y(B)āˆ’yjāˆ—ā€–1 > ck/2+

āˆšck or ā€–y(B)āˆ’yjāˆ—ā€–1 < ck/2āˆ’

āˆšck

with probability 11/12 will allow Bob to solve the instance of Ind on block bjāˆ— with index iāˆ— in bjāˆ— .Then for each i such that y(B)i = 1, Bob makes the update ((jāˆ—, ji),āˆ’Ī²2jāˆ—) to the stream f . Hethen runs an L1 approximation algorithm to obtain L = (1 Ā± Ē«)ā€–fā€–1 with probability 11/12. LetA be the number of indices such that y(B)i > (yjāˆ—)i. Let B be the number of indices such thaty(B)i < (yjāˆ—)i. Let C be the number of indices such that y(B)i = 1 = (yjāˆ—)i, and let D be thenumber of indices (i, j) with i > jāˆ— such that (yi)j = 1. Then we have

ā€–fā€–1 = Ī²2jāˆ—A+ (Ī²2j

āˆ—+ 1)B + C +D +

āˆ‘

i<jāˆ—

vi(Ī²2i + 1)

Let Z = C +D + B, and note that Z < ckt < Ī². Let Ī· =āˆ‘

i<jāˆ— vi(Ī²2i + 1). Bob can compute Ī·

exactly knowing the values v1, . . . , vt. Rearranging terms, we have ā€–y(B)āˆ’ yjāˆ—ā€–1 = (ā€–fā€–1 āˆ’Z āˆ’Ī·)/(Ī²2j

āˆ—) = (ā€–fā€–1āˆ’Ī·)/(Ī²2j

āˆ—)Ā±1. Recall that Bob must decide whether ā€–y(B)āˆ’yjāˆ—ā€–1 > ck/2+

āˆšck

or ā€–y(B) āˆ’ yjāˆ—ā€–1 < ck/2 āˆ’āˆšck. Thus, in order to solve this instance of Gap-Ham it suffices to

obtain an additiveāˆšck/8 approximation of ā€–fā€–1/(Ī²2j

āˆ—). Now note that

ā€–fā€–1/(Ī²2jāˆ—) ā‰¤ ckt(Ī²2jāˆ—)āˆ’1 + (Ī²2j

āˆ—)āˆ’1

jāˆ—āˆ‘

i=1

Ī²2j Ā· ck

37

Page 38: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

ā‰¤ 1 + 4ck

where the ckt < Ī² term comes from the fact that every coordinate that is ever inserted will havemagnitude at least 1 at the end of the stream. Taking Ē«ā€² =

āˆšck/(40ck) = 1/(40

āˆšck) = O(1/Ē«), it

follows that a (1Ā±Ē«ā€²) approximation of ā€–fā€–1 gives aāˆšck/8 additive approximation of ā€–fā€–1/(Ī²2j

āˆ—) as

required. Thus suppose there is an algorithm A that can obtain such a (1Ā±O(1/Ē«2)) approximationwith probability 11/12. By the reduction of Theorem 15 and the hardness of Ind in Lemma 23,it follows that this protocol just described requires Ī©(kt) bits of space. Since the only informationcommunicated between Alice and Bob other than the state of the streaming algorithm was theset v1, . . . , vt, which can be sent with O(t log(k)) = o(kt) bits, it follows that A must have usedĪ©(kt) = Ī©(Ē«āˆ’2 log(Ī±Ē«)) bits of space, which is the desired lower bound.

Now note that every coordinate i that was updated in the stream had final magnitude |fi| ā‰„ 1.Furthermore, no item was inserted more than Ī²2t + 1 < Ī±2c2 + 1 times, thus the stream has thestrong O(Ī±2) property. We have proven that any algorithm that gives a (1 Ā± Ē«) approximationof ā€–fā€–1 where f is a strong Ī±-property stream with probability 11/12 requires Ī©(Ē«āˆ’2 log(Ē«2

āˆšĪ±))

= Ī©(Ē«āˆ’2 log(Ē«2Ī±)) bits, which completes the proof.

We now give a matching lower bound for L1 estimation of strict-turnstile strong Ī±-propertystreams. This exactly matches our upper bound of Theorem 6, which is for the more generalĪ±-property setting.

Theorem 16. For Ē« āˆˆ (0, 1/2) and Ī± < n, any algorithm which gives an Ē«-relative error approxi-mation of the L1 of a strong L1 Ī± property stream in the strict turnstile setting with probability atleast 2/3 must use Ī©(log(Ī±) + log(1/Ē«) + log log(n)) bits.

Proof. The reduction is from ind. Alice, given x āˆˆ 0, 1t where t = log10(Ī±/4), constructs astream u āˆˆ Rt such that ui = Ī±10ixi + 1. She then sends the state of the stream u to Bob who,given j āˆˆ [n] and xj+1, . . . , xt, subtracts off v where vi = Ī±2ixi for i ā‰„ j + 1, and 0 otherwise. Hethen runs the L1 estimation algorithm on uāˆ’ v, and obtains the value L such that L = (1 Ā± Ē«)L1

with probability 2/3. We argue that if L = (1Ā± Ē«)L1 (for Ē« < 1/2) then L > (1āˆ’ Ē«)(Ī±10j ) > Ī±10j/2iff xj = 1. If xj = 1 then (uj āˆ’ vj) = Ī±10j + 1, if L > (1 āˆ’ Ē«)L1 the result follows. If xj = 0, then

the total L1 is at most Ī±/4 + Ī±āˆ‘jāˆ’1

i=1 10i < Ī±10i

9 + Ī±/4 < Ī±10i/3, so L < (1 + Ē«)L1 < Ī±10j/2 asneeded to solve ind. Note that each coordinate has frequency at least 1 at the end of the stream,and no coordinate received more than Ī±2 updates. Thus the stream has the strong Ī±2-property.By Lemma 23, it follows that any one pass algorithm for constant factor L1 estimation of a strictturnstile strong Ī±-property stream requires Ī©(log(

āˆšĪ±)) = Ī©(log(Ī±)) bits.

Finally, note that in the restricted insertion only case (i.e. Ī± = 1), estimating the L1 normmeans estimating the value m given only the promise that m ā‰¤M = poly(n). There are log(M)/Ē«powers of (1+Ē«) that could potentially be a (1Ā±Ē«) approximation of m, so to represent the solutionrequires requires log(log(M)/Ē«) = O(log log(n) + log(1/Ē«)) bits of space, which gives the rest of thestated lower bound.

We now prove a lower bound on L0 estimation. Our lower bound matches our upper boundof Theorem 10 up to log log(n) and log(1/Ē«) multiplicative factors, and a log(n) additive term.To do so, we use the following Theorem of [39], which uses a one way two party communicationcomplexity lower bound.

Theorem 17 (Theorem A.2. [39]). Any one pass algorithm that gives a (1Ā±Ē«) multiplicative approx-imation of the L0 of a strict turnstile stream with probability at least 11/12 requires Ī©(Ē«āˆ’2 log(Ē«2n))bits of space.

38

Page 39: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Setting n = O(Ī±) will give us:

Theorem 18. For Ē« āˆˆ (0, 1/2) and Ī± < n/ log(n), any one pass algorithm that gives a (1 Ā± Ē«)multiplicative approximation of the L0 of a L0 Ī±-property stream in the strict turnstile setting withprobability at least 11/12 must use Ī©(Ē«āˆ’2 log(Ē«2Ī±) + log log(n)) bits of space.

Proof. We can first construct the same stream used in the communication complexity lower boundof Theorem 17 on n = Ī±āˆ’ 1 elements, and then allow Bob to insert a final dummy element at theend of the stream with frequency 1. The stream now has Ī± elements, and the L0 of the resultingstream, call it R, is exactly 1 larger than the initial L0 (which we will now refer to as L0). Moreover,this stream has the L0 Ī± property since the final frequency vector is non-zero and there Ī± itemsin the stream. If we then obtained an estimate R = (1 Ā± Ē«)R = (1 Ā± Ē«)(L0 + 1), then the originalL0 = 0 if R < (1 + Ē«). Otherwise R āˆ’ 1 = (1 Ā± O(Ē«)L0, and a constant factor rescaling of Ē«gives the desired approximation of the initial L0. By Theorem 17, such an approximation requiresĪ©(Ē«āˆ’2 log(Ē«2Ī±)) bits of space, as needed. The Ī©(log log(n)) lower bound follows from the proof ofLemma 16, replacing the upper bound M ā‰„ m with n.

Next, we give lower bounds for L1 and support sampling. Our lower bound for L1 samplersholds in the more restricted strong Ī±-property setting, and for such streams we show that eventhose which return an index from a distribution with variation distance at most 1/6 from the L1

distribution |fi|/ā€–fā€–1 requires Ī©(log(n) log(Ī±)) bits. In this setting, taking Ē« = o(1), this boundmatches our upper bound from Theorem 5 up to log log(n) terms. For Ī± = o(n), our supportsampling lower bound matches our upper bound in Theorem 11.

Theorem 19. Any one pass L1 sampler of a strong L1 Ī±-property stream f in the strict turnstilemodel with an output distribution that has variation distance at most 1/6 from the L1 distribution|fi|/ā€–fā€–1 and succeeds with probability 2/3 requires Ī©(log(n) log(Ī±)) bits of space.

Proof. Consider the same strict turnstile strong O(Ī±2)-property stream constructed by Alice andBob in Theorem 12, with Ē« = 1/2. Then X = [n] is the set of all subsets of [n] with 1 item. Ifiāˆ— āˆˆ [n] is Bobā€™s index, then let j = j(iāˆ—) be the block yj such that yiāˆ— āˆˆ yj . The block yj haslog(|X|) = log(n) bits, and Bobā€™s goal is to determine the set xj āŠ‚ [n] of exactly one item whichis indexed by yj . Then for the one sole item k āˆˆ xj will be a 1/2-heavy hitter, and no other itemwill be a 1/4-heavy hitter, so Bob can run O(1) parallel L1 samplers and find the item kā€² that isreturned the most number times by his samplers. If his sampler functions as specified, having atmost 1/6 variational distance from the L1 distribution, then kā€² = k with large constant probabilityand Bob can recovery the bit representation xj of k, from which he recovers the bit yiāˆ— āˆˆ yj asneeded. Since Aliceā€™s string had length Ī©(log(Ī±) log(|X|)) = Ī©(log(Ī±) log(n)), we obtain the statedlower bound.

Theorem 20. Any one pass support sampler that outputs an arbitrary i āˆˆ [n] such that fi 6= 0,of an L0 Ī±-property stream with failure probability at most 1/3, requires Ī©(log(n/Ī±) log(Ī±)) bits ofspace.

Proof. The reduction is again from Ind. Alice receives y āˆˆ 0, 1d, for d = āŒŠlog(n/Ī±) log(Ī±/4)āŒ‹, andbreaks it into blocks y1, . . . , ylog(Ī±/4), each of size āŒŠlog(n/Ī±)āŒ‹. She then initializes a stream vectorf āˆˆ Rn, and breaks f into āŒŠ4n/Ī±āŒ‹ blocks of size Ī±/4, say B1, . . . , BāŒŠ4n/Ī±āŒ‹. She uses yi as an indexinto a block Bj for j = j(i), and then inserts 2i distinct items into block Bj, each exactly once,and sends the state of her algorithm over to Bob. Bob wants to determine yiāˆ— for his fixed iāˆ— āˆˆ [n].Let k be such that yiāˆ— āˆˆ yj and Bk be the block indexed by yj. He knows yj+1, . . . , ylog(Ī±/4), and

39

Page 40: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

can delete the corresponding items from f that were inserted by Alice. At the end, the block Bk

has 2j items in it, and the total number of distinct items in the stream is less than 2j+1. Moreoverno other block has more than 2jāˆ’1 items.

Now suppose Bob had access to an algorithm that would produce a uniformly random non-zeroitem at the end of the stream, and that would report FAIL with probability at most 1/3. Hecould then run O(1) such algorithms, and pick the block Bkā€² such that more than 4/10 of thereturned indices are in Bkā€² . If his algorithms are correct, we then must have Bkā€² = Bk with largeconstant probability, from which he can recover yj and his bit yjāˆ—, thus solving Ind and giving theĪ©(d) = Ī©(log(n/Ī±) log(Ī±)) lower bound by Lemma 23.

We now show how such a random index from the support of f can be obtained using onlya support sampler. Alice and Bob can used public randomness to agree on a uniformly randompermutation Ļ€ : [n]ā†’ [n], which gives a random relabeling of the items in the stream. Then, Alicecreates the same stream as before, but using the relabeling and inserting the items in a randomizedorder into the stream, using separate randomness from that used by the streaming algorithm. Inother words, instead of inserting i āˆˆ [n] if i was inserted before, Alice inserts Ļ€(i) at a randomposition in the stream. Bob the receives the state of the streaming algorithm, and then similarlydeletes the items he would have before, but under the relabeling Ļ€ and in a randomized orderinstead.

Let i1, . . . , ir āˆˆ [n] be the items inserted by Alice that were not deleted by Bob, orderedby the order in which they were inserted into the stream. If Bob were then to run a supportsampler on this stream, he would obtain an arbitrary i = g(i1, . . . , ir) āˆˆ i1, . . . , ir, where g isa (possibly randomized) function of the ordering of the sequence i1, . . . , ir. The randomness usedby the streaming algorithm is separate from the randomness which generated the relabeling Ļ€ andthe randomness which determined the ordering of the items inserted and deleted from the stream.Thus, even conditioned on the randomness of the streaming algorithm, any ordering and labeling ofthe surviving items i1, . . . , ir is equally likely. In particular, i is equally likely to be any of i1, . . . , ir.It follows that Ļ€āˆ’1(i) is a uniformly random element of the support of f , which is precisely whatBob needed to solve Ind, completing the proof.

Finally, we show that estimating inner products even for strong Ī±-property streams requiresĪ©(Ē«āˆ’1 log(Ī±)) bits of space. Setting Ī± = n, we obtain an Ī©(Ē«āˆ’1 log(n)) lower bound for unboundeddeletion streams, which our upper bound beats for small Ī±.

Theorem 21. Any one pass algorithm that runs on two strong L1 Ī±-property streams f, g in thestrict turnstile setting and computes a value IP(f, g) such that IP(f, g) = 怈f, g怉 + Ē«ā€–fā€–1ā€–gā€–1 withprobability 2/3 requires Ī©(Ē«āˆ’1 log(Ī±)) bits of space.

Proof. The reduction is from IND. Alice has y āˆˆ 0, 1d where y is broken up into log10(Ī±)/4 blocksof size 1/(8Ē«), where the indices of the i-th block are called Bi. Bob wants to learn yiāˆ— and isgiven yj for j ā‰„ iāˆ—, and let jāˆ— be such that iāˆ— āˆˆ Bjāˆ—. Alice creates a stream f on d items, and ifi āˆˆ Bj , then Alice inserts the items to make fi = bi10

j + 1, where bi = Ī± if yi = 0, and bi = 2Ī±otherwise. She creates a second stream g = ~0 āˆˆ Rd, and sends the state of her streaming algorithmover to Bob. For every yi āˆˆ Bj = Bj(i) that Bob knows, he subtracts off bi10

j , leaving fi = 1.He then sets giāˆ— = 1, and obtains IP(f, g) via his estimator. We argue that with probability 2/3,IP(f, g) ā‰„ 3Ī±10j

āˆ—/2 iff yiāˆ— = 1. Note that the error is always at most

Ē«ā€–fā€–1ā€–gā€–1 = Ē«ā€–fā€–1

ā‰¤ Ē«(

d+ (8Ē«)āˆ’1(2Ī±10jāˆ—) +

āˆ‘

j<jāˆ—

āˆ‘

iāˆˆBj

(2Ī±10j))

40

Page 41: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

Ē«(

1/(8Ē«) log(Ī±)/4 + (4Ē«)āˆ’1Ī±10jāˆ—+ Ē«āˆ’1Ī±10j

āˆ—/32

)

< Ī±10jāˆ—/3

Now since 怈f, g怉 = (yiāˆ— + 1)Ī±10jāˆ—+ 1, if yiāˆ— = 0 then if the inner product algorithm succeeds with

probability 2/3 we must have IP(f, g) ā‰¤ Ī±10j +1+Ī±10j/3 < 3Ī±10jāˆ—/2, and similarly if yiāˆ— = 1 we

have IP(f, g) > 2Ī±10j + 1āˆ’ Ī±10j/3 > 3Ī±10j/2 as needed. So Bob can recover yiāˆ— with probability2/3 and solve IND, giving an Ī©(d) lower bound via Lemma 23. Since each item in f received atmost 5(Ī±2) updates and had final frequency 1, this stream has the strong 5Ī±2-property, and g wasinsertion only. Thus obtaining such an estimate of the inner product between strong Ī±-propertystreams requires Ī©(Ē«āˆ’1 log(

āˆšĪ±)) = Ī©(Ē«āˆ’1 log(Ī±) bits, as stated.

9 Conclusion

We have shown that for bounded deletion streams, many important L0 and L1 streaming problemscan be solved more efficiently. For L1, the fact the fiā€™s are approximately preserved under samplingpoly(Ī±) updates paves the way for our results, whereas for L0 we utilizes the fact that O(log(Ī±))-levels of sub-sampling are sufficient for several algorithms. Interestingly, it is unclear whetherimproved algorithms for Ī±-property streams exist for L2 problems, such as L2 heavy hitters orL2 estimation. The difficulty stems from the fact that ā€–fā€–2 is not preserved in any form undersampling, and thus L2 guarantees seem to require different techniques than those used in this paper.

However, we note that by utilizing the heavy hitters algorithm of [11], one can solve the generalturnstile L2 heavy hitters problem for Ī±-property streams in O(Ī±2 log(n) log(Ī±)) space. A proofis sketched in Appendix A. Clearly a polynomial dependence on Ī± is not desirable; however, forthe applications in which Ī± is a constant this still represents a significant improvement over theĪ©(log2(n)) lower bound for turnstile streams. We leave it as an open question whether the optimaldependence on Ī± can be made logarithmic.

Additionally, it is possible that problems in dynamic geometric data streams (see [34]) wouldbenefit by bounding the number of deletions on the streams in question. We note that several ofthese geometric problems directly reduce to problems in the data stream model studied here, forwhich our results directly apply.

Finally, it would be interesting to see if analogous models with lower bounds on the ā€œsignalsizeā€ of the input would result in improved algorithms for other classes of sketching algorithms.In particular, determining the appropriate analogue of the Ī±-property for linear algebra problems,such as row-rank approximation and regression, could be a fruitful direction for further research.

References

[1] Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. The aquaapproximate query answering system. In ACM Sigmod Record, volume 28, pages 574ā€“576.ACM, 1999.

[2] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Analyzing graph structure via linearmeasurements. In Proceedings of the Twenty-third Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA ā€™12, pages 459ā€“467, Philadelphia, PA, USA, 2012. Society for Industrial andApplied Mathematics. URL: http://dl.acm.org/citation.cfm?id=2095116.2095156.

41

Page 42: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

[3] Miklos Ajtai, Randal Chilton Burns, Ronald Fagin, and Larry Joseph Stockmeyer. Systemand method for differential compression of data from a plurality of binary sources, April 162002. US Patent 6,374,250.

[4] Aditya Akella, Ashwin Bharambe, Mike Reiter, and Srinivasan Seshan. Detecting ddos attackson isp networks.

[5] Noga Alon, Phillip B Gibbons, Yossi Matias, and Mario Szegedy. Tracking join and self-joinsizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGARTsymposium on Principles of database systems, pages 10ā€“20. ACM, 1999.

[6] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating thefrequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theoryof computing, pages 20ā€“29. ACM, 1996.

[7] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms fromprecision sampling. arXiv preprint arXiv:1011.1263, 2010.

[8] Khanh Do Ba, Piotr Indyk, Eric Price, and David P Woodruff. Lower bounds for sparse recov-ery. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms,pages 1190ā€“1197. SIAM, 2010.

[9] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Modelsand issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1ā€“16. ACM, 2002.

[10] Arnab Bhattacharyya, Palash Dey, and David P Woodruff. An optimal algorithm for l1-heavyhitters in insertion streams and related problems. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 385ā€“400. ACM, 2016.

[11] Vladimir Braverman, Stephen R Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, andDavid P Woodruff. Bptree: an l2 heavy hitters algorithm using constant memory. arXivpreprint arXiv:1603.00759, 2016.

[12] Vladimir Braverman, Stephen R Chestnut, Nikita Ivkin, and David P Woodruff. Beatingcountsketch for heavy hitters in insertion streams. In Proceedings of the forty-eighth annualACM symposium on Theory of Computing, pages 740ā€“753. ACM, 2016.

[13] J Lawrence Carter and Mark N Wegman. Universal classes of hash functions. Journal ofcomputer and system sciences, 18(2):143ā€“154, 1979.

[14] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in datastreams. Automata, languages and programming, pages 784ā€“784, 2002.

[15] Edith Cohen. Stream sampling for frequency cap statistics. In Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 159ā€“168.ACM, 2015.

[16] Edith Cohen, Graham Cormode, and Nick Duffield. Donā€™t let the negatives bring you down:sampling from streams of signed updates. ACM SIGMETRICS Performance Evaluation Re-view, 40(1):343ā€“354, 2012.

42

Page 43: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

[17] Edith Cohen, Graham Cormode, and Nick G. Duffield. Structure-aware sam-pling: Flexible and accurate summarization. PVLDB, 4(11):819ā€“830, 2011. URL:http://www.vldb.org/pvldb/vol4/p819-cohen.pdf.

[18] Edith Cohen, Nick Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Stream sam-pling for variance-optimal estimation of subset sums. In Proceedings of the twentieth AnnualACM-SIAM Symposium on Discrete Algorithms, pages 1255ā€“1264. Society for Industrial andApplied Mathematics, 2009.

[19] Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Al-gorithms and estimators for summarization of unaggregated data streams. J. Comput.Syst. Sci., 80(7):1214ā€“1244, 2014. URL: https://doi.org/10.1016/j.jcss.2014.04.009,doi:10.1016/j.jcss.2014.04.009.

[20] Graham Cormode, Theodore Johnson, Flip Korn, Shan Muthukrishnan, Oliver Spatscheck,and Divesh Srivastava. Holistic udafs at streaming speeds. In Proceedings of the 2004 ACMSIGMOD international conference on Management of data, pages 35ā€“46. ACM, 2004.

[21] Graham Cormode, Hossein Jowhari, Morteza Monemizadeh, and S Muthukrishnan. The sparseawakens: Streaming algorithms for matching size estimation in sparse graphs. arXiv preprintarXiv:1608.03118, 2016.

[22] Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58ā€“75, 2005.

[23] Cristian Estan and George Varghese. New directions in traffic measurement and accounting:Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270ā€“313,2003. URL: http://doi.acm.org/10.1145/859716.859719, doi:10.1145/859716.859719.

[24] Cristian Estan, George Varghese, and Mike Fisk. Bitmap algorithms for counting active flowson high speed links. In Proceedings of the 3rd ACM SIGCOMM conference on Internet mea-surement, pages 153ā€“166. ACM, 2003.

[25] Schkolnick Finkelstein, Mario Schkolnick, and Paolo Tiberio. Physical database design forrelational databases. ACM Transactions on Database Systems (TODS), 13(1):91ā€“128, 1988.

[26] M. Garofalakis, J. Gehrke, and R. Rastogi. Data Stream Management: Processing High-SpeedData Streams. Data-Centric Systems and Applications. Springer Berlin Heidelberg, 2016. URL:https://books.google.com/books?id=qiSpDAAAQBAJ.

[27] Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. Data stream management: A bravenew world. pages 1ā€“9, 01 2016.

[28] Rainer Gemulla, Wolfgang Lehner, and Peter J. Haas. A dip in the reservoir: Maintainingsample synopses of evolving datasets. In Proceedings of the 32nd International Conference onVery Large Data Bases, Seoul, Korea, September 12-15, 2006, pages 595ā€“606, 2006. URL:http://dl.acm.org/citation.cfm?id=1164179.

[29] Rainer Gemulla, Wolfgang Lehner, and Peter J. Haas. Maintaining bernoulli samplesover evolving multisets. In Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13, 2007, Beijing,China, pages 93ā€“102, 2007. URL: http://doi.acm.org/10.1145/1265530.1265544,doi:10.1145/1265530.1265544.

43

Page 44: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

[30] Rainer Gemulla, Wolfgang Lehner, and Peter J. Haas. Maintaining bounded-sizesample synopses of evolving datasets. VLDB J., 17(2):173ā€“202, 2008. URL:https://doi.org/10.1007/s00778-007-0065-y, doi:10.1007/s00778-007-0065-y.

[31] Phillip B. Gibbons and Yossi Matias. New sampling-based summary statistics for improvingapproximate query answers. In SIGMOD 1998, Proceedings ACM SIGMOD International Con-ference on Management of Data, June 2-4, 1998, Seattle, Washington, USA., pages 331ā€“342,1998. URL: http://doi.acm.org/10.1145/276304.276334, doi:10.1145/276304.276334.

[32] Sudipto Guha, Andrew McGregor, and David Tench. Vertex and hyperedge connectivity in dy-namic graph streams. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposiumon Principles of Database Systems, pages 241ā€“247. ACM, 2015.

[33] Peter J. Haas. Data-stream sampling: Basic techniques and results. In DataStream Management - Processing High-Speed Data Streams, pages 13ā€“44. 2016. URL:https://doi.org/10.1007/978-3-540-28608-0_2, doi:10.1007/978-3-540-28608-0_2.

[34] Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In Proceedings ofthe thirty-sixth annual ACM symposium on Theory of computing, pages 373ā€“380. ACM, 2004.

[35] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data streamcomputation. Journal of the ACM (JACM), 53(3):307ā€“323, 2006.

[36] Thathachar S Jayram, Ravi Kumar, and D Sivakumar. The one-way communication complex-ity of hamming distance. Theory of Computing, 4(1):129ā€“135, 2008.

[37] Thathachar S Jayram and David P Woodruff. Optimal bounds for johnson-lindenstrausstransforms and streaming problems with subconstant error. ACM Transactions on Algorithms(TALG), 9(3):26, 2013.

[38] Hossein Jowhari, Mert Saglam, and Gabor Tardos. Tight bounds for lp samplers, findingduplicates in streams, and related problems. In Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ā€™11, pages 49ā€“58,New York, NY, USA, 2011. ACM. URL: http://doi.acm.org/10.1145/1989284.1989289,doi:10.1145/1989284.1989289.

[39] Daniel M Kane, Jelani Nelson, and David P Woodruff. On the exact space complexity ofsketching and streaming small norms. In Proceedings of the twenty-first annual ACM-SIAMsymposium on Discrete Algorithms, pages 1161ā€“1178. SIAM, 2010.

[40] Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the dis-tinct elements problem. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGARTsymposium on Principles of database systems, pages 41ā€“52. ACM, 2010.

[41] Michael Kapralov, Jelani Nelson, Jakub Pachocki, Zhengyu Wang, David P Woodruff, andMobin Yahyazadeh. Optimal lower bounds for universal relation, and for samplers and findingduplicates in streams. arXiv preprint arXiv:1704.00633, 2017.

[42] Donald Ervin Knuth. The art of computer programming, Volume II:Seminumerical Algorithms, 3rd Edition. Addison-Wesley, 1998. URL:http://www.worldcat.org/oclc/312898417.

44

Page 45: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

[43] Christian Konrad. Maximum matching in turnstile streams. In Algorithms-ESA 2015, pages840ā€“852. Springer, 2015.

[44] Anukool Lakhina, Mark Crovella, and Christophe Diot. Mining anomalies using traffic featuredistributions. In ACM SIGCOMM Computer Communication Review, volume 35, pages 217ā€“228. ACM, 2005.

[45] Gurmeet Singh Manku and Rajeev Motwani. Approximate fre-quency counts over data streams. PVLDB, 5(12):1699, 2012. URL:http://vldb.org/pvldb/vol5/p1699_gurmeetsinghmanku_vldb2012.pdf.

[46] Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. On data structures andasymmetric communication complexity. In Proceedings of the twenty-seventh annual ACMsymposium on Theory of computing, pages 103ā€“111. ACM, 1995.

[47] Morteza Monemizadeh and David P Woodruff. 1-pass relative-error lp-sampling with applica-tions. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms,pages 1143ā€“1160. SIAM, 2010.

[48] David Moore, 2001. URL: http://www.caida.org/research/security/code-red/.

[49] Robert Morris. Counting large numbers of events in small registers. Communications of theACM, 21(10):840ā€“842, 1978.

[50] Shanmugavelayutham Muthukrishnan et al. Data streams: Algorithms and applications. Foun-dations and Trends RĀ© in Theoretical Computer Science, 1(2):117ā€“236, 2005.

[51] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallelanalysis with sawzall. Scientific Programming, 13(4):277ā€“298, 2005.

[52] Florin Rusu and Alin Dobra. Pseudo-random number generation for sketch-based estimations.ACM Transactions on Database Systems (TODS), 32(2):11, 2007.

[53] Florin Rusu and Alin Dobra. Sketches for size of join estimation. ACM Transactions onDatabase Systems (TODS), 33(3):15, 2008.

[54] Amit Shukla and Prasad Deshpande. Storage estimation for multidimensional aggregates inthe presence of hierarchies.

[55] Dan Teodosiu, Nikolaj Bjorner, Yuri Gurevich, Mark Manasse, and Joe Porkka. Optimizingfile replication over limited-bandwidth networks using remote differential compression. 2006.

[56] Mikkel Thorup and Yin Zhang. Tabulation based 4-universal hashing with applications tosecond moment estimation.

[57] Jeffrey Scott Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37ā€“57, 1985. URL: http://doi.acm.org/10.1145/3147.3165, doi:10.1145/3147.3165.

[58] Arno Wagner and Bernhard Plattner. Entropy based worm and anomaly detection in fast ipnetworks. In Enabling Technologies: Infrastructure for Collaborative Enterprise, 2005. 14thIEEE International Workshops on, pages 172ā€“177. IEEE, 2005.

[59] David Paul Woodruff. Efficient and private distance approximation in the communication andstreaming models. PhD thesis, Massachusetts Institute of Technology, 2007.

45

Page 46: arXiv:1803.08777v1 [cs.DS] 23 Mar 2018

A Sketch of L2 heavy hitters algorithm

We first note that the BPTree algorithm for [11] solves the L2 heavy hitters problem in spaceO(Ē«āˆ’2 log(n) log(1/Ē«)) with probability 2/3. Recall that the L2 version of the problem asks toreturn a set S āŠ‚ [n] with all i such that |fi| ā‰„ Ē«ā€–fā€–2 and no j such that |fj | < (Ē«/2)ā€–fā€–2. Alsorecall that the Ī± property states that ā€–I + Dā€–2 ā‰¤ Ī±ā€–fā€–2, where I is the vector of the streamrestricted to the positive updates and D is the entry-wise absolute value of the vector of thestream restricted to the negative updates. It follows that if i āˆˆ [n] is an Ē« heavy hitter thenā€–I + Dā€–2 ā‰¤ Ī±ā€–fā€–2 ā‰¤ Ī±/Ē«|fi| ā‰¤ Ī±/Ē«|Ii + Di|. So in the insertion-only stream I + D where everyupdate is positive, the item i must be an Ē«/Ī± heavy hitter.

Using this observation, we can solve the problem as follows. First run BPTree with Ē«ā€² = Ē«/Ī± toobtain a set S āŠ‚ n with all i such that |Ii +Di| ā‰„ (Ē«/Ī±)ā€–I +Dā€–2 and no j such that |Ij +Dj | <(Ē«/2Ī±)ā€–I +Dā€–2. It follows that |S| = O(Ī±2/Ē«2). So in parallel we run an instance of Countsketchon the original stream f with O(1/Ē«2) rows and O(log(Ī±/Ē«)) columns. For a fixed i, this gives anestimate of |fi| with additive error (Ē«/4)ā€–fā€–2 with probability 1āˆ’O(Ē«/Ī±). At the end, we then querythe Countsketch for every i āˆˆ S, and return only those which have estimated weight (3Ē«/4)ā€–fā€–2.By union bounding over all |S| queries, with constant probability the Countsketch will correctlyidentify all i āˆˆ |S| that are Ē« heavy hitters, and will throw out all items in S with weight less than(Ē«/2)ā€–fā€–2. Since all Ē« heavy hitters are in S, this solves the problem. The space to run BPTree isO(Ī±2/Ē«2 log(n) log(Ī±/Ē«), which dominates the cost of the Countsketch.

46