An Optimal Algorithm for the Distinct Elements...

33
An Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department of Mathematics Stanford University [email protected] February 21st, 2014 Joint work with Jelani Neslon (Harvard) and David Woodruff (IBM) D. Kane (Stanford) Distinct Elements February 2014 1 / 30

Transcript of An Optimal Algorithm for the Distinct Elements...

Page 1: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

An Optimal Algorithm for the Distinct ElementsProblem

Daniel M. Kane

Department of MathematicsStanford University

[email protected]

February 21st, 2014

Joint work with Jelani Neslon (Harvard) and David Woodruff (IBM)

D. Kane (Stanford) Distinct Elements February 2014 1 / 30

Page 2: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Streaming Algorithms

Algorithm to analyze

Data passing through a server OR

Data in a large database

In either case, there may be too much data to store a full (separate) copyof everything. Want an algorithm that:

Uses little space

Little processing time per item

Makes only one pass through the data

D. Kane (Stanford) Distinct Elements February 2014 2 / 30

Page 3: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Distinct Elements Problem

Distinct elements problem: Given a stream S of elements of[n] := 1, 2, . . . , n, return the number of distinct elements in S .Example:

S = 1, 7, 2, 7, 23, 1, 7

Distinct elements are 1, 2, 7, 23. Return 4.Applications:

Detecting Denial of Service Attacks[Akella, Bharambe, Reiter, Seshan]

Query planning[Selinger, Astrahan, Chamberlin, Lorie]

Data Integration[Brown, Haas, Myllymaki, Pirahesh, Reinwald, Sismanis]

D. Kane (Stanford) Distinct Elements February 2014 3 / 30

Page 4: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Approximation

Returning |S | exactly is too much to hope for.

Compare |S |,|S ∪ x| to determine if x ∈ S

Would essentially need to store all elements of the stream

Requires Ω(n) space

Instead return answer F so that F = |S |(1± ε) for some small ε > 0.

D. Kane (Stanford) Distinct Elements February 2014 4 / 30

Page 5: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Randomization

Deterministic algorithms also do not work:

Let Si be subsets that overlap at most 2/3rds of their elements.

Need to distinguish stream SiSi from SiSj .

Need to store enough information to tell which Si you have.

Requires Ω(n) space.

Thus, to obtain reasonable space, we will need to use a randomizedalgorithm that we will only require to be correct with probability 2/3 (thiscan be improved to 1− δ by running O(log(δ−1)) copies of the algorithmin parallel).

D. Kane (Stanford) Distinct Elements February 2014 5 / 30

Page 6: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Problem Statement

Problem

Find a randomized algorithm that given a stream S of elements in [n]:

Makes only a single pass over the stream

Uses s space

Uses t time per update

Returns a value in (|S |(1− ε), |S |(1 + ε)) with probability at least 2/3rds

D. Kane (Stanford) Distinct Elements February 2014 6 / 30

Page 7: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Past Work

Paper Space Time Notes

[FM85] O(log(n)) - Const. ε, rand. oracle

[AMS99] O(log(n)) O(log(n)) Const. ε

[GT01] O(ε−2 log(n)) O(ε−2)

[BJKST02] O(ε−2 log(n)) O(log(ε−1))

[BJKST02] O(ε−2 + log(n)) O(ε−2 log log(n))

[KNW10] O(ε−2 + log(n)) O(1) Optimal, This talk

D. Kane (Stanford) Distinct Elements February 2014 7 / 30

Page 8: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Lower Bounds

Space lower bounds of Ω(ε−2 + log(n)) were proven by Indyk andWoodruff (later extended to cover all ε > n−1/2 by Woodruff).Idea: Given a stream ST need to distinguish the cases:

S and T overlap on a 1/2 + ε-fraction of their entries

S and T overlap on a 1/2− ε-fraction of their entries

This reduces it to the one-way communication complexity of theGap-Hamming Problem.

D. Kane (Stanford) Distinct Elements February 2014 8 / 30

Page 9: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Basic Idea: Subsampling

The basic idea behind the majority of algorithms is due to Flajolet andMartin.

Let A be a random subset of [n] of size m

Consider |S ∩ A|I Has expectation |S|mnI Is 0 with 5/6 probability if |S | < n

6mI Is positive with 5/6 probability if |S | > 2n

m

Idea: Keep track of whether or not S contains any element of A for anumber of random A of different sizes.

D. Kane (Stanford) Distinct Elements February 2014 9 / 30

Page 10: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Implementation

Let h : [n]→ [n] be a random hash function.Let lsb(x) be the location of the least significant bit of x which is a 1.

Note that Ak := x ∈ [n] : lsb(h(x)) > k is a random subset of [n]of size about n2−k

Algorithm:

Maintain a counter r = maxx∈S(lsb(h(x)))

With 2/3rds probability log(|S |) = r ± 2

Thus 2r gives an approximation of |S | likely good to within a factorof 4

Requires O(log log(n)) bits of space

D. Kane (Stanford) Distinct Elements February 2014 10 / 30

Page 11: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Reducing Error: Statistics

To get a better approximation to |S |, you can run multiple copies of this inparallel:

Take K ≈ ε−2

Have K hash functions hi and record ri = maxx∈S(lsb(hi (x)))

Pick some r∗ ≈ log(|S |) (perhaps given by the value of an extra ri )

Pr(ri ≤ r∗) ≈ exp(−|S |2−r∗)

Let T = #i : ri ≤ r∗Estimator:

|S | ≈ 2r∗ log

(K

T

)

D. Kane (Stanford) Distinct Elements February 2014 11 / 30

Page 12: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Improving Runtime: One Bin per Element

In the previous algorithm, we needed to update the values of K ≈ ε−2 binsper update, causing the update time to be about ε−2. We would like toimprove this to constant time per update.Idea: Each update effects only one bin

K ≈ ε−2, have K counters Ci

Hash functions h : [n]→ [n] and g : [n]→ [K ]

When you see x , replace Cg(x) with the maximum of its current valueand lsb(h(x))

So Ci holds the value maxx∈S :g(x)=i lsb(h(x)).

D. Kane (Stanford) Distinct Elements February 2014 12 / 30

Page 13: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Analysis: Balls and Bins

Pick r∗ so that 2r∗ ≈ |S |/KHow many Ci store a value of more than r∗?

Let A := x ∈ S : lsb(h(x)) > r∗|A| = |S |2−r∗(1± ε) with reasonable probability

We need to count the number of distinct values that g takes on A

Question: |A| ≈ K balls are randomly thrown into K bins. What can wesay about the distribution of T , the number of bins with at least one ballin them?

D. Kane (Stanford) Distinct Elements February 2014 13 / 30

Page 14: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Balls and Bins

Lemma

E[T ] = K

(1−

(1− 1

K

)|A|).

Furthermore, if 100 < |A| < K/20,

Var(T ) <4|A|2

K.

Thus, if |A| ≈ K ≈ ε−2,

T = K

(1−

(1− 1

K

)|A|)(1± ε)

Or

|A| =log(1− T/K )

log(1− 1/K )(1± ε)

Which gives us an estimator for |S | = |A|2r∗(1± ε).D. Kane (Stanford) Distinct Elements February 2014 14 / 30

Page 15: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Improving Space: Store Count Differences

The previous algorithm needs to maintain K ≈ ε−2 counters of log log(n)bits each, which is more space than we would like. On the other hand,most counters store values very close to r∗.Idea: Instead store the differences between Ci and r∗

With high probability the number of counters differing from r∗ bymore than b will be O(K2−b)

Total space necessary is O(K ) = O(ε−2)

Need a Variable Bitlength Array (VBA) with constant time updates(one has been developed by Blandford and Blelloch)

Problem: Need 2r∗ to be a good approximation to |S |/K throughoutentire computation, not just at the end.

D. Kane (Stanford) Distinct Elements February 2014 15 / 30

Page 16: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

RoughEstimator

New algorithm: RoughEstimator maintain an approximation to |S | so thatwith 90% probability is correct to within a constant factor for the entirestream.

Same basic idea as above

K ≈ log2/3(n)

Hash functions h : [n]→ [n], g : [n]→ [K ]

Maintain Ci = maxx∈S:g(x)=i (lsb(h(x)))

I Total space O(log2/3(n) log log(n))

Let r∗ be the median of the Ci

|S | = Θ(2r∗K ) with probability 1− O(K−2)

Sufficient to verify correctness when |S | is a power 2

Correct at all such times with probability 1− O(log(n)−1/3)

D. Kane (Stanford) Distinct Elements February 2014 16 / 30

Page 17: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Space Efficient Version

Run RoughEstimator in parallelI Maintain r∗ with 2r∗ ≈ |S |/K

Hash functions h : [n]→ [n], g : [n]→ [K ]

Maintain Ci = maxx∈S:g(x)=i (lsb(h(x)))− r∗ in VBA

Total space O(K + log(n)) = O(ε−2 + log(n))

Constant time updates unless r∗ changes

Deamortize decrements to the Ci when r∗ increases

D. Kane (Stanford) Distinct Elements February 2014 17 / 30

Page 18: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Derandomization

Thus, we have an algorithm which uses O(ε−2 + log(n)) space and O(1)update time given access to random functions h and g . Unfortunately, wedon’t have access to these. We will need to store a description of any hashfunctions that we use, which will add to our space bound.We need to derandomize this algorithm. Namely, we need to find lessrandom families of functions (which can be described in a small number ofbits) that are still good enough for our analysis to work.

D. Kane (Stanford) Distinct Elements February 2014 18 / 30

Page 19: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

k-Independence

We will be able to do our derandomization with the help of k-wiseindependent families of hash functions. Recall:

Definition

A family H of hash functions h : U → V is k-wise independent if for anydistinct u1, . . . , uk ∈ U and any v1, . . . , vk ∈ V

Prh∈H(h(ui ) = vi for all i) =1

|V |k.

Or in other words, h acts like a completely random function whenrestricted to any k elements of U.

There are known (approximately) k-wise independent families of hashfunctions which can be stored in O(k log(|U||V |)) bits and evaluated inconstant time.

D. Kane (Stanford) Distinct Elements February 2014 19 / 30

Page 20: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Analysis

Recall that the analysis of both our main algorithm and RoughEstimatordepended on understanding the statistic

T = #i : ∃x ∈ S , lsb(h(x)) > r∗, g(x) = i.

Or in other words, letting

A = x ∈ S : lsb(h(x)) > r∗,

T = |g(A)|.Enough to show that |A| = 2−r∗ |S |(1± ε) and that

T = K(

1−(1− 1

K

)|A|)(1± ε) with large probability.

D. Kane (Stanford) Distinct Elements February 2014 20 / 30

Page 21: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

2-Wise Independent h

We claim that choosing h from a 2-wise independent family is sufficient tocontrol the size of A.

E[|A|] = |S |2−r∗

Var(|A|) = O(|S |2−r∗)

Both of these still hold under 2-independence

The bounds on |A| follow from Chebyshev’s inequality

D. Kane (Stanford) Distinct Elements February 2014 21 / 30

Page 22: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

k-Wise Independent Balls and Bins

Want to control T = |g(A)| under k-wise independence of g

Under full independence had E[T ] = K (1− (1− 1/K )|A|) andVar(T ) = O(|A|2/K ), which was enough

Need to show the above under k-wise independence

D. Kane (Stanford) Distinct Elements February 2014 22 / 30

Page 23: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

k-Wise Independent Balls and Bins

Need to show that the expected size of g(A) is approximately preserved byk-wise independence.

Lemma

Let A be a finite set and K a positive integer. Let g , g ′ : A→ [K ] befunctions with g random and g ′ chosen from a k-wise independent family.Let T = |g(A)|,T ′ = |g ′(A)|, then

∣∣E[T ]− E[T ′]∣∣ ≤ K

(|A|K

)k

/k!

Thus k-wise independence will be sufficient to preserve our expectationsfor k ≈ log(ε−1).

D. Kane (Stanford) Distinct Elements February 2014 23 / 30

Page 24: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Considering Each Bin

Let Ti be the indicator random variable for the event that i is in theimage of g , and T ′i the indicator of the event that i is in the image ofg ′

T =∑K

i=1 Ti ,T′ =

∑Ki=1 T

′i

Enough to show that

∣∣E[Ti ]− E[T ′i ]∣∣ ≤ ( |A|

K

)k

/k!

D. Kane (Stanford) Distinct Elements February 2014 24 / 30

Page 25: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Approximate Inclusion-Exclusion

For each x ∈ A, let Bx be the event that g ′(x) = i .Then T ′i is the indicator function of the event B =

∨x∈A Bx .

Theorem (Inclusion-Exclusion)

Pr(B) =

|A|∑t=1

(−1)t+1∑

x1,...,xt⊂A

Pr(Bx1 ∧ Bx2 ∧ . . . ∧ Bxt ).

D. Kane (Stanford) Distinct Elements February 2014 25 / 30

Page 26: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Approximate Inclusion-Exclusion

For each x ∈ A, let Bx be the event that g ′(x) = i .Then T ′i is the indicator function of the event B =

∨x∈A Bx .

Theorem (Approximate Inclusion-Exclusion)

If m is odd,

Pr(B) ≤m∑t=1

(−1)t+1∑

x1,...,xt⊂A

Pr(Bx1 ∧ Bx2 ∧ . . . ∧ Bxt ).

If m is even,

Pr(B) ≥m∑t=1

(−1)t+1∑

x1,...,xt⊂A

Pr(Bx1 ∧ Bx2 ∧ . . . ∧ Bxt ).

D. Kane (Stanford) Distinct Elements February 2014 25 / 30

Page 27: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Bounds on Pr(B)

Thus, E[T ′i ] = Pr(B) is bounded between

k−1∑t=1

(−1)t+1∑

x1,...,xt⊂A

Pr(Bx1 ∧ Bx2 ∧ . . . ∧ Bxt ),

andk∑

t=1

(−1)t+1∑

x1,...,xt⊂A

Pr(Bx1 ∧ Bx2 ∧ . . . ∧ Bxt ).

Both of these bounds are determined by k-independence, and they differ by

∑x1,...,xk⊂A

Pr(Bx1 ∧ Bx2 ∧ . . . ∧ Bxk ) =

(|A|k

)K−k ≤

(|A|K

)k

/k!

Since g is also k-wise independent, E[Ti ] and E[T ′i ] differ by at most thismuch.

D. Kane (Stanford) Distinct Elements February 2014 26 / 30

Page 28: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Composition

Unfortunately, there may not be enough space to store anlog(ε−1)-wise independent hash function from [n] to [K ].

To fix this let g1 : [n]→ [K 3] be 2-independent and g2 : [K 3]→ [K ]be log(ε−1)-wise independent and let g = g2 g1.

I With probability 1− 1/K , |g1(A)| = |A|I If this holds, our analysis for |g2(g1(A))| works

D. Kane (Stanford) Distinct Elements February 2014 27 / 30

Page 29: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Final Algorithm: RoughEstimator

Initialize:I Let K ≈ log2/3(n), k ≈ log(log(n))I Let h : [n]→ [n] be 2-independent, g1 : [n]→ [K 3] be 2-independent

and g2 : [K 3]→ [K ] k-independent, g = g2 g1I Initialize K counters of log log(n) bits each Ci at 0

Update(x) : Cg(x) ← max(Cg(x), lsb(h(x)))

Query: Return 2median(Ci )K .

D. Kane (Stanford) Distinct Elements February 2014 28 / 30

Page 30: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Final Algorithm

Initialize:I Let K ≈ ε−2, k ≈ log(ε−1)I Let h : [n]→ [n] be 2-independent, g1 : [n]→ [K 3] be 2-independent

and g2 : [K 3]→ [K ] k-independent, g = g2 g1I Initialize a copy of RoughEstimatorI Initialize a VBA storing C1, . . . ,CK , all initially 0. Initialize r = 0.

Update(x) :I Cg(x) ← max(Cg(x), lsb(h(x))− r)I Update(RoughEstimator)I Let r ′ ← blog(Query(RoughEstimator)/K )cI If r ′ 6= r : Ci ← Ci + r − r ′, r ← r ′

Query:I Let T = #i : Ci > 0I Return

2r log(1− T/K )

log(1− 1/K )

D. Kane (Stanford) Distinct Elements February 2014 29 / 30

Page 31: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

Conclusions

We have produced a streaming algorithm to compute approximations tothe number of distinct elements in a data stream with asymptoticallyoptimal space and update time. Similar techniques also yield a nearlyoptimal algorithm for L0 moment estimation (where updates can removecopies elements from the stream as well as add them), although thecorrect space bound for that problem remails open.

D. Kane (Stanford) Distinct Elements February 2014 30 / 30

Page 32: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

A. Akella, A. Bharambe, M. Reiter, and S. Seshan Detecting DDoSattacks on ISP networks In Proc. MPDS, 2003.

N. Alon, Y. Matias, and M. Szegedy The Space Complexity ofApproximating the Frequency Moments J. Comput. Syst. Sci.,58(1):137–147, 1999.

Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. TrevisanCounting distinct elements in a data stream In Proc. RANDOM,pages 1–10, 2002.

D. K. Blandford and G. E. Blelloch Compact dictionaries forvariable-length keys and data with applications ACM Trans. Alg.,4(2), 2008.

P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y.Sismanis Toward automated large-scale information integration anddiscovery In Data Management in a Connected World, pages 161–180,2005.

D. Kane (Stanford) Distinct Elements February 2014 30 / 30

Page 33: An Optimal Algorithm for the Distinct Elements Problemcseweb.ucsd.edu/~dakane/DistinctElementsTalk.pdfAn Optimal Algorithm for the Distinct Elements Problem Daniel M. Kane Department

P. Flajolet and G. N. Martin Probabilistic counting algorithms for database applications J. Comput. Syst. Sci., 31(2):182–209, 1985.

P. B. Gibbons and S. Tirthapura Estimating simple functions on theunion of data streams In Proc. SPAA, pages 281–291, 2001.

P. Indyk and D. P. Woodruff Tight lower bounds for the distinctelements problem In Proc. FOCS, pages 283–, 2003.

Daniel M. Kane, Jelani Nelson, David P. Woodruff An OptimalAlgorithm for the Distinct Elements Problem, Symposium onPrinciples of Database Systems (PODS) 2010.

P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T.G. Price Access path selection in a relational database managementsystem In SIGMOD, pages 23–34, 1979.

D. P. Woodruff Optimal space lower bounds for all frequencymoments In Proc. SODA, pages 167–175, 2004.

D. Kane (Stanford) Distinct Elements February 2014 30 / 30