The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

22
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden

Transcript of The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Page 1: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

The Average Case Complexity of Counting Distinct Elements

David Woodruff

IBM Almaden

Page 2: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Problem Description

• Given a data stream of n insertions of records, count the number F0 of distinct records

• One pass over the data stream• Algorithms must use small amount of memory and have

fast update time– too expensive to store set of distinct records– implies algorithms must be randomized and must

settle for an approximate solution: output

F 2 [(1-²)F0, (1+²)F0] with constant probability

Page 3: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

The Data Stream• How is the data in the stream organized?

– Usually, assume worst-case ordered• In this case, £(1/²2) bits are necessary and sufficient to estimate F0

– As the quality of approximation improves to say, ² = 1%, the quadratic dependence is a major drawback

– Sometimes a random-ordering can be assumed• Suppose we are mining salaries, and they are ordered alphabetically by

surname. If there is no correlation between salaries and surname, then the stream of salaries is ordered randomly

• The backing sample architecture assumes data randomly ordered by design (Gibbons, Matias, Poosala)

• This model is referred to as the Random-Order Model (Guha, McGregor)

• Unfortunately, even in this model, we still need (1/²2) bits to estimate F0

(Chakrabarti, Cormode, McGregor). Intuitively this is because the data is still worst-case.

Page 4: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Random Data Model• In an attempt to bypass the (1/²2) bound, we propose to study the

case when the data comes from an underlying distribution.– Problem 3 of Muthukrishnan’s book: “Provide inproved estimates for Lp

sums including distinct element estimation if input stream has statistical properties such as being Zipfian.”

• There is a distribution defined by probabilities pi, 1 · i · m, pi = 1.

• The next item in the stream is chosen independently of previous items, and is i with probability pi.

• We call this the Random-Data Model.

• The Random-Data Model is contained in the Random-Order Model.

Page 5: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Random Data Model

• This model for F0 was implicitly studied before

– by Motwani and Vassilvitskii when the distribution is Zipfian. This distribution is useful for estimating WebGraph statistics

– sampling-based algorithms used in practice impose distributional assumptions, without which they have poor performance (Charikar et al)

– Generalized Inverse Gaussian Poisson (GIGP) model studies sampling-based estimators when distribution is uniform and Zipfian

– by Guha and McGregor for estimating the density function of an unknown distribution, which is useful in learning theory

Page 6: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Further Restriction• We focus on the case when each probability pi = 1/d for an unknown

value of d (so uniform over a subset of [m])

• Captures the setting of sampling with replacement a set of unknown cardinality

• For a certain range of d, we show that one can beat the space lower bound that holds for adversarial data and randomly-ordered data

• For another choice of d, we show the lower bound for adversarial and randomly-ordered data also applies in this setting

• Distribution fairly robust in the sense that other distributions with a few heavy items and remaining items that are approximately uniform have the same properties above

Page 7: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Our Upper Bound

• 1-pass algorithm with an expected O(d(log 1/²)/(n²2) + log m) bits of

space, whenever d ¸ 1/²2 and d · n. The per-item processing time is constant.

• Recall the distribution is uniform over a d-element subset of [m], and we see n samples from it, so this is a typical setting of parameters.

• Notice for n even slightly larger than d, the algorithm does much

better than the (1/²2) lower bound in other data stream models.

• One can show for every combination of known algorithms with different space/time tradeoffs for F0 in the adversarial model, our algorithm is either better in space or in time.

Page 8: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Our Lower Bound

• Our main technical result is that if n and d are £(1/²2), then even estimating F0 in the random data model requires (1/²2) space

• Lower bound subsumes previous lower bounds, showing that even for a natural (random) choice of data, the problem is hard

• Our choice of distribution for showing the lower bound was used in subsequent work by Chakrabarti and Brody, which turned out to be useful for establishing an (1/²2) lower bound for constant pass algorithms for estimating F0

Page 9: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Techniques: Upper Bound

• Very simple observation:– Since d · n, each item should have frequency n/d in the stream.– If n/d is at least (1/²2), we can just compute n and the

frequency of the first item in the stream to get a (1+²)-approximation to d

– Using a balls-and-bins occupancy bound of Kamath et al, a good estimation of d implies a good estimation to F0

• If n/d is less than 1/²2, we could instead store the first O(1/²2) items (hashed appropriately for small space), treat them as a set S, and count the number of items in the remaining part of the stream that land in S – Correct, but unnecessary if d is much less than n.

Page 10: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Techniques: Upper Bound

• Instead record the first item x in the stream, and find the second position i of x in the stream.

• Position i should occur at roughly the d-th position in the stream, so i provides a constant factor approximation to d

• Since n = (d), position i should be in the first half of the stream with large constant probability.

• Now store the first i/(n²2) distinct stream elements in the second half of stream, treat them as a set S, and count the remaining items in the stream that occur in S.– Good enough for (1+²)-approximation– Space is O(log n + (i log m)/(n²2)) ¼ O(log n + (d log m)/(n²2))

Page 11: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Techniques: Upper Bound• Space is O(log n + (d log m)/(n²2)), but we can do better.

• For each j in [m], sample j independently with probability 1/(i²2). In expectation new distribution is now uniform over d/(i²2) items. If j is sampled, say j survives.

• Go back to the previous step: store the first i/(n²2) distinct surviving stream elements in the second half of stream, treat them as a set S, and count the remaining items in the stream that occur in S.– Since only £(1/²2) items survive, can store S with only (i log 1/²)/(n²2)

bits by hashing item IDs down to a range of size, say, 1/²5

– We estimate the distribution’s support size in the sub-sampled stream, which is roughly d/(i²2). We can get a (1+²)-approximation to this quantity provided it is at least (1/²2), which it is with high probability. Then scale by i²2 to estimate d, and thus F0 by previous reasoning.

– Constant update time

Page 12: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Techniques: Lower BoundConsider the uniform distribution ¹ on [d] = {1, 2, …, d}. d and n are £(1/²2)

Consider a stream of n samples from ¹ where we choose n so that 1-(1-1/d)n/2 in [1/3, 2/3]

Let X be the characteristic vector of the first n/2 stream samples on the universe [d]. So Xi = 1 if and only if item i occurs in these samples.

Let Y be the characteristic vector of the second n/2 stream samples on the universe [d]. So Yi = 1 if and only if item i occurs in these samples.

Let wt(X), wt(Y) be the number of ones in vectors X, Y.

We consider a communication game. Alice is given X and wt(Y), while Bob is given Y and wt(X), and they want to know if

Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d

Alice and Bob should solve this problem with large probability, where the probability is over the choice of X and Y, and their coin tosses (which can be fixed).

Page 13: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Techniques: Lower BoundStrategy

(1) show the space complexity S of the streaming algorithm is at least the one-way communication complexity CC of the game

(2) lower bound CC

Theorem: S ¸ CCProof: Alice runs the streaming algorithm on a random stream aX generated

by her characteristic vector X. Alice transmits the state to Bob

Bob continues the computation of the streaming algorithm on a random stream aY with characteristic vector Y

At the end the algorithm estimates F0 of a stream whose elements are in the support of X or Y (or both)

Notice that the two halves of the stream are independent in the random data model, so the stream generated has the right distribution

Page 14: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Techniques: Lower BoundWe show the estimate of F0 can solve the communication gameRemember, the communication game is to decide whether

Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d

Note the quantity on the right is the expected value of Delta(X,Y)

Some Lemmas (1) Pr[d/4 · wt(X), wt(Y) · 3d/4] = 1-o(1) (2) Consider the variable X’, distributed as X, but conditioned on wt(X) = k.

The variable Y’ is distributed as Y, but conditioned on wt(Y) = r.– X’ and Y’ are uniform over k and r bit strings, respectively– Choose k, r to be integers in [d/4, 3d/4]

Then for any constant δ > 0, there is a constant ® > 0, so that Pr[|¢(X’,Y’) – E[¢(X’, Y’)]| ¸ ® d1/2] ¸ 1- δ

Follows from standard deviation of hypergeometric distribution

Page 15: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Techniques: Lower Bound

• ¢(X,Y) = 2F0(aX◦aY)-wt(X)-wt(Y)

• Note that Bob has wt(X) and wt(Y), so he can compute ¿ = wt(X) + wt(Y) – wt(X)wt(Y)/d

• If F is the output of the streaming algorithm, Bob simply computes 2F-wt(X)-wt(Y) and checks if it is greater than ¿

• F 2 [(1-²)F0, (1+²)F0] with large constant probability

• 2F-wt(X)-wt(Y) 2 [2F0-wt(X)-wt(Y)-2²F0,, 2F0-wt(X)-wt(Y)+2²F0]

= [¢(X,Y)-2²F0, ¢(X,Y)+ 2²F0]

= [¢(X,Y)-£(²d), ¢(X,Y) + £(²d)] = [¢(X,Y) - £(d1/2), ¢(X,Y) + £(d1/2)]

Page 16: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Techniques: Lower Bound

• By our lemmas, and using that E[¢(X, Y)] = ¿, with large constant probability either ¢(X,Y) > ¿ + ®d1/2 or ¢(X,Y) < ¿ - ®d1/2

• By previous slide, Bob’s value 2F-wt(X)-wt(Y) is in the range

[¢(X,Y) - £(d1/2), ¢(X,Y) + £(d1/2)]

• If ¢(X,Y) > ¿ + ® d1/2, then 2F-wt(X)-wt(Y) > ¿ + ® d1/2 -£(d1/2) = ¿

• If ¢(X,Y) < ¿ - ® d1/2, then 2F-wt(X)-wt(Y) < ¿ - ® d1/2 + £(d1/2) = ¿

• So Bob can use the output of the streaming algorithm to solve the communication problem, so S ¸ CC

Page 17: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Lower Bounding CCCommunication game: Alice is given X and wt(Y), while Bob is given Y

and wt(X), and they want to know if Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d

Here X,Y have the distribution of being independent characteristic vectors of a stream on n/2 samples, where

1-(1-1/d)n/2 2 [1/3, 2/3]

With large probability, Pr[d/4 · wt(X), wt(Y) · 3d/4] = 1-o(1)

By averaging, a correct protocol is also correct with large probability for fixed weights i and j in [d/4, 3d/4], so we can assume X is a random string of Hamming weight i, and Y a random string of Hamming weight j.

Page 18: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Lower Bounding CC

0 1 0 0 0 1 1 1

1 1 1 1 0 0 0 0

… …

1 0 0 0 1 0 1 1

0 0 1 1 0 0 1 1

1 0 1 0 1 1 0 0

X such thatwt(X) = i

Y such that wt(Y) = j

With large probability, the message M that Alice sends has the propertythat many different X cause Alice to send M. Say such an M is large.

Since i, j 2 [d/4, 3d/4], one can show that the fraction of 1s in each row is in [1/2-o(1),1/2+o(1)]

Recall that an entry is 1 if and only if ¢(X,Y) ¸ i + j – 2i¢j/d

We show that for any large M, the fraction of 1s in most of the columns is in, say, [1/10, 9/10]. Then Bob doesn’t know what to output.

Since each row is roughly balanced, the expected fraction of 1s in each column is in [1/2-o(1), 1/2+o(1)].

Page 19: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Lower Bounding CC

1 1 1 1 0 0 0 0

1 1 1 1 0 0 0 0

… …

1 0 0 0 1 0 1 1

0 0 1 1 0 0 1 1

1 0 1 0 1 1 0 0

Since each row is roughly balanced, the expected fraction of 1s in each column is in [1/2-o(1), 1/2+o(1)].

But the variance could be huge, e.g., the matrix may look like this, in which case Bob can easily output the answer.

Page 20: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Lower Bounding CC• Can show this doesn’t happen by the second-moment method.

• Let Vy be the fraction of 1s in the column indexed by y.

• Let S be the set of x which cause Alice to send a large message M.

• Consider random Y. Let Cu = 1 if and only if ¢(u,Y) > ¿. So V = (1/|S|) sumu in S Cu

• Var[V] = (1/|S|2) sumu,v in S (E[Cu Cv] – E[Cu]E[Cv]).

• Notice that E[Cu]E[Cv] 2 [1/4-o(1), 1/4+o(1)]

while E[Cu Cv] = Pr[¢(u,Y) > ¿ | ¢(v,Y) > ¿] ¢ ½.

• Since S is a large set, most pairs u, v 2 S have Hamming distance in [d/5, 4d/5]. – A technical lemma shows that Pr[¢(u,Y) > ¿ | ¢(v,Y) > ¿] is a constant strictly less than 1.

• Hence, E[Cu Cv] is a constant strictly less than 1/2, and since this happens for most pairs u, v, we get that Var[V] is a constant strictly less than 1/4.

Page 21: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Lower Bounding CC

Fraction of 1s in a random column is just V = (1/|S|) sumu in S Cu

Let · be a small positive constant. By Chebyshev’s inequality,

Pr[|V-1/2| > 1/2-·] < Pr[|V-E[V]| > ½-·-o(1)] · Var[V]/(1/2-·)2 + o(1)

But we showed Var[V] is a constant strictly less than ¼, so this probability is a constant strictly less than 1 for small enough ·.

Hence, for a random column Y, the fraction of 1s is in [·, 1-·] with constant probability.

It follows that with some constant probability Bob outputs the wrong answer.

Hence, most messages of Alice must be small, so one can show that there must be 2d) of them, so that communication is (d) = (1/²2).

Page 22: The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Conclusions

• Introduced random data model, and studied F0-estimation under distributions uniform over a subset of d items

• For a certain range of d, we show that one can beat the space lower bound that holds for adversarial data and randomly-ordered data

• For another choice of d, we show the lower bound for adversarial and randomly-ordered data also applies in this setting

• Are there other natural distributions that admit more space-efficient algorithms in this model? Are there other useful ways of bypassing the (1/²2) lower bound?