Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus...

25
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012

Transcript of Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus...

Page 1: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Tight Bounds for Distributed Functional Monitoring

David Woodruff

IBM Almaden

Qin Zhang

Aarhus University

MADALGO

Based on a paper in STOC, 2012

Page 2: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

k-party Number-In-Hand ModelP1

P2

P3

Pk

P4

x1

x2

x3

x4

xk

Goals: - compute a function f(x1, …, xk)- minimize communication complexity

-Player to player communication - Protocol transcript always determines who speaks next

-Player to player communication - Protocol transcript always determines who speaks next

Page 3: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

k-party Number-In-Hand Model

C

P1 P2 P3 Pk…

Convenient to introduce a “coordinator” C

All communication goes through the coordinator

Communication only affected by a factor of 2

x1 x2 x3 xk

Page 4: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Model Motivation• Data distributed and stored in the cloud

– Impractical to put data on a single device

• Sensor networks– Communication is power-intensive

• Network routers– Bandwidth limitations

• Distributed functional monitoringAuthors: Can, Cormode, Huang, Muthukrishnan, Patt-

Shamir, Shafrir, Tirthapura, Wang, Yi, Zhao, …

Page 5: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

k-Party Number-In-Hand Model

C

P1 P2 P3 Pk…

x1 x2 x3 xk

Which functions do we care about? - 8i, xi 2 {0,1, … n}n

- x = x1 + x2 + … + xk

- f(x) = |x|p = (Σi xip)1/p

- |x|0 is number of non-zero coordinates - Talk will focus on |x|0 and |x|2

For distributed databases:

|x|0 is number of distinct elements

|x|22 is known as self-join size

|x|2 useful for regression, low-rank approx

For distributed databases:

|x|0 is number of distinct elements

|x|22 is known as self-join size

|x|2 useful for regression, low-rank approx

Important for applications that the xi are non-

negative

Important for applications that the xi are non-

negative

Page 6: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Randomized Communication Complexity

• What is the randomized communication cost of f?• i.e., the minimal cost of a protocol, which for every set of

inputs, fails in computing f with probability < 1/3

• (n) cost for |x|0 and |x|2

• Reduction from 2-Player Set-Disjointness (DISJ)• Alice has a set S µ [n]• Bob has a set T µ [n] • Either |S Å T| = 0 or |S Å T| = 1• |S Å T| = 1 ! DISJ(S,T) = 1, |S Å T| = 0 !DISJ(S,T) = 0• [KS, R] (n) communication

• Prohibitive

Page 7: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Approximate Answers

Compute a relation with probability > 2/3:

f(x)2(1 ± ε) |x|0

f(x)2(1 ± ε) |x|2

What is the randomized communication cost as a function of k, ε, and n?

Will ignore log(nk/ε) factors

Understanding dependence on ε is critical, e.g., ε<.01

Page 8: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Previous Results

• |x|0: (k + ε-2) and O(k¢ε-2 )

• |x|2: (k + ε-2) and O(k¢ε-2 )

Page 9: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Our Results

• |x|0: (k + ε-2) and O(k¢ε-2 ) (k¢ε-2)

• |x|2: (k + ε-2) and O(k¢ε-2 ) (k¢ε-2)

First lower bounds to depend on

product of k and ε-

2

First lower bounds to depend on

product of k and ε-

2

Implications for data streams:- First tight space lower bound for estimating number of distinct elements without using the Gap-Hamming Problem

- Improves lower bound for estimation of |x|p, p > 2

Page 10: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Previous Lower Bounds• Lower bounds for |x|0 and |x|

• [CMY](k)

• [ABC] (ε-2) • Reduction from Gap-Orthogonality (GAP-ORT)

• P1, P2 have u, v 2 {0,1}ε-2 , respectively

• |¢(u, v) – 1/(2ε2)| < 1/ε or |¢(u, v) - 1/(2ε2)| > 2/ε

• [CR, S] (ε-2) communication

Page 11: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Talk Outline

• Lower Bounds– |x|0

– |x|2

Page 12: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Lower Bound for |x|0

• Improve bound to optimal (k¢ε-2)

• Study a simpler problem: k-GAP-THRESH

– Each player Pi holds a bit Zi

– Zi are i.i.d. Bernoulli(¯)

– Decide if

i=1k Zi > ¯ k + (¯ k)1/2 or i=1

k Zi < ¯ k - (¯ k)1/2

Otherwise don’t care

• Rectangle property: for any correct protocol transcript ¿,

Z1, Z2, …, Zk are independent conditioned on ¿

Page 13: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Rectangle Property of Communication

• Let r be the randomness of C, P1, …, Pk

• For any fixed r, the set S of inputs giving rise to a transcript ¿ is a combinatorial rectangle: S = S1 x S2 x … x Sk

• If input distribution is a product distribution, conditioned on ¿ and r, inputs are independent

• Since this holds for every r, inputs are independent conditioned on ¿

Page 14: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

k-GAP-THRESH

C

P1 P2 P3 Pk…

Z1 Z2 Z3 Zk

• The Zi are i.i.d. Bernoulli(¯) • Coordinator wants to decide if: i=1

k Zi > ¯ k + (¯ k)1/2 or i=1k Zi < ¯ k - (¯ k)1/2

• By independence of the Zi | ¿ , equivalent to C having “noisy” independent copies of the Zi

Page 15: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

A Key Lemma• Lemma: For any protocol ¦ which succeeds w.pr. >.99, the

transcript ¿ is such that w.pr. > 1/2, for at least k/2 different i, H(Zi | ¿) < H(.01 ¯)

• Proof: Suppose ¿ does not satisfy this– With large probability,

¯ k - O(¯ k)1/2 i=1k Zi | ¿] < ¯ k + O(¯ k)1/2

– Since the Zi are independent given ¿, i=1

k Zi | ¿ is a sum of independent Bernoullis

– Since most H(Zi | ¿) are large, by anti-concentration, both events occur with constant probability:

i=1k Zi | ¿ > ¯ k + (¯ k)1/2 , i=1

k Zi | ¿ < ¯ k - (¯ k)1/2

So ¦ can’t succeed with large probability

Page 16: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Composition IdeaC

P1 P2 P3 Pk…

Z3Z2Z1Zk

The input to Pi in k-GAP-THRESH, denoted Zi, is the output of a 2-party Disjointness (DISJ) instance between C and Si

- Let S be a random set of size 1/(4ε2) from {1, 2, …, 1/ε2}- For each i, if Zi = 1, then choose Ti of size 1/(4ε2) so that DISJ(S, Ti) = 1, else choose Ti so that DISJ(S, Ti) = 0- Distributional complexity of solving DISJ with probability 1-¯/100, when DISJ(S,T) = 1 with probability ¯, is (1/ε2) [R]

DISJ

DISJ

DISJDISJ

Page 17: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Putting it All Together• Key Lemma ! For most i, H(Zi | ¿) < H(.01¯)

• Since H(Zi) = H(¯) for all i, for most i protocol ¦ solves DISJ(X, Yi) with probability ¸ 1- ¯/100

• For most i, the communication between C and Pi is (ε-2) – Otherwise, C could simulate the other players without any

communication and contradict lower bound for DISJ(X, Y i)

• Total communication is (k¢ε-2)

• Can show a reduction to estimating |x|0

Page 18: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Reduction to |x|0

• Think of C as a player• C’s input vector xC is characteristic vector

of the set [1/ε2] \ S

• Pi’s input vector xi is characteristic vector of the set Ti

• When |Ti Å S| = 1, support of x = xC + i xi usually increases by 1

• Choose ¯ = £(1/(ε2 k)) so thati=1

k Zi = ¯ k +- (¯ k)1/2 = 1/ε2 +- 1/ε

Page 19: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Talk Outline

• Lower Bounds– |x|0

– |x|2

Page 20: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Lower Bound for Euclidean Norm

• Improve (k + ε-) bound to optimal (k¢ε-2)

• Use Gap-Orthogonality (GAP-ORT(X, Y))– GAP-ORT(X,Y) = 1– Alice, Bob have X, Y 2 {0,1}ε-2

– Decide: |¢(X, Y) – 1/(2ε2)| <1/ε or |¢(X, Y) - 1/(2ε2)| >2/ε – Consider uniform distribution on X,Y

• [KLLRX, CKW] For any protocol ¦ that solves GAP-ORT with constant probability,

I(X, Y; ¦) = H(X,Y) – H(X,Y | ¦) = (1/ε2)

Page 21: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Information Implications

• By chain rule,

I(X, Y ; ¦) = i=11/ε2 I(Xi, Yi ; ¦ | X< i, Y< i) = (ε-2)

• For most i, I(Xi, Yi ; ¦ | X< i, Y< i) = (1)

Page 22: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

XOR DISJ

• Choose random j 2 [n] and random S 2 {00, 10, 01, 11}:S = 00: j doesn’t occur in any Ti

S = 10: j occurs only in T1, …, Tk/2

S = 01: j occurs only in Tk/, …, Tk

S = 11: j occurs in T1, …, Tk

• Every j’ j occurs in at most one set Ti

• Output equals 1 if S 2 {10, 01}, otherwise output is 0• I(¦ ; T1, …, Tk | j, S, D) = (k) for any ¦ for which I(¦ ; S) = (1)

P1 Pk/2+1… PkPk/2

T1 Tk/2+1 Tk/2 Tk µ [n]

We compose GAP-ORT with a variant of k-Party DISJ

… …

Page 23: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

GAP-ORT + XOR DISJ• Take 1/ε2 independent copies of XOR DISJ

– Ti = (Ti1, …, Ti

k), ji, Si, Di are variables for i-th instance• Is the number of outputs equal to 1 about 1/(2ε2) +-1/ε or

about 1/(2ε2) +- 2/ε?

XOR DISJ instance

XOR DISJ instance

XOR DISJ instance{1/ε2

Page 24: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Intuitive Proof• GAP-ORT is “embedded” inside of GAP-ORT + XOR DISJ

Output is XOR of bits in S

• Implies for any correct protocol ¦:For most i, I(Si ; ¦ | S< i) = (1)

• Implies via a direct sum:

For most i, I(¦ ; Ti | j, S, D, T< i ) = (k)

• Implies via the chain rule:

I(¦; T1, …, T1/ε2 | j, S, D) = (k/ε2)

• Implies communication is (k/ε2)

Page 25: Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Conclusions• Tight communication lower bounds for estimating |

x|0 and |x|2

• Techniques imply tight lower bounds for empirical entropy, heavy hitters, quantiles

• Other results:– Model in which the xi undergo poly(n) additive updates

to their coordinates– Coordinator continually maintains (1+ε)-approximation

– Improve k2/poly(ε) to k/poly(ε) communication for |x|2