Valutazione delle Norme con le Applicazioni (Norm Estimation with Applications: A Survey)
description
Transcript of Valutazione delle Norme con le Applicazioni (Norm Estimation with Applications: A Survey)
Valutazione delle Norme con le Applicazioni
(Norm Estimation with Applications: A Survey)
David Woodruff
IBM Almaden
Outline
1. The streaming model
2. Norm estimation1. Problems2. Results3. Upper bounds4. Lower bounds
3. Open questions
Data Stream Model [FM, AMS]
• Model• A large object x, modeled as a vector
• Could be a graph, matrix, set of points, etc.
• x = (x1, x2, …, xn) starts off as 0n
• Stream of m updates (j1, v1), …, (jm, vm)
• Update (j, v) causes change xj = xj + v
• v 2 {-M, -M+1, …, M}• Order and number of updates arbitrary
Application – IP session data
Source Destination Bytes Duration Protocol
18.6.7.1
10.6.2.3
11.1.0.6
12.3.1.5
…
19.7.3.2
12.3.4.8
11.6.8.2
14.7.0.1
…
40K
20K
58K
30K
…
28
18
22
32
…
http
ftp
http
http
…
AT & T collects 100+ GBs of NetFlow everyday
Application – IP Session Data
• AT & T needs to process massive stream of network data
• Traffic estimationWhat fraction of network IP addresses are active?Distinct elements computation
• Traffic analysis What are the 100 IP addresses with the most traffic? Frequent items computation
• Security/Denial of Service Are there any IP addresses witnessing a spike in traffic? Skewness computation
Algorithm Goals
• Space Complexity: Minimize memory used by the streaming algorithm• n, m, and M are large
• Pass Complexity: Minimize number of passes over the data • In many cases, only 1 pass is possible
• Computation: Minimize the time spent per stream update• Ideally constant time
Outline
1. The streaming model
2. Norm estimation1. Problems2. Results3. Upper bounds4. Lower bounds
3. Open questions
Vector Norm Estimation
• Problem – lp-norms
• Compute (j=1n |xj|p)1/p = |x|p
• p = 0 is number of non-zero entries of x• p = 1 is the Manhattan norm• p = 2 is the Euclidean norm• p = 3 is the skewness• p = 4 is the kurtosis• p = 1 is the maximum norm
- Estimating number of distinct elements
- Query planning + optimization
- Estimating number of distinct elements
- Query planning + optimization
- Measuring distances between distributions
- Embed other metrics into it (EMD, edit distance, etc.)
- Measuring distances between distributions
- Embed other metrics into it (EMD, edit distance, etc.)
- Geometric problems: clustering, nearest neighbor, etc.
- Databases: self-join size
- Geometric problems: clustering, nearest neighbor, etc.
- Databases: self-join size
- Testing distribution skewness. Easier than l1 norm
- Denial of Service attacks
- Ely Porat: “I know that Google is interested in compressed sensing with
lp guarantees for p > 2”
- Testing distribution skewness. Easier than l1 norm
- Denial of Service attacks
- Ely Porat: “I know that Google is interested in compressed sensing with
lp guarantees for p > 2”
-Long-Term Capital Risk Management hedge fund bailed out in late 90s because it underestimated kurtosis
- Use high accuracy for estimating |x|4
-Long-Term Capital Risk Management hedge fund bailed out in late 90s because it underestimated kurtosis
- Use high accuracy for estimating |x|4
Finding most frequent itemsFinding most frequent items
Other Applications of lp-Norms• lp for p 2 (0,1)
– Entropy estimation [HNO]
– Entropy = j qj log(1/qj), where qj = |xj|/|x|1– Estimate |x|p for p 2 (0,1)
• lp for p 2 [1, 1)
– Regression: minx |Ax-b|p– bi = Ai x + Noisei
– p = 1 is used to ignore outliers!
– p = 1 is used to find outliers!
– General p allows tuning
• private norm estimation [FIMNSW, IW, MM, W]
Matrix Norms
• Operator norms of n x d matrix A• Compute |A|p = maxx |Ax|p/|x|p• p = 1 is maximum l1-norm of a column• p = 2 is the spectral norm• p = 1 is maximum l1-norm of a row
• Entrywise norms• Compute |A|p = (i,j |Aij|p )1/p
• p = 2 is the Frobenius norm, also denoted |A|F
• Schatten norms– p = 1 is the nuclear norm
Numerical-linear algebra:- Approximate matrix product- Low-rank approximation
Optimization:- Minimize rank(X) subject to A(X)=B
Numerical-linear algebra:- Approximate matrix product- Low-rank approximation
Optimization:- Minimize rank(X) subject to A(X)=B
Mixed Norms
• Mixed norm of n x d matrix A• Compute lp(lq(A)) = (i=1
n |Ai|qp)1/p
• Sum-norm • lp(X(A)) = (i=1
n |Ai|Xp)1/p
- lp(l0(A)) useful for multigraphs [CM]
- lp(l2(A)) is used in k-median, k-means, and generalizations
- lp(l0(A)) useful for multigraphs [CM]
- lp(l2(A)) is used in k-median, k-means, and generalizations
- Earthmover distance [ABIW]
- l1-regression [SW]
- Earthmover distance [ABIW]
- l1-regression [SW]
Outline
1. The streaming model
2. Norm estimation1. Problems2. Results3. Upper bounds4. Lower bounds
3. Open questions
Initial Observations
Any exact computation• of a vector norm requires (n) space• of a matrix norm requires (nd) space
How do we cope?
Any deterministic computation • of a vector norm requires (n) space• of a matrix norm requires (nd) space
Output estimate Φ with |x|p · Φ · (1+ε)|x|p
Allow randomness and a small probability δ of error
How do we cope?
Vector Norm Estimation
- Use O*(f) to denote f¢poly(log(n/δ)/ε)
- Assume n, m, M are polynomially related
Rough bounds:
Space Update time
lp, p 2 [0,2] O*(1) O*(1)
lp, p > 2 £*(n1-2/p) O*(1)
Algorithms are 1-pass. Lower bounds are for O*(1)-pass algorithms
[I]
[IW, SS,
BJKS]
Vector Norm Estimation
Refined bounds for δ = 1/100:• p = 0: O(ε-2 log(n) (log 1/ε + loglog(n)) space, O(1) time (ε-2 log(n)) space [KNW]
• p 2 (0,2): O(ε-2 log(n)) space, O(log2(1/ε) log log(1/ε)) time (ε-2 log(n)) space [KNPW]
• p = 2: O(ε-2 log(n)) space, O(1) time(ε-2 log(n)) space [AMS, KNW, TZ]
• p > 2: O(ε-2 n1-2/p log2 n / min(log n, ε4/p-2)), O(log n) time (n1-2/p log n + ε-2 + n1-2/p ε-2/p) space [G, JW, BJKS]
For general δ, bounds in space get multiplied by log 1/δ [JW]
For general δ, bounds in space get multiplied by log 1/δ [JW]
Mixed Norms [CM, JW, AKO, BIKW, MW]
p
q0
1
2
1 2
1
1
n1-1/p
d1-2/q
n1-2/p d1-2/q
easy
Complexity of estimating lp(lq(A)) for n x d matrix A
n1-q/p
Matrix Norms
Operator norms• |A|1 in £*(d) space• |A|2 in O*(d2) 1-pass• |A|1 in £*(n) space
Entrywise norms• Space same as for vectors, e.g., |A|F in O*(1) space
Schatten norms• |A|pp = (i=1
n ¾ip )1/p doable in £*(d) space if n = d and A is
Laplacian of a graph and no negative values occur in the stream [KL]
Outline
1. The streaming model
2. Norm estimation1. Problems2. Results3. Upper bounds4. Lower bounds
3. Open questions
Vector Norm Estimation
• Can estimate lp-norm for every p ¸ 0 with the same data structure (with different parameters)! [IW]
• Optimal in space and time up to O*(1) factors
• More generally: obtain entire histogram of the values
Histogramming
• Let Si = {j such that (1+ε)i · |xj| < (1+ε)i+1}
• The |Si| summarize the coordinate values of x– Small histogram: only O(log(n)/ε) different i – Many, many applications
• |x|pp = i |Si|¢(1+ε)ip
• Find a data structure for estimating the |Si|
Three Ideas
1. Sign vector ¾ 2 {-1,1}n
• For any fixed x, |<¾, x>| ¼ |x|2
2. Bucketing• Given r buckets b1, …, br, randomly hash the
coordinates of x into each bucket• Let x(bk) be the restriction of x to bucket k• E[|x(bk)|22] = |x|22/r
3. Subsampling• For j = 1, 2, …, log n
Randomly sample a set Tj of 2j coordinates of x
Let x(Tj) be the restriction of x to coordinates in Tj
The Data Structure
For j = 1, …, log n1. Choose a random set Tj of 2j coordinates of x
2. Randomly hash the coordinates of x(Tj) into r buckets
3. For each bucket bk, maintain < ¾j, x(Tj)(bk) >, where ¾j 2 {-1, 1}n
That’s all folks!Space ¼ r
Time ¼ 1
Space ¼ r
Time ¼ 1
Why it Works
• Suppose |Si| (1+ε)ip ¸ ε2|x|pp/log n
If not, then
• Consider j so that 2j |Si|/n = 1
• |x(Tj)|pp ¼ 2j |x|pp / n
• If k 2 Si Å Tj , then |xk|p ¸ ε2 |x(Tj)|pp / log n
or |xk| ¸ ε2/p |x(Tj)|p / log1/p n
For p · 2,
|xk| ¸ ε2/p |x(Tj)|2 / log1/p n
For p > 2,
|xk| ¸ ε2/p |x(Tj)|2 / (n1/2-1/p log1/p n)
For p · 2,
|xk| ¸ ε2/p |x(Tj)|2 / log1/p n
For p > 2,
|xk| ¸ ε2/p |x(Tj)|2 / (n1/2-1/p log1/p n)
Wrapping Up
• For each Si, look at the appropriate level j of sub-sampling to find Si Å Tj
• E[|Si Å Tj|] = |Si| 2j/n
• Scale by n/2j to estimate |Si|
• Output i |Si|¢(1+ε)ip
An Aside
• We obtain samples from each Si for which |Si|¢(1+ε)ip ¸ ε2|x|pp/log n
Sampling algorithm1. Choose Si with probability |Si|¢(1+ε)ip / |x|pp
2. Output a sample from Si
• Chooses a k 2 [n] with probability ¼ |xk|p/|x|pp
– almost – known as lp-sampling [MW]
– useful in sublinear-time algorithms for minimum enclosing ball and classification [CHW]
Mixed Norms [JW]
• lpp(lq(A)) = j ( k |Ajk|q )p/q
• Si = {j such that (1+ε)i · k |Ajk|q < (1+ε)i+1}
Algorithm
1. lq-sample from A, treated as a vector
2. Use row identities of samples to estimate |Si|
Matrix Norms
• Spectral norm of n x d matrix A– |A|2 = maxunit x |Ax|2
• Compute S¢A, where S is an O*(d) x n matrix of random signs– |SAx|2 ¼ |Ax|2 for all x
• Output maxunit x |SAx|2
• Can do faster [AC]
Outline
1. The streaming model
2. Norm estimation1. Problems2. Results3. Upper bounds4. Lower bounds
3. Open questions
1-Round Communication Complexity
Alice Bob
x
• Alice sends a single message M(x) to Bob
• Bob outputs a function of M(x), y
• Bob’s output should equal f(x,y) with constant probability (over randomness of the protocol)
• Communication cost CC(f) is |M(x)|, maximized over x and random bits
What is f(x,y)? y
Reduction to Streaming
x y
Stream s(x) Stream s(y)
Streaming algorithm A
Streaming algorithm A
State of A
If you can solve f(x,y) from A(s(x)±s(y)), then space of A is at least CC(f)
S
Canonical Indexing Problem
x 2 {0,1}n i 2 {1, 2, …, n}
What is xi?
CC(Indexing) = (n)
(1/ε2) Bound
x 2 {- ε, ε}1/ε2 y = ei
|x-y|pp = (1/ε2-1)εp + (1-xi)p
Solves Indexing for p ¸ 2, so (1/ε2) bound
For p < 2, see Amit’s talk
What is |x-y|p?
(n1-2/p) Bound for p ¸ 2 [SS, BJKS]
What is |x-y|p?
x 2 {1, 2,…, n}n y 2 {1, 2,…, n}n
Promise: either all i satisfy xi – yi 2 {0,1} or there is a j for which xj – yj ¸ n1/p
Communication is (n1-2/p)
Proof bounds information that message reveals about input
For every block of n2/p coordinates, reveal 1 bit of information
Outline
1. The streaming model
2. Norm estimation1. Problems2. Results3. Upper bounds4. Lower bounds
3. Open questions
lp-Norms in Other Models
- sliding window, time-decayed, out-of-order
- read/write streams, annotations
- distributed functional monitoring
- compressed sensing
A Universal Data Structure
For j = 1, …, log n
1. Choose a random set Tj of 2j coordinates of x
2. Randomly hash the coordinates of x(Tj) into r buckets
3. For each bucket bk, maintain < ¾j, x(Tj)(bk) >, where ¾j 2 {-1, 1}n
In what sense is this data structure optimal for all functions of the form i f(xi)?
Good progress on this [BO], but still open
Other Norms• Earthmover distance (EMD)
– Given n green and n blue points in O(1) dimensions
EMD( , ) = 6 + 3√2
– Output (1+ε)-approximation to min-cost perfect matching
– O(n) space upper bound, (log n) lower bound
– Some progress [ABIW]
The Future
We’ve made progress Improving ε and log n factors important in practice
Future themes?
- more complicated norms and problems from optimization
- emphasis on sketching for improving time
Bibliography• [ABIW] Andoni, DoBa, Indyk, W, FOCS, 2009.• [AC] Ailon, Chazelle, STOC, 2006.• [AMS] Alon, Matias, Szegedy, STOC, 1996. • [AKO] Andoni, Kraughtgamer, Onak, preprint.• [BJKS] Bar-Yossef et al., FOCS, 2002.• [BO] Braverman, Ostrovsky, STOC, 2010.• [CHW] Clarkson, Hazan, W. FOCS, 2010.• [CM] Cormode, Muthukrishnan, PODS, 2005. • [FIMNSW] Feigenbaum et al, ICALP, 2001• [FM] Flajolet, Martin, FOCS, 1983.• [G] Ganguly, preprint.• [HNO] Harvey, Nelson, Onak, FOCS, 2008.• [I] Indyk, FOCS, 2000.• [IW] Indyk, W, STOC, 2005.• [JW] Jayram, W, FOCS, 2009.• [JW] Jayram, W, SODA, 2011.• [KNW] Kane, Nelson, W, SODA, 2010.• [KNPW] Kane, Nelson, Porat, W, STOC, 2011.• [MM] Madeira, Muthukrishnan, FSTTCS 2009.• [MW] Monemizadeh, W, SODA, 2010.• [SS] Saks, Sun, STOC, 2002.• [SW] Sohler, W, STOC, 2011.• [W] W, STOC, 2011.