Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU...
description
Transcript of Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU...
Subspace Embeddings for the L1 norm with Applications
Christian Sohler David WoodruffTU Dortmund IBM Almaden
Subspace Embeddings for the L1 norm with Applications
to...
Robust Regression and Hyperplane Fitting
3
Outline
Massive data sets Regression analysis Our results Our techniques Concluding remarks
4
Massive data sets
Examples Internet traffic logs Financial data etc.
Algorithms Want nearly linear time or less Usually at the cost of a randomized approximation
5
Regression analysis
Regression Statistical method to study dependencies between variables in the
presence of noise.
6
Regression analysis
Linear Regression Statistical method to study linear dependencies between variables in the
presence of noise.
7
Regression analysis
Linear Regression Statistical method to study linear dependencies between variables in the
presence of noise.
Example Ohm's law V = R ∙ I
8
Regression analysis
Linear Regression Statistical method to study linear dependencies between variables in the
presence of noise.
Example Ohm's law V = R ∙ I Find linear function that best
fits the data
9
Regression analysis
Linear Regression Statistical method to study linear dependencies between variables in the
presence of noise.
Standard Setting One measured variable b A set of predictor variables a ,…, a Assumption:
b = x + a x + … + a x + is assumed to be noise and the xi are model parameters we want to learn Can assume x0 = 0 Now consider n measured variables
1 d
1 1 d d0
10
Regression analysis
Matrix formInput: nd-matrix A and a vector b=(b1,…, bn)
n is the number of observations; d is the number of predictor variables
Output: x* so that Ax* and b are close
Consider the over-constrained case, when n À d Can assume that A has full column rank
11
Regression analysis
Least Squares Method Find x* that minimizes (bi – <Ai*, x*>)² Ai* is i-th row of A Certain desirable statistical properties
Method of least absolute deviation (l1 -regression) Find x* that minimizes |bi – <Ai*, x*>| Cost is less sensitive to outliers than least squares
12
Regression analysis
Geometry of regression We want to find an x that minimizes |Ax-b|1 The product Ax can be written as
A*1x1 + A*2x2 + ... + A*dxd
where A*i is the i-th column of A
This is a linear d-dimensional subspace The problem is equivalent to computing the point of the column space of A
nearest to b in l1-norm
13
Regression analysis
Solving l1 -regression via linear programming
Minimize (1,…,1) ∙ ( + ) Subject to: A x = b , ≥ 0
Generic linear programming gives poly(nd) time
Best known algorithm is nd5 log n + poly(d/ε) [Clarkson]
+ -
+ -+ -
14
Our Results
A (1+ε)-approximation algorithm for l1-regression problem
Time complexity is nd1.376 + poly(d/ε)(Clarkson’s is nd5 log n + poly(d/ε))
First 1-pass streaming algorithm with small space (poly(d log n /ε) bits)
Similar results for hyperplane fitting
15
Outline
Massive data sets Regression analysis Our results Our techniques Concluding remarks
16
Our Techniques
Notice that for any d x d change of basis matrix U,
minx in Rd |Ax-b|1 = minx in Rd |AUx-b|1
17
Our Techniques
Notice that for any y 2 Rd,
minx in Rd |Ax-b|1 = minx in Rd |Ax-b+Ay|1
We call b-Ay the “residual”, denoted b’, and so
minx in Rd |Ax-b|1 = minx in Rd |Ax-b’|1
18
Rough idea behind algorithm of Clarkson
Compute poly(d)-approximation
Compute well-conditionedbasis
Sample rows from the well-conditioned basis and the residual of the poly(d)-
approximation
Solve l1-regression on the sample, obtaining vector x, and output x
Find y such that|Ay-b|1 · poly(d) minx in Rd |Ax-b|1
Let b’ = b-Ay be the residual
Find a basis U so that for all x in Rd,
|x|1/poly(d) · |AUx|1 · poly(d) |x|1
minx in Rd |Ax-b|1 = minx in Rd |AUx – b’|1
Sample poly(d/ε) rows of AU◦b’ proportional to their l1-norm.
Takes nd5 log n time
Takes nd time
Takes nd5 log n time
Takes poly(d/ε) timeNow generic linear programming is efficient
19
Our Techniques
Suffices to show how to quickly compute
1. A poly(d)-approximation
2. A well-conditioned basis
20
Our main theorem
Theorem There is a probability space over (d log d) n matrices R such that for any
nd matrix A, with probability at least 99/100 we have for all x:
|Ax|1 ≤ |RAx|1 ≤ d log d ∙ |Ax|1
Embedding is linear is independent of A preserves lengths of an infinite number of vectors
21
Application of our main theorem
Computing a poly(d)-approximation
Compute RA and Rb
Solve x’ = argminx |RAx-Rb|1
Main theorem applied to A◦b implies x’ is a d log d – approximation
RA, Rb have d log d rows, so can solve l1-regression efficiently
Time is dominated by computing RA, a single matrix-matrix product
22
Application of our main theorem
Computing a well-conditioned basis
1. Compute RA
2. Compute U so that RAU is orthonormal (in the l2-sense)
3. Output AU
AU is well-conditioned because:
|AUx|1 · |RAUx|1 · (d log d)1/2 |RAUx|2 = (d log d)1/2 |x|2 · (d log d)1/2 |x|1and
|AUx|1 ¸ |RAUx|1/(d log d) ¸ |RAUx|2/(d log d) = |x|2/(d log d) ¸ |x|1 /(d3/2 log d)
Life is really simple!
Time dominated by computing RA and AU, two matrix-matrix products
23
Application of our main theorem
It follows that we get an nd1.376 + poly(d/ε) time algorithm for (1+ε)-approximate l1-regression
24
What’s left?
We should prove our main theorem Theorem: There is a probability space over (d log d) n matrices R such that for any
nd matrix A, with probability at least 99/100 we have for all x:
|Ax|1 ≤ |RAx|1 ≤ d log d ∙ |Ax|1
R is simple The entries of R are i.i.d. Cauchy random variables
25
Cauchy random variables
pdf(z) = 1/(π(1+z)2) for z in (-1, 1)
Infinite expectation and variance
1-stable: If z1, z2, …, zn are i.i.d. Cauchy, then for a 2 Rn,
a1¢z1 + a2¢z2 + … + an¢zn » |a|1¢z, where z is Cauchy
z
26
Proof of main theorem
By 1-stability, For all rows r of R,
<r, Ax> » |Ax|1¢Z,
where Z is a Cauchy
RAx » (|Ax|1 ¢ Z1, …, |Ax|1 ¢ Zd log d), where Z1, …, Zd log d are i.i.d. Cauchy
|RAx|1 = |Ax|1 i |Zi|
The |Zi| are half-Cauchy
i |Zi| = (d log d) with probability 1-exp(-d) by Chernoff
ε-net argument on {Ax | |Ax|1 = 1} shows |RAx|1 = |Ax|1¢(d log d) for all x
Scale R by 1/(d log d)
But i |Zi| is heavy-tailed
z
/ (d log d)
27
Proof of main theorem
i |Zi| is heavy-tailed, so |RAx|1 = |Ax|1 i |Zi| / (d log d) may be large
Each |Zi| has c.d.f. asymptotic to 1-Θ(1/z) for z in [0, 1)
No problem!
We know there exists a well-conditioned basis of A We can assume the basis vectors are A*1, …, A*d
|RA*i|1 » |A*i|1 ¢ i |Zi| / (d log d)
With constant probability, i |RA*i|1 = O(log d) i |A*i|1
28
Proof of main theorem
Suppose i |RA*i|1 = O(log d) i |A*i|1 for well-conditioned basis A*1, …, A*d
We will use the Auerbach basis which always exists: For all x, |x|1 · |Ax|1 i |A*i|1 = d
I don’t know how to compute such a basis, but it doesn’t matter!
i |RA*i|1 = O(d log d)
|RAx|1 · i |RA*i xi| · |x|1 i |RA*i|1 = |x|1O(d log d) = O(d log d) |Ax|1
Q.E.D.
29
Main Theorem
Theorem
There is a probability space over (d log d) n matrices R such that for any nd matrix A, with probability at least 99/100 we have for all x:
|Ax|1 ≤ |RAx|1 ≤ d log d ∙ |Ax|1
30
Outline
Massive data sets Regression analysis Our results Our techniques Concluding remarks
31
Regression for data streams
Streaming algorithm given additive updates to entries of A and b Pick random matrix R according to the distribution of main theorem Maintain RA and Rb during the stream Find x'that minimizes |RAx'-Rb|1 using linear programming Compute U so that RAU is orthonormal
The hard thing is sampling rows from AU◦b’ proportional to their norm Do not know U, b’ until end of stream Surpisingly, there is still a way to do this in a single pass by treating U, x’ as
formal variables and plugging them in at the end Uses a noisy sampling data structure Omitted from talk
Entries of R do not need to be independent
32
Hyperplane Fitting
Reduces to d invocations of l1-regression
Given n points in Rd, find hyperplane minimizing sum of l1-distances ofpoints to the hyperplane
33
Conclusion
Main results
Efficient algorithms for l1-regression and hyperplane fitting
nd1.376 time improves previous nd5 log n running time for l1-regression
First oblivious subspace embedding for l1