Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

30
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden

Transcript of Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

Page 1: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

Sketching as a Tool for Numerical Linear Algebra

David Woodruff

IBM Almaden

Page 2: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

2

Talk Outline

Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l1) Regression

Low Rank Approximation Sketching to speed up Low Rank Approximation

Recent Results and Open Questions M-Estimators and robust regression CUR decompositions

Page 3: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

3

Regression

Linear Regression Statistical method to study linear dependencies between

variables in the presence of noise.

Example Ohm's law V = R ∙ I Find linear function that

best fits the data

Page 4: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

4

Regression

Standard Setting One measured variable b A set of predictor variables a ,…, a Assumption:

b = x + a x + … + a x + is assumed to be noise and the xi are model

parameters we want to learn Can assume x0 = 0

Now consider n observations of b

1 d

1

1 d

d

0

Page 5: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

5

Regression analysis

Matrix form

Input: nd-matrix A and a vector b=(b1,…, bn)n is the number of observations; d is the number of predictor variables

Output: x* so that Ax* and b are close

Consider the over-constrained case, when n À d

Assume that A has full column rank

Page 6: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

6

Regression analysis

Least Squares Method Find x* that minimizes |Ax-b|22 = (bi – <Ai*, x>)²

Ai* is i-th row of A

Certain desirable statistical properties Closed form solution: x = (ATA)-1 AT b

Method of least absolute deviation (l1 -regression)

Find x* that minimizes |Ax-b|1 = |bi – <Ai*, x>|

Cost is less sensitive to outliers than least squares Can solve via linear programming

Time complexities are at least n*d2, we want better!

Page 7: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

7

Talk Outline

Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l1) Regression

Low Rank Approximation Sketching to speed up Low Rank Approximation

Recent Results and Open Questions M-Estimators and robust regression CUR decompositions

Page 8: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

8

Sketching to solve least squares regression

How to find an approximate solution x to minx |Ax-b|2 ?

Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2 with high probability

Draw S from a k x n random family of matrices, for a value k << n

Compute S*A and S*b

Output the solution x‘ to minx‘ |(SA)x-(Sb)|2

Page 9: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

9

How to choose the right sketching matrix S?

Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2

Lots of matrices work

S is d/ε2 x n matrix of i.i.d. Normal random variables

Computing S*A may be slow…

Page 10: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

10

How to choose the right sketching matrix S? [S]

S is a Johnson Lindenstrauss Transform

S = P*H*D

D is a diagonal matrix with +1, -1 on diagonals

H is the Hadamard transform

P just chooses a random (small) subset of rows of H*D

S*A can be computed much faster

Page 11: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

11

Even faster sketching matrices [CW]

[ [0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 00 0 0 -1 1 0 -1 00-1 0 0 0 0 0 1

CountSketch matrix

Define k x n matrix S, for k = O~(d2/ε2)

S is really sparse: single randomly chosen non-zero entry per column

Surprisingly,

this works!

Page 12: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

12

Simpler and Sharper Proofs [MM, NN, N]

Let B = [A, b] be an n x (d+1) matrix

Let U be an orthonormal basis for the columns of B

Suffices to show |SUx|2 = 1 ± ε for all unit x

Implies |S(Ax-b)|2 = (1 ± ε) |Ax-b|2 for all x

SU is a (d+1)2/ε2 x (d+1) matrix

Suffices to show | UTST SU – I|2 · |UTST SU – I|F · ε

Matrix product result: |CSTSD – CD|F2 · [1/(# rows of S)] * |C|F2 |D|F2

Set C = UT and D = U. Then |U|2F = (d+1) and (# rows of S) = (d+1)2/ε2

|SBx|2 = (1±ε) |Bx|2 for all x

S is called a subspace

embedding

|SBx|2 = (1±ε) |Bx|2 for all x

S is called a subspace

embedding

Page 13: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

13

Talk Outline

Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l1) Regression

Low Rank Approximation Sketching to speed up Low Rank Approximation

Recent Results and Open Questions M-Estimators and robust regression CUR decompositions

Page 14: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

14

Sketching to solve l1-regression

How to find an approximate solution x to minx |Ax-b|1 ?

Goal: output x‘ for which |Ax‘-b|1 · (1+ε) minx |Ax-b|1 with high probability

Natural attempt: Draw S from a k x n random family of matrices, for a value k << n

Compute S*A and S*b

Output the solution x‘ to minx‘ |(SA)x-(Sb)|1

Turns out this does not work

Page 15: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

15

Sketching to solve l1-regression [SW]

Why doesn’t outputting the solution x‘ to minx‘ |(SA)x-(Sb)|1

work?

Don‘t know of k x n matrices S with small k for which if x‘ is solution to minx |(SA)x-(Sb)|1 then

|Ax‘-b|1 · (1+ε) minx |Ax-b|1

with high probability

Instead: can find an S so that

|Ax‘-b|1 · (d log d) minx |Ax-b|1

S is a matrix of i.i.d. Cauchy random variables Property: |Ax-b|1 · |S(Ax-b)|1 · (d log d) |Ax-b|1

Page 16: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

16

Cauchy random variables

Cauchy random variables not as nice as Normal (Gaussian) random variables

They don’t have a mean and have infinite variance

Ratio of two independent Normal random variables is Cauchy

If a and b are scalars and C1 and C2 independent

Cauchys, thena*C1 + b*C2 ~ (|a|+|b|)*C

for a Cauchy C

If a and b are scalars and C1 and C2 independent

Cauchys, thena*C1 + b*C2 ~ (|a|+|b|)*C

for a Cauchy C

Page 17: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

17

Sketching to solve l1-regression

Main Idea: Let B = [A, b]. Compute a QR-factorization of S*B

Q has orthonormal columns and Q*R = S*B

B*R-1 is a “well-conditioning” of B: Σi=1

d |BR-1ei|1 · Σi=1d |SBR-1ei|1

· (d log d).5 Σi=1d |SBR-1ei|2

· d (d log d).5

|x|1 · |x|2

= |SBR-1x|2

· |SBR-1x|1

· (d log d) |BR-1x|1

These two properties make importance sampling work!

Page 18: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

18

Want to estimate Σi=1n yi by sampling, for yi ¸ 0

Suppose we sample yi with probability pi

T = Σi=1n δ(yi sampled) yi/pi

E[T] = Σi=1n pi ¢ yi/pi = Σi=1

n yi

Var[T] · E[T2] = Σi=1n pi ¢ yi

2 / pi2 · (Σi=1

n yi ) maxi yi/pi

Bound maxi yi/pi by ε2 (Σi=1n yi )

For us, yi = |(Ax-b)i| and this holds if pi = |eiBR-1|1 * poly(d/ε) !

Importance Sampling

Page 19: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

19

To get a bound for all x, use Bernstein’s inequality and a net argument

Sample poly(d/ε) rows of B*R-1 where the i-th row is sampled proportional to its 1-norm

T is diagonal matrix with Ti,i = 0 if row i not sampled, otherwise Ti,i = 1/Pr[row i sampled] |TBx|1 = (1 ± ε )|Bx|1 for all x

Solve regression on the (reweighted) samples!

Importance Sampling

Page 20: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

20

Sketching to solve l1-regression [MM]

Most expensive operation is computing S*A where S is the matrix of i.i.d. Cauchy random variables

All other operations are in the “smaller space”

Can speed this up by choosing S as follows:

[ [0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 00 0 0 -1 1 0 -1 00-1 0 0 0 0 0 1

¢

[

[C1

C2

C3

… Cn

Page 21: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

21

Further sketching improvements [WZ]

Can show you need a fewer number of sampled rows in later steps if instead choose S as follows

Instead of diagonal of Cauchy random variables, choose diagonal of reciprocals of exponential random variables

[ [0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 00 0 0 -1 1 0 -1 00-1 0 0 0 0 0 1

¢

[

[1/E1

1/E2

1/E3

… 1/En

For recent work on fast

sampling-based algorithms, see Richard’s talk!

For recent work on fast

sampling-based algorithms, see Richard’s talk!

Uses max-stability of exponentials

[Andoni]:max yi/ei ~ |y|1/e

Uses max-stability of exponentials

[Andoni]:max yi/ei ~ |y|1/e

Page 22: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

22

Talk Outline

Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l1) Regression

Low Rank Approximation Sketching to speed up Low Rank Approximation

Recent Results and Open Questions M-Estimators and robust regression CUR decompositions

Page 23: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

23

Low rank approximation

A is an n x d matrix

Typically well-approximated by low rank matrix E.g., only high rank because of noise

Want to output a rank k matrix A’, so that

|A-A’|F · (1+ε) |A-Ak|F,

w.h.p., where Ak = argminrank k matrices B |A-B|F

(For matrix C, |C|F = (Σi,j Ci,j2)1/2 )

Page 24: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

24

Solution to low-rank approximation [S]

Given n x d input matrix A Compute S*A using a sketching matrix S with k << n

rows. S*A takes random linear combinations of rows of A

SA

A

Project rows of A onto SA, then find best rank-k approximation to points inside of SA.

Page 25: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

25

Low Rank Approximation Idea

Regression problem minX |Ak X – A|F

Solution is X = I, and minimum is |Ak – A|F

This is a generalized regression problem!

If S is a subspace embedding for column space of Ak and also if for any matrices B, C, |BSTSC – BC|F2 · 1/(# rows of S) |B|

F2 |C|F2

Then if X’ is the minimizer to minX |SAkX – SA|F , then

|Ak X’ – A|F · (1+ε) minX |AkX-A|F = (1+ε) |Ak-A|F

But minimizer X’ = (SAk)- SA is in the row span of SA!

S can be matrix of i.i.d. Normals

S can be a Fast Johnson Lindenstrauss Matrix

S can be a CountSketch matrix

S can be matrix of i.i.d. Normals

S can be a Fast Johnson Lindenstrauss Matrix

S can be a CountSketch matrix

Page 26: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

26

Caveat: projecting the points onto SA is slow

Current algorithm:

1. Compute S*A (easy)

2. Project each of the rows onto S*A

3. Find best rank-k approximation of projected points inside of rowspace of S*A (easy)

Bottleneck is step 2

[CW] Turns out you can approximate the projection Sketching for generalized regression again: minX |X(SA)-A|F

2

Page 27: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

27

Talk Outline

Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l1) Regression

Low Rank Approximation Sketching to speed up Low Rank Approximation

Recent Results and Open Questions M-Estimators and robust regression CUR decompositions

Page 28: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

28

M-Estimators and Robust Regression

Solve minx |Ax-b|M M: R -> R¸ 0 |y|M = Σi=1

n M(yi)

Least squares and L1-regression are special cases

Huber function, given a parameter c:

M(y) = y2/(2c) for |y| · c

M(y) = |y|-c/2 otherwise

Enjoys smoothness properties of l2and

robustness properties of l1

[CW15] For M-estimators with at least linear and at most quadratic growth, can get O(1)-approximation in nnz(A) + poly(d) time

Page 29: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

29

CUR Decompositions

[BW14] Can find a CUR decomposition in O(nnz(A) log n) + n*poly(k/ε) time with O(k/ε) columns, O(k/ε) rows, and rank(U) = k

Page 30: Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

30

Open Questions

Recent monograph in NOW Publishers D. Woodruff, “Sketching as a Tool for Numerical Linear Algebra”

Other types of low rank approximation:

(Spectral) How quickly can we find a rank k matrix A’, so that

|A-A’|2 · (1+ε) |A-Ak|2,

w.h.p., where Ak = argminrank k matrices B |A-B|2

(Robust) How quickly can we find a rank k matrix A’, so that

|A-A’|1 · (1+ε) |A-Ak|1,

w.h.p., where Ak = argminrank k matrices B |A-B|1

For other questions regarding Schatten norms and communication-efficiency, see reference above. Thanks!