Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with...

104
Length Squared Sampling in Matrices November 2, 2017 Length Squared Sampling in Matrices November 2, 2017 1/1

Transcript of Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with...

Page 1: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Length Squared Sampling in Matrices

November 2, 2017

Length Squared Sampling in Matrices November 2, 2017 1 / 1

Page 2: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Agenda: Sampling to deal with “big” matrices

Our definition of “Big data” Doesn’t fit into RAM,

Obvious thought to deal with a big matrix: Sample some rows andcompute only on sampled rows.Uniform Random Sampling won’t do: maybe only o(1) fraction ofrows/columns have significant entries.This lecture: Two problems on matrices:

(Approximate) Matrix Multiplication in “near-linear” time.Compressed Representation of a matrix. Uses Matrix MultiplicationThm. twice in a curious way.

Both will sample rows with probability proportional to lengthsquared.Will use Length Squared Sampling later also for SVD.Thrpughout: Algorithm tosses coins. (“Randomized Algorithm”)Data does not. [Data: Worst-case, not average case.]

Length Squared Sampling in Matrices November 2, 2017 2 / 1

Page 3: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Agenda: Sampling to deal with “big” matrices

Our definition of “Big data” Doesn’t fit into RAM,Obvious thought to deal with a big matrix: Sample some rows andcompute only on sampled rows.

Uniform Random Sampling won’t do: maybe only o(1) fraction ofrows/columns have significant entries.This lecture: Two problems on matrices:

(Approximate) Matrix Multiplication in “near-linear” time.Compressed Representation of a matrix. Uses Matrix MultiplicationThm. twice in a curious way.

Both will sample rows with probability proportional to lengthsquared.Will use Length Squared Sampling later also for SVD.Thrpughout: Algorithm tosses coins. (“Randomized Algorithm”)Data does not. [Data: Worst-case, not average case.]

Length Squared Sampling in Matrices November 2, 2017 2 / 1

Page 4: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Agenda: Sampling to deal with “big” matrices

Our definition of “Big data” Doesn’t fit into RAM,Obvious thought to deal with a big matrix: Sample some rows andcompute only on sampled rows.Uniform Random Sampling won’t do: maybe only o(1) fraction ofrows/columns have significant entries.

This lecture: Two problems on matrices:

(Approximate) Matrix Multiplication in “near-linear” time.Compressed Representation of a matrix. Uses Matrix MultiplicationThm. twice in a curious way.

Both will sample rows with probability proportional to lengthsquared.Will use Length Squared Sampling later also for SVD.Thrpughout: Algorithm tosses coins. (“Randomized Algorithm”)Data does not. [Data: Worst-case, not average case.]

Length Squared Sampling in Matrices November 2, 2017 2 / 1

Page 5: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Agenda: Sampling to deal with “big” matrices

Our definition of “Big data” Doesn’t fit into RAM,Obvious thought to deal with a big matrix: Sample some rows andcompute only on sampled rows.Uniform Random Sampling won’t do: maybe only o(1) fraction ofrows/columns have significant entries.This lecture: Two problems on matrices:

(Approximate) Matrix Multiplication in “near-linear” time.Compressed Representation of a matrix. Uses Matrix MultiplicationThm. twice in a curious way.

Both will sample rows with probability proportional to lengthsquared.Will use Length Squared Sampling later also for SVD.Thrpughout: Algorithm tosses coins. (“Randomized Algorithm”)Data does not. [Data: Worst-case, not average case.]

Length Squared Sampling in Matrices November 2, 2017 2 / 1

Page 6: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Agenda: Sampling to deal with “big” matrices

Our definition of “Big data” Doesn’t fit into RAM,Obvious thought to deal with a big matrix: Sample some rows andcompute only on sampled rows.Uniform Random Sampling won’t do: maybe only o(1) fraction ofrows/columns have significant entries.This lecture: Two problems on matrices:

(Approximate) Matrix Multiplication in “near-linear” time.

Compressed Representation of a matrix. Uses Matrix MultiplicationThm. twice in a curious way.

Both will sample rows with probability proportional to lengthsquared.Will use Length Squared Sampling later also for SVD.Thrpughout: Algorithm tosses coins. (“Randomized Algorithm”)Data does not. [Data: Worst-case, not average case.]

Length Squared Sampling in Matrices November 2, 2017 2 / 1

Page 7: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Agenda: Sampling to deal with “big” matrices

Our definition of “Big data” Doesn’t fit into RAM,Obvious thought to deal with a big matrix: Sample some rows andcompute only on sampled rows.Uniform Random Sampling won’t do: maybe only o(1) fraction ofrows/columns have significant entries.This lecture: Two problems on matrices:

(Approximate) Matrix Multiplication in “near-linear” time.Compressed Representation of a matrix. Uses Matrix MultiplicationThm. twice in a curious way.

Both will sample rows with probability proportional to lengthsquared.Will use Length Squared Sampling later also for SVD.Thrpughout: Algorithm tosses coins. (“Randomized Algorithm”)Data does not. [Data: Worst-case, not average case.]

Length Squared Sampling in Matrices November 2, 2017 2 / 1

Page 8: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Agenda: Sampling to deal with “big” matrices

Our definition of “Big data” Doesn’t fit into RAM,Obvious thought to deal with a big matrix: Sample some rows andcompute only on sampled rows.Uniform Random Sampling won’t do: maybe only o(1) fraction ofrows/columns have significant entries.This lecture: Two problems on matrices:

(Approximate) Matrix Multiplication in “near-linear” time.Compressed Representation of a matrix. Uses Matrix MultiplicationThm. twice in a curious way.

Both will sample rows with probability proportional to lengthsquared.

Will use Length Squared Sampling later also for SVD.Thrpughout: Algorithm tosses coins. (“Randomized Algorithm”)Data does not. [Data: Worst-case, not average case.]

Length Squared Sampling in Matrices November 2, 2017 2 / 1

Page 9: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Agenda: Sampling to deal with “big” matrices

Our definition of “Big data” Doesn’t fit into RAM,Obvious thought to deal with a big matrix: Sample some rows andcompute only on sampled rows.Uniform Random Sampling won’t do: maybe only o(1) fraction ofrows/columns have significant entries.This lecture: Two problems on matrices:

(Approximate) Matrix Multiplication in “near-linear” time.Compressed Representation of a matrix. Uses Matrix MultiplicationThm. twice in a curious way.

Both will sample rows with probability proportional to lengthsquared.Will use Length Squared Sampling later also for SVD.

Thrpughout: Algorithm tosses coins. (“Randomized Algorithm”)Data does not. [Data: Worst-case, not average case.]

Length Squared Sampling in Matrices November 2, 2017 2 / 1

Page 10: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Agenda: Sampling to deal with “big” matrices

Our definition of “Big data” Doesn’t fit into RAM,Obvious thought to deal with a big matrix: Sample some rows andcompute only on sampled rows.Uniform Random Sampling won’t do: maybe only o(1) fraction ofrows/columns have significant entries.This lecture: Two problems on matrices:

(Approximate) Matrix Multiplication in “near-linear” time.Compressed Representation of a matrix. Uses Matrix MultiplicationThm. twice in a curious way.

Both will sample rows with probability proportional to lengthsquared.Will use Length Squared Sampling later also for SVD.Thrpughout: Algorithm tosses coins. (“Randomized Algorithm”)Data does not. [Data: Worst-case, not average case.]

Length Squared Sampling in Matrices November 2, 2017 2 / 1

Page 11: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication

Matrix Multiplication : Am×n,Bn×q. Find AB. Exact (näive)algorithm in O(mnq) time. Better Divide and Conquer Algorithms.Can we do linear time O(mn + nq) approximately? What if A,Bcannot be stored in RAM ?

Can we just take a sample of A,B, multiply to get an approximationto product ?Sample some entries of A,B ? Need compatible dimensions tomultiply !Sample some rows/columns of A,B ?? Any rows/columns ??

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +· · · . Check.The r.h.s = sum of n quantities (which happen to be matrices.)Sample r of these n quantities and hope/prove that their sum(times n

r ) is a good estimate.Try u.a.r. T ⊂ 1,2, . . . ,n with |T | = r . Does this work for anyA,B?

Length Squared Sampling in Matrices November 2, 2017 3 / 1

Page 12: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication

Matrix Multiplication : Am×n,Bn×q. Find AB. Exact (näive)algorithm in O(mnq) time. Better Divide and Conquer Algorithms.Can we do linear time O(mn + nq) approximately? What if A,Bcannot be stored in RAM ?

Can we just take a sample of A,B, multiply to get an approximationto product ?

Sample some entries of A,B ? Need compatible dimensions tomultiply !Sample some rows/columns of A,B ?? Any rows/columns ??

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +· · · . Check.The r.h.s = sum of n quantities (which happen to be matrices.)Sample r of these n quantities and hope/prove that their sum(times n

r ) is a good estimate.Try u.a.r. T ⊂ 1,2, . . . ,n with |T | = r . Does this work for anyA,B?

Length Squared Sampling in Matrices November 2, 2017 3 / 1

Page 13: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication

Matrix Multiplication : Am×n,Bn×q. Find AB. Exact (näive)algorithm in O(mnq) time. Better Divide and Conquer Algorithms.Can we do linear time O(mn + nq) approximately? What if A,Bcannot be stored in RAM ?

Can we just take a sample of A,B, multiply to get an approximationto product ?Sample some entries of A,B ? Need compatible dimensions tomultiply !

Sample some rows/columns of A,B ?? Any rows/columns ??AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +· · · . Check.The r.h.s = sum of n quantities (which happen to be matrices.)Sample r of these n quantities and hope/prove that their sum(times n

r ) is a good estimate.Try u.a.r. T ⊂ 1,2, . . . ,n with |T | = r . Does this work for anyA,B?

Length Squared Sampling in Matrices November 2, 2017 3 / 1

Page 14: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication

Matrix Multiplication : Am×n,Bn×q. Find AB. Exact (näive)algorithm in O(mnq) time. Better Divide and Conquer Algorithms.Can we do linear time O(mn + nq) approximately? What if A,Bcannot be stored in RAM ?

Can we just take a sample of A,B, multiply to get an approximationto product ?Sample some entries of A,B ? Need compatible dimensions tomultiply !Sample some rows/columns of A,B ?? Any rows/columns ??

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +· · · . Check.The r.h.s = sum of n quantities (which happen to be matrices.)Sample r of these n quantities and hope/prove that their sum(times n

r ) is a good estimate.Try u.a.r. T ⊂ 1,2, . . . ,n with |T | = r . Does this work for anyA,B?

Length Squared Sampling in Matrices November 2, 2017 3 / 1

Page 15: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication

Matrix Multiplication : Am×n,Bn×q. Find AB. Exact (näive)algorithm in O(mnq) time. Better Divide and Conquer Algorithms.Can we do linear time O(mn + nq) approximately? What if A,Bcannot be stored in RAM ?

Can we just take a sample of A,B, multiply to get an approximationto product ?Sample some entries of A,B ? Need compatible dimensions tomultiply !Sample some rows/columns of A,B ?? Any rows/columns ??

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +· · · . Check.

The r.h.s = sum of n quantities (which happen to be matrices.)Sample r of these n quantities and hope/prove that their sum(times n

r ) is a good estimate.Try u.a.r. T ⊂ 1,2, . . . ,n with |T | = r . Does this work for anyA,B?

Length Squared Sampling in Matrices November 2, 2017 3 / 1

Page 16: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication

Matrix Multiplication : Am×n,Bn×q. Find AB. Exact (näive)algorithm in O(mnq) time. Better Divide and Conquer Algorithms.Can we do linear time O(mn + nq) approximately? What if A,Bcannot be stored in RAM ?

Can we just take a sample of A,B, multiply to get an approximationto product ?Sample some entries of A,B ? Need compatible dimensions tomultiply !Sample some rows/columns of A,B ?? Any rows/columns ??

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +· · · . Check.The r.h.s = sum of n quantities (which happen to be matrices.)

Sample r of these n quantities and hope/prove that their sum(times n

r ) is a good estimate.Try u.a.r. T ⊂ 1,2, . . . ,n with |T | = r . Does this work for anyA,B?

Length Squared Sampling in Matrices November 2, 2017 3 / 1

Page 17: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication

Matrix Multiplication : Am×n,Bn×q. Find AB. Exact (näive)algorithm in O(mnq) time. Better Divide and Conquer Algorithms.Can we do linear time O(mn + nq) approximately? What if A,Bcannot be stored in RAM ?

Can we just take a sample of A,B, multiply to get an approximationto product ?Sample some entries of A,B ? Need compatible dimensions tomultiply !Sample some rows/columns of A,B ?? Any rows/columns ??

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +· · · . Check.The r.h.s = sum of n quantities (which happen to be matrices.)Sample r of these n quantities and hope/prove that their sum(times n

r ) is a good estimate.

Try u.a.r. T ⊂ 1,2, . . . ,n with |T | = r . Does this work for anyA,B?

Length Squared Sampling in Matrices November 2, 2017 3 / 1

Page 18: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication

Matrix Multiplication : Am×n,Bn×q. Find AB. Exact (näive)algorithm in O(mnq) time. Better Divide and Conquer Algorithms.Can we do linear time O(mn + nq) approximately? What if A,Bcannot be stored in RAM ?

Can we just take a sample of A,B, multiply to get an approximationto product ?Sample some entries of A,B ? Need compatible dimensions tomultiply !Sample some rows/columns of A,B ?? Any rows/columns ??

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +· · · . Check.The r.h.s = sum of n quantities (which happen to be matrices.)Sample r of these n quantities and hope/prove that their sum(times n

r ) is a good estimate.Try u.a.r. T ⊂ 1,2, . . . ,n with |T | = r . Does this work for anyA,B?

Length Squared Sampling in Matrices November 2, 2017 3 / 1

Page 19: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication-contd.

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +..

AB =n∑

i=1A (:, i)B (i , :) .

Uniform Sampling does not work. General non-uniform Sampling:Prob.s of picking columns: p1,p2, . . . ,pn ≥ 0 ; Sum = 1.Pick j ∈ 1,2, . . . ,n with Prob(j) = pj Let X = A(:, j)B(j , :). X is amatrix-valued r.v. Really want to take r i.i.d. copies of j , and of Xand take average. But enough to compute mean and variance ofone X .What is E(X ) (entry-wise expectation)?E(X ) =

∑nj=1 pj (A(:, j)B(j , :)). How we do we make it unbiased ?

Step back: Want to estimate the sum of n real numbersa1,a2, . . . ,an. Pick a j ∈ 1,2, . . . ,n with probabilitiesp1,p2, . . . ,pn. How do we scale the picked aj so that it is anunbiased estimator of sum?

Length Squared Sampling in Matrices November 2, 2017 4 / 1

Page 20: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication-contd.

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +..

AB =n∑

i=1A (:, i)B (i , :) .

Uniform Sampling does not work. General non-uniform Sampling:Prob.s of picking columns: p1,p2, . . . ,pn ≥ 0 ; Sum = 1.

Pick j ∈ 1,2, . . . ,n with Prob(j) = pj Let X = A(:, j)B(j , :). X is amatrix-valued r.v. Really want to take r i.i.d. copies of j , and of Xand take average. But enough to compute mean and variance ofone X .What is E(X ) (entry-wise expectation)?E(X ) =

∑nj=1 pj (A(:, j)B(j , :)). How we do we make it unbiased ?

Step back: Want to estimate the sum of n real numbersa1,a2, . . . ,an. Pick a j ∈ 1,2, . . . ,n with probabilitiesp1,p2, . . . ,pn. How do we scale the picked aj so that it is anunbiased estimator of sum?

Length Squared Sampling in Matrices November 2, 2017 4 / 1

Page 21: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication-contd.

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +..

AB =n∑

i=1A (:, i)B (i , :) .

Uniform Sampling does not work. General non-uniform Sampling:Prob.s of picking columns: p1,p2, . . . ,pn ≥ 0 ; Sum = 1.Pick j ∈ 1,2, . . . ,n with Prob(j) = pj Let X = A(:, j)B(j , :). X is amatrix-valued r.v. Really want to take r i.i.d. copies of j , and of Xand take average. But enough to compute mean and variance ofone X .

What is E(X ) (entry-wise expectation)?E(X ) =

∑nj=1 pj (A(:, j)B(j , :)). How we do we make it unbiased ?

Step back: Want to estimate the sum of n real numbersa1,a2, . . . ,an. Pick a j ∈ 1,2, . . . ,n with probabilitiesp1,p2, . . . ,pn. How do we scale the picked aj so that it is anunbiased estimator of sum?

Length Squared Sampling in Matrices November 2, 2017 4 / 1

Page 22: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication-contd.

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +..

AB =n∑

i=1A (:, i)B (i , :) .

Uniform Sampling does not work. General non-uniform Sampling:Prob.s of picking columns: p1,p2, . . . ,pn ≥ 0 ; Sum = 1.Pick j ∈ 1,2, . . . ,n with Prob(j) = pj Let X = A(:, j)B(j , :). X is amatrix-valued r.v. Really want to take r i.i.d. copies of j , and of Xand take average. But enough to compute mean and variance ofone X .What is E(X ) (entry-wise expectation)?

E(X ) =∑n

j=1 pj (A(:, j)B(j , :)). How we do we make it unbiased ?Step back: Want to estimate the sum of n real numbersa1,a2, . . . ,an. Pick a j ∈ 1,2, . . . ,n with probabilitiesp1,p2, . . . ,pn. How do we scale the picked aj so that it is anunbiased estimator of sum?

Length Squared Sampling in Matrices November 2, 2017 4 / 1

Page 23: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication-contd.

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +..

AB =n∑

i=1A (:, i)B (i , :) .

Uniform Sampling does not work. General non-uniform Sampling:Prob.s of picking columns: p1,p2, . . . ,pn ≥ 0 ; Sum = 1.Pick j ∈ 1,2, . . . ,n with Prob(j) = pj Let X = A(:, j)B(j , :). X is amatrix-valued r.v. Really want to take r i.i.d. copies of j , and of Xand take average. But enough to compute mean and variance ofone X .What is E(X ) (entry-wise expectation)?E(X ) =

∑nj=1 pj (A(:, j)B(j , :)). How we do we make it unbiased ?

Step back: Want to estimate the sum of n real numbersa1,a2, . . . ,an. Pick a j ∈ 1,2, . . . ,n with probabilitiesp1,p2, . . . ,pn. How do we scale the picked aj so that it is anunbiased estimator of sum?

Length Squared Sampling in Matrices November 2, 2017 4 / 1

Page 24: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication-contd.

AB = (1st Col of A)(1st row of B)+(2nd Col of A)(2nd row of B) +..

AB =n∑

i=1A (:, i)B (i , :) .

Uniform Sampling does not work. General non-uniform Sampling:Prob.s of picking columns: p1,p2, . . . ,pn ≥ 0 ; Sum = 1.Pick j ∈ 1,2, . . . ,n with Prob(j) = pj Let X = A(:, j)B(j , :). X is amatrix-valued r.v. Really want to take r i.i.d. copies of j , and of Xand take average. But enough to compute mean and variance ofone X .What is E(X ) (entry-wise expectation)?E(X ) =

∑nj=1 pj (A(:, j)B(j , :)). How we do we make it unbiased ?

Step back: Want to estimate the sum of n real numbersa1,a2, . . . ,an. Pick a j ∈ 1,2, . . . ,n with probabilitiesp1,p2, . . . ,pn. How do we scale the picked aj so that it is anunbiased estimator of sum?

Length Squared Sampling in Matrices November 2, 2017 4 / 1

Page 25: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Pick aj from a1,a2, . . . ,an with probabilities p1,p2, . . . ,pn.E(aj/pj) =

∑nj=1 aj .

E(

1pj

A(:, j)B(j , :))

= AB. Better have pj > 0.

What about the error ?Try writing down the variance of one entry, say the (i , j) th entry.First try the second moment.

n∑l=1

pl1p2

lA2

il B2lj .

How do we bound it for different (i , j) ?

Length Squared Sampling in Matrices November 2, 2017 5 / 1

Page 26: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Pick aj from a1,a2, . . . ,an with probabilities p1,p2, . . . ,pn.E(aj/pj) =

∑nj=1 aj .

E(

1pj

A(:, j)B(j , :))

= AB. Better have pj > 0.

What about the error ?Try writing down the variance of one entry, say the (i , j) th entry.First try the second moment.

n∑l=1

pl1p2

lA2

il B2lj .

How do we bound it for different (i , j) ?

Length Squared Sampling in Matrices November 2, 2017 5 / 1

Page 27: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Pick aj from a1,a2, . . . ,an with probabilities p1,p2, . . . ,pn.E(aj/pj) =

∑nj=1 aj .

E(

1pj

A(:, j)B(j , :))

= AB. Better have pj > 0.

What about the error ?

Try writing down the variance of one entry, say the (i , j) th entry.First try the second moment.

n∑l=1

pl1p2

lA2

il B2lj .

How do we bound it for different (i , j) ?

Length Squared Sampling in Matrices November 2, 2017 5 / 1

Page 28: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Pick aj from a1,a2, . . . ,an with probabilities p1,p2, . . . ,pn.E(aj/pj) =

∑nj=1 aj .

E(

1pj

A(:, j)B(j , :))

= AB. Better have pj > 0.

What about the error ?Try writing down the variance of one entry, say the (i , j) th entry.First try the second moment.

n∑l=1

pl1p2

lA2

il B2lj .

How do we bound it for different (i , j) ?

Length Squared Sampling in Matrices November 2, 2017 5 / 1

Page 29: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Pick aj from a1,a2, . . . ,an with probabilities p1,p2, . . . ,pn.E(aj/pj) =

∑nj=1 aj .

E(

1pj

A(:, j)B(j , :))

= AB. Better have pj > 0.

What about the error ?Try writing down the variance of one entry, say the (i , j) th entry.First try the second moment.

n∑l=1

pl1p2

lA2

il B2lj .

How do we bound it for different (i , j) ?

Length Squared Sampling in Matrices November 2, 2017 5 / 1

Page 30: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Variance of the Matrix

Simple Idea (but without it, very complicated): Bound∑

ofvariances of entries of X . Let Var(X ) denote

∑i,j Var(Xij).

Var(X ) =∑m

i=1

q∑j=1

Var(xij)≤∑ij

E(

x2ij

)≤∑ij

∑l

pl1p2

la2

il b2lj .

Exchange order of summations:∑l

1pl

∑i

a2il∑

jb2

lj =∑

l

1pl|A (:, l) |2|B (l , :) |2.

What is the best choice of pj ? It is the one which minimizes thevariance of X .Suffices to minimize second moment, since E(X ) does notdepend on pl (Unbiased)!

Length Squared Sampling in Matrices November 2, 2017 6 / 1

Page 31: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Variance of the Matrix

Simple Idea (but without it, very complicated): Bound∑

ofvariances of entries of X . Let Var(X ) denote

∑i,j Var(Xij).

Var(X ) =∑m

i=1

q∑j=1

Var(xij)≤∑ij

E(

x2ij

)≤∑ij

∑l

pl1p2

la2

il b2lj .

Exchange order of summations:∑l

1pl

∑i

a2il∑

jb2

lj =∑

l

1pl|A (:, l) |2|B (l , :) |2.

What is the best choice of pj ? It is the one which minimizes thevariance of X .Suffices to minimize second moment, since E(X ) does notdepend on pl (Unbiased)!

Length Squared Sampling in Matrices November 2, 2017 6 / 1

Page 32: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Variance of the Matrix

Simple Idea (but without it, very complicated): Bound∑

ofvariances of entries of X . Let Var(X ) denote

∑i,j Var(Xij).

Var(X ) =∑m

i=1

q∑j=1

Var(xij)≤∑ij

E(

x2ij

)≤∑ij

∑l

pl1p2

la2

il b2lj .

Exchange order of summations:∑l

1pl

∑i

a2il∑

jb2

lj =∑

l

1pl|A (:, l) |2|B (l , :) |2.

What is the best choice of pj ? It is the one which minimizes thevariance of X .Suffices to minimize second moment, since E(X ) does notdepend on pl (Unbiased)!

Length Squared Sampling in Matrices November 2, 2017 6 / 1

Page 33: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Variance of the Matrix

Simple Idea (but without it, very complicated): Bound∑

ofvariances of entries of X . Let Var(X ) denote

∑i,j Var(Xij).

Var(X ) =∑m

i=1

q∑j=1

Var(xij)≤∑ij

E(

x2ij

)≤∑ij

∑l

pl1p2

la2

il b2lj .

Exchange order of summations:∑l

1pl

∑i

a2il∑

jb2

lj =∑

l

1pl|A (:, l) |2|B (l , :) |2.

What is the best choice of pj ? It is the one which minimizes thevariance of X .

Suffices to minimize second moment, since E(X ) does notdepend on pl (Unbiased)!

Length Squared Sampling in Matrices November 2, 2017 6 / 1

Page 34: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Variance of the Matrix

Simple Idea (but without it, very complicated): Bound∑

ofvariances of entries of X . Let Var(X ) denote

∑i,j Var(Xij).

Var(X ) =∑m

i=1

q∑j=1

Var(xij)≤∑ij

E(

x2ij

)≤∑ij

∑l

pl1p2

la2

il b2lj .

Exchange order of summations:∑l

1pl

∑i

a2il∑

jb2

lj =∑

l

1pl|A (:, l) |2|B (l , :) |2.

What is the best choice of pj ? It is the one which minimizes thevariance of X .Suffices to minimize second moment, since E(X ) does notdepend on pl (Unbiased)!

Length Squared Sampling in Matrices November 2, 2017 6 / 1

Page 35: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Probabilities which minimize variance

Choose pk to minimize∑

k1pk|A(:, k)|2|B(k , :)|2.

Calculus: If a1,a2, . . . ,an are any positive reals, the p1,p2, . . . ,pnminimizing

∑ alpl

subject to pl ≥ 0,∑

l pl = 1 are proportional to√al . Check.

Best Choice- pk ∝ |A(:, k)| |B(k , :)|, i.e.,pk = |A(:, k)| |B(k , :)|/

∑l |A(:, l)| |B(l , :)|

In the important special case when B = AT , pk = |A(:, k)|2/||A||2F ,where,||A||2F =

∑i,j A2

ij is called the Frobenius norm of A.Pick columns of A with probabilities proportional to thesquared length of the columns. . Length-Squared Sampling.With these probabilities, we haveVar(X ) ≤

∑k|A(:,k)|2|B(k ,:)|2|A(:,k)| |B(k ,:)|

∑nl=1 |A(:, l)| |B(l , :)|

= (∑

l |A(:, l)| |B(l , :)|)2 ≤ ||A||2F ||B||2F . [Why?]Another set of probabilities (really length squared):pk = |A(:, k)|2/||A||2F , also→ ||A||2F ||B||2F .

Length Squared Sampling in Matrices November 2, 2017 7 / 1

Page 36: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Probabilities which minimize variance

Choose pk to minimize∑

k1pk|A(:, k)|2|B(k , :)|2.

Calculus: If a1,a2, . . . ,an are any positive reals, the p1,p2, . . . ,pnminimizing

∑ alpl

subject to pl ≥ 0,∑

l pl = 1 are proportional to√al . Check.

Best Choice- pk ∝ |A(:, k)| |B(k , :)|, i.e.,pk = |A(:, k)| |B(k , :)|/

∑l |A(:, l)| |B(l , :)|

In the important special case when B = AT , pk = |A(:, k)|2/||A||2F ,where,||A||2F =

∑i,j A2

ij is called the Frobenius norm of A.Pick columns of A with probabilities proportional to thesquared length of the columns. . Length-Squared Sampling.With these probabilities, we haveVar(X ) ≤

∑k|A(:,k)|2|B(k ,:)|2|A(:,k)| |B(k ,:)|

∑nl=1 |A(:, l)| |B(l , :)|

= (∑

l |A(:, l)| |B(l , :)|)2 ≤ ||A||2F ||B||2F . [Why?]Another set of probabilities (really length squared):pk = |A(:, k)|2/||A||2F , also→ ||A||2F ||B||2F .

Length Squared Sampling in Matrices November 2, 2017 7 / 1

Page 37: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Probabilities which minimize variance

Choose pk to minimize∑

k1pk|A(:, k)|2|B(k , :)|2.

Calculus: If a1,a2, . . . ,an are any positive reals, the p1,p2, . . . ,pnminimizing

∑ alpl

subject to pl ≥ 0,∑

l pl = 1 are proportional to√al . Check.

Best Choice- pk ∝ |A(:, k)| |B(k , :)|, i.e.,pk = |A(:, k)| |B(k , :)|/

∑l |A(:, l)| |B(l , :)|

In the important special case when B = AT , pk = |A(:, k)|2/||A||2F ,where,||A||2F =

∑i,j A2

ij is called the Frobenius norm of A.Pick columns of A with probabilities proportional to thesquared length of the columns. . Length-Squared Sampling.With these probabilities, we haveVar(X ) ≤

∑k|A(:,k)|2|B(k ,:)|2|A(:,k)| |B(k ,:)|

∑nl=1 |A(:, l)| |B(l , :)|

= (∑

l |A(:, l)| |B(l , :)|)2 ≤ ||A||2F ||B||2F . [Why?]Another set of probabilities (really length squared):pk = |A(:, k)|2/||A||2F , also→ ||A||2F ||B||2F .

Length Squared Sampling in Matrices November 2, 2017 7 / 1

Page 38: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Probabilities which minimize variance

Choose pk to minimize∑

k1pk|A(:, k)|2|B(k , :)|2.

Calculus: If a1,a2, . . . ,an are any positive reals, the p1,p2, . . . ,pnminimizing

∑ alpl

subject to pl ≥ 0,∑

l pl = 1 are proportional to√al . Check.

Best Choice- pk ∝ |A(:, k)| |B(k , :)|, i.e.,pk = |A(:, k)| |B(k , :)|/

∑l |A(:, l)| |B(l , :)|

In the important special case when B = AT , pk = |A(:, k)|2/||A||2F ,where,||A||2F =

∑i,j A2

ij is called the Frobenius norm of A.

Pick columns of A with probabilities proportional to thesquared length of the columns. . Length-Squared Sampling.With these probabilities, we haveVar(X ) ≤

∑k|A(:,k)|2|B(k ,:)|2|A(:,k)| |B(k ,:)|

∑nl=1 |A(:, l)| |B(l , :)|

= (∑

l |A(:, l)| |B(l , :)|)2 ≤ ||A||2F ||B||2F . [Why?]Another set of probabilities (really length squared):pk = |A(:, k)|2/||A||2F , also→ ||A||2F ||B||2F .

Length Squared Sampling in Matrices November 2, 2017 7 / 1

Page 39: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Probabilities which minimize variance

Choose pk to minimize∑

k1pk|A(:, k)|2|B(k , :)|2.

Calculus: If a1,a2, . . . ,an are any positive reals, the p1,p2, . . . ,pnminimizing

∑ alpl

subject to pl ≥ 0,∑

l pl = 1 are proportional to√al . Check.

Best Choice- pk ∝ |A(:, k)| |B(k , :)|, i.e.,pk = |A(:, k)| |B(k , :)|/

∑l |A(:, l)| |B(l , :)|

In the important special case when B = AT , pk = |A(:, k)|2/||A||2F ,where,||A||2F =

∑i,j A2

ij is called the Frobenius norm of A.Pick columns of A with probabilities proportional to thesquared length of the columns. . Length-Squared Sampling.

With these probabilities, we haveVar(X ) ≤

∑k|A(:,k)|2|B(k ,:)|2|A(:,k)| |B(k ,:)|

∑nl=1 |A(:, l)| |B(l , :)|

= (∑

l |A(:, l)| |B(l , :)|)2 ≤ ||A||2F ||B||2F . [Why?]Another set of probabilities (really length squared):pk = |A(:, k)|2/||A||2F , also→ ||A||2F ||B||2F .

Length Squared Sampling in Matrices November 2, 2017 7 / 1

Page 40: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Probabilities which minimize variance

Choose pk to minimize∑

k1pk|A(:, k)|2|B(k , :)|2.

Calculus: If a1,a2, . . . ,an are any positive reals, the p1,p2, . . . ,pnminimizing

∑ alpl

subject to pl ≥ 0,∑

l pl = 1 are proportional to√al . Check.

Best Choice- pk ∝ |A(:, k)| |B(k , :)|, i.e.,pk = |A(:, k)| |B(k , :)|/

∑l |A(:, l)| |B(l , :)|

In the important special case when B = AT , pk = |A(:, k)|2/||A||2F ,where,||A||2F =

∑i,j A2

ij is called the Frobenius norm of A.Pick columns of A with probabilities proportional to thesquared length of the columns. . Length-Squared Sampling.With these probabilities, we haveVar(X ) ≤

∑k|A(:,k)|2|B(k ,:)|2|A(:,k)| |B(k ,:)|

∑nl=1 |A(:, l)| |B(l , :)|

= (∑

l |A(:, l)| |B(l , :)|)2 ≤ ||A||2F ||B||2F . [Why?]

Another set of probabilities (really length squared):pk = |A(:, k)|2/||A||2F , also→ ||A||2F ||B||2F .

Length Squared Sampling in Matrices November 2, 2017 7 / 1

Page 41: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Probabilities which minimize variance

Choose pk to minimize∑

k1pk|A(:, k)|2|B(k , :)|2.

Calculus: If a1,a2, . . . ,an are any positive reals, the p1,p2, . . . ,pnminimizing

∑ alpl

subject to pl ≥ 0,∑

l pl = 1 are proportional to√al . Check.

Best Choice- pk ∝ |A(:, k)| |B(k , :)|, i.e.,pk = |A(:, k)| |B(k , :)|/

∑l |A(:, l)| |B(l , :)|

In the important special case when B = AT , pk = |A(:, k)|2/||A||2F ,where,||A||2F =

∑i,j A2

ij is called the Frobenius norm of A.Pick columns of A with probabilities proportional to thesquared length of the columns. . Length-Squared Sampling.With these probabilities, we haveVar(X ) ≤

∑k|A(:,k)|2|B(k ,:)|2|A(:,k)| |B(k ,:)|

∑nl=1 |A(:, l)| |B(l , :)|

= (∑

l |A(:, l)| |B(l , :)|)2 ≤ ||A||2F ||B||2F . [Why?]Another set of probabilities (really length squared):pk = |A(:, k)|2/||A||2F , also→ ||A||2F ||B||2F .Length Squared Sampling in Matrices November 2, 2017 7 / 1

Page 42: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Approximately Length Squared

Suppose pk ≥ c|A(:, k)|2/||A||2F for some c ∈ Ω(1). It may bepossible to find such pk more easily than finding exact lengths.[For eg. by sampling.]

Still X = A(:, k)B(k , :)/pk is unbiased estimator of AB (in fact forany pk !

Now, Var(X ) ≤∑

k|A(:,k)|2|B(k ,:)|2

pk= 1

c ||A||2F ||B||2F . Loose only a

factor of 1/c2.

Length Squared Sampling in Matrices November 2, 2017 8 / 1

Page 43: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Approximately Length Squared

Suppose pk ≥ c|A(:, k)|2/||A||2F for some c ∈ Ω(1). It may bepossible to find such pk more easily than finding exact lengths.[For eg. by sampling.]Still X = A(:, k)B(k , :)/pk is unbiased estimator of AB (in fact forany pk !

Now, Var(X ) ≤∑

k|A(:,k)|2|B(k ,:)|2

pk= 1

c ||A||2F ||B||2F . Loose only a

factor of 1/c2.

Length Squared Sampling in Matrices November 2, 2017 8 / 1

Page 44: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Approximately Length Squared

Suppose pk ≥ c|A(:, k)|2/||A||2F for some c ∈ Ω(1). It may bepossible to find such pk more easily than finding exact lengths.[For eg. by sampling.]Still X = A(:, k)B(k , :)/pk is unbiased estimator of AB (in fact forany pk !

Now, Var(X ) ≤∑

k|A(:,k)|2|B(k ,:)|2

pk= 1

c ||A||2F ||B||2F . Loose only a

factor of 1/c2.

Length Squared Sampling in Matrices November 2, 2017 8 / 1

Page 45: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Reducing Variance

We saw for r.v. X , we have: E(X ) = AB and Var(X ) ≤ ||A||F ||B||F .Good Enough ?

But ||AB||F ≤ ||A||F ||B||F and equality could hold. So in best case,error is as much as ||AB||F ! No good.What is a general method of reducing the variance ?Take s i.i.d copies of X and take average. Variance cut down by afactor of s.

Length Squared Sampling in Matrices November 2, 2017 9 / 1

Page 46: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Reducing Variance

We saw for r.v. X , we have: E(X ) = AB and Var(X ) ≤ ||A||F ||B||F .Good Enough ?But ||AB||F ≤ ||A||F ||B||F and equality could hold. So in best case,error is as much as ||AB||F ! No good.

What is a general method of reducing the variance ?Take s i.i.d copies of X and take average. Variance cut down by afactor of s.

Length Squared Sampling in Matrices November 2, 2017 9 / 1

Page 47: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Reducing Variance

We saw for r.v. X , we have: E(X ) = AB and Var(X ) ≤ ||A||F ||B||F .Good Enough ?But ||AB||F ≤ ||A||F ||B||F and equality could hold. So in best case,error is as much as ||AB||F ! No good.What is a general method of reducing the variance ?

Take s i.i.d copies of X and take average. Variance cut down by afactor of s.

Length Squared Sampling in Matrices November 2, 2017 9 / 1

Page 48: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Reducing Variance

We saw for r.v. X , we have: E(X ) = AB and Var(X ) ≤ ||A||F ||B||F .Good Enough ?But ||AB||F ≤ ||A||F ||B||F and equality could hold. So in best case,error is as much as ||AB||F ! No good.What is a general method of reducing the variance ?Take s i.i.d copies of X and take average. Variance cut down by afactor of s.

Length Squared Sampling in Matrices November 2, 2017 9 / 1

Page 49: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication Theorem

A m × n. B n × q.AB ≈ CB, where,

C = [ 1pj1

A(:, j1)| 1pj2

A(:, j2) . . . | 1pjs

A(; , js)], where, j1, j2, . . . , js arepicked in i.i.d trials according to pj : j = 1,2, . . . ,n satisfyingpj ≥ c|A(:, j)|2/||A||2F∀j .B is the s × q matrix of corresponding rows of B.

E(||AB − CB||2F

)≤ ||A||

2F ||B||

2F

cs . Implies

E(||AB − CB||F

)≤ ||A||F ||B||F√

cs .

Words: Frobenius norm error goes down as 1/√

s.

Length Squared Sampling in Matrices November 2, 2017 10 / 1

Page 50: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication Theorem

A m × n. B n × q.

AB ≈ CB, where,

C = [ 1pj1

A(:, j1)| 1pj2

A(:, j2) . . . | 1pjs

A(; , js)], where, j1, j2, . . . , js arepicked in i.i.d trials according to pj : j = 1,2, . . . ,n satisfyingpj ≥ c|A(:, j)|2/||A||2F∀j .B is the s × q matrix of corresponding rows of B.

E(||AB − CB||2F

)≤ ||A||

2F ||B||

2F

cs . Implies

E(||AB − CB||F

)≤ ||A||F ||B||F√

cs .

Words: Frobenius norm error goes down as 1/√

s.

Length Squared Sampling in Matrices November 2, 2017 10 / 1

Page 51: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication Theorem

A m × n. B n × q.AB ≈ CB, where,

C = [ 1pj1

A(:, j1)| 1pj2

A(:, j2) . . . | 1pjs

A(; , js)], where, j1, j2, . . . , js arepicked in i.i.d trials according to pj : j = 1,2, . . . ,n satisfyingpj ≥ c|A(:, j)|2/||A||2F∀j .B is the s × q matrix of corresponding rows of B.

E(||AB − CB||2F

)≤ ||A||

2F ||B||

2F

cs . Implies

E(||AB − CB||F

)≤ ||A||F ||B||F√

cs .

Words: Frobenius norm error goes down as 1/√

s.

Length Squared Sampling in Matrices November 2, 2017 10 / 1

Page 52: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication Theorem

A m × n. B n × q.AB ≈ CB, where,

C = [ 1pj1

A(:, j1)| 1pj2

A(:, j2) . . . | 1pjs

A(; , js)], where, j1, j2, . . . , js arepicked in i.i.d trials according to pj : j = 1,2, . . . ,n satisfyingpj ≥ c|A(:, j)|2/||A||2F∀j .

B is the s × q matrix of corresponding rows of B.

E(||AB − CB||2F

)≤ ||A||

2F ||B||

2F

cs . Implies

E(||AB − CB||F

)≤ ||A||F ||B||F√

cs .

Words: Frobenius norm error goes down as 1/√

s.

Length Squared Sampling in Matrices November 2, 2017 10 / 1

Page 53: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication Theorem

A m × n. B n × q.AB ≈ CB, where,

C = [ 1pj1

A(:, j1)| 1pj2

A(:, j2) . . . | 1pjs

A(; , js)], where, j1, j2, . . . , js arepicked in i.i.d trials according to pj : j = 1,2, . . . ,n satisfyingpj ≥ c|A(:, j)|2/||A||2F∀j .B is the s × q matrix of corresponding rows of B.

E(||AB − CB||2F

)≤ ||A||

2F ||B||

2F

cs . Implies

E(||AB − CB||F

)≤ ||A||F ||B||F√

cs .

Words: Frobenius norm error goes down as 1/√

s.

Length Squared Sampling in Matrices November 2, 2017 10 / 1

Page 54: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication Theorem

A m × n. B n × q.AB ≈ CB, where,

C = [ 1pj1

A(:, j1)| 1pj2

A(:, j2) . . . | 1pjs

A(; , js)], where, j1, j2, . . . , js arepicked in i.i.d trials according to pj : j = 1,2, . . . ,n satisfyingpj ≥ c|A(:, j)|2/||A||2F∀j .B is the s × q matrix of corresponding rows of B.

E(||AB − CB||2F

)≤ ||A||

2F ||B||

2F

cs . Implies

E(||AB − CB||F

)≤ ||A||F ||B||F√

cs .

Words: Frobenius norm error goes down as 1/√

s.

Length Squared Sampling in Matrices November 2, 2017 10 / 1

Page 55: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Matrix Multiplication Theorem

A m × n. B n × q.AB ≈ CB, where,

C = [ 1pj1

A(:, j1)| 1pj2

A(:, j2) . . . | 1pjs

A(; , js)], where, j1, j2, . . . , js arepicked in i.i.d trials according to pj : j = 1,2, . . . ,n satisfyingpj ≥ c|A(:, j)|2/||A||2F∀j .B is the s × q matrix of corresponding rows of B.

E(||AB − CB||2F

)≤ ||A||

2F ||B||

2F

cs . Implies

E(||AB − CB||F

)≤ ||A||F ||B||F√

cs .

Words: Frobenius norm error goes down as 1/√

s.

Length Squared Sampling in Matrices November 2, 2017 10 / 1

Page 56: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Big Data: Implement in 2 passes

“Big Data” = Cannot be held in RAM.

Do one pass through A,B to compute all the probabilities pk .With pk on hand, toss coins to figure out which set of s columns ofA (and corresponding rows of B) we are going to sample.Make a second pass through A,B and pull out the sample.Multiply the sample in RAM and return result.For error ≤ ε||A||F ||B||F in expectation, s ≥ c/ε2 suffices. Forε ∈ Ω(1), s ∈ O(1).If s ∈ O(1), then RAM space needed is linear in mn + nq.

Length Squared Sampling in Matrices November 2, 2017 11 / 1

Page 57: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Big Data: Implement in 2 passes

“Big Data” = Cannot be held in RAM.Do one pass through A,B to compute all the probabilities pk .

With pk on hand, toss coins to figure out which set of s columns ofA (and corresponding rows of B) we are going to sample.Make a second pass through A,B and pull out the sample.Multiply the sample in RAM and return result.For error ≤ ε||A||F ||B||F in expectation, s ≥ c/ε2 suffices. Forε ∈ Ω(1), s ∈ O(1).If s ∈ O(1), then RAM space needed is linear in mn + nq.

Length Squared Sampling in Matrices November 2, 2017 11 / 1

Page 58: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Big Data: Implement in 2 passes

“Big Data” = Cannot be held in RAM.Do one pass through A,B to compute all the probabilities pk .With pk on hand, toss coins to figure out which set of s columns ofA (and corresponding rows of B) we are going to sample.

Make a second pass through A,B and pull out the sample.Multiply the sample in RAM and return result.For error ≤ ε||A||F ||B||F in expectation, s ≥ c/ε2 suffices. Forε ∈ Ω(1), s ∈ O(1).If s ∈ O(1), then RAM space needed is linear in mn + nq.

Length Squared Sampling in Matrices November 2, 2017 11 / 1

Page 59: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Big Data: Implement in 2 passes

“Big Data” = Cannot be held in RAM.Do one pass through A,B to compute all the probabilities pk .With pk on hand, toss coins to figure out which set of s columns ofA (and corresponding rows of B) we are going to sample.Make a second pass through A,B and pull out the sample.

Multiply the sample in RAM and return result.For error ≤ ε||A||F ||B||F in expectation, s ≥ c/ε2 suffices. Forε ∈ Ω(1), s ∈ O(1).If s ∈ O(1), then RAM space needed is linear in mn + nq.

Length Squared Sampling in Matrices November 2, 2017 11 / 1

Page 60: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Big Data: Implement in 2 passes

“Big Data” = Cannot be held in RAM.Do one pass through A,B to compute all the probabilities pk .With pk on hand, toss coins to figure out which set of s columns ofA (and corresponding rows of B) we are going to sample.Make a second pass through A,B and pull out the sample.Multiply the sample in RAM and return result.For error ≤ ε||A||F ||B||F in expectation, s ≥ c/ε2 suffices. Forε ∈ Ω(1), s ∈ O(1).

If s ∈ O(1), then RAM space needed is linear in mn + nq.

Length Squared Sampling in Matrices November 2, 2017 11 / 1

Page 61: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Big Data: Implement in 2 passes

“Big Data” = Cannot be held in RAM.Do one pass through A,B to compute all the probabilities pk .With pk on hand, toss coins to figure out which set of s columns ofA (and corresponding rows of B) we are going to sample.Make a second pass through A,B and pull out the sample.Multiply the sample in RAM and return result.For error ≤ ε||A||F ||B||F in expectation, s ≥ c/ε2 suffices. Forε ∈ Ω(1), s ∈ O(1).If s ∈ O(1), then RAM space needed is linear in mn + nq.

Length Squared Sampling in Matrices November 2, 2017 11 / 1

Page 62: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Problems solved by length squared and its cousins

Matrix Multiplication.

Sketch (Compressed representation) of a matrix (Discussed Next)Principal Component Analysis (SVD) (Coming)Tensor OptimizationGraph Sparsification

Length Squared Sampling in Matrices November 2, 2017 12 / 1

Page 63: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Problems solved by length squared and its cousins

Matrix Multiplication.Sketch (Compressed representation) of a matrix (Discussed Next)

Principal Component Analysis (SVD) (Coming)Tensor OptimizationGraph Sparsification

Length Squared Sampling in Matrices November 2, 2017 12 / 1

Page 64: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Problems solved by length squared and its cousins

Matrix Multiplication.Sketch (Compressed representation) of a matrix (Discussed Next)Principal Component Analysis (SVD) (Coming)

Tensor OptimizationGraph Sparsification

Length Squared Sampling in Matrices November 2, 2017 12 / 1

Page 65: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Problems solved by length squared and its cousins

Matrix Multiplication.Sketch (Compressed representation) of a matrix (Discussed Next)Principal Component Analysis (SVD) (Coming)Tensor Optimization

Graph Sparsification

Length Squared Sampling in Matrices November 2, 2017 12 / 1

Page 66: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Problems solved by length squared and its cousins

Matrix Multiplication.Sketch (Compressed representation) of a matrix (Discussed Next)Principal Component Analysis (SVD) (Coming)Tensor OptimizationGraph Sparsification

Length Squared Sampling in Matrices November 2, 2017 12 / 1

Page 67: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Sketch of a large Matrix

A is a m × n matrix. m,n large.

Will show: A can be approximated given just a random sample ofrows of A and a random sample of columns of A, provided, thesampling is length-squared. (Not known for other probabilities.)Can we sketch (approximate) a matrix by a sample of rows ? No.Sample tells us noting about unsampled rows.Say: rank(A) = k << m,n. If in “general position”, a sample of100k rows should pin down row space. Still don’t know for anunsampled row, what linear combination of sampled rows it is.A sample of O(k) columns should yield this information.Will rigourously prove error bound without assuming A is low rank.

Length Squared Sampling in Matrices November 2, 2017 13 / 1

Page 68: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Sketch of a large Matrix

A is a m × n matrix. m,n large.Will show: A can be approximated given just a random sample ofrows of A and a random sample of columns of A, provided, thesampling is length-squared. (Not known for other probabilities.)

Can we sketch (approximate) a matrix by a sample of rows ? No.Sample tells us noting about unsampled rows.Say: rank(A) = k << m,n. If in “general position”, a sample of100k rows should pin down row space. Still don’t know for anunsampled row, what linear combination of sampled rows it is.A sample of O(k) columns should yield this information.Will rigourously prove error bound without assuming A is low rank.

Length Squared Sampling in Matrices November 2, 2017 13 / 1

Page 69: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Sketch of a large Matrix

A is a m × n matrix. m,n large.Will show: A can be approximated given just a random sample ofrows of A and a random sample of columns of A, provided, thesampling is length-squared. (Not known for other probabilities.)Can we sketch (approximate) a matrix by a sample of rows ? No.Sample tells us noting about unsampled rows.

Say: rank(A) = k << m,n. If in “general position”, a sample of100k rows should pin down row space. Still don’t know for anunsampled row, what linear combination of sampled rows it is.A sample of O(k) columns should yield this information.Will rigourously prove error bound without assuming A is low rank.

Length Squared Sampling in Matrices November 2, 2017 13 / 1

Page 70: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Sketch of a large Matrix

A is a m × n matrix. m,n large.Will show: A can be approximated given just a random sample ofrows of A and a random sample of columns of A, provided, thesampling is length-squared. (Not known for other probabilities.)Can we sketch (approximate) a matrix by a sample of rows ? No.Sample tells us noting about unsampled rows.Say: rank(A) = k << m,n. If in “general position”, a sample of100k rows should pin down row space. Still don’t know for anunsampled row, what linear combination of sampled rows it is.

A sample of O(k) columns should yield this information.Will rigourously prove error bound without assuming A is low rank.

Length Squared Sampling in Matrices November 2, 2017 13 / 1

Page 71: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Sketch of a large Matrix

A is a m × n matrix. m,n large.Will show: A can be approximated given just a random sample ofrows of A and a random sample of columns of A, provided, thesampling is length-squared. (Not known for other probabilities.)Can we sketch (approximate) a matrix by a sample of rows ? No.Sample tells us noting about unsampled rows.Say: rank(A) = k << m,n. If in “general position”, a sample of100k rows should pin down row space. Still don’t know for anunsampled row, what linear combination of sampled rows it is.A sample of O(k) columns should yield this information.

Will rigourously prove error bound without assuming A is low rank.

Length Squared Sampling in Matrices November 2, 2017 13 / 1

Page 72: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Sketch of a large Matrix

A is a m × n matrix. m,n large.Will show: A can be approximated given just a random sample ofrows of A and a random sample of columns of A, provided, thesampling is length-squared. (Not known for other probabilities.)Can we sketch (approximate) a matrix by a sample of rows ? No.Sample tells us noting about unsampled rows.Say: rank(A) = k << m,n. If in “general position”, a sample of100k rows should pin down row space. Still don’t know for anunsampled row, what linear combination of sampled rows it is.A sample of O(k) columns should yield this information.Will rigourously prove error bound without assuming A is low rank.

Length Squared Sampling in Matrices November 2, 2017 13 / 1

Page 73: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

A

C

·

U

· R

Col Sample C,R → U Row Sample

Length Squared Sampling in Matrices November 2, 2017 14 / 1

Page 74: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Using CUR- an example

Large Corpus of documents. Each doc is a word-frequency vector.Forms a column of large word-doc matrix A.

New Document v comes in. Want its similarity to each doc incorpus. If similarity = dot product, then, want vT A.Problem: Preprocess A, so that at “query time” given v, can findapproximate vT A fast. But must bound error for EVERY v. Say wewant to find u so that |u− vT A| ≤ δ|v|.Will set u = vT CUR. Fast: Do vT C then times U then times R.Want

Maxv

∣∣∣vT (CUR − A)∣∣∣ /|v| ≤ δ.

The maximum has a name - Spectral norm of A− CUR. So, want||A− CUR||2 ≤ δ. Will show E

(||A− CUR||22

)≤ ||A||

2F

s1/3 , where s =number of sampled columns. Number of sampled rows = r .

Length Squared Sampling in Matrices November 2, 2017 15 / 1

Page 75: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Using CUR- an example

Large Corpus of documents. Each doc is a word-frequency vector.Forms a column of large word-doc matrix A.New Document v comes in. Want its similarity to each doc incorpus. If similarity = dot product, then, want vT A.

Problem: Preprocess A, so that at “query time” given v, can findapproximate vT A fast. But must bound error for EVERY v. Say wewant to find u so that |u− vT A| ≤ δ|v|.Will set u = vT CUR. Fast: Do vT C then times U then times R.Want

Maxv

∣∣∣vT (CUR − A)∣∣∣ /|v| ≤ δ.

The maximum has a name - Spectral norm of A− CUR. So, want||A− CUR||2 ≤ δ. Will show E

(||A− CUR||22

)≤ ||A||

2F

s1/3 , where s =number of sampled columns. Number of sampled rows = r .

Length Squared Sampling in Matrices November 2, 2017 15 / 1

Page 76: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Using CUR- an example

Large Corpus of documents. Each doc is a word-frequency vector.Forms a column of large word-doc matrix A.New Document v comes in. Want its similarity to each doc incorpus. If similarity = dot product, then, want vT A.Problem: Preprocess A, so that at “query time” given v, can findapproximate vT A fast. But must bound error for EVERY v. Say wewant to find u so that |u− vT A| ≤ δ|v|.

Will set u = vT CUR. Fast: Do vT C then times U then times R.Want

Maxv

∣∣∣vT (CUR − A)∣∣∣ /|v| ≤ δ.

The maximum has a name - Spectral norm of A− CUR. So, want||A− CUR||2 ≤ δ. Will show E

(||A− CUR||22

)≤ ||A||

2F

s1/3 , where s =number of sampled columns. Number of sampled rows = r .

Length Squared Sampling in Matrices November 2, 2017 15 / 1

Page 77: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Using CUR- an example

Large Corpus of documents. Each doc is a word-frequency vector.Forms a column of large word-doc matrix A.New Document v comes in. Want its similarity to each doc incorpus. If similarity = dot product, then, want vT A.Problem: Preprocess A, so that at “query time” given v, can findapproximate vT A fast. But must bound error for EVERY v. Say wewant to find u so that |u− vT A| ≤ δ|v|.Will set u = vT CUR. Fast: Do vT C then times U then times R.

WantMaxv

∣∣∣vT (CUR − A)∣∣∣ /|v| ≤ δ.

The maximum has a name - Spectral norm of A− CUR. So, want||A− CUR||2 ≤ δ. Will show E

(||A− CUR||22

)≤ ||A||

2F

s1/3 , where s =number of sampled columns. Number of sampled rows = r .

Length Squared Sampling in Matrices November 2, 2017 15 / 1

Page 78: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Using CUR- an example

Large Corpus of documents. Each doc is a word-frequency vector.Forms a column of large word-doc matrix A.New Document v comes in. Want its similarity to each doc incorpus. If similarity = dot product, then, want vT A.Problem: Preprocess A, so that at “query time” given v, can findapproximate vT A fast. But must bound error for EVERY v. Say wewant to find u so that |u− vT A| ≤ δ|v|.Will set u = vT CUR. Fast: Do vT C then times U then times R.Want

Maxv

∣∣∣vT (CUR − A)∣∣∣ /|v| ≤ δ.

The maximum has a name - Spectral norm of A− CUR. So, want||A− CUR||2 ≤ δ. Will show E

(||A− CUR||22

)≤ ||A||

2F

s1/3 , where s =number of sampled columns. Number of sampled rows = r .

Length Squared Sampling in Matrices November 2, 2017 15 / 1

Page 79: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Using CUR- an example

Large Corpus of documents. Each doc is a word-frequency vector.Forms a column of large word-doc matrix A.New Document v comes in. Want its similarity to each doc incorpus. If similarity = dot product, then, want vT A.Problem: Preprocess A, so that at “query time” given v, can findapproximate vT A fast. But must bound error for EVERY v. Say wewant to find u so that |u− vT A| ≤ δ|v|.Will set u = vT CUR. Fast: Do vT C then times U then times R.Want

Maxv

∣∣∣vT (CUR − A)∣∣∣ /|v| ≤ δ.

The maximum has a name - Spectral norm of A− CUR. So, want||A− CUR||2 ≤ δ. Will show E

(||A− CUR||22

)≤ ||A||

2F

s1/3 , where s =number of sampled columns. Number of sampled rows = r .

Length Squared Sampling in Matrices November 2, 2017 15 / 1

Page 80: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Idea

Write A = AI. Pretend multiplying A with I by sampling s columnsof A. Proved: Error ≤ ||A||F ||I||F/

√s = ||A||F

√n√s .

Needs s ≥ n to get error ≤ ||A||F . Useless. Why?Assume RRT is invertible. [true if A is not degenerate. Why?]P = RT (RRT )−1R acts as identity on row space of R:

(1) x ∈ V ⇒ xT = yT R. So, Px = RT (RRT )−1RRT y = RT y = x.(2) If x ∈ V⊥, then, Px = RT (RRT )−1Rx = 0.

Instead of the pretend AI, do pretend AP.Will prove two things which together imply ||A− CUR|| is small:

||A− AP||2 is small from (1) and (2).C = length squared sample of col.s of A. Corres rows of P- can bewritten as UR. [Hint: P ends in R. Note: R ’s rows do not corres tocol.s of C.] So, ||AP − CUR|| small.

Length Squared Sampling in Matrices November 2, 2017 16 / 1

Page 81: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Idea

Write A = AI. Pretend multiplying A with I by sampling s columnsof A. Proved: Error ≤ ||A||F ||I||F/

√s = ||A||F

√n√s .

Needs s ≥ n to get error ≤ ||A||F . Useless. Why?

Assume RRT is invertible. [true if A is not degenerate. Why?]P = RT (RRT )−1R acts as identity on row space of R:

(1) x ∈ V ⇒ xT = yT R. So, Px = RT (RRT )−1RRT y = RT y = x.(2) If x ∈ V⊥, then, Px = RT (RRT )−1Rx = 0.

Instead of the pretend AI, do pretend AP.Will prove two things which together imply ||A− CUR|| is small:

||A− AP||2 is small from (1) and (2).C = length squared sample of col.s of A. Corres rows of P- can bewritten as UR. [Hint: P ends in R. Note: R ’s rows do not corres tocol.s of C.] So, ||AP − CUR|| small.

Length Squared Sampling in Matrices November 2, 2017 16 / 1

Page 82: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Idea

Write A = AI. Pretend multiplying A with I by sampling s columnsof A. Proved: Error ≤ ||A||F ||I||F/

√s = ||A||F

√n√s .

Needs s ≥ n to get error ≤ ||A||F . Useless. Why?Assume RRT is invertible. [true if A is not degenerate. Why?]

P = RT (RRT )−1R acts as identity on row space of R:

(1) x ∈ V ⇒ xT = yT R. So, Px = RT (RRT )−1RRT y = RT y = x.(2) If x ∈ V⊥, then, Px = RT (RRT )−1Rx = 0.

Instead of the pretend AI, do pretend AP.Will prove two things which together imply ||A− CUR|| is small:

||A− AP||2 is small from (1) and (2).C = length squared sample of col.s of A. Corres rows of P- can bewritten as UR. [Hint: P ends in R. Note: R ’s rows do not corres tocol.s of C.] So, ||AP − CUR|| small.

Length Squared Sampling in Matrices November 2, 2017 16 / 1

Page 83: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Idea

Write A = AI. Pretend multiplying A with I by sampling s columnsof A. Proved: Error ≤ ||A||F ||I||F/

√s = ||A||F

√n√s .

Needs s ≥ n to get error ≤ ||A||F . Useless. Why?Assume RRT is invertible. [true if A is not degenerate. Why?]P = RT (RRT )−1R acts as identity on row space of R:

(1) x ∈ V ⇒ xT = yT R. So, Px = RT (RRT )−1RRT y = RT y = x.(2) If x ∈ V⊥, then, Px = RT (RRT )−1Rx = 0.

Instead of the pretend AI, do pretend AP.Will prove two things which together imply ||A− CUR|| is small:

||A− AP||2 is small from (1) and (2).C = length squared sample of col.s of A. Corres rows of P- can bewritten as UR. [Hint: P ends in R. Note: R ’s rows do not corres tocol.s of C.] So, ||AP − CUR|| small.

Length Squared Sampling in Matrices November 2, 2017 16 / 1

Page 84: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Idea

Write A = AI. Pretend multiplying A with I by sampling s columnsof A. Proved: Error ≤ ||A||F ||I||F/

√s = ||A||F

√n√s .

Needs s ≥ n to get error ≤ ||A||F . Useless. Why?Assume RRT is invertible. [true if A is not degenerate. Why?]P = RT (RRT )−1R acts as identity on row space of R:

(1) x ∈ V ⇒ xT = yT R. So, Px = RT (RRT )−1RRT y = RT y = x.

(2) If x ∈ V⊥, then, Px = RT (RRT )−1Rx = 0.

Instead of the pretend AI, do pretend AP.

Will prove two things which together imply ||A− CUR|| is small:

||A− AP||2 is small from (1) and (2).C = length squared sample of col.s of A. Corres rows of P- can bewritten as UR. [Hint: P ends in R. Note: R ’s rows do not corres tocol.s of C.] So, ||AP − CUR|| small.

Length Squared Sampling in Matrices November 2, 2017 16 / 1

Page 85: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Idea

Write A = AI. Pretend multiplying A with I by sampling s columnsof A. Proved: Error ≤ ||A||F ||I||F/

√s = ||A||F

√n√s .

Needs s ≥ n to get error ≤ ||A||F . Useless. Why?Assume RRT is invertible. [true if A is not degenerate. Why?]P = RT (RRT )−1R acts as identity on row space of R:

(1) x ∈ V ⇒ xT = yT R. So, Px = RT (RRT )−1RRT y = RT y = x.(2) If x ∈ V⊥, then, Px = RT (RRT )−1Rx = 0.

Instead of the pretend AI, do pretend AP.Will prove two things which together imply ||A− CUR|| is small:

||A− AP||2 is small from (1) and (2).C = length squared sample of col.s of A. Corres rows of P- can bewritten as UR. [Hint: P ends in R. Note: R ’s rows do not corres tocol.s of C.] So, ||AP − CUR|| small.

Length Squared Sampling in Matrices November 2, 2017 16 / 1

Page 86: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Idea

Write A = AI. Pretend multiplying A with I by sampling s columnsof A. Proved: Error ≤ ||A||F ||I||F/

√s = ||A||F

√n√s .

Needs s ≥ n to get error ≤ ||A||F . Useless. Why?Assume RRT is invertible. [true if A is not degenerate. Why?]P = RT (RRT )−1R acts as identity on row space of R:

(1) x ∈ V ⇒ xT = yT R. So, Px = RT (RRT )−1RRT y = RT y = x.(2) If x ∈ V⊥, then, Px = RT (RRT )−1Rx = 0.

Instead of the pretend AI, do pretend AP.Will prove two things which together imply ||A− CUR|| is small:

||A− AP||2 is small from (1) and (2).

C = length squared sample of col.s of A. Corres rows of P- can bewritten as UR. [Hint: P ends in R. Note: R ’s rows do not corres tocol.s of C.] So, ||AP − CUR|| small.

Length Squared Sampling in Matrices November 2, 2017 16 / 1

Page 87: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Idea

Write A = AI. Pretend multiplying A with I by sampling s columnsof A. Proved: Error ≤ ||A||F ||I||F/

√s = ||A||F

√n√s .

Needs s ≥ n to get error ≤ ||A||F . Useless. Why?Assume RRT is invertible. [true if A is not degenerate. Why?]P = RT (RRT )−1R acts as identity on row space of R:

(1) x ∈ V ⇒ xT = yT R. So, Px = RT (RRT )−1RRT y = RT y = x.(2) If x ∈ V⊥, then, Px = RT (RRT )−1Rx = 0.

Instead of the pretend AI, do pretend AP.Will prove two things which together imply ||A− CUR|| is small:

||A− AP||2 is small from (1) and (2).C = length squared sample of col.s of A. Corres rows of P- can bewritten as UR. [Hint: P ends in R. Note: R ’s rows do not corres tocol.s of C.] So, ||AP − CUR|| small.

Length Squared Sampling in Matrices November 2, 2017 16 / 1

Page 88: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proofs

Proposition A ≈ AP. I.e., E(||A− AP||22

)is at most 1√

r ||A||2F .

Recall: ||A− AP||22 = maxx:|x|=1 |(A− AP)x|2.If x in the row space V of R, Px = x, so, (A− AP)x = 0.Every vector is sum of a vector in V plus a vector in V⊥. So, maxat some x ∈ V⊥ and so Px = 0; (A− AP)x = Ax.|Ax|2 = xT AT Ax = xT (AT A− RT R)x ≤ ||AT A− RT R||2|x|2 ≤||AT A− RT R||2.Suffices to prove ||AT A− RT R||22 ≤ ||A||4F/r .Matrix Multiplication Theorem! Why?Pretend we are multiplying AT by A by picking col.s of A by lengthsquared sampling....

Length Squared Sampling in Matrices November 2, 2017 17 / 1

Page 89: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proofs

Proposition A ≈ AP. I.e., E(||A− AP||22

)is at most 1√

r ||A||2F .

Recall: ||A− AP||22 = maxx:|x|=1 |(A− AP)x|2.

If x in the row space V of R, Px = x, so, (A− AP)x = 0.Every vector is sum of a vector in V plus a vector in V⊥. So, maxat some x ∈ V⊥ and so Px = 0; (A− AP)x = Ax.|Ax|2 = xT AT Ax = xT (AT A− RT R)x ≤ ||AT A− RT R||2|x|2 ≤||AT A− RT R||2.Suffices to prove ||AT A− RT R||22 ≤ ||A||4F/r .Matrix Multiplication Theorem! Why?Pretend we are multiplying AT by A by picking col.s of A by lengthsquared sampling....

Length Squared Sampling in Matrices November 2, 2017 17 / 1

Page 90: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proofs

Proposition A ≈ AP. I.e., E(||A− AP||22

)is at most 1√

r ||A||2F .

Recall: ||A− AP||22 = maxx:|x|=1 |(A− AP)x|2.If x in the row space V of R, Px = x, so, (A− AP)x = 0.

Every vector is sum of a vector in V plus a vector in V⊥. So, maxat some x ∈ V⊥ and so Px = 0; (A− AP)x = Ax.|Ax|2 = xT AT Ax = xT (AT A− RT R)x ≤ ||AT A− RT R||2|x|2 ≤||AT A− RT R||2.Suffices to prove ||AT A− RT R||22 ≤ ||A||4F/r .Matrix Multiplication Theorem! Why?Pretend we are multiplying AT by A by picking col.s of A by lengthsquared sampling....

Length Squared Sampling in Matrices November 2, 2017 17 / 1

Page 91: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proofs

Proposition A ≈ AP. I.e., E(||A− AP||22

)is at most 1√

r ||A||2F .

Recall: ||A− AP||22 = maxx:|x|=1 |(A− AP)x|2.If x in the row space V of R, Px = x, so, (A− AP)x = 0.Every vector is sum of a vector in V plus a vector in V⊥. So, maxat some x ∈ V⊥ and so Px = 0; (A− AP)x = Ax.

|Ax|2 = xT AT Ax = xT (AT A− RT R)x ≤ ||AT A− RT R||2|x|2 ≤||AT A− RT R||2.Suffices to prove ||AT A− RT R||22 ≤ ||A||4F/r .Matrix Multiplication Theorem! Why?Pretend we are multiplying AT by A by picking col.s of A by lengthsquared sampling....

Length Squared Sampling in Matrices November 2, 2017 17 / 1

Page 92: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proofs

Proposition A ≈ AP. I.e., E(||A− AP||22

)is at most 1√

r ||A||2F .

Recall: ||A− AP||22 = maxx:|x|=1 |(A− AP)x|2.If x in the row space V of R, Px = x, so, (A− AP)x = 0.Every vector is sum of a vector in V plus a vector in V⊥. So, maxat some x ∈ V⊥ and so Px = 0; (A− AP)x = Ax.|Ax|2 = xT AT Ax = xT (AT A− RT R)x ≤ ||AT A− RT R||2|x|2 ≤||AT A− RT R||2.

Suffices to prove ||AT A− RT R||22 ≤ ||A||4F/r .Matrix Multiplication Theorem! Why?Pretend we are multiplying AT by A by picking col.s of A by lengthsquared sampling....

Length Squared Sampling in Matrices November 2, 2017 17 / 1

Page 93: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proofs

Proposition A ≈ AP. I.e., E(||A− AP||22

)is at most 1√

r ||A||2F .

Recall: ||A− AP||22 = maxx:|x|=1 |(A− AP)x|2.If x in the row space V of R, Px = x, so, (A− AP)x = 0.Every vector is sum of a vector in V plus a vector in V⊥. So, maxat some x ∈ V⊥ and so Px = 0; (A− AP)x = Ax.|Ax|2 = xT AT Ax = xT (AT A− RT R)x ≤ ||AT A− RT R||2|x|2 ≤||AT A− RT R||2.Suffices to prove ||AT A− RT R||22 ≤ ||A||4F/r .

Matrix Multiplication Theorem! Why?Pretend we are multiplying AT by A by picking col.s of A by lengthsquared sampling....

Length Squared Sampling in Matrices November 2, 2017 17 / 1

Page 94: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proofs

Proposition A ≈ AP. I.e., E(||A− AP||22

)is at most 1√

r ||A||2F .

Recall: ||A− AP||22 = maxx:|x|=1 |(A− AP)x|2.If x in the row space V of R, Px = x, so, (A− AP)x = 0.Every vector is sum of a vector in V plus a vector in V⊥. So, maxat some x ∈ V⊥ and so Px = 0; (A− AP)x = Ax.|Ax|2 = xT AT Ax = xT (AT A− RT R)x ≤ ||AT A− RT R||2|x|2 ≤||AT A− RT R||2.Suffices to prove ||AT A− RT R||22 ≤ ||A||4F/r .Matrix Multiplication Theorem! Why?

Pretend we are multiplying AT by A by picking col.s of A by lengthsquared sampling....

Length Squared Sampling in Matrices November 2, 2017 17 / 1

Page 95: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proofs

Proposition A ≈ AP. I.e., E(||A− AP||22

)is at most 1√

r ||A||2F .

Recall: ||A− AP||22 = maxx:|x|=1 |(A− AP)x|2.If x in the row space V of R, Px = x, so, (A− AP)x = 0.Every vector is sum of a vector in V plus a vector in V⊥. So, maxat some x ∈ V⊥ and so Px = 0; (A− AP)x = Ax.|Ax|2 = xT AT Ax = xT (AT A− RT R)x ≤ ||AT A− RT R||2|x|2 ≤||AT A− RT R||2.Suffices to prove ||AT A− RT R||22 ≤ ||A||4F/r .Matrix Multiplication Theorem! Why?Pretend we are multiplying AT by A by picking col.s of A by lengthsquared sampling....

Length Squared Sampling in Matrices November 2, 2017 17 / 1

Page 96: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proof -II

Lemma AP ≈ CUR.

C is a length squared sample of cols of A.Want to pick corres rows of P = RT (RRT )−1R. Can be written asUR for some U.Error E(||AP − CUR||2F ) ≤ ||A||2F ||P||2F/s by Matrix Mult Thm.Bound ||P||F : P has rank r and is an identity matrix on an r dimsubspace. Prove any such P has ||P||2F = r .

Putting together, we get E(||A− CUR||22) ≤ ||A||2F(

1√r + r

s

).

Optimal choice: r = s2/3.

Length Squared Sampling in Matrices November 2, 2017 18 / 1

Page 97: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proof -II

Lemma AP ≈ CUR.C is a length squared sample of cols of A.

Want to pick corres rows of P = RT (RRT )−1R. Can be written asUR for some U.Error E(||AP − CUR||2F ) ≤ ||A||2F ||P||2F/s by Matrix Mult Thm.Bound ||P||F : P has rank r and is an identity matrix on an r dimsubspace. Prove any such P has ||P||2F = r .

Putting together, we get E(||A− CUR||22) ≤ ||A||2F(

1√r + r

s

).

Optimal choice: r = s2/3.

Length Squared Sampling in Matrices November 2, 2017 18 / 1

Page 98: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proof -II

Lemma AP ≈ CUR.C is a length squared sample of cols of A.Want to pick corres rows of P = RT (RRT )−1R. Can be written asUR for some U.

Error E(||AP − CUR||2F ) ≤ ||A||2F ||P||2F/s by Matrix Mult Thm.Bound ||P||F : P has rank r and is an identity matrix on an r dimsubspace. Prove any such P has ||P||2F = r .

Putting together, we get E(||A− CUR||22) ≤ ||A||2F(

1√r + r

s

).

Optimal choice: r = s2/3.

Length Squared Sampling in Matrices November 2, 2017 18 / 1

Page 99: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proof -II

Lemma AP ≈ CUR.C is a length squared sample of cols of A.Want to pick corres rows of P = RT (RRT )−1R. Can be written asUR for some U.Error E(||AP − CUR||2F ) ≤ ||A||2F ||P||2F/s by Matrix Mult Thm.

Bound ||P||F : P has rank r and is an identity matrix on an r dimsubspace. Prove any such P has ||P||2F = r .

Putting together, we get E(||A− CUR||22) ≤ ||A||2F(

1√r + r

s

).

Optimal choice: r = s2/3.

Length Squared Sampling in Matrices November 2, 2017 18 / 1

Page 100: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proof -II

Lemma AP ≈ CUR.C is a length squared sample of cols of A.Want to pick corres rows of P = RT (RRT )−1R. Can be written asUR for some U.Error E(||AP − CUR||2F ) ≤ ||A||2F ||P||2F/s by Matrix Mult Thm.Bound ||P||F : P has rank r and is an identity matrix on an r dimsubspace. Prove any such P has ||P||2F = r .

Putting together, we get E(||A− CUR||22) ≤ ||A||2F(

1√r + r

s

).

Optimal choice: r = s2/3.

Length Squared Sampling in Matrices November 2, 2017 18 / 1

Page 101: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

Proof -II

Lemma AP ≈ CUR.C is a length squared sample of cols of A.Want to pick corres rows of P = RT (RRT )−1R. Can be written asUR for some U.Error E(||AP − CUR||2F ) ≤ ||A||2F ||P||2F/s by Matrix Mult Thm.Bound ||P||F : P has rank r and is an identity matrix on an r dimsubspace. Prove any such P has ||P||2F = r .

Putting together, we get E(||A− CUR||22) ≤ ||A||2F(

1√r + r

s

).

Optimal choice: r = s2/3.

Length Squared Sampling in Matrices November 2, 2017 18 / 1

Page 102: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

CUR Theorem

Hypothesis: A m × n matrix. r and s be positive integers.

Hypothesis: C an m× s matrix of s columns of A picked accordingto length squared sampling and R a matrix of r rows of A pickedaccording to length squared sampling.Conclusion: We can find from C and R an s × r matrix U so that

E(||A− CUR||22

)≤ ||A||2F

(2√r

+2rs

)= ||A||2F O(1/s1/3),

choosing r = s2/3

Length Squared Sampling in Matrices November 2, 2017 19 / 1

Page 103: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

CUR Theorem

Hypothesis: A m × n matrix. r and s be positive integers.Hypothesis: C an m× s matrix of s columns of A picked accordingto length squared sampling and R a matrix of r rows of A pickedaccording to length squared sampling.

Conclusion: We can find from C and R an s × r matrix U so that

E(||A− CUR||22

)≤ ||A||2F

(2√r

+2rs

)= ||A||2F O(1/s1/3),

choosing r = s2/3

Length Squared Sampling in Matrices November 2, 2017 19 / 1

Page 104: Length Squared Sampling in Matriceschiru/datascience/nov_2.pdf · Agenda: Sampling to deal with “big” matrices Our definition of “Big data ” Doesn’t fit into RAM, Obvious

CUR Theorem

Hypothesis: A m × n matrix. r and s be positive integers.Hypothesis: C an m× s matrix of s columns of A picked accordingto length squared sampling and R a matrix of r rows of A pickedaccording to length squared sampling.Conclusion: We can find from C and R an s × r matrix U so that

E(||A− CUR||22

)≤ ||A||2F

(2√r

+2rs

)= ||A||2F O(1/s1/3),

choosing r = s2/3

Length Squared Sampling in Matrices November 2, 2017 19 / 1