Linear Sketching and Applications to Distributed Computation · Application 1: k-means Clustering...

Linear Sketching and Applications to DistributedComputation

Cameron Musco

November 7, 2014

Overview

Linear Sketches

I Johnson-Lindenstrauss Lemma

I Heavy-Hitters

Applications

I k-Means Clustering

I Spectral sparsification

Overview

Linear Sketches

I Johnson-Lindenstrauss Lemma

I Heavy-Hitters

Applications

I k-Means Clustering

I Spectral sparsification

Linear Sketching

I Randomly choose Π ∼ D

Linear Sketching

I Oblivious: Π chosen independently of x.

I Composable: Π(x + y) = Πx + Πy.

Linear Sketching

I Oblivious: Π chosen independently of x.

I Composable: Π(x + y) = Πx + Πy.

Linear Sketching

I Streaming algorithms with polylog(n) space.

I Frequency moments, heavy hitters, entropy estimation

1−20...50

1−10...50

1−10...56

→ ...→

14−23...04

Πx0 → Π(x0 + x1)→ ...→ Π(x0 + x1 + ...+ xn)

Linear Sketching

1−20...50

1−10...50

1−10...56

→ ...→

14−23...04

Πx0 → Π(x0 + x1)→ ...→ Π(x0 + x1 + ...+ xn)

Linear Sketching

1−20...50

1−10...50

1−10...56

→ ...→

14−23...04

Πx0 → Π(x0 + x1)→ ...→ Π(x0 + x1 + ...+ xn)

Linear Sketching

1−20...50

1−10...50

1−10...56

→ ...→

14−23...04

Πx0 → Π(x0 + x1)→ ...→ Π(x0 + x1 + ...+ xn)

Linear Sketching

1−20...50

1−10...50

1−10...56

→ ...→

14−23...04

Πx0 → Π(x0 + x1)→ ...→ Π(x0 + x1 + ...+ xn)

Linear Sketching

1−20...50

1−10...50

1−10...56

→ ...→

14−23...04

Πx0 → Π(x0 + x1)→ ...→ Π(x0 + x1 + ...+ xn)

Linear Sketching

Two Main Tools

I Johnson Lindenstrauss sketches (randomized dimensionalityreduction, subspace embedding, etc..)

I Heavy-Hitters sketches (sparse recovery, compressive sensing,`p sampling, graph sketching, etc....)

Johnson-Lindenstrauss Lemma

I Low Dimensional Embedding. n→ m = O(log(1/δ)/ε2)

I ‖Πx‖22 ≈ε ‖x‖22.

N (0, 1) ... N (0, 1)...

...N (0, 1) ... N (0, 1)

x1x2......xn

=1√m

N (0, x21 + ...+ x2n )...

N (0, x21 + ...+ x2n )

I ‖Πx‖22 ≈ε ‖x‖22.

N (0, 1) ... N (0, 1)...

...N (0, 1) ... N (0, 1)

x1x2......xn

=1√m

N (0, x21 + ...+ x2n )...

N (0, x21 + ...+ x2n )

I ‖Πx‖22 ≈ε ‖x‖22.

N (0, 1) ... N (0, 1)...

...N (0, 1) ... N (0, 1)

x1x2......xn

=1√m

N (0, ‖x‖22)...

N (0, ‖x‖22)

‖Πx‖22 =1

m∑i=1

N (0, ‖x‖22)2 ≈ε ‖x‖22

I ‖Πx‖22 ≈ε ‖x‖22.

N (0, 1) ... N (0, 1)...

...N (0, 1) ... N (0, 1)

x1x2......xn

=1√m

N (0, ‖x‖22)...

N (0, ‖x‖22)

‖Πx‖22 =1

m∑i=1

N (0, ‖x‖22)2 ≈ε ‖x‖22

That’s it! - basic statistics

I Sparse constructions.

I ±1 replace Gaussians

I Small sketch representation - i.e. small random seed(otherwise storing Π takes more space than storing x)

Heavy-Hitters

I Count sketch, sparse recovery, `p sampling, point query, graphsketching, sparse fourier transform

I Simple idea: Hashing

Heavy-Hitters

I Random signs to deal with negative entries

I Repeat many times ‘decode’ heavy buckets

1 0 0 ... −1 0...

...0 −1 0 ... 0 1

x1x2......xn

1 0 0 ... −1 0...

...0 −1 0 ... 0 1

x1x2......xn

...c5...

Heavy-Hitters

I Random signs to deal with negative entries

I Repeat many times ‘decode’ heavy buckets

c1......

...c5...

...c20

...c1/ε

I h1(2) = 1, h2(2) = 5, h3(2) = 20, h4(2) = 1/ε→ x22

‖x‖22≥ 1

Heavy-Hitters

I Just random encoding

I polylog(n) to recover entires with 1log(n) of the total norm

I Basically best we could hope for.

I Random scalings gives `p sampling.

Application 1: k-means Clustering

I Assign points to k clusters

I k is fixed

I Minimize distance to centroids:∑n

i=1 ‖ai − µC(i)‖22

Application 1: k-means ClusteringLloyd’s algorithm aka the ‘k-means algorithm’

I Initalize random clustersI Compute centroidsI Assign each point to closest centroidI Repeat

Application 1: k-means ClusteringLloyd’s algorithm aka the ‘k-means algorithm’

I Initalize random clustersI Compute centroidsI Assign each point to closest centroidI Repeat

What if data is distributed?

Application 1: k-means ClusteringI At each iteration each server sends out new local centroidsI Adding them together, gives the new global centroids.I O(sdk) communication per iteration

Can we do better?

I Balcan et al. 2013 - O((kd + sk)d)

I Locally computable O(kd + sk) sized coreset.

I All data is aggregated and k-means performed on singleserver.

Can we do even better?

I O((sd + sk)d)→ O((sd ′ + sk)d ′) for d ′ << d .

I Liang et al. 2013, Balcan et al. 2014 - O((sk + sk)k + sdk)

I CEMMP ’14: improved O(k/ε2) to dk/εe.

Can we do better?

I O(sdk) inherent in communicating O(k) singular vectors ofdimension d to s servers.

I Apply Johnson-Lindenstrauss!

I Goal: Minimize distance to centroids:∑n

i=1 ‖ai − µC(i)‖22

I Equivalently all pairs distances:∑k

i=11|Ci |

∑(j ,k)∈Ci

‖ai − aj‖22

Can we do better?

I O(sdk) inherent in communicating O(k) singular vectors ofdimension d to s servers.

I Apply Johnson-Lindenstrauss!

I Goal: Minimize distance to centroids:∑n

i=1 ‖ai − µC(i)‖22

I Equivalently all pairs distances:∑k

i=11|Ci |

∑(j ,k)∈Ci

‖ai − aj‖22

I ‖ai − aj‖22 ≈ε ‖(ai − aj)Π‖22 = ‖aiΠ− ajΠ‖22I O(n2) distance vectors so set failure probability δ = 1

100∗n2 .

I Π needs O(log 1/δ/ε2) = O(log n/ε2) dimensions

Immediately distributes - just need to share randomnessspecifying Π.

Our Paper [Cohen Elder Musco Musco Persu 14]

I Show that Π only needs to have O(k/ε2) columns

I Almost completely removes dependence on input size!

I O(k3 + sk2 + log d) - log d gets swallowed in the word size.

Highest Level Idea for how this works

I Show that the cost of projecting the columns AΠ to anyk-dimensional subspace approximates the cost of projecting Ato that subspace.

I Note that k-means can actually be viewed as a columnprojection problem.

I k-means clustering is ‘constrained’ PCA

I Lots of applications aside from k-means clustering.

Open Questions

I (9 + ε)-approximation with only O(log k) dimensions! What isthe right answer?

I We use O(kd + sk) sized coresets blackbox and reduce d .Can we use our linear algebraic understanding to improvecoreset constructions? I feels like we should be able to.

I These algorithms should be practical. I think testing them outwould be useful - for both k-means and PCA.

I Other problems (spectral clustering, SVM, what do peopleactually do?)

Application 2: Spectral Sparsification

General Idea

I Approximate a dense graph with a much sparser graph.

I Reduce O(n2) edges → O(n log n) edges

General Idea

I Approximate a dense graph with a much sparser graph.

I Reduce O(n2) edges → O(n log n) edges

Cut Sparsification (Benczur, Karger ’96)

I Preserve every cut value to within (1± ε) factor

Applications: Minimum cut, sparsest cut, etc.

Graph Sparsification

I Let B ∈ R(n2)×n be the vertex-edge incidence matrix for agraph G .

I Let x ∈ {0, 1}n be an “indicator vector” for some cut.

‖Bx‖22 = cut value

v1 v2 v3 v4