Talwalkar mlconf (1)

Divide-‐and-‐Conquer Matrix Factoriza5on

Ameet TalwalkarUC Berkeley

November 15th, 2013

Collaborators: Lester Mackey2, Michael I. Jordan1, Yadong Mu3, Shih-‐Fu Chang3

1UC Berkeley 2Stanford University 3Columbia University

Three Converging Trends

Big Data


Distributed CompuOng

Big Data



Big Data


Machine Learning


Big Data Machine Learning

Goal: Extend ML to the Big Data SeAng

Challenge: ML not developed with scalability in mind✦ Does not naturally scale / leverage distributed compuOng





Our approach: Divide-‐and-‐conquer✦ Apply exisOng base algorithms to subsets of data and combine






✓ Build upon exisOng suites of ML algorithms✓ Preserve favorable algorithm properOes✓ Naturally leverage distributed compuOng






✓ Build upon exisOng suites of ML algorithms✓ Preserve favorable algorithm properOes✓ Naturally leverage distributed compuOng

✦ E.g., ✦ Matrix factorizaOon (DFC)✦ Assessing esOmator quality (BLB)✦ Genomic Variant Calling

[MTJ, NIPS11; TMMFJ, ICCV13]

[KTSJ, ICML12; KTSJ, JRSS13; KTASJ, KDD13]

[BTTJPYS13, submitted, CTZFJP13, submitted]

Matrix CompleOon

Matrix CompleOon

Goal: Recover a matrix from a subset of its entries

Matrix CompleOon

Goal: Recover a matrix from a subset of its entries

Can we do this at scale?✦ Netflix: 30M users, 100K+ videos✦ Facebook: 1B users✦ Pandora: 70M active users, 1M songs ✦ Amazon: Millions of users and products✦ ...

Reducing Degrees of Freedom


✦ Problem: Impossible without additional information✦ mn degrees of freedom

m

n



✦ Solution: Assume small # of

factors determine preference

m

n

= m

r nr

‘Low-rank’



✦ Solution: Assume small # of

factors determine preference ✦ degrees of freedom✦ Linear storage costs

m

n

= m

r nr

‘Low-rank’

O(m+ n)

Bad Sampling

✦ Problem: We have no raOng informaOon about

Bad Sampling

✦ Problem: We have no raOng informaOon about

✦ SoluOon: Assume observed entries drawn uniformly at random

⌦̃(r(n+m))

Bad InformaOon Spread


✦ Problem: Other raOngs don’t inform us about missing raOng

bad spread of informaOon


[Candes and Recht, 2009]

✦ Problem: Other raOngs don’t inform us about missing raOng

✦ SoluOon: Assume incoherence with standard basis

bad spread of informaOon

Matrix CompleOon

In

=

Low-rank

+ ‘noise’

Goal: Recover a matrix from a subset of its entries, assuming✦ low-‐rank, incoherent✦ uniform sampling

Matrix CompleOon

In

=

Low-rank

+ ‘noise’

✦ Nuclear-‐norm heurisOc+ strong theoreOcal guarantees+ good empirical results

Matrix CompleOon

In

=

Low-rank

+ ‘noise’

✦ Nuclear-‐norm heurisOc+ strong theoreOcal guarantees+ good empirical results⎯ very slow computa5on

Matrix CompleOon

In

=

Low-rank

+ ‘noise’

✦ Nuclear-‐norm heurisOc+ strong theoreOcal guarantees+ good empirical results⎯ very slow computa5on

Goal: Scale MC algorithms and preserve guarantees

Divide-‐Factor-‐Combine (DFC)[MTJ, NIPS11]

Divide-‐Factor-‐Combine (DFC)

✦ D step: Divide input matrix into submatrices

[MTJ, NIPS11]



✦ F step: Factor in parallel using a base MC algorithm

[MTJ, NIPS11]




✦ C step: Combine submatrix esOmates

[MTJ, NIPS11]




✦ C step: Combine submatrix esOmates

Advantages:✦ Submatrix factorizaOon is much cheaper and easily parallelized

✦ Minimal communicaOon between parallel jobs

✦ Retains comparable recovery guarantees (with proper choice of division / combinaOon strategies)

[MTJ, NIPS11]

DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:


✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC



✦ C step: Project onto single low-‐dimensional column space



✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons



✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons✦ Minimal cost: linear in n, quadraOc in rank of sub-‐soluOons




=




= =




✦ Ensemble: Project onto column space of each sub-solution and average

Does It Work?Yes, with high probability.

Theorem: Assume: ✦ is low-‐rank and incoherent,✦ entries sampled uniformly at random,✦ Nuclear norm heurisOc is base algorithm.

L0

⌦̃(r(n+m))



Then with (slightly less) high probability.

L0

⌦̃(r(n+m))

L̂ = L0



Then with (slightly less) high probability.

L0

⌦̃(r(n+m))

L̂ = L0

✦ Noisy seang: approximaOon of original bound

✦ Can divide into an increasing number of subproblems ( ) when number of observed entries int ! 1

(2 + ✏)

!̃(r2(n+m))

DFC Noisy Recovery

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25MC

RM

SE

% revealed entries

Proj−10%Proj−Ens−10%Base−MC

✦ Noisy recovery relative to base algorithm ( )n = 10K, r = 10

DFC Speedup

✦ Speedup over APG for random matrices with 4% of entries revealed and r = 0.001n

1 2 3 4 5x 104

0

500

1000

1500

2000

2500

3000

3500MC

time

(s)

m

Proj−10%Proj−Ens−10%Base−MC

NeIlix Prize: ✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform

observaOons

Matrix CompleOon

NeIlixNeIlixMethod Error Time

Nuclear Norm 0.8433 2653.1sDFC, t=4DFC, t=10

DFC-‐Ens, t=4DFC-‐Ens, t=10

NeIlix Prize: ✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform

observaOons

Matrix CompleOon

NeIlixNeIlixMethod Error Time

Nuclear Norm 0.8433 2653.1sDFC, t=4 0.8436 689.5sDFC, t=10 0.8484 289.7s

DFC-‐Ens, t=4 0.8411 689.5sDFC-‐Ens, t=10 0.8433 289.7

Matrix CompleOonNeIlix Prize:

✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform

observaOons

Robust Matrix FactorizaOon[Chandrasekaran, Sanghavi, Parrilo, and Willsky, 2009; Candes, Li, Ma, and Wright, 2011; Zhou, Li, Wright, Candes, and Ma, 2010]

In

=

Low-rank

+ ‘noise’Matrix

Comple5on


In

=

Low-rank

+ ‘noise’Matrix

Comple5on

=Principal

Component Analysis

In Low-rank

+ ‘noise’


In

=

Low-rank

+ ‘noise’Matrix

Comple5on

= +Robust Matrix Factoriza5on

In Low-rank SparseOutliers

+ ‘noise’

=Principal

Component Analysis

In Low-rank

+ ‘noise’

✦ Goal: separate foreground from background

✦ Store video as matrix✦ Low-rank = background✦ Outliers = movement

Video Surveillance



Video Surveillance

Original Frame



Video Surveillance

Original Frame Nuclear Norm(342.5s)



Video Surveillance

Original Frame Nuclear Norm(342.5s)

DFC-‐5%(24.2s)

DFC-‐0.5%(5.2s)

Subspace SegmentaOon

In

=

Low-rank

+ ‘noise’Matrix

Comple5on

[Liu, Lin, and Yu, 2010]


In

=

Low-rank

+ ‘noise’Matrix

Comple5on

=Principal

Component Analysis

In Low-rank

+ ‘noise’



In

=

Low-rank

+ ‘noise’Matrix

Comple5on

=Principal

Component Analysis

In Low-rank

+ ‘noise’


=Subspace Segmenta5on

In

+ ‘noise’

Low-rank

MoOvaOon: Face images


Principal Component Analysis

In

...

...


=

Low-rank

+ ‘noise’

✦ Model images of one person via one low-‐dimensional subspace

Principal Component Analysis

In

...

...


Subspace Segmenta5on

In


✦ Model images of five people via five low-‐dimensional subspaces

= + ‘noise’

Low-rank


In


✦ Model images of five people via five low-‐dimensional subspaces✦ Recover subspaces cluster images

= + ‘noise’

Low-rank


In


= + ‘noise’

Low-rank


In

✦ Nuclear norm heurisOc to provably recovers subspaces✦ Guarantees are preserved with DFC [TMMFJ, ICCV13]


= + ‘noise’

Low-rank


In

✦ Toy Experiment: IdenOfy images corresponding to same person (10 people, 640 images)

✦ DFC Results: Linear speedup, State-‐of-‐the-‐art accuracy

Video Event DetecOon


✦ Input: videos, some of which are associated with events✦ Goal: predict events for unlabeled videos


✦ Input: videos, some of which are associated with events✦ Goal: predict events for unlabeled videos✦ Idea:

✦ Featurize each video



✦ Featurize each video✦ Learn video clusters via nuclear norm heurisOc



✦ Featurize each video✦ Learn video clusters via nuclear norm heurisOc✦ Given labeled nodes and cluster structure, make predicOons



✦ Featurize each video✦ Learn video clusters via nuclear norm heurisOc✦ Given labeled nodes and cluster structure, make predicOons

Can do this at scale with DFC!

DFC Summary

✦ DFC: distributed framework for matrix factorizaOon✦ Similar recovery guarantees✦ Significant speedups

✦ DFC applied to 3 classes of problems:✦ Matrix compleOon✦ Robust matrix factorizaOon✦ Subspace recovery

✦ Extend DFC to other MF methods, e.g., ALS, SGD?

Big Data and Distributed CompuOng are valuable resources, but ...

✦ Challenge 1: ML not developed with scalability in mind



Divide-‐and-‐Conquer (e.g., DFC)




✦ Challenge 2: ML not developed with ease-‐of-‐use in mind




✦ Challenge 2: ML not developed with ease-‐of-‐use in mind


baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base

www.mlbase.org

http://www.mlbase.org

http://www.mlbase.org

Talwalkar mlconf (1)

Technology

Transcript of Talwalkar mlconf (1)