Talwalkar mlconf (1)

Divide-‐and-‐Conquer Matrix Factoriza5on

Ameet TalwalkarUC Berkeley

November 15th, 2013

Collaborators: Lester Mackey2, Michael I. Jordan1, Yadong Mu3, Shih-‐Fu Chang3

1UC Berkeley 2Stanford University 3Columbia University

Three Converging Trends

Big Data

Distributed CompuOng

Big Data

Machine Learning

Big Data Machine Learning

Goal: Extend ML to the Big Data SeAng

Challenge: ML not developed with scalability in mind✦ Does not naturally scale / leverage distributed compuOng

Our approach: Divide-‐and-‐conquer✦ Apply exisOng base algorithms to subsets of data and combine

✓ Build upon exisOng suites of ML algorithms✓ Preserve favorable algorithm properOes✓ Naturally leverage distributed compuOng

✦ E.g., ✦ Matrix factorizaOon (DFC)✦ Assessing esOmator quality (BLB)✦ Genomic Variant Calling

[MTJ, NIPS11; TMMFJ, ICCV13]

[KTSJ, ICML12; KTSJ, JRSS13; KTASJ, KDD13]

[BTTJPYS13, submitted, CTZFJP13, submitted]

✦ E.g., ✦ Matrix factorizaOon (DFC)✦ Assessing esOmator quality (BLB)✦ Genomic Variant Calling

[MTJ, NIPS11; TMMFJ, ICCV13]

[KTSJ, ICML12; KTSJ, JRSS13; KTASJ, KDD13]

[BTTJPYS13, submitted, CTZFJP13, submitted]

Matrix CompleOon

Goal: Recover a matrix from a subset of its entries

Matrix CompleOon

Can we do this at scale?✦ Netflix: 30M users, 100K+ videos✦ Facebook: 1B users✦ Pandora: 70M active users, 1M songs ✦ Amazon: Millions of users and products✦ ...

Reducing Degrees of Freedom

✦ Problem: Impossible without additional information✦ mn degrees of freedom

✦ Solution: Assume small # of

factors determine preference

‘Low-rank’

✦ Solution: Assume small # of

factors determine preference ✦ degrees of freedom✦ Linear storage costs

‘Low-rank’

O(m+ n)

Bad Sampling

✦ Problem: We have no raOng informaOon about

Bad Sampling

✦ Problem: We have no raOng informaOon about

✦ SoluOon: Assume observed entries drawn uniformly at random

⌦̃(r(n+m))

Bad InformaOon Spread

✦ Problem: Other raOngs don’t inform us about missing raOng

bad spread of informaOon

Bad InformaOon Spread

[Candes and Recht, 2009]

✦ Problem: Other raOngs don’t inform us about missing raOng

✦ SoluOon: Assume incoherence with standard basis

bad spread of informaOon

Matrix CompleOon

Low-rank

+ ‘noise’

Goal: Recover a matrix from a subset of its entries, assuming✦ low-‐rank, incoherent✦ uniform sampling

Matrix CompleOon

Low-rank

+ ‘noise’

✦ Nuclear-‐norm heurisOc+ strong theoreOcal guarantees+ good empirical results

Matrix CompleOon

Low-rank

+ ‘noise’

✦ Nuclear-‐norm heurisOc+ strong theoreOcal guarantees+ good empirical results⎯ very slow computa5on

Matrix CompleOon

Low-rank

+ ‘noise’

✦ Nuclear-‐norm heurisOc+ strong theoreOcal guarantees+ good empirical results⎯ very slow computa5on

Goal: Scale MC algorithms and preserve guarantees

Divide-‐Factor-‐Combine (DFC)[MTJ, NIPS11]

Divide-‐Factor-‐Combine (DFC)

✦ D step: Divide input matrix into submatrices

[MTJ, NIPS11]

✦ F step: Factor in parallel using a base MC algorithm

[MTJ, NIPS11]

✦ C step: Combine submatrix esOmates

[MTJ, NIPS11]

✦ C step: Combine submatrix esOmates

Advantages:✦ Submatrix factorizaOon is much cheaper and easily parallelized

✦ Minimal communicaOon between parallel jobs

✦ Retains comparable recovery guarantees (with proper choice of division / combinaOon strategies)

[MTJ, NIPS11]

DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:

✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC

✦ C step: Project onto single low-‐dimensional column space

✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons

✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons✦ Minimal cost: linear in n, quadraOc in rank of sub-‐soluOons

✦ Ensemble: Project onto column space of each sub-solution and average

Does It Work?Yes, with high probability.

Theorem: Assume: ✦ is low-‐rank and incoherent,✦ entries sampled uniformly at random,✦ Nuclear norm heurisOc is base algorithm.

⌦̃(r(n+m))

Then with (slightly less) high probability.

⌦̃(r(n+m))

L̂ = L0

Then with (slightly less) high probability.

⌦̃(r(n+m))

L̂ = L0

✦ Noisy seang: approximaOon of original bound

✦ Can divide into an increasing number of subproblems ( ) when number of observed entries int ! 1

(2 + ✏)

!̃(r2(n+m))

DFC Noisy Recovery

0 2 4 6 8 100

0.25MC

% revealed entries

Proj−10%Proj−Ens−10%Base−MC

✦ Noisy recovery relative to base algorithm ( )n = 10K, r = 10

DFC Speedup

✦ Speedup over APG for random matrices with 4% of entries revealed and r = 0.001n

1 2 3 4 5x 104

3500MC

Proj−10%Proj−Ens−10%Base−MC

NeIlix Prize: ✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform

observaOons

Matrix CompleOon

NeIlixNeIlixMethod Error Time

Nuclear Norm 0.8433 2653.1sDFC, t=4DFC, t=10

DFC-‐Ens, t=4DFC-‐Ens, t=10

NeIlix Prize: ✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform

observaOons

Matrix CompleOon

Nuclear Norm 0.8433 2653.1sDFC, t=4 0.8436 689.5sDFC, t=10 0.8484 289.7s

DFC-‐Ens, t=4 0.8411 689.5sDFC-‐Ens, t=10 0.8433 289.7

Matrix CompleOonNeIlix Prize:

✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform

observaOons

Nuclear Norm 0.8433 2653.1sDFC, t=4 0.8436 689.5sDFC, t=10 0.8484 289.7s

DFC-‐Ens, t=4 0.8411 689.5sDFC-‐Ens, t=10 0.8433 289.7

Matrix CompleOonNeIlix Prize:

✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform

observaOons

Robust Matrix FactorizaOon[Chandrasekaran, Sanghavi, Parrilo, and Willsky, 2009; Candes, Li, Ma, and Wright, 2011; Zhou, Li, Wright, Candes, and Ma, 2010]

Low-rank

+ ‘noise’Matrix

Comple5on

Low-rank

+ ‘noise’Matrix

Comple5on

=Principal

Component Analysis

In Low-rank

+ ‘noise’

Low-rank

+ ‘noise’Matrix

Comple5on

= +Robust Matrix Factoriza5on

In Low-rank SparseOutliers

+ ‘noise’

=Principal

Component Analysis

In Low-rank

+ ‘noise’

✦ Goal: separate foreground from background

✦ Store video as matrix✦ Low-rank = background✦ Outliers = movement

Video Surveillance

Original Frame

Video Surveillance

Original Frame Nuclear Norm(342.5s)

Video Surveillance

Original Frame Nuclear Norm(342.5s)

DFC-‐5%(24.2s)

DFC-‐0.5%(5.2s)

Subspace SegmentaOon

Low-rank

+ ‘noise’Matrix

Comple5on

[Liu, Lin, and Yu, 2010]

Low-rank

+ ‘noise’Matrix

Comple5on

=Principal

Component Analysis

In Low-rank

+ ‘noise’

Low-rank

+ ‘noise’Matrix

Comple5on

=Principal

Component Analysis

In Low-rank

+ ‘noise’

=Subspace Segmenta5on

+ ‘noise’

Low-rank

MoOvaOon: Face images

Principal Component Analysis

Low-rank

+ ‘noise’

✦ Model images of one person via one low-‐dimensional subspace

Principal Component Analysis

Subspace Segmenta5on

✦ Model images of five people via five low-‐dimensional subspaces

= + ‘noise’

Low-rank

✦ Model images of five people via five low-‐dimensional subspaces✦ Recover subspaces cluster images

= + ‘noise’

Low-rank

= + ‘noise’

Low-rank

✦ Nuclear norm heurisOc to provably recovers subspaces✦ Guarantees are preserved with DFC [TMMFJ, ICCV13]

= + ‘noise’

Low-rank

✦ Toy Experiment: IdenOfy images corresponding to same person (10 people, 640 images)

✦ DFC Results: Linear speedup, State-‐of-‐the-‐art accuracy

Video Event DetecOon

✦ Input: videos, some of which are associated with events✦ Goal: predict events for unlabeled videos

✦ Input: videos, some of which are associated with events✦ Goal: predict events for unlabeled videos✦ Idea:

✦ Featurize each video

✦ Featurize each video✦ Learn video clusters via nuclear norm heurisOc

✦ Featurize each video✦ Learn video clusters via nuclear norm heurisOc✦ Given labeled nodes and cluster structure, make predicOons

Can do this at scale with DFC!

DFC Summary

✦ DFC: distributed framework for matrix factorizaOon✦ Similar recovery guarantees✦ Significant speedups

✦ DFC applied to 3 classes of problems:✦ Matrix compleOon✦ Robust matrix factorizaOon✦ Subspace recovery

✦ Extend DFC to other MF methods, e.g., ALS, SGD?

Big Data and Distributed CompuOng are valuable resources, but ...

✦ Challenge 1: ML not developed with scalability in mind

Divide-‐and-‐Conquer (e.g., DFC)

✦ Challenge 2: ML not developed with ease-‐of-‐use in mind

baseML

ML base

www.mlbase.org

Talwalkar mlconf (1)

Technology

Transcript of Talwalkar mlconf (1)

MLconf NYC Josh Wills

Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Josh Wills, MLconf 2013

Ted Willke, Intel Labs MLconf 2013

Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Ben Hamner, CTO, Kaggle, at MLconf NYC 2017

Session 1 - Silva, Singh, Richardson at MLconf NYC

MLConf Seattle 2015 - ML@Quora

MLconf seattle 2015 presentation

Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017

MLConf 2016 SigOpt Talk by Scott Clark

Session 2 - Akyildiz, Beinecke, Yee at MLconf NYC

MLconf NYC Edo Liberty

Jake Mannix, MLconf 2013

Atlanta MLconf Machine Learning Conference 09-23-2016

American Express Slides, MLconf 2013

Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

MLconf NYC Justin Basilico

Music recommendations @ MLConf 2014

Quoc le, slides MLconf 11/15/13