Talwalkar mlconf (1)
-
Upload
sessionsevents -
Category
Technology
-
view
953 -
download
0
Transcript of Talwalkar mlconf (1)
Divide-‐and-‐Conquer Matrix Factoriza5on
Ameet TalwalkarUC Berkeley
November 15th, 2013
Collaborators: Lester Mackey2, Michael I. Jordan1, Yadong Mu3, Shih-‐Fu Chang3
1UC Berkeley 2Stanford University 3Columbia University
Three Converging Trends
Big Data
Three Converging Trends
Distributed CompuOng
Big Data
Three Converging Trends
Distributed CompuOng
Big Data
Three Converging Trends
Machine Learning
Distributed CompuOng
Big Data Machine Learning
Goal: Extend ML to the Big Data SeAng
Challenge: ML not developed with scalability in mind✦ Does not naturally scale / leverage distributed compuOng
Distributed CompuOng
Big Data Machine Learning
Goal: Extend ML to the Big Data SeAng
Challenge: ML not developed with scalability in mind✦ Does not naturally scale / leverage distributed compuOng
Our approach: Divide-‐and-‐conquer✦ Apply exisOng base algorithms to subsets of data and combine
Distributed CompuOng
Big Data Machine Learning
Goal: Extend ML to the Big Data SeAng
Challenge: ML not developed with scalability in mind✦ Does not naturally scale / leverage distributed compuOng
Our approach: Divide-‐and-‐conquer✦ Apply exisOng base algorithms to subsets of data and combine
✓ Build upon exisOng suites of ML algorithms✓ Preserve favorable algorithm properOes✓ Naturally leverage distributed compuOng
Distributed CompuOng
Big Data Machine Learning
Goal: Extend ML to the Big Data SeAng
Challenge: ML not developed with scalability in mind✦ Does not naturally scale / leverage distributed compuOng
Our approach: Divide-‐and-‐conquer✦ Apply exisOng base algorithms to subsets of data and combine
✓ Build upon exisOng suites of ML algorithms✓ Preserve favorable algorithm properOes✓ Naturally leverage distributed compuOng
✦ E.g., ✦ Matrix factorizaOon (DFC)✦ Assessing esOmator quality (BLB)✦ Genomic Variant Calling
[MTJ, NIPS11; TMMFJ, ICCV13]
[KTSJ, ICML12; KTSJ, JRSS13; KTASJ, KDD13]
[BTTJPYS13, submitted, CTZFJP13, submitted]
Distributed CompuOng
Big Data Machine Learning
Goal: Extend ML to the Big Data SeAng
Challenge: ML not developed with scalability in mind✦ Does not naturally scale / leverage distributed compuOng
Our approach: Divide-‐and-‐conquer✦ Apply exisOng base algorithms to subsets of data and combine
✓ Build upon exisOng suites of ML algorithms✓ Preserve favorable algorithm properOes✓ Naturally leverage distributed compuOng
✦ E.g., ✦ Matrix factorizaOon (DFC)✦ Assessing esOmator quality (BLB)✦ Genomic Variant Calling
[MTJ, NIPS11; TMMFJ, ICCV13]
[KTSJ, ICML12; KTSJ, JRSS13; KTASJ, KDD13]
[BTTJPYS13, submitted, CTZFJP13, submitted]
Matrix CompleOon
Matrix CompleOon
Matrix CompleOon
Goal: Recover a matrix from a subset of its entries
Matrix CompleOon
Goal: Recover a matrix from a subset of its entries
Matrix CompleOon
Goal: Recover a matrix from a subset of its entries
Matrix CompleOon
Goal: Recover a matrix from a subset of its entries
Can we do this at scale?✦ Netflix: 30M users, 100K+ videos✦ Facebook: 1B users✦ Pandora: 70M active users, 1M songs ✦ Amazon: Millions of users and products✦ ...
Reducing Degrees of Freedom
Reducing Degrees of Freedom
✦ Problem: Impossible without additional information✦ mn degrees of freedom
m
n
Reducing Degrees of Freedom
✦ Problem: Impossible without additional information✦ mn degrees of freedom
✦ Solution: Assume small # of
factors determine preference
m
n
= m
r nr
‘Low-rank’
Reducing Degrees of Freedom
✦ Problem: Impossible without additional information✦ mn degrees of freedom
✦ Solution: Assume small # of
factors determine preference ✦ degrees of freedom✦ Linear storage costs
m
n
= m
r nr
‘Low-rank’
O(m+ n)
Bad Sampling
✦ Problem: We have no raOng informaOon about
Bad Sampling
✦ Problem: We have no raOng informaOon about
✦ SoluOon: Assume observed entries drawn uniformly at random
⌦̃(r(n+m))
Bad InformaOon Spread
Bad InformaOon Spread
✦ Problem: Other raOngs don’t inform us about missing raOng
bad spread of informaOon
Bad InformaOon Spread
[Candes and Recht, 2009]
✦ Problem: Other raOngs don’t inform us about missing raOng
✦ SoluOon: Assume incoherence with standard basis
bad spread of informaOon
Matrix CompleOon
In
=
Low-rank
+ ‘noise’
Goal: Recover a matrix from a subset of its entries, assuming✦ low-‐rank, incoherent✦ uniform sampling
Matrix CompleOon
In
=
Low-rank
+ ‘noise’
✦ Nuclear-‐norm heurisOc+ strong theoreOcal guarantees+ good empirical results
Matrix CompleOon
In
=
Low-rank
+ ‘noise’
✦ Nuclear-‐norm heurisOc+ strong theoreOcal guarantees+ good empirical results⎯ very slow computa5on
Matrix CompleOon
In
=
Low-rank
+ ‘noise’
✦ Nuclear-‐norm heurisOc+ strong theoreOcal guarantees+ good empirical results⎯ very slow computa5on
Goal: Scale MC algorithms and preserve guarantees
Divide-‐Factor-‐Combine (DFC)[MTJ, NIPS11]
Divide-‐Factor-‐Combine (DFC)
✦ D step: Divide input matrix into submatrices
[MTJ, NIPS11]
Divide-‐Factor-‐Combine (DFC)
✦ D step: Divide input matrix into submatrices
✦ F step: Factor in parallel using a base MC algorithm
[MTJ, NIPS11]
Divide-‐Factor-‐Combine (DFC)
✦ D step: Divide input matrix into submatrices
✦ F step: Factor in parallel using a base MC algorithm
✦ C step: Combine submatrix esOmates
[MTJ, NIPS11]
Divide-‐Factor-‐Combine (DFC)
✦ D step: Divide input matrix into submatrices
✦ F step: Factor in parallel using a base MC algorithm
✦ C step: Combine submatrix esOmates
Advantages:✦ Submatrix factorizaOon is much cheaper and easily parallelized
✦ Minimal communicaOon between parallel jobs
✦ Retains comparable recovery guarantees (with proper choice of division / combinaOon strategies)
[MTJ, NIPS11]
DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:
DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:
✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC
DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:
✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC
✦ C step: Project onto single low-‐dimensional column space
DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:
✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC
✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons
DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:
✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC
✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons✦ Minimal cost: linear in n, quadraOc in rank of sub-‐soluOons
DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:
✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC
✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons✦ Minimal cost: linear in n, quadraOc in rank of sub-‐soluOons
=
DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:
✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC
✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons✦ Minimal cost: linear in n, quadraOc in rank of sub-‐soluOons
=
DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:
✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC
✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons✦ Minimal cost: linear in n, quadraOc in rank of sub-‐soluOons
=
DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:
✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC
✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons✦ Minimal cost: linear in n, quadraOc in rank of sub-‐soluOons
= =
DFC-‐Proj✦ D step: Randomly parOOon observed entries into t submatrices:
✦ F step: Complete the submatrices in parallel✦ Reduced cost: Expect t-‐fold speedup per iteraOon✦ Parallel computaOon: Pay cost of one cheaper MC
✦ C step: Project onto single low-‐dimensional column space✦ Roughly, share informaOon across sub-‐soluOons✦ Minimal cost: linear in n, quadraOc in rank of sub-‐soluOons
✦ Ensemble: Project onto column space of each sub-solution and average
Does It Work?Yes, with high probability.
Theorem: Assume: ✦ is low-‐rank and incoherent,✦ entries sampled uniformly at random,✦ Nuclear norm heurisOc is base algorithm.
L0
⌦̃(r(n+m))
Does It Work?Yes, with high probability.
Theorem: Assume: ✦ is low-‐rank and incoherent,✦ entries sampled uniformly at random,✦ Nuclear norm heurisOc is base algorithm.
Then with (slightly less) high probability.
L0
⌦̃(r(n+m))
L̂ = L0
Does It Work?Yes, with high probability.
Theorem: Assume: ✦ is low-‐rank and incoherent,✦ entries sampled uniformly at random,✦ Nuclear norm heurisOc is base algorithm.
Then with (slightly less) high probability.
L0
⌦̃(r(n+m))
L̂ = L0
✦ Noisy seang: approximaOon of original bound
✦ Can divide into an increasing number of subproblems ( ) when number of observed entries int ! 1
(2 + ✏)
!̃(r2(n+m))
DFC Noisy Recovery
0 2 4 6 8 100
0.05
0.1
0.15
0.2
0.25MC
RM
SE
% revealed entries
Proj−10%Proj−Ens−10%Base−MC
✦ Noisy recovery relative to base algorithm ( )n = 10K, r = 10
DFC Speedup
✦ Speedup over APG for random matrices with 4% of entries revealed and r = 0.001n
1 2 3 4 5x 104
0
500
1000
1500
2000
2500
3000
3500MC
time
(s)
m
Proj−10%Proj−Ens−10%Base−MC
NeIlix Prize: ✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform
observaOons
Matrix CompleOon
NeIlixNeIlixMethod Error Time
Nuclear Norm 0.8433 2653.1sDFC, t=4DFC, t=10
DFC-‐Ens, t=4DFC-‐Ens, t=10
NeIlix Prize: ✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform
observaOons
Matrix CompleOon
NeIlixNeIlixMethod Error Time
Nuclear Norm 0.8433 2653.1sDFC, t=4 0.8436 689.5sDFC, t=10 0.8484 289.7s
DFC-‐Ens, t=4 0.8411 689.5sDFC-‐Ens, t=10 0.8433 289.7
Matrix CompleOonNeIlix Prize:
✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform
observaOons
NeIlixNeIlixMethod Error Time
Nuclear Norm 0.8433 2653.1sDFC, t=4 0.8436 689.5sDFC, t=10 0.8484 289.7s
DFC-‐Ens, t=4 0.8411 689.5sDFC-‐Ens, t=10 0.8433 289.7
Matrix CompleOonNeIlix Prize:
✦ 100 million raOngs in {1, ... , 5}✦ 18K movies, 480K user✦ Issues: Full-‐rank; Noisy, non-‐uniform
observaOons
Robust Matrix FactorizaOon[Chandrasekaran, Sanghavi, Parrilo, and Willsky, 2009; Candes, Li, Ma, and Wright, 2011; Zhou, Li, Wright, Candes, and Ma, 2010]
In
=
Low-rank
+ ‘noise’Matrix
Comple5on
Robust Matrix FactorizaOon[Chandrasekaran, Sanghavi, Parrilo, and Willsky, 2009; Candes, Li, Ma, and Wright, 2011; Zhou, Li, Wright, Candes, and Ma, 2010]
In
=
Low-rank
+ ‘noise’Matrix
Comple5on
=Principal
Component Analysis
In Low-rank
+ ‘noise’
Robust Matrix FactorizaOon[Chandrasekaran, Sanghavi, Parrilo, and Willsky, 2009; Candes, Li, Ma, and Wright, 2011; Zhou, Li, Wright, Candes, and Ma, 2010]
In
=
Low-rank
+ ‘noise’Matrix
Comple5on
= +Robust Matrix Factoriza5on
In Low-rank SparseOutliers
+ ‘noise’
=Principal
Component Analysis
In Low-rank
+ ‘noise’
✦ Goal: separate foreground from background
✦ Store video as matrix✦ Low-rank = background✦ Outliers = movement
Video Surveillance
✦ Goal: separate foreground from background
✦ Store video as matrix✦ Low-rank = background✦ Outliers = movement
Video Surveillance
Original Frame
✦ Goal: separate foreground from background
✦ Store video as matrix✦ Low-rank = background✦ Outliers = movement
Video Surveillance
Original Frame Nuclear Norm(342.5s)
✦ Goal: separate foreground from background
✦ Store video as matrix✦ Low-rank = background✦ Outliers = movement
Video Surveillance
Original Frame Nuclear Norm(342.5s)
DFC-‐5%(24.2s)
DFC-‐0.5%(5.2s)
Subspace SegmentaOon
In
=
Low-rank
+ ‘noise’Matrix
Comple5on
[Liu, Lin, and Yu, 2010]
Subspace SegmentaOon
In
=
Low-rank
+ ‘noise’Matrix
Comple5on
=Principal
Component Analysis
In Low-rank
+ ‘noise’
[Liu, Lin, and Yu, 2010]
Subspace SegmentaOon
In
=
Low-rank
+ ‘noise’Matrix
Comple5on
=Principal
Component Analysis
In Low-rank
+ ‘noise’
[Liu, Lin, and Yu, 2010]
=Subspace Segmenta5on
In
+ ‘noise’
Low-rank
MoOvaOon: Face images
MoOvaOon: Face images
Principal Component Analysis
In
...
...
MoOvaOon: Face images
=
Low-rank
+ ‘noise’
✦ Model images of one person via one low-‐dimensional subspace
Principal Component Analysis
In
...
...
MoOvaOon: Face images
MoOvaOon: Face images
Subspace Segmenta5on
In
MoOvaOon: Face images
Subspace Segmenta5on
In
MoOvaOon: Face images
Subspace Segmenta5on
In
MoOvaOon: Face images
Subspace Segmenta5on
In
MoOvaOon: Face images
Subspace Segmenta5on
In
MoOvaOon: Face images
Subspace Segmenta5on
In
MoOvaOon: Face images
✦ Model images of five people via five low-‐dimensional subspaces
= + ‘noise’
Low-rank
Subspace Segmenta5on
In
MoOvaOon: Face images
✦ Model images of five people via five low-‐dimensional subspaces✦ Recover subspaces cluster images
= + ‘noise’
Low-rank
Subspace Segmenta5on
In
MoOvaOon: Face images
= + ‘noise’
Low-rank
Subspace Segmenta5on
In
✦ Nuclear norm heurisOc to provably recovers subspaces✦ Guarantees are preserved with DFC [TMMFJ, ICCV13]
MoOvaOon: Face images
= + ‘noise’
Low-rank
Subspace Segmenta5on
In
✦ Toy Experiment: IdenOfy images corresponding to same person (10 people, 640 images)
✦ DFC Results: Linear speedup, State-‐of-‐the-‐art accuracy
Video Event DetecOon
Video Event DetecOon
✦ Input: videos, some of which are associated with events✦ Goal: predict events for unlabeled videos
Video Event DetecOon
✦ Input: videos, some of which are associated with events✦ Goal: predict events for unlabeled videos✦ Idea:
✦ Featurize each video
Video Event DetecOon
✦ Input: videos, some of which are associated with events✦ Goal: predict events for unlabeled videos✦ Idea:
✦ Featurize each video✦ Learn video clusters via nuclear norm heurisOc
Video Event DetecOon
✦ Input: videos, some of which are associated with events✦ Goal: predict events for unlabeled videos✦ Idea:
✦ Featurize each video✦ Learn video clusters via nuclear norm heurisOc✦ Given labeled nodes and cluster structure, make predicOons
Video Event DetecOon
✦ Input: videos, some of which are associated with events✦ Goal: predict events for unlabeled videos✦ Idea:
✦ Featurize each video✦ Learn video clusters via nuclear norm heurisOc✦ Given labeled nodes and cluster structure, make predicOons
Can do this at scale with DFC!
DFC Summary
✦ DFC: distributed framework for matrix factorizaOon✦ Similar recovery guarantees✦ Significant speedups
✦ DFC applied to 3 classes of problems:✦ Matrix compleOon✦ Robust matrix factorizaOon✦ Subspace recovery
✦ Extend DFC to other MF methods, e.g., ALS, SGD?
Big Data and Distributed CompuOng are valuable resources, but ...
✦ Challenge 1: ML not developed with scalability in mind
Big Data and Distributed CompuOng are valuable resources, but ...
✦ Challenge 1: ML not developed with scalability in mind
Divide-‐and-‐Conquer (e.g., DFC)
Big Data and Distributed CompuOng are valuable resources, but ...
✦ Challenge 1: ML not developed with scalability in mind
Divide-‐and-‐Conquer (e.g., DFC)
✦ Challenge 2: ML not developed with ease-‐of-‐use in mind
Big Data and Distributed CompuOng are valuable resources, but ...
✦ Challenge 1: ML not developed with scalability in mind
Divide-‐and-‐Conquer (e.g., DFC)
✦ Challenge 2: ML not developed with ease-‐of-‐use in mind
Big Data and Distributed CompuOng are valuable resources, but ...
baseML
baseML
baseML
baseML
ML base
ML base
ML base
ML base
ML base
www.mlbase.org