Talwalkar mlconf (1)

Post on 20-Aug-2015

953 views 0 download

Tags:

Transcript of Talwalkar mlconf (1)

Divide-­‐and-­‐Conquer  Matrix  Factoriza5on

Ameet  TalwalkarUC  Berkeley

November  15th,  2013

Collaborators:  Lester  Mackey2,  Michael  I.  Jordan1,  Yadong  Mu3,  Shih-­‐Fu  Chang3

1UC  Berkeley            2Stanford  University            3Columbia  University  

Three  Converging  Trends

Big  Data

Three  Converging  Trends

Distributed  CompuOng

Big  Data

Three  Converging  Trends

Distributed  CompuOng

Big  Data

Three  Converging  Trends

Machine  Learning

Distributed  CompuOng

Big  Data Machine  Learning

Goal:  Extend  ML  to  the  Big  Data  SeAng  

Challenge:  ML  not  developed  with  scalability  in  mind✦ Does  not  naturally  scale  /  leverage  distributed  compuOng

Distributed  CompuOng

Big  Data Machine  Learning

Goal:  Extend  ML  to  the  Big  Data  SeAng  

Challenge:  ML  not  developed  with  scalability  in  mind✦ Does  not  naturally  scale  /  leverage  distributed  compuOng

Our  approach:  Divide-­‐and-­‐conquer✦ Apply  exisOng  base  algorithms  to  subsets  of  data  and  combine

Distributed  CompuOng

Big  Data Machine  Learning

Goal:  Extend  ML  to  the  Big  Data  SeAng  

Challenge:  ML  not  developed  with  scalability  in  mind✦ Does  not  naturally  scale  /  leverage  distributed  compuOng

Our  approach:  Divide-­‐and-­‐conquer✦ Apply  exisOng  base  algorithms  to  subsets  of  data  and  combine

✓ Build  upon  exisOng  suites  of  ML  algorithms✓ Preserve  favorable  algorithm  properOes✓ Naturally  leverage  distributed  compuOng

Distributed  CompuOng

Big  Data Machine  Learning

Goal:  Extend  ML  to  the  Big  Data  SeAng  

Challenge:  ML  not  developed  with  scalability  in  mind✦ Does  not  naturally  scale  /  leverage  distributed  compuOng

Our  approach:  Divide-­‐and-­‐conquer✦ Apply  exisOng  base  algorithms  to  subsets  of  data  and  combine

✓ Build  upon  exisOng  suites  of  ML  algorithms✓ Preserve  favorable  algorithm  properOes✓ Naturally  leverage  distributed  compuOng

✦ E.g.,  ✦ Matrix  factorizaOon  (DFC)✦ Assessing  esOmator  quality  (BLB)✦ Genomic  Variant  Calling

[MTJ, NIPS11; TMMFJ, ICCV13]

[KTSJ, ICML12; KTSJ, JRSS13; KTASJ, KDD13]

[BTTJPYS13, submitted, CTZFJP13, submitted]

Distributed  CompuOng

Big  Data Machine  Learning

Goal:  Extend  ML  to  the  Big  Data  SeAng  

Challenge:  ML  not  developed  with  scalability  in  mind✦ Does  not  naturally  scale  /  leverage  distributed  compuOng

Our  approach:  Divide-­‐and-­‐conquer✦ Apply  exisOng  base  algorithms  to  subsets  of  data  and  combine

✓ Build  upon  exisOng  suites  of  ML  algorithms✓ Preserve  favorable  algorithm  properOes✓ Naturally  leverage  distributed  compuOng

✦ E.g.,  ✦ Matrix  factorizaOon  (DFC)✦ Assessing  esOmator  quality  (BLB)✦ Genomic  Variant  Calling

[MTJ, NIPS11; TMMFJ, ICCV13]

[KTSJ, ICML12; KTSJ, JRSS13; KTASJ, KDD13]

[BTTJPYS13, submitted, CTZFJP13, submitted]

Matrix  CompleOon

Matrix  CompleOon

Matrix  CompleOon

Goal: Recover a matrix from a subset of its entries

Matrix  CompleOon

Goal: Recover a matrix from a subset of its entries

Matrix  CompleOon

Goal: Recover a matrix from a subset of its entries

Matrix  CompleOon

Goal: Recover a matrix from a subset of its entries

Can we do this at scale?✦ Netflix: 30M users, 100K+ videos✦ Facebook: 1B users✦ Pandora: 70M active users, 1M songs ✦ Amazon: Millions of users and products✦ ...

Reducing  Degrees  of  Freedom

Reducing  Degrees  of  Freedom

✦ Problem: Impossible without additional information✦ mn degrees of freedom

m

n

Reducing  Degrees  of  Freedom

✦ Problem: Impossible without additional information✦ mn degrees of freedom

✦ Solution: Assume small # of

factors determine preference

m

n

= m

r nr

‘Low-rank’

Reducing  Degrees  of  Freedom

✦ Problem: Impossible without additional information✦ mn degrees of freedom

✦ Solution: Assume small # of

factors determine preference ✦ degrees of freedom✦ Linear storage costs

m

n

= m

r nr

‘Low-rank’

O(m+ n)

Bad  Sampling

✦ Problem:    We  have  no  raOng  informaOon  about  

Bad  Sampling

✦ Problem:    We  have  no  raOng  informaOon  about

✦ SoluOon:    Assume                      observed  entries  drawn  uniformly  at  random

⌦̃(r(n+m))

Bad  InformaOon  Spread

Bad  InformaOon  Spread

✦ Problem:  Other  raOngs  don’t  inform  us  about  missing  raOng

bad  spread  of  informaOon

Bad  InformaOon  Spread

[Candes and Recht, 2009]

✦ Problem:  Other  raOngs  don’t  inform  us  about  missing  raOng

✦ SoluOon:    Assume  incoherence  with  standard  basis

bad  spread  of  informaOon

Matrix  CompleOon

In

=

Low-rank

+ ‘noise’

Goal:  Recover  a  matrix  from  a  subset  of  its  entries,  assuming✦ low-­‐rank,  incoherent✦ uniform  sampling

Matrix  CompleOon

In

=

Low-rank

+ ‘noise’

✦ Nuclear-­‐norm  heurisOc+  strong  theoreOcal  guarantees+  good  empirical  results

Matrix  CompleOon

In

=

Low-rank

+ ‘noise’

✦ Nuclear-­‐norm  heurisOc+  strong  theoreOcal  guarantees+  good  empirical  results⎯  very  slow  computa5on

Matrix  CompleOon

In

=

Low-rank

+ ‘noise’

✦ Nuclear-­‐norm  heurisOc+  strong  theoreOcal  guarantees+  good  empirical  results⎯  very  slow  computa5on

Goal:  Scale  MC  algorithms  and  preserve  guarantees

Divide-­‐Factor-­‐Combine  (DFC)[MTJ, NIPS11]

Divide-­‐Factor-­‐Combine  (DFC)

✦ D  step:  Divide  input  matrix  into  submatrices

[MTJ, NIPS11]

Divide-­‐Factor-­‐Combine  (DFC)

✦ D  step:  Divide  input  matrix  into  submatrices

✦ F  step:  Factor  in  parallel  using  a  base  MC  algorithm

[MTJ, NIPS11]

Divide-­‐Factor-­‐Combine  (DFC)

✦ D  step:  Divide  input  matrix  into  submatrices

✦ F  step:  Factor  in  parallel  using  a  base  MC  algorithm

✦ C  step:  Combine  submatrix  esOmates

[MTJ, NIPS11]

Divide-­‐Factor-­‐Combine  (DFC)

✦ D  step:  Divide  input  matrix  into  submatrices

✦ F  step:  Factor  in  parallel  using  a  base  MC  algorithm

✦ C  step:  Combine  submatrix  esOmates

Advantages:✦ Submatrix  factorizaOon  is  much  cheaper  and  easily  parallelized

✦ Minimal  communicaOon  between  parallel  jobs

✦ Retains  comparable  recovery  guarantees  (with  proper  choice  of  division  /  combinaOon  strategies)

[MTJ, NIPS11]

DFC-­‐Proj✦ D  step:  Randomly  parOOon  observed  entries  into  t  submatrices:

DFC-­‐Proj✦ D  step:  Randomly  parOOon  observed  entries  into  t  submatrices:

✦ F  step:  Complete  the  submatrices  in  parallel✦ Reduced  cost:  Expect  t-­‐fold  speedup  per  iteraOon✦ Parallel  computaOon:  Pay  cost  of  one  cheaper  MC

DFC-­‐Proj✦ D  step:  Randomly  parOOon  observed  entries  into  t  submatrices:

✦ F  step:  Complete  the  submatrices  in  parallel✦ Reduced  cost:  Expect  t-­‐fold  speedup  per  iteraOon✦ Parallel  computaOon:  Pay  cost  of  one  cheaper  MC

✦ C  step:  Project  onto  single  low-­‐dimensional  column  space

DFC-­‐Proj✦ D  step:  Randomly  parOOon  observed  entries  into  t  submatrices:

✦ F  step:  Complete  the  submatrices  in  parallel✦ Reduced  cost:  Expect  t-­‐fold  speedup  per  iteraOon✦ Parallel  computaOon:  Pay  cost  of  one  cheaper  MC

✦ C  step:  Project  onto  single  low-­‐dimensional  column  space✦ Roughly,  share  informaOon  across  sub-­‐soluOons

DFC-­‐Proj✦ D  step:  Randomly  parOOon  observed  entries  into  t  submatrices:

✦ F  step:  Complete  the  submatrices  in  parallel✦ Reduced  cost:  Expect  t-­‐fold  speedup  per  iteraOon✦ Parallel  computaOon:  Pay  cost  of  one  cheaper  MC

✦ C  step:  Project  onto  single  low-­‐dimensional  column  space✦ Roughly,  share  informaOon  across  sub-­‐soluOons✦ Minimal  cost:  linear  in  n,  quadraOc  in  rank  of  sub-­‐soluOons

DFC-­‐Proj✦ D  step:  Randomly  parOOon  observed  entries  into  t  submatrices:

✦ F  step:  Complete  the  submatrices  in  parallel✦ Reduced  cost:  Expect  t-­‐fold  speedup  per  iteraOon✦ Parallel  computaOon:  Pay  cost  of  one  cheaper  MC

✦ C  step:  Project  onto  single  low-­‐dimensional  column  space✦ Roughly,  share  informaOon  across  sub-­‐soluOons✦ Minimal  cost:  linear  in  n,  quadraOc  in  rank  of  sub-­‐soluOons

=

DFC-­‐Proj✦ D  step:  Randomly  parOOon  observed  entries  into  t  submatrices:

✦ F  step:  Complete  the  submatrices  in  parallel✦ Reduced  cost:  Expect  t-­‐fold  speedup  per  iteraOon✦ Parallel  computaOon:  Pay  cost  of  one  cheaper  MC

✦ C  step:  Project  onto  single  low-­‐dimensional  column  space✦ Roughly,  share  informaOon  across  sub-­‐soluOons✦ Minimal  cost:  linear  in  n,  quadraOc  in  rank  of  sub-­‐soluOons

=

DFC-­‐Proj✦ D  step:  Randomly  parOOon  observed  entries  into  t  submatrices:

✦ F  step:  Complete  the  submatrices  in  parallel✦ Reduced  cost:  Expect  t-­‐fold  speedup  per  iteraOon✦ Parallel  computaOon:  Pay  cost  of  one  cheaper  MC

✦ C  step:  Project  onto  single  low-­‐dimensional  column  space✦ Roughly,  share  informaOon  across  sub-­‐soluOons✦ Minimal  cost:  linear  in  n,  quadraOc  in  rank  of  sub-­‐soluOons

=

DFC-­‐Proj✦ D  step:  Randomly  parOOon  observed  entries  into  t  submatrices:

✦ F  step:  Complete  the  submatrices  in  parallel✦ Reduced  cost:  Expect  t-­‐fold  speedup  per  iteraOon✦ Parallel  computaOon:  Pay  cost  of  one  cheaper  MC

✦ C  step:  Project  onto  single  low-­‐dimensional  column  space✦ Roughly,  share  informaOon  across  sub-­‐soluOons✦ Minimal  cost:  linear  in  n,  quadraOc  in  rank  of  sub-­‐soluOons

= =

DFC-­‐Proj✦ D  step:  Randomly  parOOon  observed  entries  into  t  submatrices:

✦ F  step:  Complete  the  submatrices  in  parallel✦ Reduced  cost:  Expect  t-­‐fold  speedup  per  iteraOon✦ Parallel  computaOon:  Pay  cost  of  one  cheaper  MC

✦ C  step:  Project  onto  single  low-­‐dimensional  column  space✦ Roughly,  share  informaOon  across  sub-­‐soluOons✦ Minimal  cost:  linear  in  n,  quadraOc  in  rank  of  sub-­‐soluOons

✦ Ensemble: Project onto column space of each sub-solution and average

Does  It  Work?Yes,  with  high  probability.

Theorem:    Assume:  ✦            is  low-­‐rank  and  incoherent,✦                                              entries  sampled  uniformly  at  random,✦ Nuclear  norm  heurisOc  is  base  algorithm.

L0

⌦̃(r(n+m))

Does  It  Work?Yes,  with  high  probability.

Theorem:    Assume:  ✦            is  low-­‐rank  and  incoherent,✦                                              entries  sampled  uniformly  at  random,✦ Nuclear  norm  heurisOc  is  base  algorithm.

Then                              with  (slightly  less)  high  probability.    

L0

⌦̃(r(n+m))

L̂ = L0

Does  It  Work?Yes,  with  high  probability.

Theorem:    Assume:  ✦            is  low-­‐rank  and  incoherent,✦                                              entries  sampled  uniformly  at  random,✦ Nuclear  norm  heurisOc  is  base  algorithm.

Then                              with  (slightly  less)  high  probability.    

L0

⌦̃(r(n+m))

L̂ = L0

✦ Noisy  seang:                          approximaOon  of  original  bound

✦ Can  divide  into  an  increasing  number  of  subproblems  (                          )  when  number  of  observed  entries  int ! 1

(2 + ✏)

!̃(r2(n+m))

DFC  Noisy  Recovery

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25MC

RM

SE

% revealed entries

Proj−10%Proj−Ens−10%Base−MC

✦ Noisy recovery relative to base algorithm ( )n = 10K, r = 10

DFC Speedup

✦ Speedup over APG for random matrices with 4% of entries revealed and r = 0.001n

1 2 3 4 5x 104

0

500

1000

1500

2000

2500

3000

3500MC

time

(s)

m

Proj−10%Proj−Ens−10%Base−MC

NeIlix  Prize:  ✦ 100  million  raOngs  in  {1,  ...  ,  5}✦ 18K  movies,  480K  user✦ Issues:  Full-­‐rank;  Noisy,  non-­‐uniform  

observaOons

Matrix  CompleOon

NeIlixNeIlixMethod Error Time

Nuclear  Norm 0.8433 2653.1sDFC,  t=4DFC,  t=10

DFC-­‐Ens,  t=4DFC-­‐Ens,  t=10

NeIlix  Prize:  ✦ 100  million  raOngs  in  {1,  ...  ,  5}✦ 18K  movies,  480K  user✦ Issues:  Full-­‐rank;  Noisy,  non-­‐uniform  

observaOons

Matrix  CompleOon

NeIlixNeIlixMethod Error Time

Nuclear  Norm 0.8433 2653.1sDFC,  t=4 0.8436 689.5sDFC,  t=10 0.8484 289.7s

DFC-­‐Ens,  t=4 0.8411 689.5sDFC-­‐Ens,  t=10 0.8433 289.7

Matrix  CompleOonNeIlix  Prize:  

✦ 100  million  raOngs  in  {1,  ...  ,  5}✦ 18K  movies,  480K  user✦ Issues:  Full-­‐rank;  Noisy,  non-­‐uniform  

observaOons

NeIlixNeIlixMethod Error Time

Nuclear  Norm 0.8433 2653.1sDFC,  t=4 0.8436 689.5sDFC,  t=10 0.8484 289.7s

DFC-­‐Ens,  t=4 0.8411 689.5sDFC-­‐Ens,  t=10 0.8433 289.7

Matrix  CompleOonNeIlix  Prize:  

✦ 100  million  raOngs  in  {1,  ...  ,  5}✦ 18K  movies,  480K  user✦ Issues:  Full-­‐rank;  Noisy,  non-­‐uniform  

observaOons

Robust  Matrix  FactorizaOon[Chandrasekaran, Sanghavi, Parrilo, and Willsky, 2009; Candes, Li, Ma, and Wright, 2011; Zhou, Li, Wright, Candes, and Ma, 2010]

In

=

Low-rank

+ ‘noise’Matrix  

Comple5on

Robust  Matrix  FactorizaOon[Chandrasekaran, Sanghavi, Parrilo, and Willsky, 2009; Candes, Li, Ma, and Wright, 2011; Zhou, Li, Wright, Candes, and Ma, 2010]

In

=

Low-rank

+ ‘noise’Matrix  

Comple5on

=Principal  

Component  Analysis

In Low-rank

+ ‘noise’

Robust  Matrix  FactorizaOon[Chandrasekaran, Sanghavi, Parrilo, and Willsky, 2009; Candes, Li, Ma, and Wright, 2011; Zhou, Li, Wright, Candes, and Ma, 2010]

In

=

Low-rank

+ ‘noise’Matrix  

Comple5on

= +Robust  Matrix  Factoriza5on

In Low-rank SparseOutliers

+ ‘noise’

=Principal  

Component  Analysis

In Low-rank

+ ‘noise’

✦ Goal:  separate  foreground  from  background  

✦ Store  video  as  matrix✦ Low-rank  =  background✦ Outliers  =  movement

Video  Surveillance

✦ Goal:  separate  foreground  from  background  

✦ Store  video  as  matrix✦ Low-rank  =  background✦ Outliers  =  movement

Video  Surveillance

Original  Frame

✦ Goal:  separate  foreground  from  background  

✦ Store  video  as  matrix✦ Low-rank  =  background✦ Outliers  =  movement

Video  Surveillance

Original  Frame Nuclear  Norm(342.5s)

✦ Goal:  separate  foreground  from  background  

✦ Store  video  as  matrix✦ Low-rank  =  background✦ Outliers  =  movement

Video  Surveillance

Original  Frame Nuclear  Norm(342.5s)

DFC-­‐5%(24.2s)

DFC-­‐0.5%(5.2s)

Subspace  SegmentaOon

In

=

Low-rank

+ ‘noise’Matrix  

Comple5on

[Liu, Lin, and Yu, 2010]

Subspace  SegmentaOon

In

=

Low-rank

+ ‘noise’Matrix  

Comple5on

=Principal  

Component  Analysis

In Low-rank

+ ‘noise’

[Liu, Lin, and Yu, 2010]

Subspace  SegmentaOon

In

=

Low-rank

+ ‘noise’Matrix  

Comple5on

=Principal  

Component  Analysis

In Low-rank

+ ‘noise’

[Liu, Lin, and Yu, 2010]

=Subspace  Segmenta5on

In

+ ‘noise’

Low-rank

MoOvaOon:  Face  images

MoOvaOon:  Face  images

Principal  Component  Analysis

In

...

...

MoOvaOon:  Face  images

=

Low-rank

+ ‘noise’

✦  Model  images  of  one  person  via  one  low-­‐dimensional  subspace

Principal  Component  Analysis

In

...

...

MoOvaOon:  Face  images

MoOvaOon:  Face  images

Subspace  Segmenta5on

In

MoOvaOon:  Face  images

Subspace  Segmenta5on

In

MoOvaOon:  Face  images

Subspace  Segmenta5on

In

MoOvaOon:  Face  images

Subspace  Segmenta5on

In

MoOvaOon:  Face  images

Subspace  Segmenta5on

In

MoOvaOon:  Face  images

Subspace  Segmenta5on

In

MoOvaOon:  Face  images

✦  Model  images  of  five  people  via  five  low-­‐dimensional  subspaces

= + ‘noise’

Low-rank

Subspace  Segmenta5on

In

MoOvaOon:  Face  images

✦  Model  images  of  five  people  via  five  low-­‐dimensional  subspaces✦  Recover  subspaces                        cluster  images

= + ‘noise’

Low-rank

Subspace  Segmenta5on

In

MoOvaOon:  Face  images

= + ‘noise’

Low-rank

Subspace  Segmenta5on

In

✦ Nuclear  norm  heurisOc  to  provably  recovers  subspaces✦ Guarantees  are  preserved  with  DFC [TMMFJ, ICCV13]

MoOvaOon:  Face  images

= + ‘noise’

Low-rank

Subspace  Segmenta5on

In

✦ Toy  Experiment:  IdenOfy  images  corresponding  to  same  person  (10  people,  640  images)

✦ DFC  Results:  Linear  speedup,  State-­‐of-­‐the-­‐art  accuracy  

Video  Event  DetecOon

Video  Event  DetecOon

✦ Input:  videos,  some  of  which  are  associated  with  events✦ Goal:  predict  events  for  unlabeled  videos

Video  Event  DetecOon

✦ Input:  videos,  some  of  which  are  associated  with  events✦ Goal:  predict  events  for  unlabeled  videos✦ Idea:

✦ Featurize  each  video

Video  Event  DetecOon

✦ Input:  videos,  some  of  which  are  associated  with  events✦ Goal:  predict  events  for  unlabeled  videos✦ Idea:

✦ Featurize  each  video✦ Learn  video  clusters  via  nuclear  norm  heurisOc

Video  Event  DetecOon

✦ Input:  videos,  some  of  which  are  associated  with  events✦ Goal:  predict  events  for  unlabeled  videos✦ Idea:

✦ Featurize  each  video✦ Learn  video  clusters  via  nuclear  norm  heurisOc✦ Given  labeled  nodes  and  cluster  structure,  make  predicOons

Video  Event  DetecOon

✦ Input:  videos,  some  of  which  are  associated  with  events✦ Goal:  predict  events  for  unlabeled  videos✦ Idea:

✦ Featurize  each  video✦ Learn  video  clusters  via  nuclear  norm  heurisOc✦ Given  labeled  nodes  and  cluster  structure,  make  predicOons

                                           Can  do  this  at  scale  with  DFC!

DFC  Summary

✦ DFC:  distributed  framework  for  matrix  factorizaOon✦ Similar  recovery  guarantees✦ Significant  speedups  

✦ DFC  applied  to  3  classes  of  problems:✦ Matrix  compleOon✦ Robust  matrix  factorizaOon✦ Subspace  recovery

✦ Extend  DFC  to  other  MF  methods,  e.g.,  ALS,  SGD?

Big  Data  and  Distributed  CompuOng  are  valuable  resources,  but  ...

✦ Challenge  1:  ML  not  developed  with  scalability  in  mind

Big  Data  and  Distributed  CompuOng  are  valuable  resources,  but  ...

✦ Challenge  1:  ML  not  developed  with  scalability  in  mind

Divide-­‐and-­‐Conquer  (e.g.,  DFC)

Big  Data  and  Distributed  CompuOng  are  valuable  resources,  but  ...

✦ Challenge  1:  ML  not  developed  with  scalability  in  mind

Divide-­‐and-­‐Conquer  (e.g.,  DFC)

✦ Challenge  2:  ML  not  developed  with  ease-­‐of-­‐use  in  mind

Big  Data  and  Distributed  CompuOng  are  valuable  resources,  but  ...

✦ Challenge  1:  ML  not  developed  with  scalability  in  mind

Divide-­‐and-­‐Conquer  (e.g.,  DFC)

✦ Challenge  2:  ML  not  developed  with  ease-­‐of-­‐use  in  mind

Big  Data  and  Distributed  CompuOng  are  valuable  resources,  but  ...

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base

www.mlbase.org