Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns -...

Stochastic Subsampling for Massive MatrixFactorization

Arthur Mensch, Julien Mairal,Gael Varoquaux, Bertrand Thirion

Inria Parietal, Universite Paris-Saclay

April 13, 2018

Matrix factorization

X ∈ Rp×n = DA ∈ Rp×k × Rk×n

Flexible tool for unsupervised data analysis

Dataset has lower underlying complexity than appearing size

How to scale it to very large datasets ?(Brain imaging, 4TB, hyperspectral imaging, 100 GB)

Arthur Mensch Stochastic Subsampling for Matrix Factorization 1 / 24

Example: resting-state fMRI

Resting-state data analysis:

Input: 3D brain images across time for 900 subjects

X ∈ Rp×n, n = 5 · 106, p = 2 · 105

Goal: Extract representative sparse brain components D

Functional networks: correlated brain activity in localizedareas (e.g., auditory, visual, motor cortex)

k spatial maps Time

Other examples

Computer vision:

Patches of an image

Decompose onto a dictionary

Sparse loadings

Collaborative filtering:

Incomplete user-item rating matrix

Decompose into a low-rank factorisation

Designing new efficient algorithms

lsTime

k spatial maps Time

X is large (5TB) in both number of samples n and sampledimension p

New stochastic algorithms that scale in both directions

Formalism and methods

Non-convex matrix factorization:

minD∈C,A∈Rk×n

‖X−DA‖2F + λΩ(A)

Constraints on the dictionary D: each column d(j) in B2 or B1

Penalty on the code A: `1, `2 (+ non-negativity)

Naive resolution:

Alternated minimization: use full X at each iteration

Slow: single iteration cost in O(np)

Online matrix factorization [Mairal et al., 2010]

Scaling in n:

Stream (xt) and update (Dt) at each t

Single iteration cost in O(p)

Convergence in a few epochs → large speed-up

UpdateStream Compute

Use case:

Large n, regular p, e.g., image patches:

p = 256 n ≈ 106 1GB

Low-rank factorization / sparse coding

Scaling-up for massive matrices

Out-of-the-box online algorithm ?

Update

Stream

Compute

Limited time budget ?

Need to accomodate large p

235 h run time

1 full epoch

10 h run time

Update

Stream

Compute

235 h run time

1 full epoch

10 h run time

Update

Stream

Compute

235 h run time

1 full epoch

10 h run time

Scaling-up in both directions

- Data access

- Dictionary update

Streamcolumns

- Code computation

Online matrixfactorization

Alternate-minimization

(dim.)

Iteration t

Seen at t Seen at t+1Unseen at t

Updated at t

Scaling-up in both directions

- Data access

- Dictionary update

Streamcolumns

- Code computation Subsample

Online matrixfactorization

Proposedalgorithm

Alternate-minimization

(dim.)

Iteration t

Seen at t Seen at t+1Unseen at t

Updated at t

Online dictionary learning: details

We learn the left side factor: D? solution of

minD∈C

n∑i=1

‖x(i) −Dα(i)(D)‖22

α(i)(D) = argminα∈Rk

‖x(i) −Dα‖22 + λΩ(α)

Expected risk minimization problem: (it)t sampled uniformly

minD∈C

E[ft(D)] ft(D) , ‖x(it) −Dα(it)(D)‖22

How to minimize it ?

minD∈C

n∑i=1

‖x(i) −Dα(i)(D)‖22

‖x(i) −Dα‖22 + λΩ(α)

minD∈C

n∑i=1

‖x(i) −Dα(i)(D)‖22

‖x(i) −Dα‖22 + λΩ(α)

minD∈C

A majorization-minimization algorithm

Expected risk minimization: xt , x (it)

minD∈C

f (D) = E[ft(D)] ft(D) , ‖xt −Dαt(D)‖22

At iteration t: we build a pointwise majorizing surrogate

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(α)

gt(D) = ‖xt −Dαt‖22 ≥ ft(D) gt(Dt−1) = ft(Dt−1)

We minimize the aggregated surrogate

gt(D) ,1

t∑s=1

gs(D) ≥ 1

t∑s=1

fs(D) , ft(D)

and obtain Dt

minD∈C

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(α)

gt(D) ,1

t∑s=1

gs(D) ≥ 1

t∑s=1

fs(D) , ft(D)

and obtain Dt

minD∈C

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(α)

gt(D) ,1

t∑s=1

gs(D) ≥ 1

t∑s=1

fs(D) , ft(D)

and obtain Dt

The surrogate is a simple quadratic

gt(D) = Tr (D>DCt −D>Bt)

with parameters that can be updated online

t∑s=1

αsα>s Bt =

t∑s=1

xsα>s

The surrogate is minimized using block coordinate descent

Convergence guarantees (informal)

(Dt)t converges a.s. towards a critical point of the expected risk

minD∈C

f (D) = E[ft(D)]

Major lemma: gt is a tighter and tighter surrogate:

gt(DDt−1)− ft(gt(Dt−1))→ 0

and the algorithm is asymptotically a majorization-minimizationalgorithm.

Convergence guarantees (informal)

(Dt)t converges a.s. towards a critical point of the expected risk

minD∈C

f (D) = E[ft(D)]

Major lemma: gt is a tighter and tighter surrogate:

gt(DDt−1)− ft(gt(Dt−1))→ 0

and the algorithm is asymptotically a majorization-minimizationalgorithm.

Algorithm design: summary

Online dictionary learning [Mairal et al., 2010]

1 Compute code

– O(p)

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update surrogate

– O(p)

gt(D) =1

t∑s=1

‖xs −Dαs‖22 = Tr (D>DCt −D>Bt)

3 Minimize surrogate

– O(p)

Dt = argminD∈C

gt(D) = argminD∈C

Tr (D>DCt −D>Bt)

Access to xt → Algorithm in O(p) (complexity dependency in p)

Algorithm design: summary

Online dictionary learning [Mairal et al., 2010]

1 Compute code – O(p)

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update surrogate – O(p)

gt(D) =1

t∑s=1

‖xs −Dαs‖22 = Tr (D>DCt −D>Bt)

3 Minimize surrogate – O(p)

Dt = argminD∈C

gt(D) = argminD∈C

Tr (D>DCt −D>Bt)

Access to xt → Algorithm in O(p) (complexity dependency in p)

Stochastic subsampling

How to reduce single iteration cost O(p)?

Sample masking matrix Mt

Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q

xt →Mtxt , E[Mtxt ] = xt

Use only Mtxt in algorithm computations

Noisy updates but single iteration in O(q)

Stream

Subsample

Subsampled Online matrix Factorization (SOMF)

Adapt the 3 parts of the algorith to obtain O(q) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

Stream

Subsample

1 Codecomputation

2 Surrogateupdate

Stream

Subsample

1 Codecomputation

2 Surrogateupdate

Stream

Subsample

1 Codecomputation

2 Surrogateupdate

Stream

Subsample

1 Codecomputation

2 Surrogateupdate

1. Code computation

Linear regression with random sampling

αt = argminα∈Rk

‖Mt(xt −Dt−1αt)‖22 + λΩ(α)

approximative (sketched) solution of

αt = argminα∈Rk

‖xt −Dt−1αt‖22 + λΩ(α)

(Mt)t introduces errors in (αt)t computations

The error can be controlled and reduced

3. Surrogate minimization

Original OMF: block coordinate descent with projection on C

minD∈C

gt(D) d(j) ← p⊥C[j ,j](d(j) − 1

Cj ,j(Dc

(j)t − b

(j)t )

SOMF: Freeze the rows not selected by Mt

minD∈C

P⊥t D=P⊥t Dt−1

Freezerows

Reduces to a block coordinate descent in Rq×k !

minD∈C

gt(D) d(j) ← p⊥C[j ,j](d(j) − 1

Cj ,j(Dc

(j)t − b

(j)t )

minD∈C

Freezerows

minD∈C

gt(D) d(j) ← p⊥C[j ,j](d(j) − 1

Cj ,j(Dc

(j)t − b

(j)t )

minD∈C

Freezerows

Theoretical analysis

Convergence theorem (informal)

f (Dt) converges with probability one and every limit point D∞ of(Dt)t is a stationary point of f : for all D ∈ C

∇f (D∞,D−D∞) ≥ 0

Surrogate approximation Partial minimization

Proof: Control perturbation (red) from the online matrixfactorization algorithm (green)

Theoretical analysis

Convergence theorem (informal)

f (Dt) converges with probability one and every limit point D∞ of(Dt)t is a stationary point of f : for all D ∈ C

∇f (D∞,D−D∞) ≥ 0

Surrogate approximation Partial minimization

Proof: Control perturbation (red) from the online matrixfactorization algorithm (green)

Results: up to 12x speed-up

Resting-state fMRI

Online dictionary learning

235 h run time

1 full epoch

10 h run time

124 epoch

Proposed method

10 h run time

12 epoch, reduction r=12

Qualitatively, usable maps are obtained 10× faster

Hyperspectral imaging

Comp. 1 Comp. 2 Comp. 3Time: 14h

841k patchesOMF

r = 1 Time: 177 s

3k patches

r = 24Time: 179 s

87k patches

SOMF atoms are more focal and less noisy given a certaintime budget

Collaborative filtering

Mtxt movie ratings from user t

vs. coordinate descent for MMMF loss (no hyperparameters)

100 s 1000 s0.930.940.950.960.970.980.99 Netflix (140M)

Coordinate descent

Proposed(full projection)

Proposed(partial projection)

Dataset Test RMSE Speed

CD MODL -up

ML 1M 0.872 0.866 ×0.75ML 10M 0.802 0.799 ×3.7NF (140M) 0.938 0.934 ×6.8

Outperform coordinate descent beyond 10M ratings

Same prediction performance

Speed-up 6.8× on Netflix

Conclusion

New efficient algorithm with many potential use-case.

Subsampling mini-batches at each iteration.

Python package (github.com/arthurmensch/modl)

Perspectives:

Efficient heuristics and adaptative subsampling ratio

Is this kind of approach transposable to SGD setting ?

Publications

[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).

Online learning for matrix factorization and sparse coding.

The Journal of Machine Learning Research, 11:19–60.

[Mensch et al., 2016] Mensch, A., Mairal, J., Thirion, B., and Varoquaux, G.(2016).

Dictionary learning for massive matrix factorization.

In 33rd International Conference on Machine Learning (ICML).

[Mensch et al., 2017] Mensch, A., Mairal, J., Thirion, B., and Varoquaux, G.(2017).

Stochastic Subsampling for Factorizing Huge Matrices.

arXiv:1701.05363 [cs, math, q-bio, stat].

Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns -...

Documents

Transcript of Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns -...