Post on 17-Jun-2020
Stochastic Subsampling for Massive MatrixFactorization
Arthur Mensch, Julien Mairal,Gael Varoquaux, Bertrand Thirion
Inria Parietal, Universite Paris-Saclay
April 13, 2018
Matrix factorization
X ∈ Rp×n = DA ∈ Rp×k × Rk×n
Flexible tool for unsupervised data analysis
Dataset has lower underlying complexity than appearing size
How to scale it to very large datasets ?(Brain imaging, 4TB, hyperspectral imaging, 100 GB)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 1 / 24
Example: resting-state fMRI
Resting-state data analysis:
Input: 3D brain images across time for 900 subjects
X ∈ Rp×n, n = 5 · 106, p = 2 · 105
Goal: Extract representative sparse brain components D
Functional networks: correlated brain activity in localizedareas (e.g., auditory, visual, motor cortex)
Voxe
ls
Time
=
k spatial maps Time
x
Arthur Mensch Stochastic Subsampling for Matrix Factorization 2 / 24
Other examples
Computer vision:
Patches of an image
Decompose onto a dictionary
Sparse loadings
Collaborative filtering:
Incomplete user-item rating matrix
Decompose into a low-rank factorisation
Arthur Mensch Stochastic Subsampling for Matrix Factorization 3 / 24
Designing new efficient algorithms
Voxe
lsTime
=
k spatial maps Time
x
X is large (5TB) in both number of samples n and sampledimension p
New stochastic algorithms that scale in both directions
Arthur Mensch Stochastic Subsampling for Matrix Factorization 4 / 24
Formalism and methods
Non-convex matrix factorization:
minD∈C,A∈Rk×n
‖X−DA‖2F + λΩ(A)
Constraints on the dictionary D: each column d(j) in B2 or B1
Penalty on the code A: `1, `2 (+ non-negativity)
= x
Naive resolution:
Alternated minimization: use full X at each iteration
Slow: single iteration cost in O(np)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 5 / 24
Online matrix factorization [Mairal et al., 2010]
Scaling in n:
Stream (xt) and update (Dt) at each t
Single iteration cost in O(p)
Convergence in a few epochs → large speed-up
= x
UpdateStream Compute
Use case:
Large n, regular p, e.g., image patches:
p = 256 n ≈ 106 1GB
Low-rank factorization / sparse coding
Arthur Mensch Stochastic Subsampling for Matrix Factorization 6 / 24
Scaling-up for massive matrices
Out-of-the-box online algorithm ?
=
x
Update
Stream
Compute
Limited time budget ?
Need to accomodate large p
235 h run time
1 full epoch
10 h run time
124
epoch
Arthur Mensch Stochastic Subsampling for Matrix Factorization 7 / 24
Scaling-up for massive matrices
Out-of-the-box online algorithm ?
=
x
Update
Stream
Compute
Limited time budget ?
Need to accomodate large p
235 h run time
1 full epoch
10 h run time
124
epoch
Arthur Mensch Stochastic Subsampling for Matrix Factorization 7 / 24
Scaling-up for massive matrices
Out-of-the-box online algorithm ?
=
x
Update
Stream
Compute
Limited time budget ?
Need to accomodate large p
235 h run time
1 full epoch
10 h run time
124
epoch
Arthur Mensch Stochastic Subsampling for Matrix Factorization 7 / 24
Scaling-up in both directions
- Data access
- Dictionary update
Streamcolumns
- Code com- putation
Online matrixfactorization
Alternate-minimization
(dim.)
Iteration t
Seen at t Seen at t+1Unseen at t
(di
m.)
Updated at t
Arthur Mensch Stochastic Subsampling for Matrix Factorization 8 / 24
Scaling-up in both directions
- Data access
- Dictionary update
Streamcolumns
- Code com- putation Subsample
rows
Online matrixfactorization
Proposedalgorithm
Alternate-minimization
(dim.)
Iteration t
Seen at t Seen at t+1Unseen at t
(di
m.)
Updated at t
Arthur Mensch Stochastic Subsampling for Matrix Factorization 9 / 24
Online dictionary learning: details
We learn the left side factor: D? solution of
minD∈C
1
n
n∑i=1
‖x(i) −Dα(i)(D)‖22
α(i)(D) = argminα∈Rk
‖x(i) −Dα‖22 + λΩ(α)
Expected risk minimization problem: (it)t sampled uniformly
minD∈C
E[ft(D)] ft(D) , ‖x(it) −Dα(it)(D)‖22
How to minimize it ?
Arthur Mensch Stochastic Subsampling for Matrix Factorization 10 / 24
Online dictionary learning: details
We learn the left side factor: D? solution of
minD∈C
1
n
n∑i=1
‖x(i) −Dα(i)(D)‖22
α(i)(D) = argminα∈Rk
‖x(i) −Dα‖22 + λΩ(α)
Expected risk minimization problem: (it)t sampled uniformly
minD∈C
E[ft(D)] ft(D) , ‖x(it) −Dα(it)(D)‖22
How to minimize it ?
Arthur Mensch Stochastic Subsampling for Matrix Factorization 10 / 24
Online dictionary learning: details
We learn the left side factor: D? solution of
minD∈C
1
n
n∑i=1
‖x(i) −Dα(i)(D)‖22
α(i)(D) = argminα∈Rk
‖x(i) −Dα‖22 + λΩ(α)
Expected risk minimization problem: (it)t sampled uniformly
minD∈C
E[ft(D)] ft(D) , ‖x(it) −Dα(it)(D)‖22
How to minimize it ?
Arthur Mensch Stochastic Subsampling for Matrix Factorization 10 / 24
A majorization-minimization algorithm
Expected risk minimization: xt , x (it)
minD∈C
f (D) = E[ft(D)] ft(D) , ‖xt −Dαt(D)‖22
At iteration t: we build a pointwise majorizing surrogate
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(α)
gt(D) = ‖xt −Dαt‖22 ≥ ft(D) gt(Dt−1) = ft(Dt−1)
We minimize the aggregated surrogate
gt(D) ,1
t
t∑s=1
gs(D) ≥ 1
t
t∑s=1
fs(D) , ft(D)
and obtain Dt
Arthur Mensch Stochastic Subsampling for Matrix Factorization 11 / 24
A majorization-minimization algorithm
Expected risk minimization: xt , x (it)
minD∈C
f (D) = E[ft(D)] ft(D) , ‖xt −Dαt(D)‖22
At iteration t: we build a pointwise majorizing surrogate
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(α)
gt(D) = ‖xt −Dαt‖22 ≥ ft(D) gt(Dt−1) = ft(Dt−1)
We minimize the aggregated surrogate
gt(D) ,1
t
t∑s=1
gs(D) ≥ 1
t
t∑s=1
fs(D) , ft(D)
and obtain Dt
Arthur Mensch Stochastic Subsampling for Matrix Factorization 11 / 24
A majorization-minimization algorithm
Expected risk minimization: xt , x (it)
minD∈C
f (D) = E[ft(D)] ft(D) , ‖xt −Dαt(D)‖22
At iteration t: we build a pointwise majorizing surrogate
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(α)
gt(D) = ‖xt −Dαt‖22 ≥ ft(D) gt(Dt−1) = ft(Dt−1)
We minimize the aggregated surrogate
gt(D) ,1
t
t∑s=1
gs(D) ≥ 1
t
t∑s=1
fs(D) , ft(D)
and obtain Dt
Arthur Mensch Stochastic Subsampling for Matrix Factorization 11 / 24
A majorization-minimization algorithm
The surrogate is a simple quadratic
gt(D) = Tr (D>DCt −D>Bt)
with parameters that can be updated online
Ct =1
t
t∑s=1
αsα>s Bt =
1
t
t∑s=1
xsα>s
The surrogate is minimized using block coordinate descent
Arthur Mensch Stochastic Subsampling for Matrix Factorization 12 / 24
Convergence guarantees (informal)
(Dt)t converges a.s. towards a critical point of the expected risk
minD∈C
f (D) = E[ft(D)]
Major lemma: gt is a tighter and tighter surrogate:
gt(DDt−1)− ft(gt(Dt−1))→ 0
and the algorithm is asymptotically a majorization-minimizationalgorithm.
Arthur Mensch Stochastic Subsampling for Matrix Factorization 13 / 24
Convergence guarantees (informal)
(Dt)t converges a.s. towards a critical point of the expected risk
minD∈C
f (D) = E[ft(D)]
Major lemma: gt is a tighter and tighter surrogate:
gt(DDt−1)− ft(gt(Dt−1))→ 0
and the algorithm is asymptotically a majorization-minimizationalgorithm.
Arthur Mensch Stochastic Subsampling for Matrix Factorization 13 / 24
Algorithm design: summary
Online dictionary learning [Mairal et al., 2010]
1 Compute code
– O(p)
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(αt)
2 Update surrogate
– O(p)
gt(D) =1
t
t∑s=1
‖xs −Dαs‖22 = Tr (D>DCt −D>Bt)
3 Minimize surrogate
– O(p)
Dt = argminD∈C
gt(D) = argminD∈C
Tr (D>DCt −D>Bt)
Access to xt → Algorithm in O(p) (complexity dependency in p)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 14 / 24
Algorithm design: summary
Online dictionary learning [Mairal et al., 2010]
1 Compute code – O(p)
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(αt)
2 Update surrogate – O(p)
gt(D) =1
t
t∑s=1
‖xs −Dαs‖22 = Tr (D>DCt −D>Bt)
3 Minimize surrogate – O(p)
Dt = argminD∈C
gt(D) = argminD∈C
Tr (D>DCt −D>Bt)
Access to xt → Algorithm in O(p) (complexity dependency in p)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 14 / 24
Stochastic subsampling
How to reduce single iteration cost O(p)?
Sample masking matrix Mt
Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q
xt →Mtxt , E[Mtxt ] = xt
Use only Mtxt in algorithm computations
Noisy updates but single iteration in O(q)
Stream
Subsample
Subsampled Online matrix Factorization (SOMF)
Adapt the 3 parts of the algorith to obtain O(q) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24
Stochastic subsampling
How to reduce single iteration cost O(p)?
Sample masking matrix Mt
Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q
xt →Mtxt , E[Mtxt ] = xt
Use only Mtxt in algorithm computations
Noisy updates but single iteration in O(q)
Stream
Subsample
Subsampled Online matrix Factorization (SOMF)
Adapt the 3 parts of the algorith to obtain O(q) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24
Stochastic subsampling
How to reduce single iteration cost O(p)?
Sample masking matrix Mt
Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q
xt →Mtxt , E[Mtxt ] = xt
Use only Mtxt in algorithm computations
Noisy updates but single iteration in O(q)
Stream
Subsample
Subsampled Online matrix Factorization (SOMF)
Adapt the 3 parts of the algorith to obtain O(q) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24
Stochastic subsampling
How to reduce single iteration cost O(p)?
Sample masking matrix Mt
Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q
xt →Mtxt , E[Mtxt ] = xt
Use only Mtxt in algorithm computations
Noisy updates but single iteration in O(q)
Stream
Subsample
Subsampled Online matrix Factorization (SOMF)
Adapt the 3 parts of the algorith to obtain O(q) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24
Stochastic subsampling
How to reduce single iteration cost O(p)?
Sample masking matrix Mt
Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q
xt →Mtxt , E[Mtxt ] = xt
Use only Mtxt in algorithm computations
Noisy updates but single iteration in O(q)
Stream
Subsample
Subsampled Online matrix Factorization (SOMF)
Adapt the 3 parts of the algorith to obtain O(q) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24
1. Code computation
Linear regression with random sampling
αt = argminα∈Rk
‖Mt(xt −Dt−1αt)‖22 + λΩ(α)
approximative (sketched) solution of
αt = argminα∈Rk
‖xt −Dt−1αt‖22 + λΩ(α)
(Mt)t introduces errors in (αt)t computations
The error can be controlled and reduced
Arthur Mensch Stochastic Subsampling for Matrix Factorization 16 / 24
3. Surrogate minimization
Original OMF: block coordinate descent with projection on C
minD∈C
gt(D) d(j) ← p⊥C[j ,j](d(j) − 1
Cj ,j(Dc
(j)t − b
(j)t )
SOMF: Freeze the rows not selected by Mt
minD∈C
P⊥t D=P⊥t Dt−1
gt(D)
Freezerows
Reduces to a block coordinate descent in Rq×k !
Arthur Mensch Stochastic Subsampling for Matrix Factorization 17 / 24
3. Surrogate minimization
Original OMF: block coordinate descent with projection on C
minD∈C
gt(D) d(j) ← p⊥C[j ,j](d(j) − 1
Cj ,j(Dc
(j)t − b
(j)t )
SOMF: Freeze the rows not selected by Mt
minD∈C
P⊥t D=P⊥t Dt−1
gt(D)
Freezerows
Reduces to a block coordinate descent in Rq×k !
Arthur Mensch Stochastic Subsampling for Matrix Factorization 17 / 24
3. Surrogate minimization
Original OMF: block coordinate descent with projection on C
minD∈C
gt(D) d(j) ← p⊥C[j ,j](d(j) − 1
Cj ,j(Dc
(j)t − b
(j)t )
SOMF: Freeze the rows not selected by Mt
minD∈C
P⊥t D=P⊥t Dt−1
gt(D)
Freezerows
Reduces to a block coordinate descent in Rq×k !
Arthur Mensch Stochastic Subsampling for Matrix Factorization 17 / 24
Theoretical analysis
Convergence theorem (informal)
f (Dt) converges with probability one and every limit point D∞ of(Dt)t is a stationary point of f : for all D ∈ C
∇f (D∞,D−D∞) ≥ 0
Surrogate approximation Partial minimization
Proof: Control perturbation (red) from the online matrixfactorization algorithm (green)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 18 / 24
Theoretical analysis
Convergence theorem (informal)
f (Dt) converges with probability one and every limit point D∞ of(Dt)t is a stationary point of f : for all D ∈ C
∇f (D∞,D−D∞) ≥ 0
Surrogate approximation Partial minimization
Proof: Control perturbation (red) from the online matrixfactorization algorithm (green)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 18 / 24
Results: up to 12x speed-up
Arthur Mensch Stochastic Subsampling for Matrix Factorization 19 / 24
Resting-state fMRI
Online dictionary learning
235 h run time
1 full epoch
10 h run time
124 epoch
Proposed method
10 h run time
12 epoch, reduction r=12
Qualitatively, usable maps are obtained 10× faster
Arthur Mensch Stochastic Subsampling for Matrix Factorization 20 / 24
Hyperspectral imaging
Comp. 1 Comp. 2 Comp. 3Time: 14h
841k patchesOMF
r = 1 Time: 177 s
3k patches
SOMF
r = 24Time: 179 s
87k patches
SOMF atoms are more focal and less noisy given a certaintime budget
Arthur Mensch Stochastic Subsampling for Matrix Factorization 21 / 24
Collaborative filtering
Mtxt movie ratings from user t
vs. coordinate descent for MMMF loss (no hyperparameters)
100 s 1000 s0.930.940.950.960.970.980.99 Netflix (140M)
Coordinate descent
Proposed(full projection)
Proposed(partial projection)
Dataset Test RMSE Speed
CD MODL -up
ML 1M 0.872 0.866 ×0.75ML 10M 0.802 0.799 ×3.7NF (140M) 0.938 0.934 ×6.8
Outperform coordinate descent beyond 10M ratings
Same prediction performance
Speed-up 6.8× on Netflix
Arthur Mensch Stochastic Subsampling for Matrix Factorization 22 / 24
Conclusion
New efficient algorithm with many potential use-case.
Subsampling mini-batches at each iteration.
Python package (github.com/arthurmensch/modl)
Perspectives:
Efficient heuristics and adaptative subsampling ratio
Is this kind of approach transposable to SGD setting ?
Arthur Mensch Stochastic Subsampling for Matrix Factorization 23 / 24
Publications
[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).
Online learning for matrix factorization and sparse coding.
The Journal of Machine Learning Research, 11:19–60.
[Mensch et al., 2016] Mensch, A., Mairal, J., Thirion, B., and Varoquaux, G.(2016).
Dictionary learning for massive matrix factorization.
In 33rd International Conference on Machine Learning (ICML).
[Mensch et al., 2017] Mensch, A., Mairal, J., Thirion, B., and Varoquaux, G.(2017).
Stochastic Subsampling for Factorizing Huge Matrices.
arXiv:1701.05363 [cs, math, q-bio, stat].
Arthur Mensch Stochastic Subsampling for Matrix Factorization 24 / 24