Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns -...
Transcript of Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns -...
![Page 1: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/1.jpg)
Stochastic Subsampling for Massive MatrixFactorization
Arthur Mensch, Julien Mairal,Gael Varoquaux, Bertrand Thirion
Inria Parietal, Universite Paris-Saclay
April 13, 2018
![Page 2: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/2.jpg)
Matrix factorization
X ∈ Rp×n = DA ∈ Rp×k × Rk×n
Flexible tool for unsupervised data analysis
Dataset has lower underlying complexity than appearing size
How to scale it to very large datasets ?(Brain imaging, 4TB, hyperspectral imaging, 100 GB)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 1 / 24
![Page 3: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/3.jpg)
Example: resting-state fMRI
Resting-state data analysis:
Input: 3D brain images across time for 900 subjects
X ∈ Rp×n, n = 5 · 106, p = 2 · 105
Goal: Extract representative sparse brain components D
Functional networks: correlated brain activity in localizedareas (e.g., auditory, visual, motor cortex)
Voxe
ls
Time
=
k spatial maps Time
x
Arthur Mensch Stochastic Subsampling for Matrix Factorization 2 / 24
![Page 4: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/4.jpg)
Other examples
Computer vision:
Patches of an image
Decompose onto a dictionary
Sparse loadings
Collaborative filtering:
Incomplete user-item rating matrix
Decompose into a low-rank factorisation
Arthur Mensch Stochastic Subsampling for Matrix Factorization 3 / 24
![Page 5: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/5.jpg)
Designing new efficient algorithms
Voxe
lsTime
=
k spatial maps Time
x
X is large (5TB) in both number of samples n and sampledimension p
New stochastic algorithms that scale in both directions
Arthur Mensch Stochastic Subsampling for Matrix Factorization 4 / 24
![Page 6: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/6.jpg)
Formalism and methods
Non-convex matrix factorization:
minD∈C,A∈Rk×n
‖X−DA‖2F + λΩ(A)
Constraints on the dictionary D: each column d(j) in B2 or B1
Penalty on the code A: `1, `2 (+ non-negativity)
= x
Naive resolution:
Alternated minimization: use full X at each iteration
Slow: single iteration cost in O(np)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 5 / 24
![Page 7: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/7.jpg)
Online matrix factorization [Mairal et al., 2010]
Scaling in n:
Stream (xt) and update (Dt) at each t
Single iteration cost in O(p)
Convergence in a few epochs → large speed-up
= x
UpdateStream Compute
Use case:
Large n, regular p, e.g., image patches:
p = 256 n ≈ 106 1GB
Low-rank factorization / sparse coding
Arthur Mensch Stochastic Subsampling for Matrix Factorization 6 / 24
![Page 8: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/8.jpg)
Scaling-up for massive matrices
Out-of-the-box online algorithm ?
=
x
Update
Stream
Compute
Limited time budget ?
Need to accomodate large p
235 h run time
1 full epoch
10 h run time
124
epoch
Arthur Mensch Stochastic Subsampling for Matrix Factorization 7 / 24
![Page 9: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/9.jpg)
Scaling-up for massive matrices
Out-of-the-box online algorithm ?
=
x
Update
Stream
Compute
Limited time budget ?
Need to accomodate large p
235 h run time
1 full epoch
10 h run time
124
epoch
Arthur Mensch Stochastic Subsampling for Matrix Factorization 7 / 24
![Page 10: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/10.jpg)
Scaling-up for massive matrices
Out-of-the-box online algorithm ?
=
x
Update
Stream
Compute
Limited time budget ?
Need to accomodate large p
235 h run time
1 full epoch
10 h run time
124
epoch
Arthur Mensch Stochastic Subsampling for Matrix Factorization 7 / 24
![Page 11: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/11.jpg)
Scaling-up in both directions
- Data access
- Dictionary update
Streamcolumns
- Code com- putation
Online matrixfactorization
Alternate-minimization
(dim.)
Iteration t
Seen at t Seen at t+1Unseen at t
(di
m.)
Updated at t
Arthur Mensch Stochastic Subsampling for Matrix Factorization 8 / 24
![Page 12: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/12.jpg)
Scaling-up in both directions
- Data access
- Dictionary update
Streamcolumns
- Code com- putation Subsample
rows
Online matrixfactorization
Proposedalgorithm
Alternate-minimization
(dim.)
Iteration t
Seen at t Seen at t+1Unseen at t
(di
m.)
Updated at t
Arthur Mensch Stochastic Subsampling for Matrix Factorization 9 / 24
![Page 13: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/13.jpg)
Online dictionary learning: details
We learn the left side factor: D? solution of
minD∈C
1
n
n∑i=1
‖x(i) −Dα(i)(D)‖22
α(i)(D) = argminα∈Rk
‖x(i) −Dα‖22 + λΩ(α)
Expected risk minimization problem: (it)t sampled uniformly
minD∈C
E[ft(D)] ft(D) , ‖x(it) −Dα(it)(D)‖22
How to minimize it ?
Arthur Mensch Stochastic Subsampling for Matrix Factorization 10 / 24
![Page 14: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/14.jpg)
Online dictionary learning: details
We learn the left side factor: D? solution of
minD∈C
1
n
n∑i=1
‖x(i) −Dα(i)(D)‖22
α(i)(D) = argminα∈Rk
‖x(i) −Dα‖22 + λΩ(α)
Expected risk minimization problem: (it)t sampled uniformly
minD∈C
E[ft(D)] ft(D) , ‖x(it) −Dα(it)(D)‖22
How to minimize it ?
Arthur Mensch Stochastic Subsampling for Matrix Factorization 10 / 24
![Page 15: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/15.jpg)
Online dictionary learning: details
We learn the left side factor: D? solution of
minD∈C
1
n
n∑i=1
‖x(i) −Dα(i)(D)‖22
α(i)(D) = argminα∈Rk
‖x(i) −Dα‖22 + λΩ(α)
Expected risk minimization problem: (it)t sampled uniformly
minD∈C
E[ft(D)] ft(D) , ‖x(it) −Dα(it)(D)‖22
How to minimize it ?
Arthur Mensch Stochastic Subsampling for Matrix Factorization 10 / 24
![Page 16: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/16.jpg)
A majorization-minimization algorithm
Expected risk minimization: xt , x (it)
minD∈C
f (D) = E[ft(D)] ft(D) , ‖xt −Dαt(D)‖22
At iteration t: we build a pointwise majorizing surrogate
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(α)
gt(D) = ‖xt −Dαt‖22 ≥ ft(D) gt(Dt−1) = ft(Dt−1)
We minimize the aggregated surrogate
gt(D) ,1
t
t∑s=1
gs(D) ≥ 1
t
t∑s=1
fs(D) , ft(D)
and obtain Dt
Arthur Mensch Stochastic Subsampling for Matrix Factorization 11 / 24
![Page 17: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/17.jpg)
A majorization-minimization algorithm
Expected risk minimization: xt , x (it)
minD∈C
f (D) = E[ft(D)] ft(D) , ‖xt −Dαt(D)‖22
At iteration t: we build a pointwise majorizing surrogate
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(α)
gt(D) = ‖xt −Dαt‖22 ≥ ft(D) gt(Dt−1) = ft(Dt−1)
We minimize the aggregated surrogate
gt(D) ,1
t
t∑s=1
gs(D) ≥ 1
t
t∑s=1
fs(D) , ft(D)
and obtain Dt
Arthur Mensch Stochastic Subsampling for Matrix Factorization 11 / 24
![Page 18: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/18.jpg)
A majorization-minimization algorithm
Expected risk minimization: xt , x (it)
minD∈C
f (D) = E[ft(D)] ft(D) , ‖xt −Dαt(D)‖22
At iteration t: we build a pointwise majorizing surrogate
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(α)
gt(D) = ‖xt −Dαt‖22 ≥ ft(D) gt(Dt−1) = ft(Dt−1)
We minimize the aggregated surrogate
gt(D) ,1
t
t∑s=1
gs(D) ≥ 1
t
t∑s=1
fs(D) , ft(D)
and obtain Dt
Arthur Mensch Stochastic Subsampling for Matrix Factorization 11 / 24
![Page 19: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/19.jpg)
A majorization-minimization algorithm
The surrogate is a simple quadratic
gt(D) = Tr (D>DCt −D>Bt)
with parameters that can be updated online
Ct =1
t
t∑s=1
αsα>s Bt =
1
t
t∑s=1
xsα>s
The surrogate is minimized using block coordinate descent
Arthur Mensch Stochastic Subsampling for Matrix Factorization 12 / 24
![Page 20: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/20.jpg)
Convergence guarantees (informal)
(Dt)t converges a.s. towards a critical point of the expected risk
minD∈C
f (D) = E[ft(D)]
Major lemma: gt is a tighter and tighter surrogate:
gt(DDt−1)− ft(gt(Dt−1))→ 0
and the algorithm is asymptotically a majorization-minimizationalgorithm.
Arthur Mensch Stochastic Subsampling for Matrix Factorization 13 / 24
![Page 21: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/21.jpg)
Convergence guarantees (informal)
(Dt)t converges a.s. towards a critical point of the expected risk
minD∈C
f (D) = E[ft(D)]
Major lemma: gt is a tighter and tighter surrogate:
gt(DDt−1)− ft(gt(Dt−1))→ 0
and the algorithm is asymptotically a majorization-minimizationalgorithm.
Arthur Mensch Stochastic Subsampling for Matrix Factorization 13 / 24
![Page 22: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/22.jpg)
Algorithm design: summary
Online dictionary learning [Mairal et al., 2010]
1 Compute code
– O(p)
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(αt)
2 Update surrogate
– O(p)
gt(D) =1
t
t∑s=1
‖xs −Dαs‖22 = Tr (D>DCt −D>Bt)
3 Minimize surrogate
– O(p)
Dt = argminD∈C
gt(D) = argminD∈C
Tr (D>DCt −D>Bt)
Access to xt → Algorithm in O(p) (complexity dependency in p)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 14 / 24
![Page 23: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/23.jpg)
Algorithm design: summary
Online dictionary learning [Mairal et al., 2010]
1 Compute code – O(p)
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(αt)
2 Update surrogate – O(p)
gt(D) =1
t
t∑s=1
‖xs −Dαs‖22 = Tr (D>DCt −D>Bt)
3 Minimize surrogate – O(p)
Dt = argminD∈C
gt(D) = argminD∈C
Tr (D>DCt −D>Bt)
Access to xt → Algorithm in O(p) (complexity dependency in p)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 14 / 24
![Page 24: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/24.jpg)
Stochastic subsampling
How to reduce single iteration cost O(p)?
Sample masking matrix Mt
Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q
xt →Mtxt , E[Mtxt ] = xt
Use only Mtxt in algorithm computations
Noisy updates but single iteration in O(q)
Stream
Subsample
Subsampled Online matrix Factorization (SOMF)
Adapt the 3 parts of the algorith to obtain O(q) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24
![Page 25: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/25.jpg)
Stochastic subsampling
How to reduce single iteration cost O(p)?
Sample masking matrix Mt
Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q
xt →Mtxt , E[Mtxt ] = xt
Use only Mtxt in algorithm computations
Noisy updates but single iteration in O(q)
Stream
Subsample
Subsampled Online matrix Factorization (SOMF)
Adapt the 3 parts of the algorith to obtain O(q) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24
![Page 26: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/26.jpg)
Stochastic subsampling
How to reduce single iteration cost O(p)?
Sample masking matrix Mt
Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q
xt →Mtxt , E[Mtxt ] = xt
Use only Mtxt in algorithm computations
Noisy updates but single iteration in O(q)
Stream
Subsample
Subsampled Online matrix Factorization (SOMF)
Adapt the 3 parts of the algorith to obtain O(q) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24
![Page 27: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/27.jpg)
Stochastic subsampling
How to reduce single iteration cost O(p)?
Sample masking matrix Mt
Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q
xt →Mtxt , E[Mtxt ] = xt
Use only Mtxt in algorithm computations
Noisy updates but single iteration in O(q)
Stream
Subsample
Subsampled Online matrix Factorization (SOMF)
Adapt the 3 parts of the algorith to obtain O(q) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24
![Page 28: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/28.jpg)
Stochastic subsampling
How to reduce single iteration cost O(p)?
Sample masking matrix Mt
Diagonal matrix with rescaled Bernouillicoefficients, E[rank Mt ] = q
xt →Mtxt , E[Mtxt ] = xt
Use only Mtxt in algorithm computations
Noisy updates but single iteration in O(q)
Stream
Subsample
Subsampled Online matrix Factorization (SOMF)
Adapt the 3 parts of the algorith to obtain O(q) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
Arthur Mensch Stochastic Subsampling for Matrix Factorization 15 / 24
![Page 29: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/29.jpg)
1. Code computation
Linear regression with random sampling
αt = argminα∈Rk
‖Mt(xt −Dt−1αt)‖22 + λΩ(α)
approximative (sketched) solution of
αt = argminα∈Rk
‖xt −Dt−1αt‖22 + λΩ(α)
(Mt)t introduces errors in (αt)t computations
The error can be controlled and reduced
Arthur Mensch Stochastic Subsampling for Matrix Factorization 16 / 24
![Page 30: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/30.jpg)
3. Surrogate minimization
Original OMF: block coordinate descent with projection on C
minD∈C
gt(D) d(j) ← p⊥C[j ,j](d(j) − 1
Cj ,j(Dc
(j)t − b
(j)t )
SOMF: Freeze the rows not selected by Mt
minD∈C
P⊥t D=P⊥t Dt−1
gt(D)
Freezerows
Reduces to a block coordinate descent in Rq×k !
Arthur Mensch Stochastic Subsampling for Matrix Factorization 17 / 24
![Page 31: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/31.jpg)
3. Surrogate minimization
Original OMF: block coordinate descent with projection on C
minD∈C
gt(D) d(j) ← p⊥C[j ,j](d(j) − 1
Cj ,j(Dc
(j)t − b
(j)t )
SOMF: Freeze the rows not selected by Mt
minD∈C
P⊥t D=P⊥t Dt−1
gt(D)
Freezerows
Reduces to a block coordinate descent in Rq×k !
Arthur Mensch Stochastic Subsampling for Matrix Factorization 17 / 24
![Page 32: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/32.jpg)
3. Surrogate minimization
Original OMF: block coordinate descent with projection on C
minD∈C
gt(D) d(j) ← p⊥C[j ,j](d(j) − 1
Cj ,j(Dc
(j)t − b
(j)t )
SOMF: Freeze the rows not selected by Mt
minD∈C
P⊥t D=P⊥t Dt−1
gt(D)
Freezerows
Reduces to a block coordinate descent in Rq×k !
Arthur Mensch Stochastic Subsampling for Matrix Factorization 17 / 24
![Page 33: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/33.jpg)
Theoretical analysis
Convergence theorem (informal)
f (Dt) converges with probability one and every limit point D∞ of(Dt)t is a stationary point of f : for all D ∈ C
∇f (D∞,D−D∞) ≥ 0
Surrogate approximation Partial minimization
Proof: Control perturbation (red) from the online matrixfactorization algorithm (green)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 18 / 24
![Page 34: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/34.jpg)
Theoretical analysis
Convergence theorem (informal)
f (Dt) converges with probability one and every limit point D∞ of(Dt)t is a stationary point of f : for all D ∈ C
∇f (D∞,D−D∞) ≥ 0
Surrogate approximation Partial minimization
Proof: Control perturbation (red) from the online matrixfactorization algorithm (green)
Arthur Mensch Stochastic Subsampling for Matrix Factorization 18 / 24
![Page 35: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/35.jpg)
Results: up to 12x speed-up
Arthur Mensch Stochastic Subsampling for Matrix Factorization 19 / 24
![Page 36: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/36.jpg)
Resting-state fMRI
Online dictionary learning
235 h run time
1 full epoch
10 h run time
124 epoch
Proposed method
10 h run time
12 epoch, reduction r=12
Qualitatively, usable maps are obtained 10× faster
Arthur Mensch Stochastic Subsampling for Matrix Factorization 20 / 24
![Page 37: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/37.jpg)
Hyperspectral imaging
Comp. 1 Comp. 2 Comp. 3Time: 14h
841k patchesOMF
r = 1 Time: 177 s
3k patches
SOMF
r = 24Time: 179 s
87k patches
SOMF atoms are more focal and less noisy given a certaintime budget
Arthur Mensch Stochastic Subsampling for Matrix Factorization 21 / 24
![Page 38: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/38.jpg)
Collaborative filtering
Mtxt movie ratings from user t
vs. coordinate descent for MMMF loss (no hyperparameters)
100 s 1000 s0.930.940.950.960.970.980.99 Netflix (140M)
Coordinate descent
Proposed(full projection)
Proposed(partial projection)
Dataset Test RMSE Speed
CD MODL -up
ML 1M 0.872 0.866 ×0.75ML 10M 0.802 0.799 ×3.7NF (140M) 0.938 0.934 ×6.8
Outperform coordinate descent beyond 10M ratings
Same prediction performance
Speed-up 6.8× on Netflix
Arthur Mensch Stochastic Subsampling for Matrix Factorization 22 / 24
![Page 39: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/39.jpg)
Conclusion
New efficient algorithm with many potential use-case.
Subsampling mini-batches at each iteration.
Python package (github.com/arthurmensch/modl)
Perspectives:
Efficient heuristics and adaptative subsampling ratio
Is this kind of approach transposable to SGD setting ?
Arthur Mensch Stochastic Subsampling for Matrix Factorization 23 / 24
![Page 40: Stochastic Subsampling for Massive Matrix Factorization · - Dictionary update Stream columns - Code com-putation Online matrix factorization Alternate-minimization (dim.) Iteration](https://reader033.fdocuments.in/reader033/viewer/2022060300/5f081ee87e708231d4207117/html5/thumbnails/40.jpg)
Publications
[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).
Online learning for matrix factorization and sparse coding.
The Journal of Machine Learning Research, 11:19–60.
[Mensch et al., 2016] Mensch, A., Mairal, J., Thirion, B., and Varoquaux, G.(2016).
Dictionary learning for massive matrix factorization.
In 33rd International Conference on Machine Learning (ICML).
[Mensch et al., 2017] Mensch, A., Mairal, J., Thirion, B., and Varoquaux, G.(2017).
Stochastic Subsampling for Factorizing Huge Matrices.
arXiv:1701.05363 [cs, math, q-bio, stat].
Arthur Mensch Stochastic Subsampling for Matrix Factorization 24 / 24