· Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël...
Transcript of · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël...
Dictionary Learning forMassive Matrix Factorization
Arthur Mensch, Julien Mairal,Gael Varoquaux, Bertrand Thirion
Inria/CEA Parietal, Inria Thoth
June 20, 2016
Matrix factorization
X
p
n
D
p
k
=
A
n
k
1X ∈ Rp×n = DA ∈ Rp×k × Rk×n
Flexible tool for unsupervised data analysis
Dataset has lower underlying complexity than appearing size
How to scale it to very large datasets ? (Brain imaging, 2TB)
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 19
Matrix factorization
X
p
n
D
p
k
=
A
n
k
1Low rank factorization : k < p
X
p
n
D
p
k
=
A
n
k
1...with optional sparse factors
→ interpretable data (fMRI, genetics, topic modeling)
X
p
n
D
p
k
=
A
n
k
1Overcomplete dictionary learning k p - sparse A[Olshausen and Field, 1997]
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 19
Formalism and methods
Non-convex formulation
minD∈C,A∈Rk×n
‖X−DA‖22 + λΩ(A)
Constraints on D
Penalty on A (`1, `2)
Naive resolution
Alternated minimization: use full X at each iteration
Very slow : single iteration in O(p n)
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 19
Online matrix factorization
Stream (xt), update D at each t [Mairal et al., 2010]
Single iteration in O(p), a few epochs
xt
p
n
D
p
k
=
αt
n
k
streaming
1
Large n, regular p, eg image patches:
p = 256 n ≈ 106 1GB
Both (sparse) low-rank factorization / sparse coding
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 19
Scaling-up for massive matrices
Functional MRI (HCP dataset)
Brain “movies” : space × time
Extract k sparse networks
p = 2 · 105 n = 2 · 106 2 TB
Way larger than vision problems
Unusual setting: data is large inboth directions
Also useful in collaborative filtering
X
Vox
els
Time
=
D A
k spatial maps Time
x
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 19
Scaling-up for massive matrices
Out-of-the-box online algorithm ?
xt
p
n
D
p
k
=
αt
n
k
1Limited time budget ?Need to accomodate large p
235 h run time
1 full epoch
10 h run time
124
epoch
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 19
Scaling-up in both directions
X
p
n
Batch → onlinext
n
Steaming
Handle large n
1
xt
p
n
Streaming
Mtxt
n
Streaming
SubsamplingHandle large p
Online → double online
1Online learning + partial random access to samples
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 19
Scaling-up in both directions
xt
p
n
Streaming
Mtxt
n
Streaming
SubsamplingHandle large p
Online → double online
1Low-distorsion lemma [Johnson and Lindenstrauss, 1984]
Random linear alebra [Halko et al., 2009]
Sketching for data reduction [Pilanci and Wainwright, 2014]
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 19
Algorithm design
Online dictionary learning [Mairal et al., 2010]
1 Compute code – O(p)
αt = argminα∈Rk
‖xt −Dt−1α‖22 + λΩ(αt)
2 Update surrogate – O(p)
gt =1
t
t∑i=1
‖xi −Dαi‖22
3 Minimize surrogate – O(p)
Dt = argminD∈C
gt(D) = argminD∈C
Tr (D>DAt −D>Bt)
xt access → O(p) algorithm (complexity dependency in p)
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 19
Introducing subsampling
Iteration cost in O(p): can we reduce it?
xt →Mtxt , p → rkMt = s
Use only Mtxt in algorithmcomputation: complexity in O(s)
Mtxt
p
n
Streaming
Subsampling
1Our contribution
Adapt the 3 parts of the algorith to obtain O(s) complexity
1 Codecomputation
2 Surrogateupdate
3 Surrogateminimization
[Szabo et al., 2011]: dictionary learning with missing value – O(p)
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 19
1. Code computation
Linear regression with random sampling
αt = argminα∈Rk
‖Mt(xt −Dt−1αt)‖22 + λ
rkMt
pΩ(α)
approximative solution of
αt = argminα∈Rk
‖xt −Dt−1αt‖22 + λΩ(α)
validity in high dimension, with incoherent features:
D>MtD ≈s
pD>D D>Mtxt ≈
s
pD>xt
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 19
2. Surrogate update
Original algorithm: At and Bt used in dictionary update
At = 1t
∑ti=1 αiα
>i same as in online algorithm
Bt = (1− 1t )Bt−1 + 1
t xtα>t = 1
t
∑ti=1 xiα
>i Forbidden
Partial update of Bt at each iteration
Bt =1∑t
i=1 Mi
t∑i=1
Mixiα>i
Only MtB is updated
Behaves like Ex[xα] for large t
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 19
3. Surrogate minimization
Original algorithm : block coordinate descent with projection on C
minD∈C
gt(D) Dj ← p⊥Cj (Dj −1
Aj ,j(D(At)j − (Bt)j))
Forbidden update of full D at iteration t
Cautious update
Leave dictionary unchanged for unseen features (I−Mt)
minD∈C
(I−Mt)D=(I−Mt)Dt−1
gt(D)
O(s) update in block coordinate descent
Dj ← p⊥Crj (Dj −1
(At)j ,j(Mt(D(At)j − (Bt)j)))
`1 ball C rj = D ∈ C, ‖MtD‖1 ≤ ‖MtDt−1‖1
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 19
Resting-state fMRI
HCP dataset
One brain image per second
200 sessions n = 2 · 106 p = 2 · 105
D AX
Vox
els
Time
=
k spatial maps Time
x
Sparse decomposition k = 70 C = Bk1 Ω = ‖ · ‖22
Validation
Increase reduction factor ps
Objective function on test set vs CPU time
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 19
Resting-state fMRI
Online dictionary learning
235 h run time
1 full epoch
10 h run time
124 epoch
Proposed method
10 h run time
12 epoch, reduction r=12
Qualitatively, usable maps are obtained 10× faster
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 19
Resting-state fMRI
.1 h 1 h 10 h 100 hCPUtime
2.20
2.25
2.30
2.35
2.40O
bje
ctiv
eva
lue
on
test
set ×108
Original online algorithmReduction factor r 4 8 12
No reduction
Speed-up close to reduction factor ps
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 19
Resting-state fMRI
100 1000 Epoch 4000Records2.162.172.182.192.202.212.222.232.24
Ob
ject
ive
valu
eo
nte
stse
t ×108
λ = 10−4
No reduction(original alg.)
r = 4r = 8r = 12
Convergence speed / number of seen records
Information is acquired faster
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 19
Collaborative filtering
Mtxt movie ratings from user t
vs. coordinate descent for MMMF loss (no hyperparameters)
100 s 1000 s0.930.940.950.960.970.980.99 Netflix (140M)
Coordinate descent
Proposed(full projection)
Proposed(partial projection)
Dataset Test RMSE Speed
CD MODL -up
ML 1M 0.872 0.866 ×0.75ML 10M 0.802 0.799 ×3.7NF (140M) 0.938 0.934 ×6.8
Outperform coordinate descent beyond 10M ratings
Same prediction performance
Speed-up 6.8× on Netflix
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 19
Conclusion
Take-home message
Loading stochastic subsets of samplestreams can drastically accelerates onlinematrix factorization
Mtxt
p
n
Streaming
Subsampling
1
Reduce CPU (+IO) load at each iteration
cf Gradient Descent vs SGD
An order of magnitude speed-up on two different problems
Python package http://github.com/arthurmensch/modl
Heuristic at contribution time
A follow-up algorithm has convergence guarantees
Questions ? (Poster # 41 this afternoon)
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 19
Bibliography I
[Halko et al., 2009] Halko, N., Martinsson, P.-G., and Tropp, J. A. (2009).
Finding structure with randomness: Probabilistic algorithms for constructingapproximate matrix decompositions.
arXiv:0909.4061 [math].
[Johnson and Lindenstrauss, 1984] Johnson, W. B. and Lindenstrauss, J.(1984).
Extensions of Lipschitz mappings into a Hilbert space.
Contemporary mathematics, 26(189-206):1.
[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).
Online learning for matrix factorization and sparse coding.
The Journal of Machine Learning Research, 11:19–60.
[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).
Sparse coding with an overcomplete basis set: A strategy employed by V1?
Vision Research, 37(23):3311–3325.
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 19
Bibliography II
[Pilanci and Wainwright, 2014] Pilanci, M. and Wainwright, M. J. (2014).
Iterative Hessian sketch: Fast and accurate solution approximation forconstrained least-squares.
arXiv:1411.0347 [cs, math, stat].
[Szabo et al., 2011] Szabo, Z., Poczos, B., and Lorincz, A. (2011).
Online group-structured dictionary learning.
In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2865–2872. IEEE.
[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).
Scalable coordinate descent approaches to parallel matrix factorization forrecommender systems.
In Proceedings of the International Conference on Data Mining, pages765–774. IEEE.
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 19
Appendix
Collaborative filtering
Streaming uncomplete data
Mt is imposed by user t
Data stream : Mtxt movies ranked by user t
Proposed by [Szabo et al., 2011]), with O(p) complexity
Validation: Test RMSE (rating prediction) vs CPU timeBaseline: Coordinate descent solver [Yu et al., 2012] solvingrelated loss
n∑i=1
(‖Mt(Xt −Dαt)‖22 + λ‖αt‖2
2) + λ‖D‖22
Fastest solver available apart from SGD – no hyperparameters
Our method is not sensitive to hyperparameters
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 19
Algorithm
Our algorithm
1 Code computation
αt = argminα∈Rk
‖Mt(xt −Dt−1α)‖22
+ λrkMt
pΩ(αt)
2 Surrogate aggregation
At =1
t
t∑i=1
αiα>i
Bt = Bt−1 +1∑t
i=1 Mi(Mtxtα
>t −MtBt−1)
3 Surrogate minimization
MtDj ← p⊥Cj (MtDj−1
(At)j,jMt(D(At)j−(Bt)j ))
Original online MF
1 Code computation
αt = argminα∈Rk
‖xt −Dt−1α‖22
+ λΩ(αt)
2 Surrogate aggregation
At =1
t
t∑i=1
αiα>i
Bt = Bt−1 +1
t(xtα
>t − Bt−1)
3 Surrogate minimization
Dj ← p⊥Crj(Dj−
1
(At)j,j(D(At)j−(Bt)j ))
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 19