Test of Time Award *0.2cm Online Dictionary Learning for Sparse...

Test of Time Award

Online Dictionary Learning for Sparse Coding

Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro

International Conference on Machine Learning, 2019

Julien Mairal Online Dictionary Learning for Sparse Coding 1/15

Test of Time Award

Online Learning for Matrix Factorization and Sparse Coding

Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro

International Conference on Machine Learning, 2019

Francis Jean Guillermo

What are these papers about?

They are dealing with matrix factorization

when a factor is sparse.

or the other one.

or both.

or not only one factor is sparse, but it admits a particular structure.

or one factor admits a particular structure (e.g., piecewise constant), but it is not sparse.

What these papers are about?

In these papers, data matrices have many columns,

n→ +∞

or an infinite number of columns, or columns are streamed online.

Formulation(s)

X = [x1,x2, . . . ,xn] is a data matrix.

We may call D = [d1, . . . ,dp] a dictionary.

A = [α1, . . . ,αn] carries the decomposition coefficients of X onto D.

Interpretation as signal/data decomposition

X ≈ DA ⇐⇒ ∀ i, xi ≈ Dαi =

p∑j=1

αi[j]dj .

Formulation(s)

X ≈ DA ⇐⇒ ∀ i, xi ≈ Dαi =

p∑j=1

αi[j]dj .

Formulation(s)

X ≈ DA ⇐⇒ ∀ i, xi ≈ Dαi =

p∑j=1

αi[j]dj .

Generic formulation

minD∈D

n∑i=1

L(xi,D) with L(x,D)M= min

α∈A

2‖x−Dα‖2 + λψ(α).

Formulation(s)

X ≈ DA ⇐⇒ ∀ i, xi ≈ Dαi =

p∑j=1

αi[j]dj .

Generic formulation / stochastic case

n∑i=1

minD∈D

Ex [L(x,D)] with L(x,D)M= min

α∈A

2‖x−Dα‖2 + λψ(α).

Formulation(s)

n∑i=1

minD∈D

α∈A

2‖x−Dα‖2 + λψ(α).

Which formulations does it cover?

D A ψ

non-negative matrix factorization Rm×p+ Rp

sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1non-negative sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp

+ ‖.‖1structured sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1 + Ω(.)

≈ sparse PCA D : ∀ j, ‖dj‖22 + ‖dj‖1 ≤ 1 Rp ‖.‖1...

......

[Paatero and Tapper, ’94]

, [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011], [Zou et al., 2004].

Formulation(s)

n∑i=1

minD∈D

α∈A

2‖x−Dα‖2 + λψ(α).

D A ψ

sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1

non-negative sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp+ ‖.‖1

structured sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1 + Ω(.)

......

[Paatero and Tapper, ’94], [Olshausen and Field, ’96]

, [Hoyer, 2002], [Mairal et al., 2011], [Zou et al., 2004].

Formulation(s)

n∑i=1

minD∈D

α∈A

2‖x−Dα‖2 + λψ(α).

D A ψ

+ ‖.‖1

structured sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1 + Ω(.)

......

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002]

, [Mairal et al., 2011], [Zou et al., 2004].

Formulation(s)

n∑i=1

minD∈D

α∈A

2‖x−Dα‖2 + λψ(α).

D A ψ

......

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011]

, [Zou et al., 2004].

Formulation(s)

n∑i=1

minD∈D

α∈A

2‖x−Dα‖2 + λψ(α).

D A ψ

≈ sparse PCA D : ∀ j, ‖dj‖22 + ‖dj‖1 ≤ 1 Rp ‖.‖1

......

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011], [Zou et al., 2004].

Formulation(s)

n∑i=1

minD∈D

α∈A

2‖x−Dα‖2 + λψ(α).

D A ψ

......

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011], [Zou et al., 2004].

The sparse coding context

was introduced by Olshausen and Field, ’96. It was the first time (together with ICA, see[Bell and Sejnowski, ’97]) that a simple unsupervised learning principle would lead to

various sorts of “Gabor-like” filters, whentrained on natural image patches.

The sparse coding context

Remember that we can play with various structured sparsity-inducing penalties:

[Jenatton et al. 2010], [Kavukcuoglu et al., 2009], [Mairal et al. 2011], [Hyvarinen and Hoyer, 2001].

Sparsity and simplicity principles

1950 1960 1970 1980 1990 2000 2010

1921: Wrinch and Jeffrey’s simplicity principle.

1952: Markowitz’s portfolio selection.

1960’s and 70’s: best subset selection in statistics.

1990’s: the wavelet era in signal processing.

1996: Olshausen and Field’s dictionary learning method.

1994–1996: the Lasso (Tibshirani) and Basis pursuit (Chen and Donoho).

2004: compressed sensing (Candes, Romberg and Tao).

2006: Elad and Aharon’s image denoising method.

1921 1950

1960 1970 1980 1990 2000 2010

1921 1950 1960 1970

1980 1990 2000 2010

1921 1950 1960 1970 1980 1990

2000 2010

1921 1950 1960 1970 1980 1990 2000

1921 1950 1960 1970 1980 1990 2000 2010

Context of 2009

Many successful stories of dictionary learning in image processing

image denoising, inpainting, demosaicing, super-resolution . . .

Also successful stories in computer vision for modeling local features

dictionary learning on top of SIFT wins the PASCAL VOC’09 challenge.

another variant wins the ImageNet 2010 challenge.

Matrix factorization becomes a key technique for unsupervised data modeling

recommender systems (Netflix prize) and social networks.

document clustering.

genomic pattern discovery.

Context of 2009

[Elad and Aharon., 2006], [Mairal et al., 2008], [Yang et al., 2008] . . .

Context of 2009

[Elad and Aharon., 2006], [Mairal et al., 2008], [Yang et al., 2008] . . .

Context of 2009

[Yang et al., 2009], [Lin et al. 2010] . . .

Context of 2009

[Koren et al., 2009b], [Ma et al. 2008], [Xu et al. 2003], [Brunet et al., 2004]. . .

Context of 2009

Classical approach for matrix factorization: alternate minimization

minD∈D,A∈A

2‖X−DA‖2F + λψ(A).

which requires loading all data at every iteration (batch optimization).

Meanwhile, Leon Bottou is advocating stochastic optimization for machine learning

which makes the risk minimization point of view relevant:

minD∈D

n∑i=1

α∈A

2‖x−Dα‖2+λψ(α).

see Leon’s tutorial at NIPS’07, or NeurIPS’18 test of time award [Bottou and Bousquet, 2008].

Context of 2009

minD∈D,A∈A

2‖X−DA‖2F + λψ(A).

minD∈D

n∑i=1

α∈A

Context of 2009

minD∈D,A∈A

2‖X−DA‖2F + λψ(A).

minD∈D

n∑i=1

α∈A

What we did

We started experimenting with SGD

and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Idea 1: If we knew optimal codes α?i for all xi’s in advance, then the problem becomes

minD∈D

2trace

)− trace

)with B =

n∑i=1

α?iα

?>i and C =

n∑i=1

xiα?>i ,

which yields parameter-free block coordinate descent rules for updating D.

Idea 2: Build appropriate matrices B and C in an online fashion.

What about theory?

We could provide guarantees of convergence to stationary points, even though the problemis non-convex, stochastic, constrained, and non-smooth.

[Neal and Hinton, ’98], [Mairal, 2013], [Mensch, 2018].

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

minD∈D

2trace

)− trace

)with B =

n∑i=1

α?iα

?>i and C =

n∑i=1

xiα?>i ,

What about theory?

What we did

minD∈D

2trace

)− trace

)with B =

n∑i=1

α?iα

?>i and C =

n∑i=1

xiα?>i ,

What about theory?

What we did

minD∈D

2trace

)− trace

)with B =

n∑i=1

α?iα

?>i and C =

n∑i=1

xiα?>i ,

What about theory?

What we did

minD∈D

2trace

)− trace

)with B =

n∑i=1

α?iα

?>i and C =

n∑i=1

xiα?>i ,

What about theory?

[Neal and Hinton, ’98]

, [Mairal, 2013], [Mensch, 2018].

What we did

minD∈D

2trace

)− trace

)with B =

n∑i=1

α?iα

?>i and C =

n∑i=1

xiα?>i ,

What about theory?

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for morescalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox.

robustness to hyper-parameters: default setting that works (many times) in practice.

(try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Connection with neural networks

A cheap way to obtain a sparse code β from x and D is

β = relu(D>x− λ),

versus

α ∈ arg minα∈A

2‖x−Dα‖2 + λ‖α‖1.

Then, not surprisingly, for dictionary learning,

end-to-end feature learning is feasible.

one can design convolutional and multilayer models.

sparse decomposition algorithms perform neural network-like operations (LISTA).

[Mairal et al., 2012], [Zeiler and Fergus, 2010], [Gregor and LeCun, 2010].

versus

2‖x−Dα‖2 + λ‖α‖1.

versus

2‖x−Dα‖2 + λ‖α‖1.

[Mairal et al., 2012]

, [Zeiler and Fergus, 2010], [Gregor and LeCun, 2010].

versus

2‖x−Dα‖2 + λ‖α‖1.

[Mairal et al., 2012], [Zeiler and Fergus, 2010]

, [Gregor and LeCun, 2010].

versus

2‖x−Dα‖2 + λ‖α‖1.

Thoughts

Is Wrinch and Jeffrey’s simplicity principle still relevant?

Simplicity is a key to interpretability and to model/hypothesis selection.

Next form will probably be different than `1. Which one?

Simplicity is not enough. Various forms and robustness and stability are also needed.

[Yu and Kumbier, 2019].

Test of Time Award *0.2cm Online Dictionary Learning for Sparse...

Documents

Transcript of Test of Time Award *0.2cm Online Dictionary Learning for Sparse...

Release Notes Digi TransPort Product Range Version 5269 ...ftp1.digi.com/support/firmware/transport/ftp/archived...Version 5269 (Bootloader 7.24) October 15, 2014 INTRODUCTION This

Language Technology at MILE Lab · Title: Language Technology at MILE Lab Author: Ramakrishnan Angarai Ganesan0.2cmMILE Laboratory, Department of Electrical Engineering 0.2cm Indian

Medical Image Processing - Project 0.2cm Introduction · 2013-10-17 · Institute for Vision and Graphics, ... Joanna Czajkwsoak Medical Image Processing - Project 5/14. ... Joanna

Langages 0.2cm Informatique scientifique pour le Calcul ...lyoncalcul.univ-lyon1.fr/ed/DOCS_2017-2018/typologie_langages.pdf · Fortran, C++ (NB : mot cl e auto en C++11). Les langages

TG-5269 10/100/1000Mbps Gigabit Ethernet PC Card...Step 3: Select “Don’t search. I will choose the dr iver to install” as shown in Figure 2-3. Then click TG-5269 10/100/1000Mbps

Macomb County Court Directory - c.ymcdn.comc.ymcdn.com/sites/ · Chief Custody Investigator Amy Friedline-Schubett 469-5940 ... Paternity/Welfare Fraud Beth Kirshner 469-5269 ...

Spezifikation und Verifikation Kapitel 2 0.2cm CTL Model ... · PDF fileWiederholung Model Checking CTL Spezi kation und Veri kation Kapitel 2 CTL Model Checking Frank Heitmann heitmann@

0.2cm Distributed Honeypots Network Implementation based ... · • Transfers the processed statistics to the web ... – Hardware and network – Honeypot(s) ... Distributed Honeypots

Enhancing Sampling in Computational Statistics [0.2cm ...

TEL FAX.03-5269-0447 38,000B 15.30 (4.63±9 ( (20-40%) o ... · TEL FAX.03-5269-0447 38,000B 15.30 (4.63±9 ( (20-40%) o ZENRIN co.LTD. 015 . Subject: Image Created Date: 7/28/2012

The Night Firejumanjifans.weebly.com/uploads/1/0/9/3/1093400/jumanji_plans.pdf · 40cm 30cm 3.5cm 2cm 3cm 7cm 5.5cm 5cm 0.25cm 27cm 0.2cm 0.2cm 26cm 5.5cm 1cm 1cm 19cm 9cm 2cm Board

Shock Tubes 0.2cm Interaction on Meteoritesusers.ba.cnr.it/imip/cscpal38/HYMEP/lectures/LinodaSilva.pdf · 2020. 10. 20. · Shock-Tube Design Application to the ESTHER Shock-Tube

Ontherelationbetweenneuronalsizeand ... · Christophe Pouzat and Georgios Is. Detorakis 0.2cm MAP5, Paris-Descartes University and CNRS 8145 0.2cm Signals and Systems Laboratory,

HomeSite Power 5500 (4EGMBD-5269) 6500 (5EGMBE-5270 ...n0nas/manuals/onan/914-0501...Service Manual HomeSite Power 5500 (4EGMBD-5269) 6500 (5EGMBE-5270) Portable Generator Set Printed

Neural Network Techniques [0.2cm] in Dependency Parsingclcl.unige.ch/slides_pour_Nivre/Geneva5.pdf · Neural Network Techniques I Empirical results have improved substantially since

Release Notes Digi TransPort Product Range Version 5269 ...

INGRESO A E LEARNING LONGPORT - Longport Air …training.longportsecurity.com.co/pluginfile.php/5269/mod_label... · para ingresar a la plataforma. El remitente aparecera como adminusuario.

A Satellite Ground Station *[0.2cm] Control System

The role of water in[0.2cm] planet formation · c 2019,DjoekeSchoonenberg contact: djoekeschoonenberg@gmail.com ISBN:978-94-6323-827-4 CoverdesignbyHuubRutjes PrintedbyGildeprint

Report from Dagstuhl Seminar 15112 Network Calculusdrops.dagstuhl.de/opus/volltexte/2015/5269/pdf/dagrep_v005_i003_p... · Report from Dagstuhl Seminar 15112 Network Calculus Edited