Sketched Learning from Random Features Moments · Compressive learning Nicolas Keriven Compression...

Post on 24-Aug-2020

6 views 0 download

Transcript of Sketched Learning from Random Features Moments · Compressive learning Nicolas Keriven Compression...

Sketched Learning from Random Features Moments

Nicolas Keriven

Ecole Normale Supérieure (Paris)

CFM-ENS chair in Data Science

(thesis with Rémi Gribonval at Inria Rennes)

ISMP, July 6th 2018

Compressive learning

Nicolas Keriven 1/15

Compressive learning

Nicolas Keriven

Compression Learning

Linear sketch

• Sketched learning: First compress data in a linear sketch [Cormode

2011], then learn

1/15

Compressive learning

Nicolas Keriven

Compression Learning

Linear sketch

• Sketched learning: First compress data in a linear sketch [Cormode

2011], then learn• Hash tables, count sketches, histograms…

1/15

Compressive learning

Nicolas Keriven

Compression Learning

Linear sketch

• Sketched learning: First compress data in a linear sketch [Cormode

2011], then learn• Hash tables, count sketches, histograms…

• Advantages: one-pass, streaming, distributed compression, data privacy…

1/15

Compressive learning

Nicolas Keriven

Compression Learning

Linear sketch

• Sketched learning: First compress data in a linear sketch [Cormode

2011], then learn• Hash tables, count sketches, histograms…

• Advantages: one-pass, streaming, distributed compression, data privacy…

• In this talk: unsupervised learning

1/15

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

• : mean

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

• : mean

• : moment

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

• : mean

• : moment

• : histogram

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

• : mean

• : moment

• : histogram

• Proposed: kernel random features [Rahimi 2007]

(random proj. + non-linearity)

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

• : mean

• : moment

• : histogram

• Proposed: kernel random features [Rahimi 2007]

(random proj. + non-linearity)

Questions:

• What information is preserved by the sketching ?

• How to retrieve this information ?

• What is a sufficient number of features ?

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

• : mean

• : moment

• : histogram

• Proposed: kernel random features [Rahimi 2007]

(random proj. + non-linearity)

Questions:

• What information is preserved by the sketching ?

• How to retrieve this information ?

• What is a sufficient number of features ?

Intuition: sketching as a linear embedding

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

• : mean

• : moment

• : histogram

• Proposed: kernel random features [Rahimi 2007]

(random proj. + non-linearity)

Questions:

• What information is preserved by the sketching ?

• How to retrieve this information ?

• What is a sufficient number of features ?

- Assumption:

Intuition: sketching as a linear embedding

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

• : mean

• : moment

• : histogram

• Proposed: kernel random features [Rahimi 2007]

(random proj. + non-linearity)

Questions:

• What information is preserved by the sketching ?

• How to retrieve this information ?

• What is a sufficient number of features ?

- Assumption:

- Linear operator:

Intuition: sketching as a linear embedding

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

• : mean

• : moment

• : histogram

• Proposed: kernel random features [Rahimi 2007]

(random proj. + non-linearity)

Questions:

• What information is preserved by the sketching ?

• How to retrieve this information ?

• What is a sufficient number of features ?

- Assumption:

- Linear operator:

- « Noisy » linear measurement:

Noise small

Intuition: sketching as a linear embedding

How-to: build a sketch

Nicolas Keriven

What is a sketch ?

Any linear sketch = empirical moments

2/15

What is contained in a sketch ?

• : mean

• : moment

• : histogram

• Proposed: kernel random features [Rahimi 2007]

(random proj. + non-linearity)

Questions:

• What information is preserved by the sketching ?

• How to retrieve this information ?

• What is a sufficient number of features ?

- Assumption:

- Linear operator:

- « Noisy » linear measurement:

Noise small

Intuition: sketching as a linear embedding

Dimensionality-reducing, random, linear embedding: Compressive Sensing?

Retrieving mixture of Diracsfrom a sketch= k-means

Example of applications [Keriven 2016,2017]

Nicolas Keriven 3/15

Retrieving mixture of Diracsfrom a sketch= k-means

Example of applications [Keriven 2016,2017]

Nicolas Keriven

Application: Spectral clusteringfor MNIST classification

3/15

Retrieving mixture of Diracsfrom a sketch= k-means

Example of applications [Keriven 2016,2017]

Nicolas Keriven

Application: Spectral clusteringfor MNIST classification

3/15

- Twice faster than k-means- 4 orders of magnitude more

memory efficient

Retrieving mixture of Diracsfrom a sketch= k-means

Example of applications [Keriven 2016,2017]

Nicolas Keriven

Application: Spectral clusteringfor MNIST classification

3/15

- Twice faster than k-means- 4 orders of magnitude more

memory efficient

Retrieving GMMs from a sketch

Application: speaker verification [Reynolds 2000]

Error:

• EM on 300 000 samples : 29.53• 20kB sketch computed on 50GB database: 28.96

In this talk

Nicolas Keriven

Q: Theoretical guarantees ?

• Inspired by Compressive Sensing:

• 1: with the Restricted Isometry Property (RIP)

• 2: with dual certificates

4/15

Outline

Nicolas Keriven

Information-preservation guarantees: a RIP analysis

Joint work with R. Gribonval, G. Blanchard, Y. Traonmilin

Total variation regularization:a dual certificate analysis

Conclusion, outlooks

Recall: Linear inverse problem

Nicolas Keriven 5/15

True distribution:

Recall: Linear inverse problem

Nicolas Keriven 5/15

Sketch:

True distribution:

Recall: Linear inverse problem

Nicolas Keriven

• Estimation problem = linear inverse problem on measures

• Extremely ill-posed !

5/15

Sketch:

True distribution:

Recall: Linear inverse problem

Nicolas Keriven

• Estimation problem = linear inverse problem on measures

• Extremely ill-posed !

• Feasibility? (information-preservation)

5/15

Best algorithmpossible

Sketch:

Information preservation guarantees

Nicolas Keriven 6/15

: Model set of « simple » distributions (eg. GMMs)

Information preservation guarantees

Nicolas Keriven 6/15

: Model set of « simple » distributions (eg. GMMs)

Information preservation guarantees

Nicolas Keriven 6/15

: Model set of « simple » distributions (eg. GMMs)

Information preservation guarantees

Nicolas Keriven 6/15

: Model set of « simple » distributions (eg. GMMs)

Information preservation guarantees

Nicolas Keriven

GoalProve the existence of a decoder robustto noise and stable to modeling error.

« Instance-optimal » decoder

6/15

: Model set of « simple » distributions (eg. GMMs)

Information preservation guarantees

Nicolas Keriven

GoalProve the existence of a decoder robustto noise and stable to modeling error.

Lower Restricted Isometry Property

« Instance-optimal » decoder

6/15

: Model set of « simple » distributions (eg. GMMs)

Information preservation guarantees

Nicolas Keriven

Generalized Method of Moments

GoalProve the existence of a decoder robustto noise and stable to modeling error.

Lower Restricted Isometry Property

« Instance-optimal » decoder

6/15

: Model set of « simple » distributions (eg. GMMs)

Information preservation guarantees

Nicolas Keriven

New goal: find/construct models and operators that satisfy the LRIP (w.h.p.)

Generalized Method of Moments

GoalProve the existence of a decoder robustto noise and stable to modeling error.

Lower Restricted Isometry Property

« Instance-optimal » decoder

6/15

Proving the LRIP

Nicolas Keriven

Goal: LRIP

7/15

Proving the LRIP

Nicolas Keriven

Construction of :- Kernel mean [Gretton 2006, Borgwardt 2006]

- Random features [Rahimi 2007]

Pointwise LRIP

Goal: LRIP

7/15

Proving the LRIP

Nicolas Keriven

Construction of :- Kernel mean [Gretton 2006, Borgwardt 2006]

- Random features [Rahimi 2007]

Pointwise LRIP

Goal: LRIP

Extension to LRIP

Covering numbers (compacity) of the normalized secant set

7/15

Proving the LRIP

Nicolas Keriven

Construction of :- Kernel mean [Gretton 2006, Borgwardt 2006]

- Random features [Rahimi 2007]

Pointwise LRIP

Goal: LRIP

Extension to LRIP

Covering numbers (compacity) of the normalized secant set

Subset of a unit ball (infinite dimension)that only depends on

7/15

Proving the LRIP

Nicolas Keriven

Construction of :- Kernel mean [Gretton 2006, Borgwardt 2006]

- Random features [Rahimi 2007]

Pointwise LRIP

Goal: LRIP

Extension to LRIP

Covering numbers (compacity) of the normalized secant set

Subset of a unit ball (infinite dimension)that only depends on

7/15

Main result [Keriven 2016]

Nicolas Keriven

Main hypothesis

The normalized secant set has finite covering numbers.

8/15

Result

For ,

Main result [Keriven 2016]

Nicolas Keriven

Main hypothesis

The normalized secant set has finite covering numbers.

8/15

Result

For ,

Main result [Keriven 2016]

Nicolas Keriven

Main hypothesis

The normalized secant set has finite covering numbers.

Pointwise concentration Dimensionality of the model

8/15

Result

For ,

W.h.p.

Main result [Keriven 2016]

Nicolas Keriven

Main hypothesis

The normalized secant set has finite covering numbers.

Pointwise concentration Dimensionality of the model

8/15

Result

For ,

W.h.p.

Main result [Keriven 2016]

Nicolas Keriven

Main hypothesis

The normalized secant set has finite covering numbers.

Pointwise concentration Dimensionality of the model

Modeling error Empirical noise

8/15

Result

For ,

W.h.p.

Main result [Keriven 2016]

Nicolas Keriven

Main hypothesis

The normalized secant set has finite covering numbers.

Pointwise concentration Dimensionality of the model

Modeling error Empirical noise

8/15

Does not depend on !

Does not depend on !

Result

For ,

W.h.p.

Main result [Keriven 2016]

Nicolas Keriven

Main hypothesis

The normalized secant set has finite covering numbers.

Pointwise concentration Dimensionality of the model

Modeling error

- Classic Compressive Sensing: finite dimension: Known- Here: infinite dimension: Technical

Empirical noise

8/15

Does not depend on !

Does not depend on !

Application

Nicolas Keriven

k-means with mixtures of Diracs

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs

Hypotheses- - separated centroids- - bounded domain for centroids

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs

Hypotheses- - separated centroids- - bounded domain for centroids

(no assumptionon the data)

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs

Hypotheses- - separated centroids- - bounded domain for centroids

Sketch- Adjusted Random Fourier features (for

technical reasons)

(no assumptionon the data)

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs

Hypotheses- - separated centroids- - bounded domain for centroids

Sketch- Adjusted Random Fourier features (for

technical reasons)

Result- W.r.t. k-means usual cost (SSE)

(no assumptionon the data)

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs

Hypotheses- - separated centroids- - bounded domain for centroids

Sketch- Adjusted Random Fourier features (for

technical reasons)

Result- W.r.t. k-means usual cost (SSE)

Sketch size

(no assumptionon the data)

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs GMM with known covariance

Hypotheses- - separated centroids- - bounded domain for centroids

Sketch- Adjusted Random Fourier features (for

technical reasons)

Result- W.r.t. k-means usual cost (SSE)

Sketch size

(no assumptionon the data)

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs GMM with known covariance

Hypotheses- - separated centroids- - bounded domain for centroids

Sketch- Adjusted Random Fourier features (for

technical reasons)

Result- W.r.t. k-means usual cost (SSE)

Sketch size

Hypotheses- Sufficiently separated means- Bounded domain for means

(no assumptionon the data)

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs GMM with known covariance

Hypotheses- - separated centroids- - bounded domain for centroids

Sketch- Adjusted Random Fourier features (for

technical reasons)

Result- W.r.t. k-means usual cost (SSE)

Sketch size

Hypotheses- Sufficiently separated means- Bounded domain for means

Sketch- Fourier features

(no assumptionon the data)

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs GMM with known covariance

Hypotheses- - separated centroids- - bounded domain for centroids

Sketch- Adjusted Random Fourier features (for

technical reasons)

Result- W.r.t. k-means usual cost (SSE)

Sketch size

Hypotheses- Sufficiently separated means- Bounded domain for means

Sketch- Fourier features

Result- With respect to log-likelihood

(no assumptionon the data)

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs GMM with known covariance

Hypotheses- - separated centroids- - bounded domain for centroids

Sketch- Adjusted Random Fourier features (for

technical reasons)

Result- W.r.t. k-means usual cost (SSE)

Sketch size

Hypotheses- Sufficiently separated means- Bounded domain for means

Sketch- Fourier features

Result- With respect to log-likelihood

Sketch size

(no assumptionon the data)

9/15

Application

Nicolas Keriven

k-means with mixtures of Diracs GMM with known covariance

Hypotheses- - separated centroids- - bounded domain for centroids

Sketch- Adjusted Random Fourier features (for

technical reasons)

Result- W.r.t. k-means usual cost (SSE)

Sketch size

Hypotheses- Sufficiently separated means- Bounded domain for means

Sketch- Fourier features

Result- With respect to log-likelihood

Sketch size

(no assumptionon the data)

9/15

Compared to Generalized Method of moments, different guarantees

Outline

Nicolas Keriven

Information-preservation guarantees: a RIP analysis

Total variation regularization:a dual certificate analysisJoint work with C. Poon, G. Peyré

Conclusion, outlooks

Total Variation regularization

Nicolas Keriven

Previously: RIP analysis

Minimization: moment matching

11/15

Total Variation regularization

Nicolas Keriven

Previously: RIP analysis

• Must know

• Non-convex !

Minimization: moment matching

11/15

Total Variation regularization

Nicolas Keriven

Previously: RIP analysis

• Must know

• Non-convex !

Minimization: moment matching

Convex relaxation (« super resolution »): Beurling-LASSO (BLASSO) [DeCastro 2015]

• : Radon measure

• : Total variation (« L1 norm »)

11/15

Total Variation regularization

Nicolas Keriven

Previously: RIP analysis

• Must know

• Non-convex !

Minimization: moment matching

Convex relaxation (« super resolution »): Beurling-LASSO (BLASSO) [DeCastro 2015]

• : Radon measure

• : Total variation (« L1 norm »)

Questions:• Is the measure sparse ?

• Does it have the right number of components ?

• Does it recover the true ?

11/15

Dual certificates

Nicolas Keriven 12/15

Dual certificate analysis:

( = Lagrange multiplier)

Dual certificates

Nicolas Keriven 12/15

Dual certificate analysis:

( = Lagrange multiplier)

Function

Dual certificates

Nicolas Keriven 12/15

Dual certificate analysis:

( = Lagrange multiplier)

Such that:

• otherwise•

Function

Dual certificates

Nicolas Keriven

Step 1: study full kernel [Candes 2013]

Assume sufficiently separated

12/15

Dual certificate analysis:

( = Lagrange multiplier)

Such that:

• otherwise•

Function

Dual certificates

Nicolas Keriven

Step 1: study full kernel [Candes 2013]

Step 2: bounding the deviations

Assume sufficiently separated

12/15

Dual certificate analysis:

( = Lagrange multiplier)

Such that:

• otherwise•

Function

Dual certificates

Nicolas Keriven

Step 1: study full kernel [Candes 2013]

Step 2: bounding the deviations

Assume sufficiently separated

m=10

12/15

Dual certificate analysis:

( = Lagrange multiplier)

Such that:

• otherwise•

Function

Dual certificates

Nicolas Keriven

Step 1: study full kernel [Candes 2013]

Step 2: bounding the deviations

Assume sufficiently separated

m=10 m=20

12/15

Dual certificate analysis:

( = Lagrange multiplier)

Such that:

• otherwise•

Function

Dual certificates

Nicolas Keriven

Step 1: study full kernel [Candes 2013]

Step 2: bounding the deviations

Assume sufficiently separated

m=50m=10 m=20

12/15

Dual certificate analysis:

( = Lagrange multiplier)

Such that:

• otherwise•

Function

Results for separated GMM

Nicolas Keriven

1: Ideal scaling in sparsity

13/15

Results for separated GMM

Nicolas Keriven

1: Ideal scaling in sparsity

13/15

Results for separated GMM

Nicolas Keriven

1: Ideal scaling in sparsity

In progress…

13/15

Results for separated GMM

Nicolas Keriven

1: Ideal scaling in sparsity

In progress…

• not necessarily right number of components, but:

• Mass of concentrated around true• (weak) robustness to modelling error

• Proof: infinite-dimensional golfingscheme (new)

13/15

Results for separated GMM

Nicolas Keriven

1: Ideal scaling in sparsity

In progress…

• not necessarily right number of components, but:

• Mass of concentrated around true• (weak) robustness to modelling error

• Proof: infinite-dimensional golfingscheme (new)

2: Minimal norm certificate[Duval, Peyré 2015]

In progress…

13/15

Assumption: data are actually drawn from a GMM…

Results for separated GMM

Nicolas Keriven

1: Ideal scaling in sparsity

In progress…

• not necessarily right number of components, but:

• Mass of concentrated around true• (weak) robustness to modelling error

• Proof: infinite-dimensional golfingscheme (new)

2: Minimal norm certificate[Duval, Peyré 2015]

In progress…

• when n high enough: sparse, withright number of components

• Proof: adaptation of [Tang, Recht 2013]

13/15

Assumption: data are actually drawn from a GMM…

Outline

Nicolas Keriven

Information-preservation guarantees: a RIP analysis

Total variation regularization:a dual certificate analysis

Conclusion, outlooks

Sketch learning

Nicolas Keriven

• Sketching :• Streaming, distributed learning

• Original view on data compression and generalized moments

• Combines random features and kernel mean with infinitedimensional Compressive sensing

14/15

Summary, outlooks

Nicolas Keriven

• RIP analysis• Information preservation guarantees• Fine control on noise, modeling error (instance optimal decoder) and

recovery metrics• Necessary and sufficient conditions

15/15

Summary, outlooks

Nicolas Keriven

• Dual certificate analysis• Convex minimization• In some cases, automatically guess the right number of components

• RIP analysis• Information preservation guarantees• Fine control on noise, modeling error (instance optimal decoder) and

recovery metrics• Necessary and sufficient conditions

15/15

Summary, outlooks

Nicolas Keriven

• Dual certificate analysis• Convex minimization• In some cases, automatically guess the right number of components

• RIP analysis• Information preservation guarantees• Fine control on noise, modeling error (instance optimal decoder) and

recovery metrics• Necessary and sufficient conditions

15/15

• Outlooks• Algorithms for TV minimization• Other features (not necessarily random…)• Other « sketched » learning tasks• Multilayer sketches ?

Thank you !

Nicolas Keriven

• Gribonval, Blanchard, Keriven, Traonmilin. Compressive Statistical Learning with Random Feature Moments. 2017. <arXiv:1706.07180>

• Keriven. Sketching for Large-Scale Learning of Mixture Models. PhD Thesis. <tel-01620815>

• Poon, Keriven, Peyré. A Dual Certificates Analysis of Compressive Off-the-Grid Recovery. 2018. <arXiv:1802.08464>

• Code, applications: nkeriven.github.io