A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear...

105
A Kernel Perspective for Regularizing Deep Neural Networks Julien Mairal Inria Grenoble Imaging and Machine Learning, IHP, 2019 Julien Mairal A Kernel Perspective for Regularizing NN 1/1

Transcript of A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear...

Page 1: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A Kernel Perspectivefor Regularizing Deep Neural Networks

Julien Mairal

Inria Grenoble

Imaging and Machine Learning, IHP, 2019

Julien Mairal A Kernel Perspective for Regularizing NN 1/1

Page 2: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Publications

Theoretical Foundations

A. Bietti and J. Mairal. Invariance and Stability of DeepConvolutional Representations. NIPS. 2017.

A. Bietti and J. Mairal. Group Invariance, Stability toDeformations, and Complexity of Deep ConvolutionalRepresentations. JMLR. 2019.

Practical aspects

A. Bietti, G. Mialon, D. Chen, and J. Mairal. A KernelPerspective for Regularizing Deep Neural Networks. arXiv.2019.

Julien Mairal A Kernel Perspective for Regularizing NN 2/1

Page 3: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Convolutional Neural Networks

Short Introduction and Current Challenges

Julien Mairal A Kernel Perspective for Regularizing NN 3/1

Page 4: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Learning a predictive model

The goal is to learn a prediction function f : Rp → R given labeledtraining data (xi, yi)i=1,...,n with xi in Rp, and yi in R:

minf∈F

1

n

n∑i=1

L(yi, f(xi))︸ ︷︷ ︸empirical risk, data fit

+ λΩ(f)︸ ︷︷ ︸regularization

.

Julien Mairal A Kernel Perspective for Regularizing NN 4/1

Page 5: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Convolutional Neural Networks

The goal is to learn a prediction function f : Rp → R given labeledtraining data (xi, yi)i=1,...,n with xi in Rp, and yi in R:

minf∈F

1

n

n∑i=1

L(yi, f(xi))︸ ︷︷ ︸empirical risk, data fit

+ λΩ(f)︸ ︷︷ ︸regularization

.

What is specific to multilayer neural networks?

The “neural network” space F is explicitly parametrized by:

f(x) = σk(Wkσk–1(Wk–1 . . . σ2(W2σ1(W1x)) . . .)).

Linear operations are either unconstrained (fully connected) or shareparameters (e.g., convolutions).

Finding the optimal W1,W2, . . . ,Wk yields a non-convexoptimization problem in huge dimension.

Julien Mairal A Kernel Perspective for Regularizing NN 5/1

Page 6: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Convolutional Neural Networks

Picture from LeCun et al. [1998]

What are the main features of CNNs?

they capture compositional and multiscale structures in images;

they provide some invariance;

they model local stationarity of images at several scales;

they are state-of-the-art in many fields.

Julien Mairal A Kernel Perspective for Regularizing NN 6/1

Page 7: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Convolutional Neural Networks

The keywords: multi-scale, compositional, invariant, local features.Picture from Y. LeCun’s tutorial:

Julien Mairal A Kernel Perspective for Regularizing NN 7/1

Page 8: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Convolutional Neural Networks

Picture from Olah et al. [2017]:

Julien Mairal A Kernel Perspective for Regularizing NN 8/1

Page 9: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Convolutional Neural Networks

Picture from Olah et al. [2017]:

Julien Mairal A Kernel Perspective for Regularizing NN 8/1

Page 10: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Convolutional Neural Networks: Challenges

What are current high-potential problems to solve?

1 lack of stability (see next slide).

2 learning with few labeled data.

3 learning with no supervision (see Tab. from Bojanowski and Joulin, 2017).

Julien Mairal A Kernel Perspective for Regularizing NN 9/1

Page 11: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Convolutional Neural Networks: Challenges

Illustration of instability. Picture from Kurakin et al. [2016].

Figure: Adversarial examples are generated by computer; then printed on paper;a new picture taken on a smartphone fools the classifier.

Julien Mairal A Kernel Perspective for Regularizing NN 10/1

Page 12: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Convolutional Neural Networks: Challenges

minf∈F

1

n

n∑i=1

L(yi, f(xi))︸ ︷︷ ︸empirical risk, data fit

+ λΩ(f)︸ ︷︷ ︸regularization

.

The issue of regularization

today, heuristics are used (DropOut, weight decay, early stopping)...

...but they are not sufficient.

how to control variations of prediction functions?

|f(x)− f(x′)| should be close if x and x′ are “similar”.

what does it mean for x and x′ to be “similar”?

what should be a good regularization function Ω?

Julien Mairal A Kernel Perspective for Regularizing NN 11/1

Page 13: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Deep Neural Networks

from a Kernel Perspective

Julien Mairal A Kernel Perspective for Regularizing NN 12/1

Page 14: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective

Recipe

Map data x to high-dimensional space, Φ(x) in H (RKHS), withHilbertian geometry (projections, barycenters, angles, . . . , exist!).

predictive models f in H are linear forms in H: f(x) = 〈f,Φ(x)〉H.

Learning with a positive definite kernel K(x, x′) = 〈Φ(x),Φ(x′)〉H.

[Scholkopf and Smola, 2002, Shawe-Taylor and Cristianini, 2004]...

Julien Mairal A Kernel Perspective for Regularizing NN 13/1

Page 15: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective

Recipe

Map data x to high-dimensional space, Φ(x) in H (RKHS), withHilbertian geometry (projections, barycenters, angles, . . . , exist!).

predictive models f in H are linear forms in H: f(x) = 〈f,Φ(x)〉H.

Learning with a positive definite kernel K(x, x′) = 〈Φ(x),Φ(x′)〉H.

What is the relation with deep neural networks?

It is possible to design a RKHS H where a large class of deep neuralnetworks live [Mairal, 2016].

f(x) = σk(Wkσk–1(Wk–1 . . . σ2(W2σ1(W1x)) . . .)) = 〈f,Φ(x)〉H.

This is the construction of “convolutional kernel networks”.

[Scholkopf and Smola, 2002, Shawe-Taylor and Cristianini, 2004]...

Julien Mairal A Kernel Perspective for Regularizing NN 13/1

Page 16: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective

Recipe

Map data x to high-dimensional space, Φ(x) in H (RKHS), withHilbertian geometry (projections, barycenters, angles, . . . , exist!).

predictive models f in H are linear forms in H: f(x) = 〈f,Φ(x)〉H.

Learning with a positive definite kernel K(x, x′) = 〈Φ(x),Φ(x′)〉H.

Why do we care?

Φ(x) is related to the network architecture and is independentof training data. Is it stable? Does it lose signal information?

f is a predictive model. Can we control its stability?

|f(x)− f(x′)| ≤ ‖f‖H‖Φ(x)− Φ(x′)‖H.

‖f‖H controls both stability and generalization!

Julien Mairal A Kernel Perspective for Regularizing NN 13/1

Page 17: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Summary of the results from Bietti and Mairal [2019]

Multi-layer construction of the RKHS HContains CNNs with smooth homogeneous activations functions.

Signal representation: Conditions for

Signal preservation of the multi-layer kernel mapping Φ.

Stability to deformations and non-expansiveness for Φ.

Constructions to achieve group invariance.

On learning

Bounds on the RKHS norm ‖.‖H to control stability andgeneralization of a predictive model f .

|f(x)− f(x′)| ≤ ‖f‖H‖Φ(x)− Φ(x′)‖H.[Mallat, 2012]

Julien Mairal A Kernel Perspective for Regularizing NN 14/1

Page 18: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Summary of the results from Bietti and Mairal [2019]

Multi-layer construction of the RKHS HContains CNNs with smooth homogeneous activations functions.

Signal representation: Conditions for

Signal preservation of the multi-layer kernel mapping Φ.

Stability to deformations and non-expansiveness for Φ.

Constructions to achieve group invariance.

On learning

Bounds on the RKHS norm ‖.‖H to control stability andgeneralization of a predictive model f .

|f(x)− f(x′)| ≤ ‖f‖H‖Φ(x)− Φ(x′)‖H.[Mallat, 2012]

Julien Mairal A Kernel Perspective for Regularizing NN 14/1

Page 19: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Summary of the results from Bietti and Mairal [2019]

Multi-layer construction of the RKHS HContains CNNs with smooth homogeneous activations functions.

Signal representation: Conditions for

Signal preservation of the multi-layer kernel mapping Φ.

Stability to deformations and non-expansiveness for Φ.

Constructions to achieve group invariance.

On learning

Bounds on the RKHS norm ‖.‖H to control stability andgeneralization of a predictive model f .

|f(x)− f(x′)| ≤ ‖f‖H‖Φ(x)− Φ(x′)‖H.[Mallat, 2012]

Julien Mairal A Kernel Perspective for Regularizing NN 14/1

Page 20: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Smooth homogeneous activations functions

z 7→ ReLU(w>z) =⇒ z 7→ ‖z‖σ(w>z/‖z‖).

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0x

0.0

0.5

1.0

1.5

2.0

f(x)

f : x (x)ReLUsReLU

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0x

0

1

2

3

4

f(x)

f : x |x| (wx/|x|)ReLU, w=1sReLU, w = 0sReLU, w = 0.5sReLU, w = 1sReLU, w = 2

Julien Mairal A Kernel Perspective for Regularizing NN 15/1

Page 21: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Assume we have an RKHS H for deep networks:

minf∈H

1

n

n∑i=1

L(yi, f(xi)) +λ

2‖f‖2H.

‖.‖H encourages smoothness and stability w.r.t. the geometry inducedby the kernel (which depends itself on the choice of architecture).

Problem

Multilayer kernels developed for deep networks are typically intractable.

One solution [Mairal, 2016]

do kernel approximations at each layer, which leads to non-standardCNNs called convolutional kernel networks (CKNs).not the subject of this talk.

Julien Mairal A Kernel Perspective for Regularizing NN 16/1

Page 22: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Assume we have an RKHS H for deep networks:

minf∈H

1

n

n∑i=1

L(yi, f(xi)) +λ

2‖f‖2H.

‖.‖H encourages smoothness and stability w.r.t. the geometry inducedby the kernel (which depends itself on the choice of architecture).

Problem

Multilayer kernels developed for deep networks are typically intractable.

One solution [Mairal, 2016]

do kernel approximations at each layer, which leads to non-standardCNNs called convolutional kernel networks (CKNs).

not the subject of this talk.

Julien Mairal A Kernel Perspective for Regularizing NN 16/1

Page 23: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Assume we have an RKHS H for deep networks:

minf∈H

1

n

n∑i=1

L(yi, f(xi)) +λ

2‖f‖2H.

‖.‖H encourages smoothness and stability w.r.t. the geometry inducedby the kernel (which depends itself on the choice of architecture).

Problem

Multilayer kernels developed for deep networks are typically intractable.

One solution [Mairal, 2016]

do kernel approximations at each layer, which leads to non-standardCNNs called convolutional kernel networks (CKNs).not the subject of this talk.

Julien Mairal A Kernel Perspective for Regularizing NN 16/1

Page 24: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Consider a classical CNN parametrized by θ, which live in the RKHS:

minθ∈Rp

1

n

n∑i=1

L(yi, fθ(xi)) +λ

2‖fθ‖2H.

This is different than CKNs since fθ admits a classical parametrization.

Problem

‖fθ‖H is intractable...

One solution [Bietti et al., 2019]

use approximations (lower- and upper-bounds), based on mathematicalproperties of ‖.‖H.

This is the subject of this talk.

Julien Mairal A Kernel Perspective for Regularizing NN 17/1

Page 25: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Consider a classical CNN parametrized by θ, which live in the RKHS:

minθ∈Rp

1

n

n∑i=1

L(yi, fθ(xi)) +λ

2‖fθ‖2H.

This is different than CKNs since fθ admits a classical parametrization.

Problem

‖fθ‖H is intractable...

One solution [Bietti et al., 2019]

use approximations (lower- and upper-bounds), based on mathematicalproperties of ‖.‖H.

This is the subject of this talk.

Julien Mairal A Kernel Perspective for Regularizing NN 17/1

Page 26: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Consider a classical CNN parametrized by θ, which live in the RKHS:

minθ∈Rp

1

n

n∑i=1

L(yi, fθ(xi)) +λ

2‖fθ‖2H.

This is different than CKNs since fθ admits a classical parametrization.

Problem

‖fθ‖H is intractable...

One solution [Bietti et al., 2019]

use approximations (lower- and upper-bounds), based on mathematicalproperties of ‖.‖H.

This is the subject of this talk.

Julien Mairal A Kernel Perspective for Regularizing NN 17/1

Page 27: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Construction of the RKHS for continuous signals

Initial map x0 in L2(Ω,H0)

x0 : Ω→ H0: continuous input signal

u ∈ Ω = Rd: location (d = 2 for images).

x0(u) ∈ H0: input value at location u (H0 = R3 for RGB images).

Building map xk in L2(Ω,Hk) from xk−1 in L2(Ω,Hk−1)

xk : Ω→ Hk: feature map at layer k

xk = AkMkPkxk−1.

Pk: patch extraction operator, extract small patch of feature mapxk−1 around each point u (Pkxk−1(u) is a patch centered at u).

Mk: non-linear mapping operator, maps each patch to a newHilbert space Hk with a pointwise non-linear function ϕk(·).

Ak: (linear) pooling operator at scale σk.

Julien Mairal A Kernel Perspective for Regularizing NN 18/1

Page 28: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Construction of the RKHS for continuous signals

Initial map x0 in L2(Ω,H0)

x0 : Ω→ H0: continuous input signal

u ∈ Ω = Rd: location (d = 2 for images).

x0(u) ∈ H0: input value at location u (H0 = R3 for RGB images).

Building map xk in L2(Ω,Hk) from xk−1 in L2(Ω,Hk−1)

xk : Ω→ Hk: feature map at layer k

xk = AkMk

Pkxk−1.

Pk: patch extraction operator, extract small patch of feature mapxk−1 around each point u (Pkxk−1(u) is a patch centered at u).

Mk: non-linear mapping operator, maps each patch to a newHilbert space Hk with a pointwise non-linear function ϕk(·).

Ak: (linear) pooling operator at scale σk.

Julien Mairal A Kernel Perspective for Regularizing NN 18/1

Page 29: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Construction of the RKHS for continuous signals

Initial map x0 in L2(Ω,H0)

x0 : Ω→ H0: continuous input signal

u ∈ Ω = Rd: location (d = 2 for images).

x0(u) ∈ H0: input value at location u (H0 = R3 for RGB images).

Building map xk in L2(Ω,Hk) from xk−1 in L2(Ω,Hk−1)

xk : Ω→ Hk: feature map at layer k

xk = Ak

MkPkxk−1.

Pk: patch extraction operator, extract small patch of feature mapxk−1 around each point u (Pkxk−1(u) is a patch centered at u).

Mk: non-linear mapping operator, maps each patch to a newHilbert space Hk with a pointwise non-linear function ϕk(·).

Ak: (linear) pooling operator at scale σk.

Julien Mairal A Kernel Perspective for Regularizing NN 18/1

Page 30: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Construction of the RKHS for continuous signals

Initial map x0 in L2(Ω,H0)

x0 : Ω→ H0: continuous input signal

u ∈ Ω = Rd: location (d = 2 for images).

x0(u) ∈ H0: input value at location u (H0 = R3 for RGB images).

Building map xk in L2(Ω,Hk) from xk−1 in L2(Ω,Hk−1)

xk : Ω→ Hk: feature map at layer k

xk = AkMkPkxk−1.

Pk: patch extraction operator, extract small patch of feature mapxk−1 around each point u (Pkxk−1(u) is a patch centered at u).

Mk: non-linear mapping operator, maps each patch to a newHilbert space Hk with a pointwise non-linear function ϕk(·).

Ak: (linear) pooling operator at scale σk.

Julien Mairal A Kernel Perspective for Regularizing NN 18/1

Page 31: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Construction of the RKHS for continuous signals

xk–1 : Ω → Hk–1xk–1(u) ∈ Hk–1

Pkxk–1(v) ∈ Pk (patch extraction)

kernel mapping

MkPkxk–1(v) = ϕk(Pkxk–1(v)) ∈ HkMkPkxk–1 : Ω → Hk

xk :=AkMkPkxk–1 : Ω → Hk

linear poolingxk(w) = AkMkPkxk–1(w) ∈ Hk

Julien Mairal A Kernel Perspective for Regularizing NN 19/1

Page 32: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Construction of the RKHS for continuous signals

Assumption on x0

x0 is typically a discrete signal aquired with physical device.

Natural assumption: x0 = A0x, with x the original continuoussignal, A0 local integrator with scale σ0 (anti-aliasing).

Multilayer representation

Φn(x) = AnMnPnAn−1Mn−1Pn−1 · · · A1M1P1x0 ∈ L2(Ω,Hn).

σk grows exponentially in practice (i.e., fixed with subsampling).

Prediction layer

e.g., linear f(x) = 〈w,Φn(x)〉.“linear kernel” K(x, x′) = 〈Φn(x),Φn(x′)〉 =

∫Ω〈xn(u), x′n(u)〉du.

Julien Mairal A Kernel Perspective for Regularizing NN 20/1

Page 33: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Construction of the RKHS for continuous signals

Assumption on x0

x0 is typically a discrete signal aquired with physical device.

Natural assumption: x0 = A0x, with x the original continuoussignal, A0 local integrator with scale σ0 (anti-aliasing).

Multilayer representation

Φn(x) = AnMnPnAn−1Mn−1Pn−1 · · · A1M1P1x0 ∈ L2(Ω,Hn).

σk grows exponentially in practice (i.e., fixed with subsampling).

Prediction layer

e.g., linear f(x) = 〈w,Φn(x)〉.“linear kernel” K(x, x′) = 〈Φn(x),Φn(x′)〉 =

∫Ω〈xn(u), x′n(u)〉du.

Julien Mairal A Kernel Perspective for Regularizing NN 20/1

Page 34: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Construction of the RKHS for continuous signals

Assumption on x0

x0 is typically a discrete signal aquired with physical device.

Natural assumption: x0 = A0x, with x the original continuoussignal, A0 local integrator with scale σ0 (anti-aliasing).

Multilayer representation

Φn(x) = AnMnPnAn−1Mn−1Pn−1 · · · A1M1P1x0 ∈ L2(Ω,Hn).

σk grows exponentially in practice (i.e., fixed with subsampling).

Prediction layer

e.g., linear f(x) = 〈w,Φn(x)〉.“linear kernel” K(x, x′) = 〈Φn(x),Φn(x′)〉 =

∫Ω〈xn(u), x′n(u)〉du.

Julien Mairal A Kernel Perspective for Regularizing NN 20/1

Page 35: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Practical Regularization Strategies

Julien Mairal A Kernel Perspective for Regularizing NN 21/1

Page 36: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Another point of view: consider a classical CNN parametrized by θ,which live in the RKHS:

minθ∈Rp

1

n

n∑i=1

L(yi, fθ(xi)) +λ

2‖fθ‖2H.

Upper-bounds

‖fθ‖H ≤ ω(‖Wk‖, ‖Wk–1‖, . . . , ‖W1‖) (spectral norms) ,

where the Wj ’s are the convolution filters. The bound suggestscontrolling the spectral norm of the filters.

[Cisse et al., 2017, Miyato et al., 2018, Bartlett et al., 2017]...

Julien Mairal A Kernel Perspective for Regularizing NN 22/1

Page 37: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Another point of view: consider a classical CNN parametrized by θ,which live in the RKHS:

minθ∈Rp

1

n

n∑i=1

L(yi, fθ(xi)) +λ

2‖fθ‖2H.

Lower-bounds

‖f‖H = sup‖u‖H≤1

〈f, u〉H ≥ supu∈U〈f, u〉H for U ⊆ BH(1).

We design a set U that leads to a tractable approximation, but itrequires some knowledge about the properties of H,Φ.

Julien Mairal A Kernel Perspective for Regularizing NN 23/1

Page 38: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Adversarial penalty

We know that Φ is non-expansive and f(x) = 〈f,Φ(x)〉. Then,

U = Φ(x+ δ)− Φ(x) : x ∈ X , ‖δ‖2 ≤ 1

leads toλ‖f‖2δ = sup

x∈X ,‖δ‖2≤λf(x+ δ)− f(x).

The resulting strategy is related to adversarial regularization (but it isdecoupled from the loss term and does not use labels).

minθ∈Rp

1

n

n∑i=1

L(yi, fθ(xi)) + supx∈X ,‖δ‖2≤λ

fθ(x+ δ)− fθ(x).

[Madry et al., 2018]

Julien Mairal A Kernel Perspective for Regularizing NN 24/1

Page 39: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Adversarial penalty

We know that Φ is non-expansive and f(x) = 〈f,Φ(x)〉. Then,

U = Φ(x+ δ)− Φ(x) : x ∈ X , ‖δ‖2 ≤ 1

leads toλ‖f‖2δ = sup

x∈X ,‖δ‖2≤λf(x+ δ)− f(x).

The resulting strategy is related to adversarial regularization (but it isdecoupled from the loss term and does not use labels).vs, for adversarial regularization,

minθ∈Rp

1

n

n∑i=1

sup‖δ‖2≤λ

L(yi, fθ(xi + δ)).

[Madry et al., 2018]

Julien Mairal A Kernel Perspective for Regularizing NN 24/1

Page 40: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Gradient penalties

We know that Φ is non-expansive and f(x) = 〈f,Φ(x)〉. Then,

U = Φ(x+ δ)− Φ(x) : x ∈ X , ‖δ‖2 ≤ 1

leads to‖∇f‖ = sup

x∈X‖∇f(x)‖2.

Related penalties have been used to stabilize the training of GANs andgradients of the loss function have been used to improve robustness.

[Gulrajani et al., 2017, Roth et al., 2017, 2018, Drucker and Le Cun, 1991, Lyu et al.,

2015, Simon-Gabriel et al., 2018]

Julien Mairal A Kernel Perspective for Regularizing NN 25/1

Page 41: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A kernel perspective: regularization

Adversarial deformation penalties

We know that Φ is stable to deformations and f(x) = 〈f,Φ(x)〉.Then,

U = Φ(Lτx)− Φ(x) : x ∈ X , τleads to

‖f‖2τ = supx∈X

τ small deformation

f(Lτx)− f(x).

This is related to data augmentation and tangent propagation.

[Engstrom et al., 2017, Simard et al., 1998]

Julien Mairal A Kernel Perspective for Regularizing NN 26/1

Page 42: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Experiments with Few labeled Samples

Table: Accuracies on CIFAR10 with 1 000 examples for standard architecturesVGG-11 and ResNet-18. With / without data augmentation.

Method 1k VGG-11 1k ResNet-18

No weight decay 50.70 / 43.75 45.23 / 37.12Weight decay 51.32 / 43.95 44.85 / 37.09SN projection 54.14 / 46.70 47.12 / 37.28PGD-`2 51.25 / 44.40 45.80 / 41.87grad-`2 55.19 / 43.88 49.30 / 44.65‖f‖2δ penalty 51.41 / 45.07 48.73 / 43.72‖∇f‖2 penalty 54.80 / 46.37 48.99 / 44.97PGD-`2 + SN proj 54.19 / 46.66 47.47 / 41.25grad-`2 + SN proj 55.32 / 46.88 48.73 / 42.78‖f‖2δ + SN proj 54.02 / 46.72 48.12 / 43.56‖∇f‖2 + SN proj 55.24 / 46.80 49.06 / 44.92

Julien Mairal A Kernel Perspective for Regularizing NN 27/1

Page 43: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Experiments with Few labeled Samples

Table: Accuracies with 300 or 1 000 examples from MNIST, using deformations.(∗) indicates that random deformations were included as training examples,

Method 300 VGG 1k VGG

Weight decay 89.32 94.08SN projection 90.69 95.01grad-`2 93.63 96.67‖f‖2δ penalty 94.17 96.99‖∇f‖2 penalty 94.08 96.82Weight decay (∗) 92.41 95.64grad-`2 (∗) 95.05 97.48‖Dτf‖2 penalty 94.18 96.98‖f‖2τ penalty 94.42 97.13‖f‖2τ + ‖∇f‖2 94.75 97.40‖f‖2τ + ‖f‖2δ 95.23 97.66‖f‖2τ + ‖f‖2δ (∗) 95.53 97.56‖f‖2τ + ‖f‖2δ + SN proj 95.20 97.60‖f‖2τ + ‖f‖2δ + SN proj (∗) 95.40 97.77

Julien Mairal A Kernel Perspective for Regularizing NN 28/1

Page 44: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Experiments with Few labeled SamplesTable: AUROC50 for protein homology detection tasks using CNN, with orwithout data augmentation (DA).

Method No DA DA

No weight decay 0.446 0.500Weight decay 0.501 0.546SN proj 0.591 0.632PGD-`2 0.575 0.595grad-`2 0.540 0.552‖f‖2δ 0.600 0.608‖∇f‖2 0.585 0.611PGD-`2 + SN proj 0.596 0.627grad-`2 + SN proj 0.592 0.624‖f‖2δ + SN proj 0.630 0.644‖∇f‖2 + SN proj 0.603 0.625

Note: statistical tests have been conducted for all of these experiments(see paper).

Julien Mairal A Kernel Perspective for Regularizing NN 29/1

Page 45: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Experiments with Few labeled SamplesTable: AUROC50 for protein homology detection tasks using CNN, with orwithout data augmentation (DA).

Method No DA DA

No weight decay 0.446 0.500Weight decay 0.501 0.546SN proj 0.591 0.632PGD-`2 0.575 0.595grad-`2 0.540 0.552‖f‖2δ 0.600 0.608‖∇f‖2 0.585 0.611PGD-`2 + SN proj 0.596 0.627grad-`2 + SN proj 0.592 0.624‖f‖2δ + SN proj 0.630 0.644‖∇f‖2 + SN proj 0.603 0.625

Note: statistical tests have been conducted for all of these experiments(see paper).

Julien Mairal A Kernel Perspective for Regularizing NN 29/1

Page 46: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Adversarial Robustness: Trade-offs

0.800 0.825 0.850 0.875 0.900 0.925standard accuracy

0.76

0.78

0.80

0.82

0.84

0.86

adve

rsar

ial a

ccur

acy

2, test = 0.1PGD- 2grad- 2

|f|2

| f|2PGD- 2+SN projSN projSN pen(SVD)clean

0.5 0.6 0.7 0.8 0.9standard accuracy

0.1

0.2

0.3

0.4

0.52, test = 1.0

Figure: Robustness trade-off curves of different regularization methods forVGG11 on CIFAR10. Each plot shows test accuracy vs adversarial test accuracyDifferent points on a curve correspond to training with different regularizationstrengths.

Julien Mairal A Kernel Perspective for Regularizing NN 30/1

Page 47: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Conclusions from this work on regularization

What the kernel perspective brings us

gives a unified perspective on many regularization principles.

useful both for generalization and robustness.

related to robust optimization.

Future work

regularization based on kernel approximations.

semi-supervised learning to exploit unlabeled data.

relation with implicit regularization.

Julien Mairal A Kernel Perspective for Regularizing NN 31/1

Page 48: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Invariance and Stability to Deformations

(probably for another time)

Julien Mairal A Kernel Perspective for Regularizing NN 32/1

Page 49: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A signal processing perspectiveplus a bit of harmonic analysis

consider images defined on a continuous domain Ω = Rd.

τ : Ω→ Ω: c1-diffeomorphism.

Lτx(u) = x(u− τ(u)): action operator.

much richer group of transformations than translations.

Invariance to TranslationsTwo dimensional group: R2

Translations and Deformations

• Patterns are translated and deformed

[Mallat, 2012, Allassonniere, Amit, and Trouve, 2007, Trouve and Younes, 2005]...

Julien Mairal A Kernel Perspective for Regularizing NN 33/1

Page 50: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A signal processing perspectiveplus a bit of harmonic analysis

consider images defined on a continuous domain Ω = Rd.

τ : Ω→ Ω: c1-diffeomorphism.

Lτx(u) = x(u− τ(u)): action operator.

much richer group of transformations than translations.

Invariance to TranslationsTwo dimensional group: R2

Translations and Deformations

• Patterns are translated and deformed

relation with deep convolutional representations

stability to deformations studied for wavelet-based scattering transform.

[Mallat, 2012, Bruna and Mallat, 2013, Sifre and Mallat, 2013]...

Julien Mairal A Kernel Perspective for Regularizing NN 33/1

Page 51: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

A signal processing perspectiveplus a bit of harmonic analysis

consider images defined on a continuous domain Ω = Rd.

τ : Ω→ Ω: c1-diffeomorphism.

Lτx(u) = x(u− τ(u)): action operator.

much richer group of transformations than translations.

Definition of stability

Representation Φ(·) is stable [Mallat, 2012] if:

‖Φ(Lτx)− Φ(x)‖ ≤ (C1‖∇τ‖∞ + C2‖τ‖∞)‖x‖.

‖∇τ‖∞ = supu ‖∇τ(u)‖ controls deformation.

‖τ‖∞ = supu |τ(u)| controls translation.

C2 → 0: translation invariance.

Julien Mairal A Kernel Perspective for Regularizing NN 33/1

Page 52: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Construction of the RKHS for continuous signals

xk–1 : Ω → Hk–1xk–1(u) ∈ Hk–1

Pkxk–1(v) ∈ Pk (patch extraction)

kernel mapping

MkPkxk–1(v) = ϕk(Pkxk–1(v)) ∈ HkMkPkxk–1 : Ω → Hk

xk :=AkMkPkxk–1 : Ω → Hk

linear poolingxk(w) = AkMkPkxk–1(w) ∈ Hk

Julien Mairal A Kernel Perspective for Regularizing NN 34/1

Page 53: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Patch extraction operator Pk

Pkxk–1(u) := (v ∈ Sk 7→ xk–1(u+ v)) ∈ Pk = HSkk–1.

xk–1 : Ω → Hk–1xk–1(u) ∈ Hk–1

Pkxk–1(v) ∈ Pk (patch extraction)

Sk: patch shape, e.g. box.

Pk is linear, and preserves the norm: ‖Pkxk–1‖ = ‖xk–1‖.Norm of a map: ‖x‖2 =

∫Ω ‖x(u)‖2du <∞ for x in L2(Ω,H).

Julien Mairal A Kernel Perspective for Regularizing NN 35/1

Page 54: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Non-linear pointwise mapping operator Mk

MkPkxk–1(u) := ϕk(Pkxk–1(u)) ∈ Hk.

xk–1 : Ω → Hk–1

Pkxk–1(v) ∈ Pk

non-linear mapping

MkPkxk–1(v) = ϕk(Pkxk–1(v)) ∈ HkMkPkxk–1 : Ω → Hk

Julien Mairal A Kernel Perspective for Regularizing NN 36/1

Page 55: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Non-linear pointwise mapping operator Mk

MkPkxk–1(u) := ϕk(Pkxk–1(u)) ∈ Hk.

ϕk : Pk → Hk pointwise non-linearity on patches.

We assume non-expansivity

‖ϕk(z)‖ ≤ ‖z‖ and ‖ϕk(z)− ϕk(z′)‖ ≤ ‖z − z′‖.

Mk then satisfies, for x, x′ ∈ L2(Ω,Pk)

‖Mkx‖ ≤ ‖x‖ and ‖Mkx−Mkx′‖ ≤ ‖x− x′‖.

Julien Mairal A Kernel Perspective for Regularizing NN 37/1

Page 56: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

ϕk from kernels

Kernel mapping of homogeneous dot-product kernels:

Kk(z, z′) = ‖z‖‖z′‖κk

( 〈z, z′〉‖z‖‖z′‖

)= 〈ϕk(z), ϕk(z′)〉.

κk(u) =∑∞

j=0 bjuj with bj ≥ 0, κk(1) = 1.

‖ϕk(z)‖ = Kk(z, z)1/2 = ‖z‖ (norm preservation).

‖ϕk(z)− ϕk(z′)‖ ≤ ‖z − z′‖ if κ′k(1) ≤ 1 (non-expansiveness).

Examples

κexp(〈z, z′〉) = e〈z,z′〉−1 = e−

12‖z−z′‖2 (if ‖z‖ = ‖z′‖ = 1).

κinv-poly(〈z, z′〉) = 12−〈z,z′〉 .

[Schoenberg, 1942, Scholkopf, 1997, Smola et al., 2001, Cho and Saul, 2010, Zhang

et al., 2016, 2017, Daniely et al., 2016, Bach, 2017, Mairal, 2016]...

Julien Mairal A Kernel Perspective for Regularizing NN 38/1

Page 57: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

ϕk from kernels

Kernel mapping of homogeneous dot-product kernels:

Kk(z, z′) = ‖z‖‖z′‖κk

( 〈z, z′〉‖z‖‖z′‖

)= 〈ϕk(z), ϕk(z′)〉.

κk(u) =∑∞

j=0 bjuj with bj ≥ 0, κk(1) = 1.

‖ϕk(z)‖ = Kk(z, z)1/2 = ‖z‖ (norm preservation).

‖ϕk(z)− ϕk(z′)‖ ≤ ‖z − z′‖ if κ′k(1) ≤ 1 (non-expansiveness).

Examples

κexp(〈z, z′〉) = e〈z,z′〉−1 = e−

12‖z−z′‖2 (if ‖z‖ = ‖z′‖ = 1).

κinv-poly(〈z, z′〉) = 12−〈z,z′〉 .

[Schoenberg, 1942, Scholkopf, 1997, Smola et al., 2001, Cho and Saul, 2010, Zhang

et al., 2016, 2017, Daniely et al., 2016, Bach, 2017, Mairal, 2016]...

Julien Mairal A Kernel Perspective for Regularizing NN 38/1

Page 58: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Pooling operator Ak

xk(u) = AkMkPkxk–1(u) =

∫Rd

hσk(u− v)MkPkxk–1(v)dv ∈ Hk.

xk–1 : Ω → Hk–1

MkPkxk–1 : Ω → Hk

xk := AkMkPkxk–1 : Ω → Hk

linear poolingxk(w) = AkMkPkxk–1(w) ∈ Hk

Julien Mairal A Kernel Perspective for Regularizing NN 39/1

Page 59: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Pooling operator Ak

xk(u) = AkMkPkxk–1(u) =

∫Rd

hσk(u− v)MkPkxk–1(v)dv ∈ Hk.

hσk : pooling filter at scale σk.

hσk(u) := σ−dk h(u/σk) with h(u) Gaussian.

linear, non-expansive operator: ‖Ak‖ ≤ 1 (operator norm).

Julien Mairal A Kernel Perspective for Regularizing NN 39/1

Page 60: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Recap: Pk, Mk, Ak

xk–1 : Ω → Hk–1xk–1(u) ∈ Hk–1

Pkxk–1(v) ∈ Pk (patch extraction)

kernel mapping

MkPkxk–1(v) = ϕk(Pkxk–1(v)) ∈ HkMkPkxk–1 : Ω → Hk

xk :=AkMkPkxk–1 : Ω → Hk

linear poolingxk(w) = AkMkPkxk–1(w) ∈ Hk

Julien Mairal A Kernel Perspective for Regularizing NN 40/1

Page 61: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Invariance, definitions

τ : Ω→ Ω: C1-diffeomorphism with Ω = Rd.

Lτx(u) = x(u− τ(u)): action operator.

Much richer group of transformations than translations.

Invariance to TranslationsTwo dimensional group: R2

Translations and Deformations

• Patterns are translated and deformed

[Mallat, 2012, Bruna and Mallat, 2013, Sifre and Mallat, 2013]...

Julien Mairal A Kernel Perspective for Regularizing NN 41/1

Page 62: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Invariance, definitions

τ : Ω→ Ω: C1-diffeomorphism with Ω = Rd.

Lτx(u) = x(u− τ(u)): action operator.

Much richer group of transformations than translations.

Definition of stability

Representation Φ(·) is stable [Mallat, 2012] if:

‖Φ(Lτx)− Φ(x)‖ ≤ (C1‖∇τ‖∞ + C2‖τ‖∞)‖x‖.

‖∇τ‖∞ = supu ‖∇τ(u)‖ controls deformation.

‖τ‖∞ = supu |τ(u)| controls translation.

C2 → 0: translation invariance.

[Mallat, 2012, Bruna and Mallat, 2013, Sifre and Mallat, 2013]...

Julien Mairal A Kernel Perspective for Regularizing NN 41/1

Page 63: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Warmup: translation invariance

Representation

Φn(x)M= AnMnPnAn–1Mn–1Pn–1 · · · A1M1P1A0x.

How to achieve translation invariance?

Translation: Lcx(u) = x(u− c).

Equivariance - all operators commute with Lc: Lc = Lc.

‖Φn(Lcx)− Φn(x)‖ = ‖LcΦn(x)− Φn(x)‖≤ ‖LcAn −An‖ · ‖MnPnΦn–1(x)‖≤ ‖LcAn −An‖‖x‖.

Mallat [2012]: ‖LτAn −An‖ ≤ C2σn‖τ‖∞ (operator norm).

Scale σn of the last layer controls translation invariance.

Julien Mairal A Kernel Perspective for Regularizing NN 42/1

Page 64: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Warmup: translation invariance

Representation

Φn(x)M= AnMnPnAn–1Mn–1Pn–1 · · · A1M1P1A0x.

How to achieve translation invariance?

Translation: Lcx(u) = x(u− c).

Equivariance - all operators commute with Lc: Lc = Lc.

‖Φn(Lcx)− Φn(x)‖ = ‖LcΦn(x)− Φn(x)‖≤ ‖LcAn −An‖ · ‖MnPnΦn–1(x)‖≤ ‖LcAn −An‖‖x‖.

Mallat [2012]: ‖LτAn −An‖ ≤ C2σn‖τ‖∞ (operator norm).

Scale σn of the last layer controls translation invariance.

Julien Mairal A Kernel Perspective for Regularizing NN 42/1

Page 65: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Warmup: translation invariance

Representation

Φn(x)M= AnMnPnAn–1Mn–1Pn–1 · · · A1M1P1A0x.

How to achieve translation invariance?

Translation: Lcx(u) = x(u− c).

Equivariance - all operators commute with Lc: Lc = Lc.

‖Φn(Lcx)− Φn(x)‖ = ‖LcΦn(x)− Φn(x)‖≤ ‖LcAn −An‖ · ‖MnPnΦn–1(x)‖≤ ‖LcAn −An‖‖x‖.

Mallat [2012]: ‖LτAn −An‖ ≤ C2σn‖τ‖∞ (operator norm).

Scale σn of the last layer controls translation invariance.

Julien Mairal A Kernel Perspective for Regularizing NN 42/1

Page 66: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Warmup: translation invariance

Representation

Φn(x)M= AnMnPnAn–1Mn–1Pn–1 · · · A1M1P1A0x.

How to achieve translation invariance?

Translation: Lcx(u) = x(u− c).

Equivariance - all operators commute with Lc: Lc = Lc.

‖Φn(Lcx)− Φn(x)‖ = ‖LcΦn(x)− Φn(x)‖≤ ‖LcAn −An‖ · ‖MnPnΦn–1(x)‖≤ ‖LcAn −An‖‖x‖.

Mallat [2012]: ‖LcAn −An‖ ≤ C2σnc (operator norm).

Scale σn of the last layer controls translation invariance.

Julien Mairal A Kernel Perspective for Regularizing NN 42/1

Page 67: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Stability to deformations

Representation

Φn(x)M= AnMnPnAn–1Mn–1Pn–1 · · · A1M1P1A0x.

How to achieve stability to deformations?

Patch extraction Pk and pooling Ak do not commute with Lτ !

‖‖ ≤ C1‖∇τ‖∞ [from Mallat, 2012].

But: [Pk, Lτ ] is unstable at high frequencies!

Adapt to current layer resolution, patch size controlled by σk–1:

‖[PkAk–1, Lτ ]‖ ≤ C1,κ‖∇τ‖∞ supu∈Sk

|u| ≤ κσk–1

C1,κ grows as κd+1 =⇒ more stable with small patches(e.g., 3x3, VGG et al.).

Julien Mairal A Kernel Perspective for Regularizing NN 43/1

Page 68: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Stability to deformations

Representation

Φn(x)M= AnMnPnAn–1Mn–1Pn–1 · · · A1M1P1A0x.

How to achieve stability to deformations?

Patch extraction Pk and pooling Ak do not commute with Lτ !

‖AkLτ − LτAk‖ ≤ C1‖∇τ‖∞ [from Mallat, 2012].

But: [Pk, Lτ ] is unstable at high frequencies!

Adapt to current layer resolution, patch size controlled by σk–1:

‖[PkAk–1, Lτ ]‖ ≤ C1,κ‖∇τ‖∞ supu∈Sk

|u| ≤ κσk–1

C1,κ grows as κd+1 =⇒ more stable with small patches(e.g., 3x3, VGG et al.).

Julien Mairal A Kernel Perspective for Regularizing NN 43/1

Page 69: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Stability to deformations

Representation

Φn(x)M= AnMnPnAn–1Mn–1Pn–1 · · · A1M1P1A0x.

How to achieve stability to deformations?

Patch extraction Pk and pooling Ak do not commute with Lτ !

‖[Ak, Lτ ]‖ ≤ C1‖∇τ‖∞ [from Mallat, 2012].

But: [Pk, Lτ ] is unstable at high frequencies!

Adapt to current layer resolution, patch size controlled by σk–1:

‖[PkAk–1, Lτ ]‖ ≤ C1,κ‖∇τ‖∞ supu∈Sk

|u| ≤ κσk–1

C1,κ grows as κd+1 =⇒ more stable with small patches(e.g., 3x3, VGG et al.).

Julien Mairal A Kernel Perspective for Regularizing NN 43/1

Page 70: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Stability to deformations

Representation

Φn(x)M= AnMnPnAn–1Mn–1Pn–1 · · · A1M1P1A0x.

How to achieve stability to deformations?

Patch extraction Pk and pooling Ak do not commute with Lτ !

‖[Ak, Lτ ]‖ ≤ C1‖∇τ‖∞ [from Mallat, 2012].

But: [Pk, Lτ ] is unstable at high frequencies!

Adapt to current layer resolution, patch size controlled by σk–1:

‖[PkAk–1, Lτ ]‖ ≤ C1,κ‖∇τ‖∞ supu∈Sk

|u| ≤ κσk–1

C1,κ grows as κd+1 =⇒ more stable with small patches(e.g., 3x3, VGG et al.).

Julien Mairal A Kernel Perspective for Regularizing NN 43/1

Page 71: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Stability to deformations

Representation

Φn(x)M= AnMnPnAn–1Mn–1Pn–1 · · · A1M1P1A0x.

How to achieve stability to deformations?

Patch extraction Pk and pooling Ak do not commute with Lτ !

‖[Ak, Lτ ]‖ ≤ C1‖∇τ‖∞ [from Mallat, 2012].

But: [Pk, Lτ ] is unstable at high frequencies!

Adapt to current layer resolution, patch size controlled by σk–1:

‖[PkAk–1, Lτ ]‖ ≤ C1,κ‖∇τ‖∞ supu∈Sk

|u| ≤ κσk–1

C1,κ grows as κd+1 =⇒ more stable with small patches(e.g., 3x3, VGG et al.).

Julien Mairal A Kernel Perspective for Regularizing NN 43/1

Page 72: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Stability to deformations

Representation

Φn(x)M= AnMnPnAn–1Mn–1Pn–1 · · · A1M1P1A0x.

How to achieve stability to deformations?

Patch extraction Pk and pooling Ak do not commute with Lτ !

‖[Ak, Lτ ]‖ ≤ C1‖∇τ‖∞ [from Mallat, 2012].

But: [Pk, Lτ ] is unstable at high frequencies!

Adapt to current layer resolution, patch size controlled by σk–1:

‖[PkAk–1, Lτ ]‖ ≤ C1,κ‖∇τ‖∞ supu∈Sk

|u| ≤ κσk–1

C1,κ grows as κd+1 =⇒ more stable with small patches(e.g., 3x3, VGG et al.).

Julien Mairal A Kernel Perspective for Regularizing NN 43/1

Page 73: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Stability to deformations: final result

Theorem

If ‖∇τ‖∞ ≤ 1/2,

‖Φn(Lτx)− Φn(x)‖ ≤(C1,κ (n+ 1) ‖∇τ‖∞ +

C2

σn‖τ‖∞

)‖x‖.

translation invariance: large σn.

stability: small patch sizes.

signal preservation: subsampling factor ≈ patch size.

=⇒ needs several layers.

requires additional discussion to make stability non-trivial.

related work on stability [Wiatowski and Bolcskei, 2017]

Julien Mairal A Kernel Perspective for Regularizing NN 44/1

Page 74: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Stability to deformations: final result

Theorem

If ‖∇τ‖∞ ≤ 1/2,

‖Φn(Lτx)− Φn(x)‖ ≤(C1,κ (n+ 1) ‖∇τ‖∞ +

C2

σn‖τ‖∞

)‖x‖.

translation invariance: large σn.

stability: small patch sizes.

signal preservation: subsampling factor ≈ patch size.

=⇒ needs several layers.

requires additional discussion to make stability non-trivial.

related work on stability [Wiatowski and Bolcskei, 2017]

Julien Mairal A Kernel Perspective for Regularizing NN 44/1

Page 75: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Beyond the translation group

Can we achieve invariance to other groups?

Group action: Lgx(u) = x(g−1u) (e.g., rotations, reflections).

Feature maps x(u) defined on u ∈ G (G: locally compact group).

Recipe: Equivariant inner layers + global pooling in last layer

Patch extraction:

Px(u) = (x(uv))v∈S .

Non-linear mapping: equivariant because pointwise!

Pooling (µ: left-invariant Haar measure):

Ax(u) =

∫Gx(uv)h(v)dµ(v) =

∫Gx(v)h(u−1v)dµ(v).

related work [Sifre and Mallat, 2013, Cohen and Welling, 2016, Raj et al., 2016]...

Julien Mairal A Kernel Perspective for Regularizing NN 45/1

Page 76: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Beyond the translation group

Can we achieve invariance to other groups?

Group action: Lgx(u) = x(g−1u) (e.g., rotations, reflections).

Feature maps x(u) defined on u ∈ G (G: locally compact group).

Recipe: Equivariant inner layers + global pooling in last layer

Patch extraction:

Px(u) = (x(uv))v∈S .

Non-linear mapping: equivariant because pointwise!

Pooling (µ: left-invariant Haar measure):

Ax(u) =

∫Gx(uv)h(v)dµ(v) =

∫Gx(v)h(u−1v)dµ(v).

related work [Sifre and Mallat, 2013, Cohen and Welling, 2016, Raj et al., 2016]...

Julien Mairal A Kernel Perspective for Regularizing NN 45/1

Page 77: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Group invariance and stability

Previous construction is similar to Cohen and Welling [2016] for CNNs.

A case of interest: the roto-translation group

G = R2 o SO(2) (mix of translations and rotations).

Stability with respect to the translation group.

Global invariance to rotations (only global pooling at final layer).

Inner layers: only pool on translation group.Last layer: global pooling on rotations.Cohen and Welling [2016]: pooling on rotations in inner layers hurtsperformance on Rotated MNIST

Julien Mairal A Kernel Perspective for Regularizing NN 46/1

Page 78: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Discretization and signal preservation: example in 1D

Discrete signal xk in `2(Z, Hk) vs continuous ones xk in L2(R,Hk).

xk: subsampling factor sk after pooling with scale σk ≈ sk:

xk[n] = AkMkPkxk–1[nsk].

Claim: We can recover xk−1 from xk if factor sk ≤ patch size.

How? Recover patches with linear functions (contained in Hk)

〈fw, MkPkxk−1(u)〉 = fw(Pkxk−1(u)) = 〈w, Pkxk−1(u)〉,

andPkxk−1(u) =

∑w∈B〈fw, MkPkxk−1(u)〉w.

Warning: no claim that recovery is practical and/or stable.

Julien Mairal A Kernel Perspective for Regularizing NN 47/1

Page 79: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Discretization and signal preservation: example in 1D

Discrete signal xk in `2(Z, Hk) vs continuous ones xk in L2(R,Hk).

xk: subsampling factor sk after pooling with scale σk ≈ sk:

xk[n] = AkMkPkxk–1[nsk].

Claim: We can recover xk−1 from xk if factor sk ≤ patch size.

How? Recover patches with linear functions (contained in Hk)

〈fw, MkPkxk−1(u)〉 = fw(Pkxk−1(u)) = 〈w, Pkxk−1(u)〉,

andPkxk−1(u) =

∑w∈B〈fw, MkPkxk−1(u)〉w.

Warning: no claim that recovery is practical and/or stable.

Julien Mairal A Kernel Perspective for Regularizing NN 47/1

Page 80: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Discretization and signal preservation: example in 1D

Discrete signal xk in `2(Z, Hk) vs continuous ones xk in L2(R,Hk).

xk: subsampling factor sk after pooling with scale σk ≈ sk:

xk[n] = AkMkPkxk–1[nsk].

Claim: We can recover xk−1 from xk if factor sk ≤ patch size.

How? Recover patches with linear functions (contained in Hk)

〈fw, MkPkxk−1(u)〉 = fw(Pkxk−1(u)) = 〈w, Pkxk−1(u)〉,

andPkxk−1(u) =

∑w∈B〈fw, MkPkxk−1(u)〉w.

Warning: no claim that recovery is practical and/or stable.

Julien Mairal A Kernel Perspective for Regularizing NN 47/1

Page 81: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Discretization and signal preservation: example in 1D

Discrete signal xk in `2(Z, Hk) vs continuous ones xk in L2(R,Hk).

xk: subsampling factor sk after pooling with scale σk ≈ sk:

xk[n] = AkMkPkxk–1[nsk].

Claim: We can recover xk−1 from xk if factor sk ≤ patch size.

How? Recover patches with linear functions (contained in Hk)

〈fw, MkPkxk−1(u)〉 = fw(Pkxk−1(u)) = 〈w, Pkxk−1(u)〉,

andPkxk−1(u) =

∑w∈B〈fw, MkPkxk−1(u)〉w.

Warning: no claim that recovery is practical and/or stable.

Julien Mairal A Kernel Perspective for Regularizing NN 47/1

Page 82: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Discretization and signal preservation: example in 1D

xk−1

Pkxk−1(u) ∈ Pk

MkPkxk−1

dot-product kernel

AkMkPkxk−1

linear pooling

downsampling

xk

recovery with linear measurements

Akxk−1

deconvolution

xk−1

Julien Mairal A Kernel Perspective for Regularizing NN 48/1

Page 83: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

RKHS of patch kernels Kk

Kk(z, z′) = ‖z‖‖z′‖κ

( 〈z, z′〉‖z‖‖z′‖

), κ(u) =

∞∑j=0

bjuj .

What does the RKHS contain?

RKHS contains homogeneous functions:

f : z 7→ ‖z‖σ(〈g, z〉/‖z‖).

Smooth activations: σ(u) =∑∞

j=0 ajuj with aj ≥ 0.

Norm: ‖f‖2Hk≤ C2

σ(‖g‖2) =∑∞

j=0

a2jbj‖g‖2 <∞.

Homogeneous version of [Zhang et al., 2016, 2017]

Julien Mairal A Kernel Perspective for Regularizing NN 49/1

Page 84: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

RKHS of patch kernels Kk

Kk(z, z′) = ‖z‖‖z′‖κ

( 〈z, z′〉‖z‖‖z′‖

), κ(u) =

∞∑j=0

bjuj .

What does the RKHS contain?

RKHS contains homogeneous functions:

f : z 7→ ‖z‖σ(〈g, z〉/‖z‖).

Smooth activations: σ(u) =∑∞

j=0 ajuj with aj ≥ 0.

Norm: ‖f‖2Hk≤ C2

σ(‖g‖2) =∑∞

j=0

a2jbj‖g‖2 <∞.

Homogeneous version of [Zhang et al., 2016, 2017]

Julien Mairal A Kernel Perspective for Regularizing NN 49/1

Page 85: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

RKHS of patch kernels Kk

Kk(z, z′) = ‖z‖‖z′‖κ

( 〈z, z′〉‖z‖‖z′‖

), κ(u) =

∞∑j=0

bjuj .

What does the RKHS contain?

RKHS contains homogeneous functions:

f : z 7→ ‖z‖σ(〈g, z〉/‖z‖).

Smooth activations: σ(u) =∑∞

j=0 ajuj with aj ≥ 0.

Norm: ‖f‖2Hk≤ C2

σ(‖g‖2) =∑∞

j=0

a2jbj‖g‖2 <∞.

Homogeneous version of [Zhang et al., 2016, 2017]

Julien Mairal A Kernel Perspective for Regularizing NN 49/1

Page 86: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

RKHS of patch kernels Kk

Examples:

σ(u) = u (linear): C2σ(λ2) = O(λ2).

σ(u) = up (polynomial): C2σ(λ2) = O(λ2p).

σ ≈ sin, sigmoid, smooth ReLU: C2σ(λ2) = O(ecλ

2).

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0x

0.0

0.5

1.0

1.5

2.0

f(x)

f : x (x)ReLUsReLU

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0x

0

1

2

3

4

f(x)

f : x |x| (wx/|x|)ReLU, w=1sReLU, w = 0sReLU, w = 0.5sReLU, w = 1sReLU, w = 2

Julien Mairal A Kernel Perspective for Regularizing NN 50/1

Page 87: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Constructing a CNN in the RKHS HKSome CNNs live in the RKHS: “linearization” principle

f(x) = σk(Wkσk–1(Wk–1 . . . σ2(W2σ1(W1x)) . . .)) = 〈f,Φ(x)〉H.

Consider a CNN with filters W ijk (u), u ∈ Sk.

k: layer;i: index of filter;j: index of input channel.

“Smooth homogeneous” activations σ.

The CNN can be constructed hierarchically in HK.

Norm:

‖fσ‖2 ≤ ‖Wn+1‖22 C2σ(‖Wn‖22 C2

σ(‖Wn–1‖22 C2σ(. . . ))).

Linear layers: product of spectral norms.

Julien Mairal A Kernel Perspective for Regularizing NN 51/1

Page 88: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Constructing a CNN in the RKHS HKSome CNNs live in the RKHS: “linearization” principle

f(x) = σk(Wkσk–1(Wk–1 . . . σ2(W2σ1(W1x)) . . .)) = 〈f,Φ(x)〉H.

Consider a CNN with filters W ijk (u), u ∈ Sk.

k: layer;i: index of filter;j: index of input channel.

“Smooth homogeneous” activations σ.

The CNN can be constructed hierarchically in HK.

Norm (linear layers):

‖fσ‖2 ≤ ‖Wn+1‖22 · ‖Wn‖22 · ‖Wn–1‖22 . . . ‖W1‖22.

Linear layers: product of spectral norms.

Julien Mairal A Kernel Perspective for Regularizing NN 51/1

Page 89: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Link with generalization

Direct application of classical generalization bounds

Simple bound on Rademacher complexity for linear/kernel methods:

FB = f ∈ HK, ‖f‖ ≤ B =⇒ RadN (FB) ≤ O(BR√N

).

Leads to margin bound O(‖fN‖R/γ√N) for a learned CNN fN

with margin (confidence) γ > 0.

Related to recent generalization bounds for neural networks basedon product of spectral norms [e.g., Bartlett et al., 2017,Neyshabur et al., 2018].

[see, e.g., Boucheron et al., 2005, Shalev-Shwartz and Ben-David, 2014]...

Julien Mairal A Kernel Perspective for Regularizing NN 52/1

Page 90: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Link with generalization

Direct application of classical generalization bounds

Simple bound on Rademacher complexity for linear/kernel methods:

FB = f ∈ HK, ‖f‖ ≤ B =⇒ RadN (FB) ≤ O(BR√N

).

Leads to margin bound O(‖fN‖R/γ√N) for a learned CNN fN

with margin (confidence) γ > 0.

Related to recent generalization bounds for neural networks basedon product of spectral norms [e.g., Bartlett et al., 2017,Neyshabur et al., 2018].

[see, e.g., Boucheron et al., 2005, Shalev-Shwartz and Ben-David, 2014]...

Julien Mairal A Kernel Perspective for Regularizing NN 52/1

Page 91: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Conclusions from the work on invariance and stability

Study of generic properties of signal representation

Deformation stability with small patches, adapted to resolution.

Signal preservation when subsampling ≤ patch size.

Group invariance by changing patch extraction and pooling.

Applies to learned models

Same quantity ‖f‖ controls stability and generalization.

“higher capacity” is needed to discriminate small deformations.

Questions:

How does SGD control capacity in CNNs?

What about networks with no pooling layers? ResNet?

Julien Mairal A Kernel Perspective for Regularizing NN 53/1

Page 92: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Conclusions from the work on invariance and stability

Study of generic properties of signal representation

Deformation stability with small patches, adapted to resolution.

Signal preservation when subsampling ≤ patch size.

Group invariance by changing patch extraction and pooling.

Applies to learned models

Same quantity ‖f‖ controls stability and generalization.

“higher capacity” is needed to discriminate small deformations.

Questions:

How does SGD control capacity in CNNs?

What about networks with no pooling layers? ResNet?

Julien Mairal A Kernel Perspective for Regularizing NN 53/1

Page 93: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Conclusions from the work on invariance and stability

Study of generic properties of signal representation

Deformation stability with small patches, adapted to resolution.

Signal preservation when subsampling ≤ patch size.

Group invariance by changing patch extraction and pooling.

Applies to learned models

Same quantity ‖f‖ controls stability and generalization.

“higher capacity” is needed to discriminate small deformations.

Questions:

How does SGD control capacity in CNNs?

What about networks with no pooling layers? ResNet?

Julien Mairal A Kernel Perspective for Regularizing NN 53/1

Page 94: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

References I

Stephanie Allassonniere, Yali Amit, and Alain Trouve. Towards a coherentstatistical framework for dense deformable template estimation. Journal ofthe Royal Statistical Society: Series B (Statistical Methodology), 69(1):3–29, 2007.

Francis Bach. On the equivalence between kernel quadrature rules and randomfeature expansions. Journal of Machine Learning Research (JMLR), 18:1–38,2017.

Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalizedmargin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.

Alberto Bietti and Julien Mairal. Group invariance, stability to deformations,and complexity of deep convolutional representations. Journal of MachineLearning Research, 2019.

Alberto Bietti, Gregoire Mialon, Dexiong Chen, and Julien Mairal. A kernelperspective for regularizing deep neural networks. arXiv, 2019.

Julien Mairal A Kernel Perspective for Regularizing NN 54/1

Page 95: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

References IIStephane Boucheron, Olivier Bousquet, and Gabor Lugosi. Theory of

classification: A survey of some recent advances. ESAIM: probability andstatistics, 9:323–375, 2005.

Joan Bruna and Stephane Mallat. Invariant scattering convolution networks.IEEE Transactions on pattern analysis and machine intelligence (PAMI), 35(8):1872–1886, 2013.

Y. Cho and L. K. Saul. Large-margin classification in infinite neural networks.Neural Computation, 22(10):2678–2697, 2010.

Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, andNicolas Usunier. Parseval networks: Improving robustness to adversarialexamples. In Proceedings of the International Conference on MachineLearning (ICML), 2017.

Taco Cohen and Max Welling. Group equivariant convolutional networks. InInternational Conference on Machine Learning (ICML), 2016.

Julien Mairal A Kernel Perspective for Regularizing NN 55/1

Page 96: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

References IIIAmit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of

neural networks: The power of initialization and a dual view on expressivity.In Advances In Neural Information Processing Systems, pages 2253–2261,2016.

Harris Drucker and Yann Le Cun. Double backpropagation increasinggeneralization performance. In International Joint Conference on NeuralNetworks (IJCNN), 1991.

Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. Arotation and a translation suffice: Fooling cnns with simple transformations.arXiv preprint arXiv:1712.02779, 2017.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, andAaron C Courville. Improved training of Wasserstein GANs. In Advances inNeural Information Processing Systems (NIPS), 2017.

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in thephysical world. arXiv preprint arXiv:1607.02533, 2016.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. P. IEEE, 86(11):2278–2324, 1998.

Julien Mairal A Kernel Perspective for Regularizing NN 56/1

Page 97: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

References IVChunchuan Lyu, Kaizhu Huang, and Hai-Ning Liang. A unified gradient

regularization family for adversarial examples. In IEEE InternationalConference on Data Mining (ICDM), 2015.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, andAdrian Vladu. Towards deep learning models resistant to adversarial attacks.In Proceedings of the International Conference on Learning Representations(ICLR), 2018.

J. Mairal. End-to-end kernel learning with supervised convolutional kernelnetworks. In Advances in Neural Information Processing Systems (NIPS),2016.

Stephane Mallat. Group invariant scattering. Communications on Pure andApplied Mathematics, 65(10):1331–1398, 2012.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida.Spectral normalization for generative adversarial networks. In Proceedings ofthe International Conference on Learning Representations (ICLR), 2018.

Julien Mairal A Kernel Perspective for Regularizing NN 57/1

Page 98: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

References VBehnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan

Srebro. A PAC-Bayesian approach to spectrally-normalized margin boundsfor neural networks. In Proceedings of the International Conference onLearning Representations (ICLR), 2018.

Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Featurevisualization. 2017.

Anant Raj, Abhishek Kumar, Youssef Mroueh, P Thomas Fletcher, andBernhard Scholkopf. Local group invariant representations via orbitembeddings. preprint arXiv:1612.01988, 2016.

Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann.Stabilizing training of generative adversarial networks through regularization.In Advances in Neural Information Processing Systems (NIPS), 2017.

Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann.Adversarially robust training through structured gradient regularization.arXiv preprint arXiv:1805.08736, 2018.

I. Schoenberg. Positive definite functions on spheres. Duke Math. J., 1942.

Julien Mairal A Kernel Perspective for Regularizing NN 58/1

Page 99: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

References VIB. Scholkopf. Support Vector Learning. PhD thesis, Technischen Universitat

Berlin, 1997.

Bernhard Scholkopf and Alexander J Smola. Learning with kernels: supportvector machines, regularization, optimization, and beyond. MIT press, 2002.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning:From theory to algorithms. Cambridge university press, 2014.

John Shawe-Taylor and Nello Cristianini. An introduction to support vectormachines and other kernel-based learning methods. Cambridge UniversityPress, 2004.

Laurent Sifre and Stephane Mallat. Rotation, scaling and deformation invariantscattering for texture discrimination. In Proceedings of the IEEE conferenceon computer vision and pattern recognition (CVPR), 2013.

Patrice Y Simard, Yann A LeCun, John S Denker, and Bernard Victorri.Transformation invariance in pattern recognition–tangent distance andtangent propagation. In Neural networks: tricks of the trade, pages 239–274.Springer, 1998.

Julien Mairal A Kernel Perspective for Regularizing NN 59/1

Page 100: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

References VIICarl-Johann Simon-Gabriel, Yann Ollivier, Bernhard Scholkopf, Leon Bottou,

and David Lopez-Paz. Adversarial vulnerability of neural networks increaseswith input dimension. arXiv preprint arXiv:1802.01421, 2018.

Alex J Smola and Bernhard Scholkopf. Sparse greedy matrix approximation formachine learning. In Proceedings of the International Conference onMachine Learning (ICML), 2000.

Alex J Smola, Zoltan L Ovari, and Robert C Williamson. Regularization withdot-product kernels. In Advances in neural information processing systems,pages 308–314, 2001.

Alain Trouve and Laurent Younes. Local geometry of deformable templates.SIAM journal on mathematical analysis, 37(1):17–59, 2005.

Thomas Wiatowski and Helmut Bolcskei. A mathematical theory of deepconvolutional neural networks for feature extraction. IEEE Transactions onInformation Theory, 2017.

C. Williams and M. Seeger. Using the Nystrom method to speed up kernelmachines. In Advances in Neural Information Processing Systems (NIPS),2001.

Julien Mairal A Kernel Perspective for Regularizing NN 60/1

Page 101: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

References VIIIKai Zhang, Ivor W Tsang, and James T Kwok. Improved nystrom low-rank

approximation and error analysis. In International Conference on MachineLearning (ICML), 2008.

Y. Zhang, P. Liang, and M. J. Wainwright. Convexified convolutional neuralnetworks. In International Conference on Machine Learning (ICML), 2017.

Yuchen Zhang, Jason D Lee, and Michael I Jordan. `1-regularized neuralnetworks are improperly learnable in polynomial time. In InternationalConference on Machine Learning (ICML), 2016.

Julien Mairal A Kernel Perspective for Regularizing NN 61/1

Page 102: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

ϕk from kernel approximations: CKNs [Mairal, 2016]

Approximate ϕk(z) by projection (Nystrom approximation) on

F = Span(ϕk(z1), . . . , ϕk(zp)).

Hilbert space H

F

ϕ(x)

ϕ(x′)

Figure: Nystrom approximation.

[Williams and Seeger, 2001, Smola and Scholkopf, 2000, Zhang et al., 2008]...

Julien Mairal A Kernel Perspective for Regularizing NN 62/1

Page 103: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

ϕk from kernel approximations: CKNs [Mairal, 2016]

Approximate ϕk(z) by projection (Nystrom approximation) on

F = Span(ϕk(z1), . . . , ϕk(zp)).

Leads to tractable, p-dimensional representation ψk(z).

Norm is preserved, and projection is non-expansive:

‖ψk(z)− ψk(z′)‖ = ‖Πkϕk(z)−Πkϕk(z′)‖

≤ ‖ϕk(z)− ϕk(z′)‖ ≤ ‖z − z′‖.

Anchor points z1, . . . , zp (≈ filters) can be learned from data(K-means or backprop).

[Williams and Seeger, 2001, Smola and Scholkopf, 2000, Zhang et al., 2008]...

Julien Mairal A Kernel Perspective for Regularizing NN 62/1

Page 104: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

ϕk from kernel approximations: CKNs [Mairal, 2016]

Convolutional kernel networks in practice.

I0

x

x′

kernel trick

projection on F1

M1

ψ1(x)

ψ1(x′)

I1linear pooling

Hilbert space H1

F1

ϕ1(x)

ϕ1(x′)

Julien Mairal A Kernel Perspective for Regularizing NN 63/1

Page 105: A Kernel Perspective for Regularizing Deep Neural Networks · predictive models fin Hare linear forms in H: f(x) = hf;( x)i H. Learning with a positive de nite kernel K(x;x0) = h(

Discussion

norm of ‖Φ(x)‖ is of the same order (or close enough) to ‖x‖.the kernel representation is non-expansive but not contractive

supx,x′∈L2(Ω,H0)

‖Φ(x)− Φ(x′)‖‖x− x′‖ = 1.

Julien Mairal A Kernel Perspective for Regularizing NN 64/1