A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... ·...

Yaniv Romano The Electrical Engineering Department Technion – Israel Institute of technology

Joint work with

Jeremias Sulam Vardan Papyan Prof. Michael Elad

1

The research leading to these results has been received funding from the European union's Seventh Framework Program (FP/2007-2013) ERC grant Agreement ERC-SPARSE- 320649

A Quest for a Universal Model for Signals:

From Sparsity to ConvNets

Local Sparsity

=

Signal

Dictionary (learned) Sparse vector

Local Sparsity

Independent patch-processing

=

=

Convolutional Sparse Coding


Local Sparsity

Multi-Layer Convolutional Sparse Coding



Local Sparsity



Local Sparsity

Convolutional Neural Networks


Forward pass Multi-Layer


Extension of the classical sparse theory to a multi-layer setting

The forward pass is a sparse-coding algorithm,

serving our model


!



Local Sparsity

Our Story Begins with…

9

Image Denoising

Many (thousands) image denoising algorithms can be cast as the minimization of an energy function of the form

10

Original Image 𝐗

White Gaussian Noise

𝐄

Noisy Image 𝐘

Relation to measurements

Prior or regularization

min𝐗

1

2𝐗 − 𝐘 2

2 + G 𝐗 Topic=image and

noise and (removal

or denoising)

Leading Image Denoising Methods…

are built upon powerful patch-based local models: Popular local models: GMM Sparse-Representation Example-based Low-rank Field-of-Experts & Neural networks

11

Working locally allows us to learn the model,

e.g. dictionary learning

• It does come up in many applications

• Other inverse problems can be recast as an iterative denoising [Zoran & Weiss

‘11] [Venkatakrishnan, Bouman & Wohlberg ‘13] [Brifman, Romano & Elad ’16]

• In a recent work we show that a denoiser 𝑓 𝐗 can form a regularizer: Regularization by Denoising (RED) [Romano, Elad & Milanfar ‘16]

• It is the simplest inverse problem revealing the limitations of the model…

12

min𝐗 1

2𝐇𝐗 − 𝐘 2

2 + 𝜌

2 𝐗T 𝐗 − 𝑓 𝐗

G 𝐗 regularizer

min𝐗

1

2𝐇𝐗 − 𝐘 2

2 + G 𝐗


Prior

𝛻𝐗G 𝐗 = 𝐗 − 𝑓 𝐗

Under simple conditions on 𝑓 𝐱


Why is it so Popular?

• Assumes that every patch is a linear combination of a few columns, called atoms, taken from a matrix called a dictionary

• The operator 𝐑i extracts the i-th 𝑛-dimensional patch from 𝐗 ∈ ℝ𝑁

• Sparse coding:

𝑛

𝑚 > 𝑛

𝑛

𝐑i𝐗

𝛄i

= 𝛀

The Sparse-Land Model

13

𝐏0 : min𝛄i

𝛄i 0 s. t. 𝐑i𝐗 = 𝛀𝛄i

Patch Denoising Given a noisy patch 𝐑i𝐘, solve

14

Greedy methods such as Orthogonal Matching Pursuit

(OMP) or Thresholding

Convex relaxations such as Basis Pursuit (BP)

Clean patch: 𝛀𝛄 i 𝐏0 and (𝐏0

ϵ) are hard to solve

𝐏0ϵ : 𝛄 i = argmin

𝛄i

𝛄i 0 s. t. 𝐑i𝐘 − 𝛀𝛄i 2 ≤ ϵ

𝐏1ϵ : min

𝛄i 𝛄i 1 + ξ 𝐑i𝐘 − 𝛀𝛄i 2

2

Recall K-SVD Denoising [Elad & Aharon, ‘06]

15

Noisy Image Reconstructed Image

Denoise each patch

Using OMP

Initial Dictionary Using K-SVD

Update the dictionary

• Despite its simplicity, this is a very well-performing algorithm • We refer to this framework as “patch averaging” • A modification of this method leads to state-of-the-art results

[Mairal, Bach, Ponce, Spairo, Zisserman, `09]

• Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

• Here is what WE thought of…

16

What is ?

Gap: Global-LocalThe

Efficient Independent Local Patch Processing VS.

The Global Need to Model The Entire Image

𝐘


• Here is what WE thought of… Whitening the residual image [Romano & Elad ‘13]

17

𝐑i𝐘 𝛀𝛄𝐢

What is ?

𝐗 𝐘



18

𝐑i𝐘 𝛀𝛄i

Orthogonal (when using the OMP)

What is ?

𝐗 𝐘



19

𝐑i𝐘 𝛀𝛄i


The orthogonally is lost due to patch-averaging

What is ?

𝐗 𝐘



20

𝐑i𝐘 𝛀𝛄i


What is ?



SOS-Boosting [Romano & Elad ’15]

21

Denoise

What is ?




22

Denoise

Previous Result

Strengthen

What is ?




23

Denoise

Previous Result

Strengthen – Operate

What is ?




24

Denoise

Previous Result

Strengthen – Operate – Subtract

What is ?

25

Relation to Game Theory: Encourage overlapping patches to reach a consensus Relation to Graph Theory: Minimizing a Laplacian regularization functional:

𝐗 𝑘+1 = 𝑓 𝐘 + 𝜌𝐗 𝑘 − 𝜌𝐗 𝑘

𝐗 ∗ = argmin𝐗

𝐗 − 𝑓 𝐘 22 + 𝜌𝐗𝑇 𝐗 − 𝑓 𝐗




Why boosting? Since it’s guaranteed to be able to improve any denoiser

What is ?

26




Exploiting self-similarities [Ram & Elad ‘13] [Romano, Protter & Elad ‘14]

What is ?

27





A multi-scale treatment [Ophir, Lustig & Elad ‘11] [Sulam, Ophir & Elad ‘14] [Papyan & Elad ‘15]

What is ?

28






Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16]

• Con-Patch

Self-Similarity feature

What is ?

29







• Con-Patch

• RAISR (Upscaling)

Edge Features: Direction, Strength, Consistency

What is ?

30







• Con-Patch

• RAISR (Upscaling)

Edge Features: Direction, Strength, Consistency

What is ?

31







Enforcing the local model on the final patches (EPLL)

• Sparse EPLL [Sulam & Elad ‘15]

K-SVD EPLL Noisy

What is ?

32









• Image Synthesis [Ren, Romano & Elad ‘17]

What is ?

33










What is ?










…

What is ?

34










…

What is ?

35

Missing a theoretical backbone!

Why can we use a local prior to solve a global problem?

Now, Our Story Takes a Surprising Turn

36

37


[LeCun, Bottou, Bengio and Haffner ‘98] [Lewicki & Sejnowski ‘99] [Hashimoto & Kurata, ‘00] [Mørup, Schmidt & Hansen ’08]

[Zeiler, Krishnan, Taylor & Fergus ‘10] [Jarrett, Kavukcuoglu, Gregor, LeCun ‘11] [Heide, Heidrich & Wetzstein ‘15] [Gu, Zuo, Xie, Meng, Feng & Zhang ‘15] …

Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding

Vardan Papyan, Jeremias Sulam and Michael Elad

Convolutional Dictionary Learning via Local Processing Vardan Papyan, Yaniv Romano, Jeremias Sulam, and Michael Elad

Intuitively…

38

𝐗

The first filter The second filter

= + + + + + + + = + + + + + + +

Convolutional Sparse Coding (CSC)

39

∗ = +

+ +

+

…

∗

∗ ∗ ∗

𝐑i𝐗 = 𝛀𝛄i

𝐗 = 𝐃𝚪

=

Convolutional Sparse Representation

40

𝑛

(2𝑛 − 1)𝑚

𝐑i𝐗 𝑛

(2𝑛 − 1)𝑚

𝐑i+1𝐗

𝛄i 𝛄i+1

stripe-dictionary

Adjacent representations overlap, as they skip by 𝑚 items as we sweep through the patches of 𝐗

stripe vector

𝐃L

CSC Relation to Our Story

• A clear global model: every patch has a sparse representation w.r.t. to the same local dictionary 𝛀, just as we have assumed

• No notion of disagreement on the patch overlaps

• Related to the current common practice of patch averaging (𝐑iT

- put the patch 𝛀𝛄i back in the i-th location of the global vector)

𝐗 = 𝐃𝚪 =1

𝑛 𝐑i

T𝛀𝛄ii

• What about the Pursuit? “Patch averaging”: independent sparse coding for each patch

CSC: should seek all the representations together

• Is there a bridge between the two? We’ll come back to this later …

• What about the theory? Until recently little was known regrading the theoretical aspects of CSC

41

Classical Sparse Theory (Noiseless)

42

Definition: Mutual Coherence: μ 𝐃 = maxi≠j

|diTdj|

[Donoho & Elad ‘03]

Theorem: For a signal 𝐗 = 𝐃𝚪, if 𝚪 0 <1

21 +

1

μ 𝐃

then this solution is necessarily the sparsest


Theorem: The OMP and BP are guaranteed to recover the

true sparse code assuming that 𝚪 0 <1

21 +

1

μ 𝐃

[Tropp ‘04], [Donoho & Elad ‘03]

𝐏0 : min𝚪

𝚪 0 s. t. 𝐗 = 𝐃𝚪

• Assuming that 𝑚 = 2 and 𝑛 = 64 we have that [Welch, ’74]

μ 𝐃 ≥ 0.063

• As a result, uniqueness and success of pursuits is guaranteed as long as

𝚪 0 <1

21 +

1

μ(𝐃)≤1

21 +

1

0.063≈ 8

• Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result!

• Repeating the above for the noisy case leads to even worse performance predictions

• Bottom line: Classic SparseLand theory cannot provide good explanations for the CSC model

CSC: The Need for a Theoretical Study

43

• If 𝚪 is locally sparse [Papyan, Sulam & Elad ‘16]

The solution of 𝐏0,∞ is necessarily unique

The global OMP and BP are guaranteed to recover it

𝑚 = 2 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i 𝛄i

44

This result poses a local constraint for a global guarantee, and as such, the guarantees scale linearly with the dimension of the signal

𝚪 0,∞s is low all 𝛄i are sparse every

patch has a sparse representation over 𝛀

𝚪 0,∞s = max

i 𝛄i 0

𝐏0,∞ : min𝚪

𝚪 0,∞s s. t. 𝐗 = 𝐃𝚪

Moving to Local Sparsity: Stripes

• In practice, 𝐘 = 𝐃𝚪 + 𝐄, where 𝐄 is due to noise or model deviations

• If 𝚪 is locally sparse and the noise is bounded [Papyan, Sulam & Elad ‘16]

The solution of the 𝐏0,∞ϵ is stable

The solution obtained via global OMP/BP is stable

45

From Ideal to Noisy Signals

𝐃𝚪 𝐄 𝐘

𝐏0,∞ϵ : min

𝚪 𝚪 0,∞

s s. t. 𝐘 − 𝐃𝚪 2 ≤ ϵ 𝚪

The true and estimated representation are close

How close is 𝚪 to 𝚪?

Global Pursuit via Local Processing

46

=

𝑛

𝑚

𝑚 𝐃L αi

𝐏1ϵ : 𝚪BP = min

𝚪 1

2𝐘 − 𝐃𝚪 2

2 + ξ 𝚪 1

𝐗 = 𝐃𝚪

= 𝐑iT𝐬i

i

= 𝐑iT𝐃L𝛂i

i

si are slices not patches

min𝐬𝑖 , 𝛂𝑖

1

2𝐘 − 𝐑i

T𝐬ii 2

2

+ 𝜉 𝛂𝑖 1

i

s. t. 𝐬i = 𝐃L𝛂i

47

Using variable splitting si = 𝐃Lαi

• These two problems are equivalent, and convex w.r.t their variables

• The new formulation targets the local slices, and their sparse representations

• Can be solved via ADMM, replace the constraint with a penalty

Global Pursuit via Local Processing (2)

𝐏1ϵ : 𝚪BP = min

𝚪 1

2𝐘 − 𝐃𝚪 2

2 + ξ 𝚪 1

Local Sparse Coding:

Slice Reconstruction:

Slice Aggregation:

Local Laplacian:

Dual Variable Update:

𝛂𝑖 = argmin𝛂𝑖

𝜌

2𝐬𝑖 − 𝐃𝐿𝛂𝑖 + 𝐮𝑖 𝟐

𝟐 + 𝜉 𝛂𝑖 𝟏

Slice-Based Pursuit

48

𝐩𝑖 =1

𝜌𝐑𝑖𝐘 + 𝐃𝐿𝛂𝑖 − 𝐮𝑖

𝐗 = 𝐑𝑗𝐓𝐩𝑗

𝑗

𝐬𝑖 = 𝐩𝑖 −1

𝜌 + 𝑛𝐑𝑖𝐗

𝐮𝑖 = 𝐮𝑖 + 𝐬𝑖 − 𝐃𝐿𝛂𝑖

Comment: One iteration of this

procedure amounts to … the very same

patch-averaging algorithm we started with

49

Two Comments About this Scheme

Patches Slices

Patches Slices

The Proposed Scheme can be used for Dictionary (𝐃L) Learning

We work with Slices and not Patches

Slice-based DL algorithm using standard patch-based tools, leading

to a faster and simpler method, compared to existing methods

Patches extracted from natural images, and their corresponding slices. Observe how the slices are

far simpler, and contained by their corresponding patches

[Wohlberg, 2016] Ours

50

Two Comments About this Scheme

Patches Slices

Patches Slices

The Proposed Scheme can be used for Dictionary (𝐃L) Learning

We work with Slices and not Patches

Slice-based DL algorithm using standard patch-based tools, leading

to a faster and simpler method, compared to existing methods

Patches extracted from natural images, and their corresponding slices. Observe how the slices are

far simpler, and contained by their corresponding patches

[Wohlberg, 2016] Ours

Heide et. al Wohlberg

51

Going Deeper Convolutional Neural Networks

Analyzed via Convolutional Sparse Coding

Joint work with Vardan Papyan and Michael Elad

CSC and CNN

• There is an analogy between CSC and CNN:

Convolutional structure, data driven models, ReLU is a sparsifying operator, and more…

• We propose a principled way to analyze CNN

• But first, a short review of CNN…

52

CNN


SparseLand Sparse Representation Theory

*Our analysis holds true for fully connected networks as well

The Underlying Idea

Modeling data sources enables a theoretical analysis of algorithms’ performance

*

CNN

53

𝑁

𝑁

𝐘 𝑚1

𝑁

𝑁

𝑛0

𝑛0 𝐖1

𝑚2

𝑁

𝑁 𝑛1

𝑛1

𝐖2

ReLU z = max 0, z

[LeCun, Bottou, Bengio and Haffner ‘98] [Krizhevsky, Sutskever & Hinton ‘12] [Simonyan & Zisserman ‘14] [He, Zhang, Ren & Sun ‘15]

𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐗 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐗 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘 𝑓 𝐗, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1




T𝐘 𝑓 𝐘, 𝐖i , 𝐛i = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘

Mathematically...

54

𝐙2 ∈ ℝ𝑁𝑚2

𝑚1

ReLU ReLU

𝐖2T ∈ ℝ𝑁𝑚2×𝑁𝑚1

𝑛1𝑚1 𝑚2

𝐖1T ∈ ℝ𝑁𝑚1×𝑁

𝑚1 𝑛0

𝐛1 ∈ ℝ𝑁𝑚1

𝐛2 ∈ ℝ𝑁𝑚2

𝐘 ∈ ℝ𝑁

min𝐖i , 𝐛i ,𝐔

ℓ h 𝐘j , 𝐔, 𝑓 𝐘j, 𝐖i , 𝐛ij

• Consider the task of classification

• Given a set of signals 𝐘j j and their corresponding labels

h 𝐘j j, the CNN learns an end-to-end mapping

Training Stage of CNN

55

Output of last layer Classifier True label

Back to CSC

56

𝐗 ∈ ℝ𝑁

𝑚1

𝑛0

𝐃1 ∈ ℝ𝑁×𝑁𝑚1

𝑛1𝑚1 𝑚2

𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2

𝑚1

𝚪1 ∈ ℝ𝑁𝑚1



Convolutional sparsity (CSC) assumes an

inherent structure is present in natural

signals

We propose to impose the same structure on the

representations themselves

Multi-Layer CSC (ML-CSC)

Intuition: From Atoms to Molecules

57

𝐗 ∈ ℝ𝑁 𝐃1 ∈ ℝ𝑁×𝑁𝑚1 𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2



• Columns in 𝐃1 are convolutional atoms

• Columns in 𝐃2 combine the atoms in 𝐃𝟏, creating more complex structures

The dictionary 𝐃1𝐃2 is a superposition of the atoms of 𝐃1

• The size of the effective atoms is increased throughout the layers (receptive field)


Intuition: From Atoms to Molecules

58

𝐗 ∈ ℝ𝑁 𝐃1 ∈ ℝ𝑁×𝑁𝑚1 𝐃2 ∈ ℝ𝑁𝑚1×𝑁𝑚2



• One could chain the multiplication

of all the dictionaries into one effective dictionary 𝐃eff = 𝐃1𝐃2𝐃3 ∙∙∙ 𝐃K

and then 𝐗 = 𝐃eff 𝚪K as in SparseLand

• However, a key property in this model is the sparsity of each representation (feature-maps)

https://www.youtube.com/watch?v=AgkfIQ4IGaM



A Small Taste: Model Training (MNIST)

59

𝐃1𝐃2𝐃3 (28×28)

MNIST Dictionary: •D1: 32 filters of size 7×7, with stride of 2 (dense) •D2: 128 filters of size 5×5×32 with stride of 1 - 99.09 % sparse •D3: 1024 filters of size 7×7×128 – 99.89 % sparse

𝐃1𝐃2 (15×15)

𝐃1 (7×7)

A Small Taste: Pursuit

60

Γ1

Γ2

Γ3

Γ0

Y

99.51% sparse (5 nnz)

99.52% sparse (30 nnz)

94.51 % sparse (213 nnz)

A Small Taste: Pursuit

61

Γ1

Γ2

Γ3

Γ0

Y

99.41 % sparse (6 nnz) 99.25 % sparse

(47 nnz)

92.20 % sparse (302 nnz)

A Small Taste: Model Training (CIFAR)

62

𝐃1𝐃2𝐃3 (32×32)

CIFAR Dictionary: • D1: 64 filters of size 5x5x3, stride of 2

dense • D2: 256 filters of size 5x5x64, stride of 2

82.99 % sparse • D3: 1024 filters of size 5x5x256

90.66 % sparse

𝐃1𝐃2 (13×13) 𝐃1 (5×5×3)

Deep Coding Problem (𝐃𝐂𝐏)

63

Find 𝚪j j=1

K s. t.

𝐗 = 𝐃1𝚪1 𝚪1 0,∞s ≤ λ1

𝚪1 = 𝐃2𝚪2 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞

s ≤ λK

Find 𝚪j j=1

K s. t.

𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1

𝚪1 −𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞

s ≤ λK

Noiseless Pursuit

Noisy Pursuit

Deep Learning Problem 𝐃𝐋𝐏

64

min𝐃i i=1

K ,𝐔 ℓ h 𝐘j , 𝐔, 𝐃𝐂𝐏

⋆ 𝐘j, 𝐃𝐢

j

Supervised Dictionary Learning

Task driven dictionary learning [Mairal, Bach, Sapiro & Ponce ’12]

Unsupervised Dictionary Learning

Deepest representation obtained from DCP

Classifier True label

Find 𝐃i i=1K s. t.

𝐘j − 𝐃1𝚪𝟏j

2≤ ℰ0 𝚪𝟏

j

0,∞

s≤ λ1

𝚪1j− 𝐃2𝚪𝟐

j

𝟐≤ ℰ𝟏 𝚪𝟐

j

0,∞

s≤ λ2

⋮ ⋮

𝚪K−1j

− 𝐃K𝚪𝑲j

𝟐≤ ℰ𝐊−𝟏 𝚪K

j

0,∞

s≤ λK

j=1

𝐽

• The simplest pursuit algorithm (single-layer case) is the THR algorithm, which operates on a given input signal 𝐘 by:

• Restricting the coefficients to be nonnegative does not restrict the expressiveness of the model

65

ReLU = Soft Nonnegative Thresholding

ML-CSC: The Simplest Pursuit

and 𝚪 is sparse 𝚪 = 𝒫𝛽 𝐃T𝐘 𝐘 = 𝐃𝚪 + 𝐄

• Layered thresholding (LT):

• Forward pass of CNN:

Consider this for Solving the DCP

66

Estimate 𝚪1 via the THR algorithm


𝑓 𝐗 = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘

𝚪 2 = 𝒫β2 𝐃2T 𝒫β1 𝐃1

T𝐘

ReLU Soft Non-Negative THR

Bias 𝐛 Thresholds 𝛽

Dictionary D Weights 𝐖

Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂𝐏∗

𝐃𝐂𝐏λℰ : Find 𝚪j j=1

K s. t.

𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1

𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞

s ≤ λK

• Layered thresholding (LT):

• Forward pass of CNN:

Consider this for Solving the DCP

67



𝑓 𝐗 = ReLU 𝐛2 +𝐖2T ReLU 𝐛1 +𝐖1

T𝐘

𝚪 2 = 𝒫β2 𝐃2T 𝒫β1 𝐃1

T𝐘

The layered (soft nonnegative) thresholding and the forward pass algorithm

are the very same things !!!

𝐃𝐂𝐏λℰ : Find 𝚪j j=1

K s. t.

𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1

𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 − 𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞

s ≤ λK

• DLP (supervised ):

• CNN training:

Consider this for Solving the DLP

68

The thresholds for the DCP should

also learned

min𝐃i i=1


⋆ 𝐘j, 𝐃𝐢

j

min𝐖i , 𝐛i ,U

ℓ h 𝐘j , 𝐔, 𝑓 𝐘, 𝐖i , 𝐛ij

Estimate via the layered THR algorithm

Layered Soft NN THR 𝐃𝐂𝐏∗

Forward pass 𝑓 ⋅

CNN

Language

SparseLand

Language

• DLP (supervised ):

• CNN training:

Consider this for Solving the DLP

69

The thresholds for the DCP should

also learned

min𝐃i i=1


⋆ 𝐘j, 𝐃𝐢

j

Estimate via the layered THR algorithm

min𝐖i , 𝐛i ,U

ℓ h 𝐘j , 𝐔, 𝑓 𝐘, 𝐖i , 𝐛ij

Layered THR Pursuit 𝐃𝐂𝐏∗

CNN Language

SparseLand Language

Forward pass 𝑓 ⋅

The problem solved by the training stage of CNN and the DLP are equivalent as well, assuming that the DCP is approximated via

the layered thresholding algorithm

* Recall that for the ML-CSC, there exists an unsupervised avenue for training the dictionaries that has no simple parallel in CNN

*

𝐘

Theoretical Questions

70

M A 𝚪 i i=1

K 𝐗 = 𝐃1𝚪1

𝚪1 = 𝐃2𝚪2 ⋮

𝚪K−1 = 𝐃K𝚪K

𝚪i is 𝐋0,∞ sparse

𝐃𝐂𝐏λℰ

Layered Thresholding

(Forward Pass)

Other?

𝐗

Theoretical Path: Possible Questions

• Having established the importance of the ML-CSC model and its associated pursuit, the DCP problem, we now turn to its analysis

• The main questions we aim to address:

I. Uniqueness of the solution (set of representations) to the 𝐃𝐂𝐏λ ?

II. Stability of the solution to the 𝐃𝐂𝐏λℰ problem ?

III. Stability of the solution obtained via the hard and soft layered THR algorithms (forward pass) ?

IV. Limitations of this (very simple) algorithm and alternative pursuit?

V. Algorithms for training the dictionaries 𝐃i i=1K vs. CNN ?

VI. New insights on how to operate on signals via CNN ?

71

Uniqueness of 𝐃𝐂𝐏λ

72

Theorem: If a set of solutions 𝚪i i=1K is

found for (𝐃𝐂𝐏λ) such that:

𝚪i 0,∞s = λi <

1

21 +

1

μ 𝐃i

Then these are necessarily the unique solution to this problem.

The feature maps CNN aims to recover are unique


𝐃𝐂𝐏λ : Find a set of representations satisfying 𝐗 = 𝐃1𝚪1 𝚪1 0,∞

s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞

s ≤ λ2⋮ ⋮

𝚪K−1 = 𝐃K𝚪K 𝚪K 0,∞s ≤ λK

Is this set unique?

Mutual Coherence:

μ 𝐃 = maxi≠j

|diTdj|

• Suppose that we manage to solve the

𝐃𝐂𝐏λℰ and find a feasible set of

representations satisfying all the conditions

• The question we pose is: How close is 𝚪 i to 𝚪i?

Stability of 𝐃𝐂𝐏λℰ

73

𝚪 i i=1

K

Is this set stable?

𝐃𝐂𝐏λℰ : Find a set of representations satisfying

𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1

𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 −𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞

s ≤ λK

Stability of 𝐃𝐂𝐏λℰ

74

Theorem: If the true representations 𝚪i i=1K satisfy

𝚪i 0,∞s ≤ λi <

1

21 +

1

μ 𝐃i

then the set of solutions 𝚪 i i=1

K obtained by solving this

problem (somehow) with ℰ0 = 𝐄 22 and ℰi = 0 (𝑖 ≥ 1)

must obey

𝚪 i − 𝚪i 2

2≤ 4 E 2

2 1

1− 2λj − 1 μ 𝐃j

𝑖

𝑗=1

Observe this annoying effect of error magnification as we

dive into the model

The problem CNN aims to solve is stable under certain conditions

Local Noise Assumption

• Our analysis relied on the local sparsity of the underlying solution 𝚪, which was enforced through the ℓ0,∞ norm

• In what follows, we present stability guarantees that will also depend on the local energy in the noise vector 𝐄

• This will be enforced via the ℓ2,∞ norm, defined as:

75

𝐄 2,∞p

= maxi 𝐑i𝐄 2

Stability of Layered-THR

76

Theorem: If 𝚪i 0,∞s <

1

21 +

1

μ 𝐃i⋅𝚪 imin

𝚪 imax −

1

μ 𝐃i⋅

εLi−1

𝚪 imax

then the layered hard THR (with the proper thresholds) will find the correct supports and

𝚪 iLT − 𝚪i 2,∞

p≤ εL

i

where we have defined εL0 = 𝐄 2,∞

p and

εLi = 𝚪i 0,∞

p⋅ εL

i−1 + μ 𝐃i 𝚪i 0,∞s − 1 𝚪 i

max

* Least-Squares update of the non-zeros?

*

The stability of the forward pass is guaranteed if the underlying representations are locally sparse and

the noise is locally bounded

Limitations of the Forward Pass

• The stability analysis reveals several inherent limitations of the forward pass (a.k.a. Layered THR) algorithm:

• Even in the noiseless case, the forward pass is incapable of recovering the perfect solution of the DCP problem

• Its success depends on the ratio 𝚪 imin / 𝚪 i

max . This is a direct consequence of relying on a simple thresholding operator

• The distance between the true sparse vector and the estimated one increases exponentially as a function of the layer depth

• next we propose a new algorithm that attempts to solve some of these problems

77

Special Case – Sparse Dictionaries

• Throughout the theoretical study we assumed that the representations in the different layers are 𝐋0,∞-sparse

• Do we know of a simple example of a set of dictionaries 𝐃i i=1K

and their corresponding signals 𝐗 that will obey this property?

• Assuming the dictionaries are sparse:

• In the context of CNN, the above happens if a sparsity promoting regularization, such as the 𝐋1, is employed on the filters

78

𝚪j 0,∞

s≤ 𝚪K 0,∞

s 𝐃i 0

K

i=j+1

Maximal number of

non-zeros in an atom in 𝐃i

Better Pursuit ?

• So far we proposed the Layered THR:

• The motivation is clear – getting close to what CNN use

• However, this is the simplest and weakest pursuit known in the field of sparsity – Can we offer something better?

79

𝚪 𝐾 = 𝒫βK 𝐃KT …𝒫β2 𝐃2

T 𝒫β1 𝐃1T𝐗


s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞

s ≤ λ2⋮ ⋮


Layered Basis Pursuit (Noiseless)

80

• We can propose a Layered Basis Pursuit Algorithm:

𝚪1LBP = min

𝚪1 𝚪1 1 s. t. 𝐃1𝚪1 Deconvolutional = ܆

networks [Zeiler, Krishnan, Taylor &

Fergus ‘10] 𝚪2LBP = min

𝚪2 𝚪2 1 s. t. 𝚪1

LBP = 𝐃2𝚪2


s ≤ λ1𝚪1 = 𝐃2𝚪2 𝚪2 0,∞

s ≤ λ2⋮ ⋮


• As opposed to prior work in CNN, we can do far more than just proposing an algorithm – we can analyze its terms for success:

• Consequences: The layered BP can retrieve the underlying representations in the noiseless

case, a task in which the forward pass fails to provide

The Layered-BP’s success does not depend on the ratio 𝚪 imin / 𝚪 i

max

Guarantee for Success of Layered BP

81

Theorem: If a set of representations 𝚪i i=1K of the

Multi-Layered CSC model satisfy

𝚪i 0,∞s ≤ λi <

1

21 +

1

μ 𝐃i

then the Layered BP is guaranteed to find them

Layered Basis Pursuit (Noisy)

82

𝚪1LBP = min

𝚪1 1

2𝐘 − 𝐃1𝚪1 2

2 + ξ1 𝚪1 1

𝚪2LBP = min

𝚪2 1

2 𝚪1

LBP −𝐃2𝚪2 2

2+ ξ2 𝚪2 1

𝐃𝐂𝐏λℰ : Find a set of representations satisfying

𝐘 − 𝐃1𝚪1 2 ≤ ℰ0 𝚪1 0,∞s ≤ λ1

𝚪1 − 𝐃2𝚪2 2 ≤ ℰ1 𝚪2 0,∞s ≤ λ2

⋮ ⋮𝚪K−1 −𝐃K𝚪K 2 ≤ ℰK−1 𝚪K 0,∞

s ≤ λK

Similarly to the noiseless case, this can be also solved via the Layered Basis Pursuit Algorithm… But in a

Lagrangian form

Stability of Layered BP

83

Theorem: Assuming that 𝚪i 0,∞s ≤

1

31 +

1

μ 𝐃i

then for correctly chosen λi i=1K we are guaranteed that

1. The support of 𝚪 iLBP is contained in that of 𝚪i

2. The error is bounded: 𝚪 iLBP − 𝚪i 2,∞

p≤ εL

i , where

εLi = 7.5i 𝐄 2,∞

p 𝚪j 0,∞

pi

j=1

3. Every entry in 𝚪i greater than εLi / 𝚪i 0,∞

pwill be found

Layered Iterative Soft-Thresholding

𝚪jt = 𝒮ξj/cj 𝚪j

t−1 +1

cj𝐃jT 𝚪 j−1 −𝐃j𝚪j

t−1

Layered Iterative Thresholding

84 * ci > 0.5 λmax(𝐃i

T𝐃i)

Layered BP: 𝚪jLBP = min

𝚪j 1

2 𝚪j−1

LBP −𝐃j𝚪j 2

2+ ξj 𝚪j 1

t

j

j

Note that our suggestion implies that groups of layers share the same dictionaries

Can be seen as a recurrent neural network [Gregor & LeCun ‘10]

This Talk

85


Local Sparsity

We described the limitations of patch-based processing and ways to overcome some of these

This Talk

86


Local Sparsity


We presented a theoretical study of the CSC and a practical algorithm that works locally

This Talk

87


Local Sparsity


Convolutional Neural

Networks

We mentioned several interesting connections between CSC and CNN and this led us to …

This Talk

88



Local Sparsity



Networks

… propose and analyze a multi-layer extension of CSC, shown to be tightly connected to CNN

This Talk

89




Local Sparsity



Networks

A novel interpretation and theoretical understanding of CNN

The ML-CSC was shown to enable a theoretical study of CNN, along with new insights

This Talk

90




Local Sparsity



Networks

A novel interpretation and theoretical understanding of CNN

The underlying idea: Modeling the data source

in order to be able to theoretically analyze

algorithms’ performance

Questions?

91

A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... ·...

Documents

Transcript of A Quest for a Universal Model for Signals: From Sparsity ...yromano/presentations/From... ·...