Optimization Models - EECS 127 / EECS...

Optimization ModelsEECS 127 / EECS 227AT

Laurent El Ghaoui

EECS departmentUC Berkeley

Spring 2020

Sp20 1 / 30

LECTURE 26

Implicit Deep Learning

The Matrix is everywhere. It is allaround us.

Morpheus

Sp20 2 / 30

Outline

1 Implicit Rules

2 Link with Neural Nets

3 Well-Posedness

4 Robustness Analysis

5 Training Implicit Models

6 Take-Aways

Sp20 3 / 30

Collaborators

Joint work with:

Armin Askari, Fangda Gu, Bert Travacca, Alicia Tsai (UC Berkeley);

Mert Pilanci (Stanford);

Emmanuel Vallod, Stefano Proto (www.sumup.ai).

Sponsors:

Sp20 4 / 30

www.sumup.ai

Implicit prediction rule

Equilibrium equation:x = φ(Ax + Bu)

Prediction:y(u) = Cx + Du

Input u ∈ Rp, predicted output y(u) ∈ Rq, hidden “state” vector x ∈ Rn.

Model parameter matrix:

M =

(A BC D

).

Activation: vector map φ : Rn → Rn, e.g. the ReLU: φ(·) = max(·, 0)(acting componentwise on vectors).

Sp20 5 / 30

Deep neural nets as implicit models

Figure: A neural network. Figure: An implicit model.

Implicit models are more general: they allow loops in the network graph.

Sp20 6 / 30

Example

Fully connected, feedforward neural network:

y(u) = WLxL, xl+1 = φl(Wlxl), l = 1, . . . , L− 1, x0 = u.

Implicit model:

(A BC D

)=

0 WL−1 . . . 0 0

0. . .

......

. . . W1 00 W0

WL 0 . . . 0 0

,

x =xL...x1

,

φ(z) =φL(zL)...

φ1(z1)

.

The equilibrium equation x = φ(Ax + Bu) is easily solved via backwardsubstitution (forward pass).

Sp20 7 / 30

Example: ResNet20

Figure: The A matrix for ResNet20.

20-layer network,implicit model oforder n ∼ 180000.

Convolutional layershave blocks withToeplitz structure.

Residualconnections appearas lines.

Sp20 8 / 30

Neural networks as implicit models

Framework covers most neural network architectures:

Neural nets have strictly upper triangular matrix A.

Equilibrium equation solved by substitution, i.e. “forward pass”.

State vector x contains all the hidden features.

Activation φ can be different for each component or blocks of x .

Covers CNNs, RNNs, recurrent neural networks, (Bi-)LSTM, attention,transformers, etc.

Sp20 9 / 30

Related concept: state-space models

The so-called “state-space” models for dynamical systems use the same idea torepresent high-order differential equations . . .

Linear, time-invariant (LTI) dynamical system:

x = Ax + Bu, y = Cx + Du

Figure: LTI system

Sp20 10 / 30

Well-posedness

The matrix A ∈ Rn×n is said to be well-posed for φ if, for every b ∈ Rn, a solutionx ∈ Rn to the equation

x = φ(Ax + b),

exists, and it is unique.

Figure: Equation has two or no solutions,depending on sgn(b).

Figure: Solution is unique for every b.

Sp20 11 / 30

Perron-Frobenius theory [1]

A square matrix P with non-negative entries admits a real eigenvalue λ with anon-negative eigenvector v 6= 0:

Pv = λv .

The value λ dominates all the other eigenvalues: for any other (complex)eigenvalue µ ∈ C, we have |µ| ≤ λPF.

Figure: A web link matrix.

Google’s Page rank search engine relieson computing the Perron-Frobeniuseigenvector of the web link matrix.

Sp20 12 / 30

PF Sufficient condition for well-posedness

Fact: Assume that φ is componentwise non-expansive (e.g., φ = ReLU):

∀ u, v ∈ Rn : |φ(u)− φ(v)| ≤ |u − v |.

Then the matrix A is well-posed for φ if the non-negative matrix |A| satisfies

λpf (|A|) < 1,

in which case the solution can be found via the fixed-point iterations:

x(t + 1) = φ(Ax(t) + b), t = 0, 1, 2, . . .

Covers neural networks: since then |A| is strictly upper triangular, thusλpf (|A|) = 0.

Sp20 13 / 30

Proof: existence

We have

|x(t + 1)− x(t)| = |φ(Ax(t) + b)− φ(Ax(t − 1) + b)| ≤ |A||x(t)− x(t − 1)|,

which implies that for every t, h ≥ 0:

|x(t + τ)− x(t)| ≤t+τ∑k=t

|A|k |x(1)− x(0)| ≤ |A|tτ∑

k=0

|A|k |x(1)− x(0)| ≤ |A|tw ,

where

w :=+∞∑k=0

|A|k |x(1)− x(0)| = (I − |A|)−1|x(1)− x(0)|,

since, due to λPF(|A|) < 1, I − |A| is invertible, and the series above converges.

Since limt→0 |A|t = 0, we obtain that x(t) is a Cauchy sequence, hence it has alimit point, x∞. By continuity of φ we further obtain that x∞ = φ(Ax∞ + b),which establishes the existence of a solution.

Sp20 14 / 30

Proof: unicity

To prove unicity, consider x1, x2 ∈ Rn+ two solutions to the equation. Using the

hypotheses in the theorem, we have, for any k ≥ 1:

|x1 − x2| ≤ |A||x1 − x2| ≤ |A|k |x1 − x2|.

The fact that |A|k → 0 as k → +∞ then establishes unicity.

Sp20 15 / 30

Norm condition

More conservative condition: ‖A‖∞ < 1, where

λPF(|A|) ≤ ‖A‖∞ := maxi

∑j

|Aij |.

Under previous PF conditions for well-posedness:

we can always rescale the model so that ‖A‖∞ < 1, without altering theprediction rule;

scaling related to PF eigenvector of |A|.

Hence during training we may simply use norm condition.

Sp20 16 / 30

Composing implicit modelsCascade connection

Figure: A cascade connection.

Class of implicit models closed under the following connections:

Cascade

Parallel and sum

Multiplicative

Feedback

Sp20 17 / 30

Robustness analysis

Goal: analyze the impact of input perturbations on the state and outputs.

Motivations:

Diagnose a given (implicit) model.

Generate adversarial attacks.

Defense: modify the training problem so as to improve robustness properties.

Sp20 18 / 30

Why does it matter?

Changing a few carefully chosen pixels in a test image can cause a classifier tomis-categorize the image (Kwiatkowska et al., 2019).

Sp20 19 / 30

Robustness analysis

Input is unknown-but-bounded: u ∈ U , with

U :={u0 + δ ∈ Rp : |δ| ≤ σu

},

u0 ∈ Rn is a “nominal” input;

σu ∈ Rn+ is a measure of componentwise uncertainty around it.

Assume (sufficient condition for) well-posedness:

φ componentwise non-expansive;

λPF(|A|) < 1.

Nominal prediction:

x0 = φ(Ax0 + Bu0), y(u0) = Cx0 + Du0.

Sp20 20 / 30

Component-wise bounds on the state and output

Fact: If λPF(|A|) < 1, then I − |A| is invertible, and

|y(u)− y(u0)| ≤ S |u − u0|,

whereS := |C |(I − |A|)−1|B|+ |D|

is a “sensitivity matrix” of the implicit model.

Figure: Sensitivity matrix of a classification network with 10 outputs (each image is arow).

Sp20 21 / 30

Generate a sparse attack on a targeted output

Attack method:

select the output to attack based on the rows (class) of sensitivity matrix;

select top k entries in chosen row;

randomly alter corresponding pixels.

Changing k = 1 (top) k = 2 (mid, bot)pixels, images are wrongly classified, andaccuracy decreases from 99% to 74%.

Sp20 22 / 30

Generate a sparse bounded attack on a targeted output

Target a specific output with sparse attacks:

U :={u0 + δ ∈ Rp : |δ| ≤ σu, Card(δ) ≤ k

},

With k ≤ n. Solve a linear program, with c related to chosen target:

maxx, u

c>x : x ≥ Ax + Bu, x ≥ 0, |x − x0| ≤ σx , |u − u0| ≤ σu‖diag (()σu)−1(u − u0)‖1 ≤ k .

Changing k = 100 pixels by atiny amount (σu = 0.1), targetimages are wrongly classified bya network with 99% nominalaccuracy.

Sp20 23 / 30

Training problemSetup

Inputs: U = [u1, . . . , um], with m data points ui ∈ Rp, i ∈ [m].

Outputs: Y = [y1, . . . , ym], with m responses yi ∈ Rq, i ∈ [m].

Predictions: with X = [x1, . . . , xm] ∈ Rn×m the matrix of hidden feature vectors,and φ acting columnwise,

Y = CX + DU, X = φ(AX + BU).

Sp20 24 / 30

Training problemConstrained problem

minX ,A,B,C ,D

L(Y , Y ) + π(A,B,C ,D)

s.t. Y = CX + DU, X = φ(AX + BU), ‖A‖∞ ≤ κ.

Constraint on A with κ < 1 ensures well-posedness.

π(·) is a (convex) penalty, e.g. one that encourages robustness:

π(A,B,C ,D) ∝ 1

2

‖B‖2∞ + ‖C‖2∞1− ‖A‖∞

+ ‖D‖∞.

May also incorporate penalties to encourage sparsity, low-rank, etc., e.g.:∑i∈[p]

‖Bei‖∞

encourages entire columns of B to be zero, for feature selection.

Sp20 25 / 30

Projected (sub) gradient

SGD can be adapted to the problem:

Differentiating through the equilibrium equation is possible.

Need to deal with the constraint of well-posedness via projection.

Projection on constraint ‖A‖∞ ≤ κ can be done extremely fast using(vectorized) bisection, solving for each row of A in parallel.

Can extend to Frank-Wolfe methods, which are suited to seeking sparsemodels.

Sp20 26 / 30

Example: traffic sign data set

0 2 4 6 8 10epochs

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95ac

cura

cy

implicit model test accuracyneural network test accuracy

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

loss

implicit model test lossneural network test loss

Sp20 27 / 30

Take-aways

Implicit models are more general than standard neural networks.

Well-posedness is a key property that can be enforced via norm or eigenvalueconditions.

Models can be composed together in modular fashion.

The notationally very simple framework allows for rigorous analyses forrobustness, model compression, architecture optimization, etc.

The corresponding training problem is amenable to SGD methods.

Sp20 28 / 30

Towards a general theory?

Sp20 29 / 30

References

Stephen Boyd.

Perron-Frobenius theory, 2008.Lecture slides for EE 363, Stanford University.

Geir E Dullerud and Fernando Paganini.

A course in robust control theory: a convex approach, volume 36.Springer Science & Business Media, 2013.

L. El Ghaoui, F. Gu, B. Travacca, A. Askari, and A. Tsai.

Implicit deep learning.Submitted to ICML, preliminary version at https://arxiv.org/abs/1908.06315, February 2020.

Sp20 30 / 30

https://arxiv.org/abs/1908.06315

Optimization Models - EECS 127 / EECS...

Documents

Transcript of Optimization Models - EECS 127 / EECS...