Optimization Models - EECS 127 / EECS...

32
Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS department UC Berkeley Spring 2020 Sp20 1 / 30

Transcript of Optimization Models - EECS 127 / EECS...

Page 1: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Optimization ModelsEECS 127 / EECS 227AT

Laurent El Ghaoui

EECS departmentUC Berkeley

Spring 2020

Sp20 1 / 30

Page 2: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

LECTURE 26

Implicit Deep Learning

The Matrix is everywhere. It is allaround us.

Morpheus

Sp20 2 / 30

Page 3: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Outline

1 Implicit Rules

2 Link with Neural Nets

3 Well-Posedness

4 Robustness Analysis

5 Training Implicit Models

6 Take-Aways

Sp20 3 / 30

Page 4: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Collaborators

Joint work with:

Armin Askari, Fangda Gu, Bert Travacca, Alicia Tsai (UC Berkeley);

Mert Pilanci (Stanford);

Emmanuel Vallod, Stefano Proto (www.sumup.ai).

Sponsors:

Sp20 4 / 30

Page 5: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Implicit prediction rule

Equilibrium equation:x = φ(Ax + Bu)

Prediction:y(u) = Cx + Du

Input u ∈ Rp, predicted output y(u) ∈ Rq, hidden “state” vector x ∈ Rn.

Model parameter matrix:

M =

(A BC D

).

Activation: vector map φ : Rn → Rn, e.g. the ReLU: φ(·) = max(·, 0)(acting componentwise on vectors).

Sp20 5 / 30

Page 6: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Deep neural nets as implicit models

Figure: A neural network. Figure: An implicit model.

Implicit models are more general: they allow loops in the network graph.

Sp20 6 / 30

Page 7: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Example

Fully connected, feedforward neural network:

y(u) = WLxL, xl+1 = φl(Wlxl), l = 1, . . . , L− 1, x0 = u.

Implicit model:

(A BC D

)=

0 WL−1 . . . 0 0

0. . .

......

. . . W1 00 W0

WL 0 . . . 0 0

,

x =xL...x1

,

φ(z) =φL(zL)...

φ1(z1)

.

The equilibrium equation x = φ(Ax + Bu) is easily solved via backwardsubstitution (forward pass).

Sp20 7 / 30

Page 8: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Example: ResNet20

Figure: The A matrix for ResNet20.

20-layer network,implicit model oforder n ∼ 180000.

Convolutional layershave blocks withToeplitz structure.

Residualconnections appearas lines.

Sp20 8 / 30

Page 9: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Neural networks as implicit models

Framework covers most neural network architectures:

Neural nets have strictly upper triangular matrix A.

Equilibrium equation solved by substitution, i.e. “forward pass”.

State vector x contains all the hidden features.

Activation φ can be different for each component or blocks of x .

Covers CNNs, RNNs, recurrent neural networks, (Bi-)LSTM, attention,transformers, etc.

Sp20 9 / 30

Page 10: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Related concept: state-space models

The so-called “state-space” models for dynamical systems use the same idea torepresent high-order differential equations . . .

Linear, time-invariant (LTI) dynamical system:

x = Ax + Bu, y = Cx + Du

Figure: LTI system

Sp20 10 / 30

Page 11: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Well-posedness

The matrix A ∈ Rn×n is said to be well-posed for φ if, for every b ∈ Rn, a solutionx ∈ Rn to the equation

x = φ(Ax + b),

exists, and it is unique.

Figure: Equation has two or no solutions,depending on sgn(b).

Figure: Solution is unique for every b.

Sp20 11 / 30

Page 12: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Perron-Frobenius theory [1]

A square matrix P with non-negative entries admits a real eigenvalue λ with anon-negative eigenvector v 6= 0:

Pv = λv .

The value λ dominates all the other eigenvalues: for any other (complex)eigenvalue µ ∈ C, we have |µ| ≤ λPF.

Figure: A web link matrix.

Google’s Page rank search engine relieson computing the Perron-Frobeniuseigenvector of the web link matrix.

Sp20 12 / 30

Page 13: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

PF Sufficient condition for well-posedness

Fact: Assume that φ is componentwise non-expansive (e.g., φ = ReLU):

∀ u, v ∈ Rn : |φ(u)− φ(v)| ≤ |u − v |.

Then the matrix A is well-posed for φ if the non-negative matrix |A| satisfies

λpf (|A|) < 1,

in which case the solution can be found via the fixed-point iterations:

x(t + 1) = φ(Ax(t) + b), t = 0, 1, 2, . . .

Covers neural networks: since then |A| is strictly upper triangular, thusλpf (|A|) = 0.

Sp20 13 / 30

Page 14: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Proof: existence

We have

|x(t + 1)− x(t)| = |φ(Ax(t) + b)− φ(Ax(t − 1) + b)| ≤ |A||x(t)− x(t − 1)|,

which implies that for every t, h ≥ 0:

|x(t + τ)− x(t)| ≤t+τ∑k=t

|A|k |x(1)− x(0)| ≤ |A|tτ∑

k=0

|A|k |x(1)− x(0)| ≤ |A|tw ,

where

w :=+∞∑k=0

|A|k |x(1)− x(0)| = (I − |A|)−1|x(1)− x(0)|,

since, due to λPF(|A|) < 1, I − |A| is invertible, and the series above converges.

Since limt→0 |A|t = 0, we obtain that x(t) is a Cauchy sequence, hence it has alimit point, x∞. By continuity of φ we further obtain that x∞ = φ(Ax∞ + b),which establishes the existence of a solution.

Sp20 14 / 30

Page 15: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Proof: unicity

To prove unicity, consider x1, x2 ∈ Rn+ two solutions to the equation. Using the

hypotheses in the theorem, we have, for any k ≥ 1:

|x1 − x2| ≤ |A||x1 − x2| ≤ |A|k |x1 − x2|.

The fact that |A|k → 0 as k → +∞ then establishes unicity.

Sp20 15 / 30

Page 16: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Norm condition

More conservative condition: ‖A‖∞ < 1, where

λPF(|A|) ≤ ‖A‖∞ := maxi

∑j

|Aij |.

Under previous PF conditions for well-posedness:

we can always rescale the model so that ‖A‖∞ < 1, without altering theprediction rule;

scaling related to PF eigenvector of |A|.

Hence during training we may simply use norm condition.

Sp20 16 / 30

Page 17: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Composing implicit modelsCascade connection

Figure: A cascade connection.

Class of implicit models closed under the following connections:

Cascade

Parallel and sum

Multiplicative

Feedback

Sp20 17 / 30

Page 18: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Robustness analysis

Goal: analyze the impact of input perturbations on the state and outputs.

Motivations:

Diagnose a given (implicit) model.

Generate adversarial attacks.

Defense: modify the training problem so as to improve robustness properties.

Sp20 18 / 30

Page 19: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Why does it matter?

Changing a few carefully chosen pixels in a test image can cause a classifier tomis-categorize the image (Kwiatkowska et al., 2019).

Sp20 19 / 30

Page 20: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Robustness analysis

Input is unknown-but-bounded: u ∈ U , with

U :={u0 + δ ∈ Rp : |δ| ≤ σu

},

u0 ∈ Rn is a “nominal” input;

σu ∈ Rn+ is a measure of componentwise uncertainty around it.

Assume (sufficient condition for) well-posedness:

φ componentwise non-expansive;

λPF(|A|) < 1.

Nominal prediction:

x0 = φ(Ax0 + Bu0), y(u0) = Cx0 + Du0.

Sp20 20 / 30

Page 21: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Component-wise bounds on the state and output

Fact: If λPF(|A|) < 1, then I − |A| is invertible, and

|y(u)− y(u0)| ≤ S |u − u0|,

whereS := |C |(I − |A|)−1|B|+ |D|

is a “sensitivity matrix” of the implicit model.

Figure: Sensitivity matrix of a classification network with 10 outputs (each image is arow).

Sp20 21 / 30

Page 22: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Generate a sparse attack on a targeted output

Attack method:

select the output to attack based on the rows (class) of sensitivity matrix;

select top k entries in chosen row;

randomly alter corresponding pixels.

Changing k = 1 (top) k = 2 (mid, bot)pixels, images are wrongly classified, andaccuracy decreases from 99% to 74%.

Sp20 22 / 30

Page 23: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Generate a sparse attack on a targeted output

Attack method:

select the output to attack based on the rows (class) of sensitivity matrix;

select top k entries in chosen row;

randomly alter corresponding pixels.

Changing k = 1 (top) k = 2 (mid, bot)pixels, images are wrongly classified, andaccuracy decreases from 99% to 74%.

Sp20 22 / 30

Page 24: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Generate a sparse bounded attack on a targeted output

Target a specific output with sparse attacks:

U :={u0 + δ ∈ Rp : |δ| ≤ σu, Card(δ) ≤ k

},

With k ≤ n. Solve a linear program, with c related to chosen target:

maxx, u

c>x : x ≥ Ax + Bu, x ≥ 0, |x − x0| ≤ σx , |u − u0| ≤ σu‖diag (()σu)−1(u − u0)‖1 ≤ k .

Changing k = 100 pixels by atiny amount (σu = 0.1), targetimages are wrongly classified bya network with 99% nominalaccuracy.

Sp20 23 / 30

Page 25: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Generate a sparse bounded attack on a targeted output

Target a specific output with sparse attacks:

U :={u0 + δ ∈ Rp : |δ| ≤ σu, Card(δ) ≤ k

},

With k ≤ n. Solve a linear program, with c related to chosen target:

maxx, u

c>x : x ≥ Ax + Bu, x ≥ 0, |x − x0| ≤ σx , |u − u0| ≤ σu‖diag (()σu)−1(u − u0)‖1 ≤ k .

Changing k = 100 pixels by atiny amount (σu = 0.1), targetimages are wrongly classified bya network with 99% nominalaccuracy.

Sp20 23 / 30

Page 26: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Training problemSetup

Inputs: U = [u1, . . . , um], with m data points ui ∈ Rp, i ∈ [m].

Outputs: Y = [y1, . . . , ym], with m responses yi ∈ Rq, i ∈ [m].

Predictions: with X = [x1, . . . , xm] ∈ Rn×m the matrix of hidden feature vectors,and φ acting columnwise,

Y = CX + DU, X = φ(AX + BU).

Sp20 24 / 30

Page 27: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Training problemConstrained problem

minX ,A,B,C ,D

L(Y , Y ) + π(A,B,C ,D)

s.t. Y = CX + DU, X = φ(AX + BU), ‖A‖∞ ≤ κ.

Constraint on A with κ < 1 ensures well-posedness.

π(·) is a (convex) penalty, e.g. one that encourages robustness:

π(A,B,C ,D) ∝ 1

2

‖B‖2∞ + ‖C‖2∞1− ‖A‖∞

+ ‖D‖∞.

May also incorporate penalties to encourage sparsity, low-rank, etc., e.g.:∑i∈[p]

‖Bei‖∞

encourages entire columns of B to be zero, for feature selection.

Sp20 25 / 30

Page 28: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Projected (sub) gradient

SGD can be adapted to the problem:

Differentiating through the equilibrium equation is possible.

Need to deal with the constraint of well-posedness via projection.

Projection on constraint ‖A‖∞ ≤ κ can be done extremely fast using(vectorized) bisection, solving for each row of A in parallel.

Can extend to Frank-Wolfe methods, which are suited to seeking sparsemodels.

Sp20 26 / 30

Page 29: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Example: traffic sign data set

0 2 4 6 8 10epochs

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95ac

cura

cy

implicit model test accuracyneural network test accuracy

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

loss

implicit model test lossneural network test loss

Sp20 27 / 30

Page 30: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Take-aways

Implicit models are more general than standard neural networks.

Well-posedness is a key property that can be enforced via norm or eigenvalueconditions.

Models can be composed together in modular fashion.

The notationally very simple framework allows for rigorous analyses forrobustness, model compression, architecture optimization, etc.

The corresponding training problem is amenable to SGD methods.

Sp20 28 / 30

Page 31: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

Towards a general theory?

Sp20 29 / 30

Page 32: Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/lecture/week13/slides.pdf · Models can becomposedtogether in modular fashion. Thenotationally very simple

References

Stephen Boyd.

Perron-Frobenius theory, 2008.Lecture slides for EE 363, Stanford University.

Geir E Dullerud and Fernando Paganini.

A course in robust control theory: a convex approach, volume 36.Springer Science & Business Media, 2013.

L. El Ghaoui, F. Gu, B. Travacca, A. Askari, and A. Tsai.

Implicit deep learning.Submitted to ICML, preliminary version at https://arxiv.org/abs/1908.06315, February 2020.

Sp20 30 / 30