optimization problems overview - University Of Maryland · PROBLEMS: MODEL FITTING penalized...

OPTIMIZATION PROBLEMSOVERVIEW

Lecture 1 - CMSC764 / AMSC607 SP18

WHAT IS OPTIMIZATION?

A: Minimizing things

In college you learned:set derivative to zero

sounds easy.

convex!

BUT THEN…What if there’s no closed-form solution?

What if the problem has 1 BILLION dimensions?What if the problem is non-convex?

What if the function has no derivative?What if there are constraints?

What if the objective function has a BILLION terms?

Does this ever really happen?

MODEL FITTING PROBLEMS

BASIC OPTIMIZATION PROBLEMS: MODEL FITTING

model parameters

training data / inputs label data / outputs Example:linear model

least-squares

f(di, w) = yi dTi w = bi

loss function

minX

i

`(di, w, yi) min kDw � bk2

`(di, w, bi) = (dTi w � bi)2

BASIC OPTIMIZATION PROBLEMS: MODEL FITTING

penalized regressions

min kDw � bk2

`(di, w, bi) = (dTi w � bi)2

loss function

minX

i

`(di, w, yi)

ridge penalty“prior”

min kwk22 + kDw � bk2min J(w) +X

i

`(di, w, yi)

EIGENVALUE DECOMPOSITION

• Spectral theorem: symmetric matrices have a complete, orthogonal set of eigenvalues

A =

Change of basis

EIGENVALUE DECOMPOSITION

• Action of matrix is described by eigenvalues

A =

Ax = A(�1e1 + �3e2 + �4e3)

= �1Ae1 + �3Ae2 + �4Ae3

= �1�1e1 + �3�2e2 + �4�3e3

x = �1e1 + �3e2 + �4e3

MATRIX INVERSE

• Action of matrix is described by eigenvalues

A�1x = �1��11 e1 + �2�

�12 e2 + �3�

�13 e3

Ax = A(�1e1 + �3e2 + �4e3)

= �1Ae1 + �3Ae2 + �4Ae3

= �1�1e1 + �3�2e2 + �4�3e3

ESTIMATION PROBLEMSuppose �1 = 1, �2 = 0.1, �2 = 0.01

Ax = b = �1e1 + 0.1�2e2 + 0.01�3e3

b̂ = b+ ⌘I tell you

noise

Can you recover x?

(do on board)

Does this ever really happen?

CONDITION NUMBER

• Ratio of largest to smallest singular value

=�max

�min =

��max

�min

�� = kAkkA�1k

Why are these definitions the same for symmetric matrices?What is the condition number of our problem?

kx̂� xkkxk

kb� b̂kkbk

Ax = bvs

Ax = b̂

DOES THIS EVER HAPPEN IN REAL LIFE?

• No. The situation is never this good.

• Common in optimization: regularizations and IPM’s

• Example: Convolution

Kernel

Signal

Output

CONVOLUTION MATRIX

100 200 300 400 500 600

100

200

300

400

500

600

Condition number:3,500,000

CONVOLUTION: 2D

K

DEBLURRINGRelative difference 0.0092

K�1 K�1

UNDER-DETERMINED SYSTEMS

• Another problem: What if matrix isn’t even full-rank?

• If the error is bounded ( ) solve

A 2 RM⇥N M < N

b = Ax+ ⌘

minimize kxk subject to kAx� bk ✏

k⌘k ✏

“Occam’s razor”

GEOMETRIC INTERPRETATIONRN

y = Ax

All points on the red line satisfy !

Point with the smallest norm!`2

RIDGE REGRESSION• If the error is bounded ( ) solve

b = Ax+ ⌘

minimize kxk subject to kAx� bk ✏

k⌘k ✏

• This is equivalent to

for some value of

Why? Show on board.

minimize �kxk2 + kb�Axk2

�

RIDGE REGRESSION

Closed form solution!


(ATA+ �I)�1AT b

What does this do to condition number?

RIDGE REGRESSION

Closed form solution!

�2max + �

�2min + �

New condition number


(ATA+ �I)�1AT b

TIKHONOV REGULARIZATION

• Has many names (ridge regression in stats)

• Advantage: Easier to solve new problem

• Improved condition number (less noise sensitivity)

• Parameter can be set:

• Empirically (e.g. cross-validation)

• Use noise bounds + theory (BIC, etc…)

�


BAYESIAN INTERPRETATION

• Model data formation: write distribution of data given parameters

• Observe data from random process

• Use Bayes rule: write distribution of parameters given data

• Find “most likely” parameters given data

• Optional: uncertainty quantification / confidence

BAYESIAN INTERPRETATION

• Assumptions:

• We know expected signal power:

• Linear measurement model

• Noise is i.i.d. Gaussian:

• MAP (maximum a-posteriori) estimate:

E{x2i } = E

b = Ax+ ⌘

⌘ = N(0,�)

maximizex

p(x|b)


BAYESIAN INTERPRETATIONmaximize

xp(x|b)

p(X|Y ) / p(Y |X)p(X)

b ⇠ N(Ax,�)

Bayes’ Rule

maximizex

p(x|b) / p(b|x)p(x)

priormodel

x ⇠ N(0,pE)

BAYESIAN INTERPRETATIONmaximize

xp(x|b) = p(b|x)p(x)

p(x) = (2⇡)�n/2e�1

2E kxk2

p(b|x) =Y

i

1p2⇡

e�1

2�2 (bi�(Ax)i)2

= (2⇡)�m/2e�1

2�2 kb�Axk2

maximize exp

✓�kb�Axk2

�2

◆exp

✓�kxk2

E

◆

probabilityof data

prior

NEGATIVE LOG-LIKELIHOODmaximize

xp(x|b) = p(b|x)p(x)

maximize exp

✓�kb�Axk2

�2

◆exp

✓�kxk2

E

◆

maximize �kb�Axk2

�2� kxk2

E

minimize�2

Ekxk2 + kb�Axk2

NLL

SPARSE PRIORS

• Priors add information to the problem

• Ridge/Tikhonov priors require a lot of assumptions

• Prior only good when assumptions true!

• A very general prior : sparsity

WHAT IS SPARSITY?• Signal has very few non-zeros: small norm

5 10 15 20 25 30 35 40 45 50−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

vector index

ampl

itude

`0

OTHER NOTIONS OF SPARSITY

• “Low density” signals - rapid decay of coefficients

1 2 3 4 5 6 7x 104

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

DCT index

mag

nitu

de

• Fast decay: = Small Weak norm`p

DENSE SIGNAL / SPARSE REPRESENTATION:AUDIO

Audio Signal DCT

•Sounds produced by vibrating objects•Energy is concentrated at resonance frequencies of object

•Defined by eigenvalues of the Laplacian of vibrating surface

DENSE SIGNAL / SPARSE REPRESENTATION:AUDIO

Echolocation chirp: brown bat Gabor transform (STFT)

• Bat hears convolution of signal with environment• Chirps: generate well-conditioned convolution matrices

time

frequ

ency

source: Christoph Studer

DENSE SIGNAL / SPARSE REPRESENTATION:IMAGES

•Approximately Piecewise constant•High correlations between adjacent pixels within objects•High variation across edges

Wavelet transform of natural image

Credit: Mark Davenport

NEURAL EVENTS• Neural potentials: convolution of spike train with

kernel

16

Real recording

Spiking events

Kernel

COMPRESSED SENSING

20

= nice MxN matrix with M<N *

vector y vector x “nice” matrix A

•  Assume x has only k<<N non-zero entries

•  If vector x is sufficiently sparse, then one can uniquely recover x from y

MACHINE LEARNING: OVERFITTING

Ax = bFeatures/Data

Model Parameters

Labels

•Happens when you can design the measurement matrix•More features = better fit

EXAMPLE: POLYNOMIAL FITTING

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Noisy data drawn from polynomialwhat degree is best?

n=20


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25

0.3

0.35

degree

mod

el fit

ting

erro

r

n=20

17


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

degree = 17


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Fit SIGNAL not NOISE!

bad “out of sample” error

WHY DID THIS HAPPEN?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25

0.3

0.35

degree

mod

el fit

ting

erro

r

n=20

17

WHY DID THIS HAPPEN?

0 2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25

0.3

0.35

degree

mod

el fit

ting

erro

r

17

= 3e16We know we’re overfittingjust from condition number.

Why?

SPARSE RECOVERY PROBLEMS

minimize kxk0 subject to Ax = b

“Fat” matrix(underdetermined)

Densemeasurements

Sparse solution

COMPLEXITY ALERT: NP-Complete

(Reductions to subset cover and SAT)Nonetheless: can solve by greedy methods

• Orthogonal Matching Pursuit (OMP)• Stagewise methods: StOMP, CoSAMP, etc…

used to control over-fitting

L0 IS NOT CONVEX

minimize kxk0 subject to Ax = b

minimize |x| subject to Ax = b

WHY USE L1?

• L1 is the “tightest” convex relaxation of L0

CONVEX RELAXATIONminimize |x| subject to Ax = b

SPARSE OPTIMIZATION PROBLEMS

minimize |x| subject to Ax = b

minimize �|x|+ 1

2kAx� bk2

Basis Pursuit

Basis PursuitDenoising

Lasso minimize1

2kAx� bk2 subject to |x| �

BAYESIAN LAND!

minimize �|x|+ 1

2kAx� bk2Basis Pursuit

Denoising

What prior is this?

p(x) = e�|x|

Laplace distribution

“robust to outliers”

HOW TO SET LAMBDA?minimize out-of-sample error

training data

} 30% test set

minimize �|x|+ 1

2kAx� bk2

CROSS VALIDATIONminimize out-of-sample error

training datatest data

minimize �|x|+ 1

2kAx� bk2

choose lambda to minimizetest error

CROSS VALIDATIONminimize out-of-sample error

training datatest data

idealistic curves: no sampling noise

CROSS VALIDATION

K-fold CV leave-one-out CV random sampling CV

AFTER MODEL SELECTION

de-biasing ensemble learning

minimize �|x|+ 1

2kAtrx� btrk2

minimize1

2kAallx� ballk2

subject to x 2 C

minimize �|x|+ 1

2kA1x� b1k2

minimize �|x|+ 1

2kA2x� b2k2

minimize �|x|+ 1

2kAKx� bKk2

...

These methods work best on small problems!

CO-SPARSITY

• Sometimes signal is sparse under a transform

• When transform is invertible, can use synthesis

minimize �|�x|+ 1

2kAx� bk2

minimize �|v|+ 1

2kA��1v � bk2

• Otherwise, use the analysis formulation

• The thing in the L1 norm is sparse!

EXAMPLE:IMAGE PROCESSING

LINEAR FILTER

2 3 1 1 4Signal

Filter/stencil [�1 1 0]

1Output

LINEAR FILTER

2 3 1 1 4Signal


1Output �2

LINEAR FILTER

2 3 1 1 4Signal


1Output �2 0

Now what??

CIRCULANT BOUNDARY

2 3 1 1 4Signal


1Output �2 0

End of signal

0

3

2

CIRCULANT BOUNDARY

2 3 1 1 4Signal


1Output �2 0

End of signal

0

3

1

�2 �2

2 3

IMAGE GRADIENT

0

BB@

u1 � u0

u2 � u1

u3 � u2

u4 � u3

1

CCA

0

BBBB@

u1 � u0

u2 � u1

u3 � u2

u4 � u3

u1 � u4

1

CCCCA

Neumann

Circulant

Stencil [�1 1 0]

TOTAL VARIATION

• Discrete gradient operator

• Linear filter with “Stencil”rx = (x2 � x1, x3 � x2, x4 � x3)

(�1, 1)

TV (x) =X

|xi+1 � xi|

TV (x) = |rx|

Is this a norm?

TOTAL VARIATION: 2D(rx)ij = (xi+1,j � xij , xi,j+1 � xij)

I gx = DxI gy = DyI

TViso(x) = |rx| =X

ij

q(xi+1,j � xij)2 + (xi,j+1 � xij)2

TVan(x) = |rx| =X

ij

|xi+1,j � xij |+ |xi,j+1 � xij |

IMAGE RESTORATIONOriginal Noisy

TOTAL VARIATION DENOISING

�|rx|+ 1

2ku� fk2�krxk2 + 1

2ku� fk2

�|rx|+ 1

2ku� fk2�krxk2 + 1

2ku� fk2

slice

0 100 200 300 400 500 60080

90

100

110

120

130

140

150

160

170

0 100 200 300 400 500 60040

60

80

100

120

140

160

180

200

TV IMAGING PROBLEMS

minimize �|rx|+ 1

2ku� fk2

minimize �|rx|+ 1

2kKu� fk2

minimize �|rx|+ |u� f |

Denoising (ROF)

Deblurring

TVL1

MACHINE LEARNING TOPICS

LINEAR CLASSIFIERS

feature vectorsvectors containing

descriptions of objects

Rn

labels+1/-1 labels indicating

which “class” each vector lies in

+1

�1

training data a1

a2

a10

LINEAR CLASSIFIERS

+1

�1

goallearn a line that separates

two object classes

slope

what line is best?

dTw = 0

d1

d2

d10

D

SUPPORT VECTOR MACHINE

+1

�1

SVMchoose line with

maximum “margin”1

kwk

margin width = 1

kwk

aTw = 0

dTw = 1

dTw = �1


+1

�1

1

kwk

hinge losspenalize points thatlie within the margin

dTw = 1

X

i

h(dTi w)


+1

�1

1

kwk

hinge losspenalize points thatlie within the margin

dTw = 1

X

i

h(dTi w)

For class -1…X

i

h(�dTi w)

In general…X

i

h(yi · dTi w)


+1

�1

1

kwk

dTw = 1

In general…X

i

h(yi · dTi w)

short-hand

combined objective

h(D̂w)

margin width = 1

kwk

minimize1

2kwk2 + h(D̂w)

−8 −6 −4 −2 0 2 4 6 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

LOGISTIC REGRESSION

Rn

1f(x) =ex

1 + ex

P[y = 1] =ed

Tw

1 + edTw

d1

d2

d10�1P[y = �1] = 1� edTw

1 + edTw=

1

1 + edTw

LIKELIHOOD FUNCTIONS

Rn

1

0

case: observed y=1

probability of observing labelgiven data

case: observed y=-1

d1

d2

d10

P[y = 1] =ed

Tw

1 + edTw

NLL(d) = log(1 + edTw)� dTw

NLL(d) = log(1 + edTw)

P[y = �1] = 1� edTw

1 + edTw=

1

1 + edTw

LIKELIHOOD FUNCTIONS

case: observed y=1

probability of observing labelgiven data

−8 −6 −4 −2 0 2 4 6 80

1

2

3

4

5

6

7

8

−8 −6 −4 −2 0 2 4 6 80

1

2

3

4

5

6

7

8

case: observed y=0

P[y = 1] =ed

Tw

1 + edTw

NLL(d) = log(1 + edTw)� dTw

`(x, y = 1) = log(1 + ex)� x

`(x, y = 0) = log(1 + ex)

NLL(d) = log(1 + edTw)

P[y = �1] = 1� edTw

1 + edTw=

1

1 + edTw

LOGISTIC OBJECTIVEprobability of observing label

given data

−8 −6 −4 −2 0 2 4 6 80

1

2

3

4

5

6

7

8

−8 −6 −4 −2 0 2 4 6 80

1

2

3

4

5

6

7

8

shorthand

shorthand

`(x, y = 1) = log(1 + ex)� x

`(x, y = 0) = log(1 + ex)

f(w) =X

yi=1

`(dTx) +X

yi=�1

`(�dTx)

f(w) = L(D̂w)

L(z) = log(1 + e�z) = log(1 + ez)� z

NEURAL NETWORKS

di yi

W1 W2 W3

ARTIFICIAL NEURONw1

w2

w3

w4

f

X

i

wizi

!

NEURAL NETWORKS

di yi

W1 W2 W3

f(f(f(diW1)W2)W3) = yi

TRAINING

least-squares loss

“cross entropy”/log-likelihood loss

l(z, y) =

(� log(z), y = 1

� log(1� z), y = �1

`(z, y) = (z � y)2

f(f(f(diW1)W2)W3) = yi

minW1,W2,W3

X

i

kf(f(f(diW1)W2)W3)� yik2

minW1,W2,W3

X

i

`(f(f(f(diW1)W2)W3)�, yi)

MULTI-CLASS NETWORK

0

0

1

“ones hot”representation

target label

.75

�.1

1.5

interpretable?

d1i

d2i

d3i

yi

0

0

1

“ones hot”representation

target label

.75

�.1

1.5

MULTI-CLASS NETWORK

.09

.01

.9

probability interpretation

d1i

d2i

d3i

.75

�.1

1.5

SOFTMAX

.09

.01

.9

ez1

ez2

ez3

makenon-negative

X

i

ezi

computenormalization

ez1Pi e

zi

ez2Pi e

zi

ez3Pi e

zi

probabilities

.75

�.1

1.5

CROSS-ENTROPY LOSS

.09

.01

.9

0

0

1

target label

NLL= � log(.9)

`(x, y) = �X

k

yk log(xk)

yi

y1i

y2i

y3i

“cross entropy loss”is negative log likelihood

optimization problems overview - University Of Maryland · PROBLEMS: MODEL FITTING penalized...

Documents

Transcript of optimization problems overview - University Of Maryland · PROBLEMS: MODEL FITTING penalized...