optimization problems overview - University Of Maryland · PROBLEMS: MODEL FITTING penalized...
Transcript of optimization problems overview - University Of Maryland · PROBLEMS: MODEL FITTING penalized...
OPTIMIZATION PROBLEMSOVERVIEW
Lecture 1 - CMSC764 / AMSC607 SP18
WHAT IS OPTIMIZATION?
A: Minimizing things
In college you learned:set derivative to zero
sounds easy.
convex!
BUT THEN…What if there’s no closed-form solution?
What if the problem has 1 BILLION dimensions?What if the problem is non-convex?
What if the function has no derivative?What if there are constraints?
What if the objective function has a BILLION terms?
Does this ever really happen?
MODEL FITTING PROBLEMS
BASIC OPTIMIZATION PROBLEMS: MODEL FITTING
model parameters
training data / inputs label data / outputs Example:linear model
least-squares
f(di, w) = yi dTi w = bi
loss function
minX
i
`(di, w, yi) min kDw � bk2
`(di, w, bi) = (dTi w � bi)2
BASIC OPTIMIZATION PROBLEMS: MODEL FITTING
penalized regressions
min kDw � bk2
`(di, w, bi) = (dTi w � bi)2
loss function
minX
i
`(di, w, yi)
ridge penalty“prior”
min kwk22 + kDw � bk2min J(w) +X
i
`(di, w, yi)
EIGENVALUE DECOMPOSITION
• Spectral theorem: symmetric matrices have a complete, orthogonal set of eigenvalues
A =
Change of basis
EIGENVALUE DECOMPOSITION
• Action of matrix is described by eigenvalues
A =
Ax = A(�1e1 + �3e2 + �4e3)
= �1Ae1 + �3Ae2 + �4Ae3
= �1�1e1 + �3�2e2 + �4�3e3
x = �1e1 + �3e2 + �4e3
MATRIX INVERSE
• Action of matrix is described by eigenvalues
A�1x = �1��11 e1 + �2�
�12 e2 + �3�
�13 e3
Ax = A(�1e1 + �3e2 + �4e3)
= �1Ae1 + �3Ae2 + �4Ae3
= �1�1e1 + �3�2e2 + �4�3e3
ESTIMATION PROBLEMSuppose �1 = 1, �2 = 0.1, �2 = 0.01
Ax = b = �1e1 + 0.1�2e2 + 0.01�3e3
b̂ = b+ ⌘I tell you
noise
Can you recover x?
(do on board)
Does this ever really happen?
CONDITION NUMBER
• Ratio of largest to smallest singular value
=�max
�min =
�����max
�min
���� = kAkkA�1k
Why are these definitions the same for symmetric matrices?What is the condition number of our problem?
kx̂� xkkxk
kb� b̂kkbk
Ax = bvs
Ax = b̂
DOES THIS EVER HAPPEN IN REAL LIFE?
• No. The situation is never this good.
• Common in optimization: regularizations and IPM’s
• Example: Convolution
Kernel
Signal
Output
CONVOLUTION MATRIX
100 200 300 400 500 600
100
200
300
400
500
600
Condition number:3,500,000
CONVOLUTION: 2D
K
DEBLURRINGRelative difference 0.0092
K�1 K�1
UNDER-DETERMINED SYSTEMS
• Another problem: What if matrix isn’t even full-rank?
• If the error is bounded ( ) solve
A 2 RM⇥N M < N
b = Ax+ ⌘
minimize kxk subject to kAx� bk ✏
k⌘k ✏
“Occam’s razor”
GEOMETRIC INTERPRETATIONRN
y = Ax
All points on the red line satisfy !
Point with the smallest norm!`2
RIDGE REGRESSION• If the error is bounded ( ) solve
b = Ax+ ⌘
minimize kxk subject to kAx� bk ✏
k⌘k ✏
• This is equivalent to
for some value of
Why? Show on board.
minimize �kxk2 + kb�Axk2
�
RIDGE REGRESSION
Closed form solution!
minimize �kxk2 + kb�Axk2
(ATA+ �I)�1AT b
What does this do to condition number?
RIDGE REGRESSION
Closed form solution!
�2max + �
�2min + �
New condition number
minimize �kxk2 + kb�Axk2
(ATA+ �I)�1AT b
TIKHONOV REGULARIZATION
• Has many names (ridge regression in stats)
• Advantage: Easier to solve new problem
• Improved condition number (less noise sensitivity)
• Parameter can be set:
• Empirically (e.g. cross-validation)
• Use noise bounds + theory (BIC, etc…)
�
minimize �kxk2 + kb�Axk2
BAYESIAN INTERPRETATION
• Model data formation: write distribution of data given parameters
• Observe data from random process
• Use Bayes rule: write distribution of parameters given data
• Find “most likely” parameters given data
• Optional: uncertainty quantification / confidence
BAYESIAN INTERPRETATION
• Assumptions:
• We know expected signal power:
• Linear measurement model
• Noise is i.i.d. Gaussian:
• MAP (maximum a-posteriori) estimate:
E{x2i } = E
b = Ax+ ⌘
⌘ = N(0,�)
maximizex
p(x|b)
minimize �kxk2 + kb�Axk2
BAYESIAN INTERPRETATIONmaximize
xp(x|b)
p(X|Y ) / p(Y |X)p(X)
b ⇠ N(Ax,�)
Bayes’ Rule
maximizex
p(x|b) / p(b|x)p(x)
priormodel
x ⇠ N(0,pE)
BAYESIAN INTERPRETATIONmaximize
xp(x|b) = p(b|x)p(x)
p(x) = (2⇡)�n/2e�1
2E kxk2
p(b|x) =Y
i
1p2⇡
e�1
2�2 (bi�(Ax)i)2
= (2⇡)�m/2e�1
2�2 kb�Axk2
maximize exp
✓�kb�Axk2
�2
◆exp
✓�kxk2
E
◆
probabilityof data
prior
NEGATIVE LOG-LIKELIHOODmaximize
xp(x|b) = p(b|x)p(x)
maximize exp
✓�kb�Axk2
�2
◆exp
✓�kxk2
E
◆
maximize �kb�Axk2
�2� kxk2
E
minimize�2
Ekxk2 + kb�Axk2
NLL
SPARSE PRIORS
• Priors add information to the problem
• Ridge/Tikhonov priors require a lot of assumptions
• Prior only good when assumptions true!
• A very general prior : sparsity
WHAT IS SPARSITY?• Signal has very few non-zeros: small norm
5 10 15 20 25 30 35 40 45 50−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
vector index
ampl
itude
`0
OTHER NOTIONS OF SPARSITY
• “Low density” signals - rapid decay of coefficients
1 2 3 4 5 6 7x 104
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
DCT index
mag
nitu
de
• Fast decay: = Small Weak norm`p
DENSE SIGNAL / SPARSE REPRESENTATION:AUDIO
Audio Signal DCT
•Sounds produced by vibrating objects•Energy is concentrated at resonance frequencies of object
•Defined by eigenvalues of the Laplacian of vibrating surface
DENSE SIGNAL / SPARSE REPRESENTATION:AUDIO
Echolocation chirp: brown bat Gabor transform (STFT)
• Bat hears convolution of signal with environment• Chirps: generate well-conditioned convolution matrices
time
frequ
ency
source: Christoph Studer
DENSE SIGNAL / SPARSE REPRESENTATION:IMAGES
•Approximately Piecewise constant•High correlations between adjacent pixels within objects•High variation across edges
Wavelet transform of natural image
Credit: Mark Davenport
NEURAL EVENTS• Neural potentials: convolution of spike train with
kernel
16
Real recording
Spiking events
Kernel
COMPRESSED SENSING
20
= nice MxN matrix with M<N *
vector y vector x “nice” matrix A
• Assume x has only k<<N non-zero entries
• If vector x is sufficiently sparse, then one can uniquely recover x from y
MACHINE LEARNING: OVERFITTING
Ax = bFeatures/Data
Model Parameters
Labels
•Happens when you can design the measurement matrix•More features = better fit
EXAMPLE: POLYNOMIAL FITTING
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Noisy data drawn from polynomialwhat degree is best?
n=20
EXAMPLE: POLYNOMIAL FITTING
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12 14 16 180
0.05
0.1
0.15
0.2
0.25
0.3
0.35
degree
mod
el fit
ting
erro
r
n=20
17
EXAMPLE: POLYNOMIAL FITTING
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
degree = 17
EXAMPLE: POLYNOMIAL FITTING
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Fit SIGNAL not NOISE!
bad “out of sample” error
WHY DID THIS HAPPEN?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12 14 16 180
0.05
0.1
0.15
0.2
0.25
0.3
0.35
degree
mod
el fit
ting
erro
r
n=20
17
WHY DID THIS HAPPEN?
0 2 4 6 8 10 12 14 16 180
0.05
0.1
0.15
0.2
0.25
0.3
0.35
degree
mod
el fit
ting
erro
r
17
= 3e16We know we’re overfittingjust from condition number.
Why?
SPARSE RECOVERY PROBLEMS
minimize kxk0 subject to Ax = b
“Fat” matrix(underdetermined)
Densemeasurements
Sparse solution
COMPLEXITY ALERT: NP-Complete
(Reductions to subset cover and SAT)Nonetheless: can solve by greedy methods
• Orthogonal Matching Pursuit (OMP)• Stagewise methods: StOMP, CoSAMP, etc…
used to control over-fitting
L0 IS NOT CONVEX
minimize kxk0 subject to Ax = b
minimize |x| subject to Ax = b
WHY USE L1?
• L1 is the “tightest” convex relaxation of L0
CONVEX RELAXATIONminimize |x| subject to Ax = b
SPARSE OPTIMIZATION PROBLEMS
minimize |x| subject to Ax = b
minimize �|x|+ 1
2kAx� bk2
Basis Pursuit
Basis PursuitDenoising
Lasso minimize1
2kAx� bk2 subject to |x| �
BAYESIAN LAND!
minimize �|x|+ 1
2kAx� bk2Basis Pursuit
Denoising
What prior is this?
p(x) = e�|x|
Laplace distribution
“robust to outliers”
HOW TO SET LAMBDA?minimize out-of-sample error
training data
} 30% test set
minimize �|x|+ 1
2kAx� bk2
CROSS VALIDATIONminimize out-of-sample error
training datatest data
minimize �|x|+ 1
2kAx� bk2
choose lambda to minimizetest error
CROSS VALIDATIONminimize out-of-sample error
training datatest data
idealistic curves: no sampling noise
CROSS VALIDATION
K-fold CV leave-one-out CV random sampling CV
AFTER MODEL SELECTION
de-biasing ensemble learning
minimize �|x|+ 1
2kAtrx� btrk2
minimize1
2kAallx� ballk2
subject to x 2 C
minimize �|x|+ 1
2kA1x� b1k2
minimize �|x|+ 1
2kA2x� b2k2
minimize �|x|+ 1
2kAKx� bKk2
...
These methods work best on small problems!
CO-SPARSITY
• Sometimes signal is sparse under a transform
• When transform is invertible, can use synthesis
minimize �|�x|+ 1
2kAx� bk2
minimize �|v|+ 1
2kA��1v � bk2
• Otherwise, use the analysis formulation
• The thing in the L1 norm is sparse!
EXAMPLE:IMAGE PROCESSING
LINEAR FILTER
2 3 1 1 4Signal
Filter/stencil [�1 1 0]
1Output
LINEAR FILTER
2 3 1 1 4Signal
Filter/stencil [�1 1 0]
1Output �2
LINEAR FILTER
2 3 1 1 4Signal
Filter/stencil [�1 1 0]
1Output �2 0
Now what??
CIRCULANT BOUNDARY
2 3 1 1 4Signal
Filter/stencil [�1 1 0]
1Output �2 0
End of signal
0
3
2
CIRCULANT BOUNDARY
2 3 1 1 4Signal
Filter/stencil [�1 1 0]
1Output �2 0
End of signal
0
3
1
�2 �2
2 3
IMAGE GRADIENT
0
BB@
u1 � u0
u2 � u1
u3 � u2
u4 � u3
1
CCA
0
BBBB@
u1 � u0
u2 � u1
u3 � u2
u4 � u3
u1 � u4
1
CCCCA
Neumann
Circulant
Stencil [�1 1 0]
TOTAL VARIATION
• Discrete gradient operator
• Linear filter with “Stencil”rx = (x2 � x1, x3 � x2, x4 � x3)
(�1, 1)
TV (x) =X
|xi+1 � xi|
TV (x) = |rx|
Is this a norm?
TV IN 2D(rx)ij = (xi+1,j � xi,j , xi,j+1 � xi,j)
|(rx)ij | = |xi+1,j � xi,j |+ |xi,j+1 � xi,j)|Anisotropic
Isotropic k(rx)ijk =q
(xi+1,j � xij)2 + (xi,j+1 � xij)2
TOTAL VARIATION: 2D(rx)ij = (xi+1,j � xij , xi,j+1 � xij)
I gx = DxI gy = DyI
TViso(x) = |rx| =X
ij
q(xi+1,j � xij)2 + (xi,j+1 � xij)2
TVan(x) = |rx| =X
ij
|xi+1,j � xij |+ |xi,j+1 � xij |
IMAGE RESTORATIONOriginal Noisy
TOTAL VARIATION DENOISING
�|rx|+ 1
2ku� fk2�krxk2 + 1
2ku� fk2
�|rx|+ 1
2ku� fk2�krxk2 + 1
2ku� fk2
slice
0 100 200 300 400 500 60080
90
100
110
120
130
140
150
160
170
0 100 200 300 400 500 60040
60
80
100
120
140
160
180
200
TV IMAGING PROBLEMS
minimize �|rx|+ 1
2ku� fk2
minimize �|rx|+ 1
2kKu� fk2
minimize �|rx|+ |u� f |
Denoising (ROF)
Deblurring
TVL1
MACHINE LEARNING TOPICS
LINEAR CLASSIFIERS
feature vectorsvectors containing
descriptions of objects
Rn
labels+1/-1 labels indicating
which “class” each vector lies in
+1
�1
training data a1
a2
a10
LINEAR CLASSIFIERS
+1
�1
goallearn a line that separates
two object classes
slope
what line is best?
dTw = 0
d1
d2
d10
D
SUPPORT VECTOR MACHINE
+1
�1
SVMchoose line with
maximum “margin”1
kwk
margin width = 1
kwk
aTw = 0
dTw = 1
dTw = �1
SUPPORT VECTOR MACHINE
+1
�1
1
kwk
hinge losspenalize points thatlie within the margin
dTw = 1
X
i
h(dTi w)
SUPPORT VECTOR MACHINE
+1
�1
1
kwk
hinge losspenalize points thatlie within the margin
dTw = 1
X
i
h(dTi w)
For class -1…X
i
h(�dTi w)
In general…X
i
h(yi · dTi w)
SUPPORT VECTOR MACHINE
+1
�1
1
kwk
dTw = 1
In general…X
i
h(yi · dTi w)
short-hand
combined objective
h(D̂w)
margin width = 1
kwk
minimize1
2kwk2 + h(D̂w)
−8 −6 −4 −2 0 2 4 6 80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
LOGISTIC REGRESSION
Rn
1f(x) =ex
1 + ex
P[y = 1] =ed
Tw
1 + edTw
d1
d2
d10�1P[y = �1] = 1� edTw
1 + edTw=
1
1 + edTw
LIKELIHOOD FUNCTIONS
Rn
1
0
case: observed y=1
probability of observing labelgiven data
case: observed y=-1
d1
d2
d10
P[y = 1] =ed
Tw
1 + edTw
NLL(d) = log(1 + edTw)� dTw
NLL(d) = log(1 + edTw)
P[y = �1] = 1� edTw
1 + edTw=
1
1 + edTw
LIKELIHOOD FUNCTIONS
case: observed y=1
probability of observing labelgiven data
−8 −6 −4 −2 0 2 4 6 80
1
2
3
4
5
6
7
8
−8 −6 −4 −2 0 2 4 6 80
1
2
3
4
5
6
7
8
case: observed y=0
P[y = 1] =ed
Tw
1 + edTw
NLL(d) = log(1 + edTw)� dTw
`(x, y = 1) = log(1 + ex)� x
`(x, y = 0) = log(1 + ex)
NLL(d) = log(1 + edTw)
P[y = �1] = 1� edTw
1 + edTw=
1
1 + edTw
LOGISTIC OBJECTIVEprobability of observing label
given data
−8 −6 −4 −2 0 2 4 6 80
1
2
3
4
5
6
7
8
−8 −6 −4 −2 0 2 4 6 80
1
2
3
4
5
6
7
8
shorthand
shorthand
`(x, y = 1) = log(1 + ex)� x
`(x, y = 0) = log(1 + ex)
f(w) =X
yi=1
`(dTx) +X
yi=�1
`(�dTx)
f(w) = L(D̂w)
L(z) = log(1 + e�z) = log(1 + ez)� z
NEURAL NETWORKS
di yi
W1 W2 W3
ARTIFICIAL NEURONw1
w2
w3
w4
f
X
i
wizi
!
NEURAL NETWORKS
di yi
W1 W2 W3
f(f(f(diW1)W2)W3) = yi
TRAINING
least-squares loss
“cross entropy”/log-likelihood loss
l(z, y) =
(� log(z), y = 1
� log(1� z), y = �1
`(z, y) = (z � y)2
f(f(f(diW1)W2)W3) = yi
minW1,W2,W3
X
i
kf(f(f(diW1)W2)W3)� yik2
minW1,W2,W3
X
i
`(f(f(f(diW1)W2)W3)�, yi)
MULTI-CLASS NETWORK
0
0
1
“ones hot”representation
target label
.75
�.1
1.5
interpretable?
d1i
d2i
d3i
yi
0
0
1
“ones hot”representation
target label
.75
�.1
1.5
MULTI-CLASS NETWORK
.09
.01
.9
probability interpretation
d1i
d2i
d3i
.75
�.1
1.5
SOFTMAX
.09
.01
.9
ez1
ez2
ez3
makenon-negative
X
i
ezi
computenormalization
ez1Pi e
zi
ez2Pi e
zi
ez3Pi e
zi
probabilities
.75
�.1
1.5
CROSS-ENTROPY LOSS
.09
.01
.9
0
0
1
target label
NLL= � log(.9)
`(x, y) = �X
k
yk log(xk)
yi
y1i
y2i
y3i
“cross entropy loss”is negative log likelihood