Constrained Optimization - NYU Courantcfgranda/pages/OBDA_fall17/slides... · 2017. 12. 11. ·...

Constrained optimization

DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysishttp://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html

Carlos Fernandez-Granda

Compressed sensing

Convex constrained problems

Analyzing optimization-based methods

Magnetic resonance imaging

2D DFT (magnitude) 2D DFT (log. of magnitude)

Magnetic resonance imaging

Data: Samples from spectrum

Problem: Sampling is time consuming (annoying, kids move . . . )

Images are compressible (sparse in wavelet basis)

Can we recover compressible signals from less data?

2D wavelet transform

Full DFT matrix

Regular subsampling

Random subsampling

Toy example

Regular subsampling

Random subsampling

Linear inverse problems

Linear inverse problem

A~x = ~y

Linear measurements, A ∈ Rm×n

~y [i ] = 〈Ai :, ~x〉 , 1 ≤ i ≤ n,

Aim: Recover data signal ~x ∈ Rm from data ~y ∈ Rn

We need n ≥ m, otherwise the problem is underdetermined

If n < m there are infinite solutions ~x + ~w where ~w ∈ null (A)

Sparse recovery

Aim: Recover sparse ~x from linear measurements

A~x = ~y

When is the problem well posed?

There shouldn’t be two sparse vectors ~x1 and ~x2 such that A~x1 = A~x2

The spark of a matrix is the smallest subset of columns that is linearlydependent

Let ~y := A~x ∗, where A ∈ Rm×n, ~y ∈ Rn and ~x ∗ ∈ Rm is a sparse vectorwith s nonzero entries

The vector ~x ∗ is the only vector with sparsity s consistent with the data,i.e. it is the solution of

min~x||~x ||0 subject to A~x = ~y

for any choice of ~x ∗ if and only if

spark (A) > 2s

The spark of a matrix is the smallest subset of columns that is linearlydependent

Let ~y := A~x ∗, where A ∈ Rm×n, ~y ∈ Rn and ~x ∗ ∈ Rm is a sparse vectorwith s nonzero entries

The vector ~x ∗ is the only vector with sparsity s consistent with the data,i.e. it is the solution of

for any choice of ~x ∗ if and only if

spark (A) > 2s

Equivalent statements

I For any ~x ∗, ~x ∗ is the only vector with sparsity s consistent with thedata

I For any pair of s-sparse vectors ~x1 and ~x2

A (~x1 − ~x2) 6= ~0

I For any pair of subsets of s indices T1 and T2

AT1∪T2~α 6= ~0 for any ~α ∈ R|T1∪T2|

I All submatrices with at most 2s columns have no nonzero vectors intheir null space

I All submatrices with at most 2s columns are full rank

A (~x1 − ~x2) 6= ~0

AT1∪T2~α 6= ~0 for any ~α ∈ R|T1∪T2|

A (~x1 − ~x2) 6= ~0

AT1∪T2~α 6= ~0 for any ~α ∈ R|T1∪T2|

A (~x1 − ~x2) 6= ~0

AT1∪T2~α 6= ~0 for any ~α ∈ R|T1∪T2|

A (~x1 − ~x2) 6= ~0

AT1∪T2~α 6= ~0 for any ~α ∈ R|T1∪T2|

Restricted-isometry property

Robust version of spark

If two s-sparse vectors ~x1, ~x2 are far, then A~x1, A~x2 should be far

The linear operator should preserve distances (be an isometry) whenrestricted to act upon sparse vectors

A satisfies the restricted isometry property (RIP) with constant κs if

(1− κs) ||~x ||2 ≤ ||A~x ||2 ≤ (1+ κs) ||~x ||2

for any s-sparse vector ~x

If A satisfies the RIP for a sparsity level 2s then for any s-sparse ~x1, ~x2

||~y2 − ~y1||2

= A (~x1 − ~x2)

≥ (1− κ2s) ||~x2 − ~x1||2

A satisfies the restricted isometry property (RIP) with constant κs if

(1− κs) ||~x ||2 ≤ ||A~x ||2 ≤ (1+ κs) ||~x ||2

for any s-sparse vector ~x

If A satisfies the RIP for a sparsity level 2s then for any s-sparse ~x1, ~x2

||~y2 − ~y1||2 = A (~x1 − ~x2)

≥ (1− κ2s) ||~x2 − ~x1||2

Regular subsampling

Correlation with column 20

0 10 20 30 40 50 60

Random subsampling

Correlation with column 20

0 10 20 30 40 50 600.6

Deterministic matrices tend to not satisfy the RIP

It is NP-hard to check if spark or RIP hold

Random matrices satisfy RIP with high probability

We prove it for Gaussian iid matrices, ideas in proof for random Fouriermatrices are similar

Restricted-isometry property for Gaussian matrices

Let A ∈ Rm×n be a random matrix with iid standard Gaussian entries

A satisfies the RIP for a constant κs with probability 1− C2n as long as

m ≥ C1s

log(ns

)for two fixed constants C1,C2 > 0

For a fixed support T of size s bounds follow from bounds on singularvalues of Gaussian matrices

Singular values of m × s matrix, s = 100

0 20 40 60 80 100

σi√m

m/s25102050100200

Singular values of m × s matrix, s = 1000

0 200 400 600 800 1,000

σi√m

m/s25102050100200

For a fixed submatrix the singular values are bounded by√m (1− κs) ≤ σs ≤ σ1 ≤

√m (1+ κs)

with probability at least

1− 2(12κs

exp(−mκ2

)For any vector ~x with support T

√1− κs ||~x ||2 ≤

1√m||A~x ||2 ≤

√1+ κs ||~x ||2

Union bound

For any events S1, S2, . . . ,Sn in a probability space

P (∪iSi ) ≤n∑

P (Si ) .

Number of different supports of size s

)≤(ens

)sBy the union bound

√1− κs ||~x ||2 ≤

1√m||A~x ||2 ≤

√1+ κs ||~x ||2

holds for any s-sparse vector ~x with probability at least

1− 2(ens

)s (12κs

exp(−mκ2

)= 1− exp

(log 2+ s + s log

)+ s log

(12κs

)− mκ2

)≤ 1− C2

nas long as m ≥ C1s

log(ns

Number of different supports of size s(n

)≤(ens

By the union bound

√1− κs ||~x ||2 ≤

1√m||A~x ||2 ≤

√1+ κs ||~x ||2

1− 2(ens

)s (12κs

exp(−mκ2

)= 1− exp

(log 2+ s + s log

)+ s log

(12κs

)− mκ2

)≤ 1− C2

log(ns

Number of different supports of size s(n

)≤(ens

)sBy the union bound

√1− κs ||~x ||2 ≤

1√m||A~x ||2 ≤

√1+ κs ||~x ||2

1− 2(ens

)s (12κs

exp(−mκ2

)= 1− exp

(log 2+ s + s log

)+ s log

(12κs

)− mκ2

)≤ 1− C2

log(ns

Sparse recovery via `1-norm minimization

`0-“norm" minimization is intractable

(As usual) we can minimize `1 norm instead, estimate ~x`1 is the solutionto

Minimum `2-norm solution (regular subsampling)

0 10 20 30 40 50 60 70-3

True SignalMinimum L2 Solution

0 10 20 30 40 50 60 70-3

Minimum `2-norm solution (random subsampling)

0 10 20 30 40 50 60 70-3

Geometric intuition

`2 norm `1 norm

If the signal is sparse in a transform domain then

min~x||~c ||1 subject to AW ~c = ~y

If we want to recover the original ~c ∗ then AW should satisfy the RIP

However, we might be fine with any ~c ′ such that A~c ′ = ~x ∗

If the signal is sparse in a transform domain then

min~x||~c ||1 subject to AW ~c = ~y

If we want to recover the original ~c ∗ then AW should satisfy the RIP

However, we might be fine with any ~c ′ such that A~c ′ = ~x ∗

Regular subsampling

Random subsampling

Compressed sensing

Convex sets

A convex set S is any set such that for any ~x , ~y ∈ S and θ ∈ (0, 1)

θ~x + (1− θ) ~y ∈ S

The intersection of convex sets is convex

Convex vs nonconvex

Nonconvex Convex

Epigraph

epi (f )

A function is convex if and only if its epigraph is convex

Projection onto convex set

The projection of any vector ~x onto a non-empty closed convex set S

PS (~x) := argmin~y∈S||~x − ~y ||2

exists and is unique

Assume there are two distinct projections ~y1 6= ~y2

Consider

~y ′ :=~y1 + ~y2

~y ′ belongs to S (why?)

⟨~x − ~y ′, ~y1 − ~y ′

⟨~x −

~y1 + ~y2

2, ~y1 −

~y1 + ~y2

⟨~x − ~y1

2+~x − ~y2

2,~x − ~y1

2−~x − ~y2

(||~x − ~y1||2 + ||~x − ~y2||2

By Pythagoras’ theorem

||~x − ~y1||22 =∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣~y1 − ~y ′

∣∣∣∣22

=∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣∣∣∣∣~y1 − ~y2

∣∣∣∣∣∣∣∣22

>∣∣∣∣~x − ~y ′∣∣∣∣22

⟨~x − ~y ′, ~y1 − ~y ′

⟨~x −

~y1 + ~y2

2, ~y1 −

~y1 + ~y2

⟨~x − ~y1

2+~x − ~y2

2,~x − ~y1

2−~x − ~y2

(||~x − ~y1||2 + ||~x − ~y2||2

||~x − ~y1||22 =∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣~y1 − ~y ′

∣∣∣∣22

=∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣∣∣∣∣~y1 − ~y2

∣∣∣∣∣∣∣∣22

>∣∣∣∣~x − ~y ′∣∣∣∣22

⟨~x − ~y ′, ~y1 − ~y ′

⟨~x −

~y1 + ~y2

2, ~y1 −

~y1 + ~y2

⟨~x − ~y1

2+~x − ~y2

2,~x − ~y1

2−~x − ~y2

(||~x − ~y1||2 + ||~x − ~y2||2

||~x − ~y1||22

=∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣~y1 − ~y ′

∣∣∣∣22

=∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣∣∣∣∣~y1 − ~y2

∣∣∣∣∣∣∣∣22

>∣∣∣∣~x − ~y ′∣∣∣∣22

⟨~x − ~y ′, ~y1 − ~y ′

⟨~x −

~y1 + ~y2

2, ~y1 −

~y1 + ~y2

⟨~x − ~y1

2+~x − ~y2

2,~x − ~y1

2−~x − ~y2

(||~x − ~y1||2 + ||~x − ~y2||2

||~x − ~y1||22 =∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣~y1 − ~y ′

∣∣∣∣22

=∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣∣∣∣∣~y1 − ~y2

∣∣∣∣∣∣∣∣22

>∣∣∣∣~x − ~y ′∣∣∣∣22

⟨~x − ~y ′, ~y1 − ~y ′

⟨~x −

~y1 + ~y2

2, ~y1 −

~y1 + ~y2

⟨~x − ~y1

2+~x − ~y2

2,~x − ~y1

2−~x − ~y2

(||~x − ~y1||2 + ||~x − ~y2||2

||~x − ~y1||22 =∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣~y1 − ~y ′

∣∣∣∣22

=∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣∣∣∣∣~y1 − ~y2

∣∣∣∣∣∣∣∣22

>∣∣∣∣~x − ~y ′∣∣∣∣22

⟨~x − ~y ′, ~y1 − ~y ′

⟨~x −

~y1 + ~y2

2, ~y1 −

~y1 + ~y2

⟨~x − ~y1

2+~x − ~y2

2,~x − ~y1

2−~x − ~y2

(||~x − ~y1||2 + ||~x − ~y2||2

||~x − ~y1||22 =∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣~y1 − ~y ′

∣∣∣∣22

=∣∣∣∣~x − ~y ′∣∣∣∣22 + ∣∣∣∣∣∣∣∣~y1 − ~y2

∣∣∣∣∣∣∣∣22

>∣∣∣∣~x − ~y ′∣∣∣∣22

Convex combination

Given n vectors ~x1, ~x2, . . . , ~xn ∈ Rn,

~x :=n∑

θi~xi

is a convex combination of ~x1, ~x2, . . . , ~xn if

θi ≥ 0, 1 ≤ i ≤ nn∑

θi = 1

Convex hull

The convex hull of S is the set of convex combinations of points in S

The `1-norm ball is the convex hull of the intersection between the`0 “norm" ball and the `∞-norm ball

`1-norm ball

B`1 ⊆ C (B`0 ∩ B`∞)

Let ~x ∈ B`1

Set θi := |~x [i ]|, θ0 = 1−∑n

i=1 θi∑ni=0 θi = 1 by construction, θi ≥ 0 and

θ0 = 1−n+1∑i=1

= 1− ||~x ||1≥ 0 because ~x ∈ B`1

~x ∈ B`0 ∩ B`∞ because

~x =n∑

θi sign (~x [i ]) ~ei + θ0~0

B`1 ⊆ C (B`0 ∩ B`∞)

Let ~x ∈ B`1

Set θi := |~x [i ]|, θ0 = 1−∑n

i=1 θi∑ni=0 θi = 1 by construction, θi ≥ 0 and

θ0 = 1−n+1∑i=1

= 1− ||~x ||1≥ 0 because ~x ∈ B`1

~x ∈ B`0 ∩ B`∞ because

~x =n∑

θi sign (~x [i ]) ~ei + θ0~0

C (B`0 ∩ B`∞) ⊆ B`1

Let ~x ∈ C (B`0 ∩ B`∞), then

~x =m∑i=1

θi~yi

||~x ||1 ≤m∑i=1

θi ||~yi ||1 by the Triangle inequality

≤m∑i=1

θi ||~yi ||∞ ~yi only has one nonzero entry

≤m∑i=1

C (B`0 ∩ B`∞) ⊆ B`1

Let ~x ∈ C (B`0 ∩ B`∞), then

~x =m∑i=1

θi~yi

||~x ||1

≤m∑i=1

C (B`0 ∩ B`∞) ⊆ B`1

Let ~x ∈ C (B`0 ∩ B`∞), then

~x =m∑i=1

θi~yi

||~x ||1 ≤m∑i=1

≤m∑i=1

C (B`0 ∩ B`∞) ⊆ B`1

Let ~x ∈ C (B`0 ∩ B`∞), then

~x =m∑i=1

θi~yi

||~x ||1 ≤m∑i=1

≤m∑i=1

C (B`0 ∩ B`∞) ⊆ B`1

Let ~x ∈ C (B`0 ∩ B`∞), then

~x =m∑i=1

θi~yi

||~x ||1 ≤m∑i=1

≤m∑i=1

C (B`0 ∩ B`∞) ⊆ B`1

Let ~x ∈ C (B`0 ∩ B`∞), then

~x =m∑i=1

θi~yi

||~x ||1 ≤m∑i=1

≤m∑i=1

Convex optimization problem

f0, f1, . . . , fm, h1, . . . , hp : Rn → R

minimize f0 (~x)

subject to fi (~x) ≤ 0, 1 ≤ i ≤ m,

hi (~x) = 0, 1 ≤ i ≤ p,

Definitions

I A feasible vector is a vector that satisfies all the constraints

I A solution is any vector ~x ∗ such that for all feasible vectors ~x

f0 (~x) ≥ f0 (~x∗)

I If a solution exists f (~x ∗) is the optimal value or optimum of theproblem

Convex optimization problem

The optimization problem is convex if

I f0 is convex

I f1, . . . , fm are convex

I h1, . . . , hp are affine, i.e. hi (~x) = ~aTi ~x + bi for some ~ai ∈ Rn and

bi ∈ R

Linear program

minimize ~aT~x

subject to ~c Ti ~x ≤ di , 1 ≤ i ≤ m

A~x = ~b

`1-norm minimization as an LP

The optimization problem

minimize ||~x ||1subject to A~x = ~b

can be recast as the LP

minimizem∑i=1

~t[i ]

subject to ~t[i ] ≥ ~eiT~x

~t[i ] ≥ −~ei T~x

A~x = ~b

Solution to `1-norm min. problem: ~x `1

Solution to linear program:(~x lp, ~t lp)

Set ~t `1 [i ] :=∣∣~x `1 [i ]∣∣(

~x `1 , ~t `1)is feasible for linear program

∣∣∣∣∣∣~x `1∣∣∣∣∣∣1=

m∑i=1

~t `1 [i ]

≥m∑i=1

~t lp[i ] by optimality of ~t lp

≥∣∣∣∣∣∣~x lp

∣∣∣∣∣∣1

~x lp is a solution to the `1-norm min. problem

Set ~t `1 [i ] :=∣∣~x `1 [i ]∣∣(

∣∣∣∣∣∣~x `1∣∣∣∣∣∣1=

m∑i=1

~t `1 [i ]

≥m∑i=1

≥∣∣∣∣∣∣~x lp

∣∣∣∣∣∣1

Set ~t `1 [i ] :=∣∣~x `1 [i ]∣∣(

∣∣∣∣∣∣~x `1∣∣∣∣∣∣1=

m∑i=1

~t `1 [i ]

≥m∑i=1

≥∣∣∣∣∣∣~x lp

∣∣∣∣∣∣1

Set ~t `1 [i ] :=∣∣~x `1 [i ]∣∣(

∣∣∣∣∣∣~x `1∣∣∣∣∣∣1=

m∑i=1

~t `1 [i ]

≥m∑i=1

≥∣∣∣∣∣∣~x lp

∣∣∣∣∣∣1

Set ~t `1 [i ] :=∣∣~x `1 [i ]∣∣

m∑i=1

t`1i =∣∣∣∣∣∣~x `1∣∣∣∣∣∣

≤∣∣∣∣∣∣~x lp

∣∣∣∣∣∣1

by optimality of ~x `1

≤m∑i=1

~t lp[i ]

(~x `1 , ~t `1

)is a solution to the linear problem

Set ~t `1 [i ] :=∣∣~x `1 [i ]∣∣

m∑i=1

t`1i =∣∣∣∣∣∣~x `1∣∣∣∣∣∣

≤∣∣∣∣∣∣~x lp

∣∣∣∣∣∣1

≤m∑i=1

~t lp[i ]

(~x `1 , ~t `1

Set ~t `1 [i ] :=∣∣~x `1 [i ]∣∣

m∑i=1

t`1i =∣∣∣∣∣∣~x `1∣∣∣∣∣∣

≤∣∣∣∣∣∣~x lp

∣∣∣∣∣∣1

≤m∑i=1

~t lp[i ]

(~x `1 , ~t `1

Set ~t `1 [i ] :=∣∣~x `1 [i ]∣∣

m∑i=1

t`1i =∣∣∣∣∣∣~x `1∣∣∣∣∣∣

≤∣∣∣∣∣∣~x lp

∣∣∣∣∣∣1

≤m∑i=1

~t lp[i ]

(~x `1 , ~t `1

Quadratic program

For a positive semidefinite matrix Q ∈ Rn×n

minimize ~xTQ~x + ~aT~x

subject to ~c Ti ~x ≤ di , 1 ≤ i ≤ m,

A~x = ~b

`1-norm regularized least squares as a QP

The optimization problem

minimize ||A~x − y ||22 + ~α ||~x ||1

can be recast as the QP

minimize ~xTATA~x − 2~yT~x + ~αn∑

~t[i ]

subject to ~t[i ] ≥ ~eiT~x

~t[i ] ≥ −~ei T~x

Lagrangian

The Lagrangian of a canonical optimization problem is

L (~x , ~α, ~ν) := f0 (~x) +m∑i=1

~α[i ] fi (~x) +

p∑j=1

~ν[j ] hj (~x) ,

~α ∈ Rm, ~ν ∈ Rp are called Lagrange multipliers or dual variables

If ~x is feasible and ~α[i ] ≥ 0 for 1 ≤ i ≤ m

L (~x , ~α, ~ν) ≤ f0 (~x)

Lagrange dual function

The Lagrange dual function of the problem is

l (~α, ~ν) := inf~x∈Rn

f0 (~x) +m∑i=1

~α[i ]fi (~x) +

p∑j=1

~ν[j ]hj (~x)

Let p∗ be an optimum of the optimization problem

l (~α, ~ν) ≤ p∗

as long as ~α[i ] ≥ 0 for 1 ≤ i ≤ n

Dual problem

The dual problem of the (primal) optimization problem is

maximize l (~α, ~ν)

subject to ~α[i ] ≥ 0, 1 ≤ i ≤ m.

The dual problem is always convex, even if the primal isn’t!

Maximum/supremum of convex functions

Pointwise maximum of m convex functions f1, . . . , fm

fmax (x) := max1≤i≤m

fi (x)

is convex

Pointwise supremum of a family of convex functions indexed by a set I

fsup (x) := supi∈I

fi (x)

is convex

For any 0 ≤ θ ≤ 1 and any ~x , ~y ∈ R,

fsup (θ~x + (1− θ) ~y) = supi∈I

fi (θ~x + (1− θ) ~y)

≤ supi∈I

θfi (~x) + (1− θ) fi (~y) by convexity of the fi

≤ θ supi∈I

fi (~x) + (1− θ) supj∈I

fj (~y)

= θfsup (~x) + (1− θ) fsup (~y)

fi (θ~x + (1− θ) ~y)

≤ supi∈I

≤ θ supi∈I

fi (~x) + (1− θ) supj∈I

fj (~y)

= θfsup (~x) + (1− θ) fsup (~y)

fi (θ~x + (1− θ) ~y)

≤ supi∈I

≤ θ supi∈I

fi (~x) + (1− θ) supj∈I

fj (~y)

= θfsup (~x) + (1− θ) fsup (~y)

fi (θ~x + (1− θ) ~y)

≤ supi∈I

≤ θ supi∈I

fi (~x) + (1− θ) supj∈I

fj (~y)

= θfsup (~x) + (1− θ) fsup (~y)

Weak duality

If p∗ is a primal optimum and d∗ a dual optimum

d∗ ≤ p∗

Strong duality

For convex problems

d∗ = p∗

under very weak conditions

LPs: The primal optimum is finite

General convex programs (Slater’s condition):

There exists a point that is strictly feasible

fi (~x) < 0 1 ≤ i ≤ m

`1-norm minimization

The dual problem of

max~ν~yT~ν subject to

∣∣∣∣∣∣AT~ν∣∣∣∣∣∣∞≤ 1

Lagrangian L (~x , ~ν) = ||~x ||1 + ~νT (~y − A~x)

||~x ||1 − (AT~ν)T~x + ~νT~y

If AT~ν[i ] > 1? We can set ~x [i ]→∞ and l (~α, ~ν)→ −∞

If∣∣∣∣AT~ν

∣∣∣∣∞ ≤ 1?

(AT~ν)T~x ≤ ||~x ||1∣∣∣∣∣∣AT~ν

∣∣∣∣∣∣∞≤ ||~x ||1

so l (~α, ~ν) = ~νT~y

||~x ||1 − (AT~ν)T~x + ~νT~y

If AT~ν[i ] > 1?

We can set ~x [i ]→∞ and l (~α, ~ν)→ −∞

If∣∣∣∣AT~ν

∣∣∣∣∞ ≤ 1?

(AT~ν)T~x ≤ ||~x ||1∣∣∣∣∣∣AT~ν

∣∣∣∣∣∣∞≤ ||~x ||1

so l (~α, ~ν) = ~νT~y

||~x ||1 − (AT~ν)T~x + ~νT~y

If∣∣∣∣AT~ν

∣∣∣∣∞ ≤ 1?

(AT~ν)T~x ≤ ||~x ||1∣∣∣∣∣∣AT~ν

∣∣∣∣∣∣∞≤ ||~x ||1

so l (~α, ~ν) = ~νT~y

||~x ||1 − (AT~ν)T~x + ~νT~y

If∣∣∣∣AT~ν

∣∣∣∣∞ ≤ 1?

(AT~ν)T~x

≤ ||~x ||1∣∣∣∣∣∣AT~ν

∣∣∣∣∣∣∞≤ ||~x ||1

so l (~α, ~ν) = ~νT~y

||~x ||1 − (AT~ν)T~x + ~νT~y

If∣∣∣∣AT~ν

∣∣∣∣∞ ≤ 1?

(AT~ν)T~x ≤ ||~x ||1∣∣∣∣∣∣AT~ν

∣∣∣∣∣∣∞≤ ||~x ||1

so l (~α, ~ν) = ~νT~y

||~x ||1 − (AT~ν)T~x + ~νT~y

If∣∣∣∣AT~ν

∣∣∣∣∞ ≤ 1?

(AT~ν)T~x ≤ ||~x ||1∣∣∣∣∣∣AT~ν

∣∣∣∣∣∣∞≤ ||~x ||1

so l (~α, ~ν) = ~νT~y

Strong duality

The solution ~ν ∗ to

∣∣∣∣∣∣AT~ν∣∣∣∣∣∣∞≤ 1

satisfies

(AT~ν ∗)[i ]= sign(~x ∗[i ]) for all ~x ∗[i ] 6= 0

for all solutions ~x ∗ to the primal problem

Dual solution

0 10 20 30 40 50 60 70-1

True Signal SignMinimum L1 Dual Solution

By strong duality

||~x ∗||1 = ~yT~ν ∗

= (A~x ∗)T~ν ∗

= (~x ∗)T (AT~ν ∗)

=m∑i=1

(AT~ν ∗)[i ]~x ∗[i ]

By Hölder’s inequality

||~x ∗||1 ≥m∑i=1

(AT~ν ∗)[i ]~x ∗[i ]

with equality if and only if

(AT~ν ∗)[i ] = sign(~x ∗[i ]) for all ~x ∗[i ] 6= 0

Another algorithm for sparse recovery

Aim: Find nonzero locations of a sparse vector ~x from ~y = A~x

Insight: We have access to inner products of ~x and AT ~w for any ~w

~yT ~w

= (A~x)T ~w

= ~xT (AT ~w)

Idea: Maximize AT ~w , bounding magnitude of entries by 1

Entries where ~x is nonzero should saturate to 1 or -1

~yT ~w = (A~x)T ~w

= ~xT (AT ~w)

~yT ~w = (A~x)T ~w

= ~xT (AT ~w)

~yT ~w = (A~x)T ~w

= ~xT (AT ~w)

~yT ~w = (A~x)T ~w

= ~xT (AT ~w)

Compressed sensing

Best case scenario: Primal solution has closed form

Otherwise: Use dual solution to characterize primal solution

Minimum `2-norm solution

Let A ∈ Rm×n be a full rank matrix such that m < n

For any ~y ∈ Rn the solution to the optimization problem

argmin~x||~x ||2 subject to A~x = ~y .

~x ∗ := VS−1UT~y

= AT(ATA

)−1~y

where A = USV T is the SVD of A

~x = Prow(A) ~x + Prow(A)⊥ ~x

Since A is full rank V , Prow(A) ~x = V ~c for some vector ~c ∈ Rn

A~x = AProw(A) ~x

= USV TV ~c

= US~c

A~x = ~y is equivalent to US~c = ~y and ~c = S−1UT~y

A~x = AProw(A) ~x

= USV TV ~c

= US~c

A~x = AProw(A) ~x

= USV TV ~c

= US~c

A~x = AProw(A) ~x

= USV TV ~c

= US~c

For all feasible vectors ~x

Prow(A) ~x = VS−1UT~y

By Pythagoras’ theorem, minimizing ||~x ||2 is equivalent to minimizing

||~x ||22 =∣∣∣∣Prow(A) ~x

∣∣∣∣22 +

∣∣∣∣∣∣Prow(A)⊥ ~x∣∣∣∣∣∣2

Regular subsampling

0 10 20 30 40 50 60 70-3

Regular subsampling

A :=1√2

[Fm/2 Fm/2

]F ∗m/2Fm/2 = I

Fm/2F∗m/2 = I

[~xup~xdown

Regular subsampling

~x`2 = arg minA~x=~y

||~x ||2

= AT(ATA

)−1~y

=1√2

[F ∗m/2F ∗m/2

](1√2

[Fm/2 Fm/2

] 1√2

[F ∗m/2F ∗m/2

])−11√2

[Fm/2 Fm/2

] [ ~xup~xdown

[F ∗m/2F ∗m/2

[Fm/2F

∗m/2 + Fm/2F

∗m/2

])−1 (Fm/2~xup + Fm/2~xdown

[F ∗m/2F ∗m/2

]I−1 (Fm/2~xup + Fm/2~xdown

[F ∗m/2

(Fm/2~xup + Fm/2~xdown

)F ∗m/2

[~xup + ~xdown~xup + ~xdown

Regular subsampling

||~x ||2

= AT(ATA

)−1~y

=1√2

[F ∗m/2F ∗m/2

](1√2

[Fm/2 Fm/2

] 1√2

[F ∗m/2F ∗m/2

])−11√2

[Fm/2 Fm/2

] [ ~xup~xdown

[F ∗m/2F ∗m/2

[Fm/2F

∗m/2 + Fm/2F

∗m/2

[F ∗m/2F ∗m/2

[F ∗m/2

)F ∗m/2

Regular subsampling

||~x ||2

= AT(ATA

)−1~y

=1√2

[F ∗m/2F ∗m/2

](1√2

[Fm/2 Fm/2

] 1√2

[F ∗m/2F ∗m/2

])−11√2

[Fm/2 Fm/2

] [ ~xup~xdown

[F ∗m/2F ∗m/2

[Fm/2F

∗m/2 + Fm/2F

∗m/2

[F ∗m/2F ∗m/2

[F ∗m/2

)F ∗m/2

Regular subsampling

||~x ||2

= AT(ATA

)−1~y

=1√2

[F ∗m/2F ∗m/2

](1√2

[Fm/2 Fm/2

] 1√2

[F ∗m/2F ∗m/2

])−11√2

[Fm/2 Fm/2

] [ ~xup~xdown

[F ∗m/2F ∗m/2

[Fm/2F

∗m/2 + Fm/2F

∗m/2

[F ∗m/2F ∗m/2

[F ∗m/2

)F ∗m/2

Regular subsampling

||~x ||2

= AT(ATA

)−1~y

=1√2

[F ∗m/2F ∗m/2

](1√2

[Fm/2 Fm/2

] 1√2

[F ∗m/2F ∗m/2

])−11√2

[Fm/2 Fm/2

] [ ~xup~xdown

[F ∗m/2F ∗m/2

[Fm/2F

∗m/2 + Fm/2F

∗m/2

[F ∗m/2F ∗m/2

[F ∗m/2

)F ∗m/2

Regular subsampling

||~x ||2

= AT(ATA

)−1~y

=1√2

[F ∗m/2F ∗m/2

](1√2

[Fm/2 Fm/2

] 1√2

[F ∗m/2F ∗m/2

])−11√2

[Fm/2 Fm/2

] [ ~xup~xdown

[F ∗m/2F ∗m/2

[Fm/2F

∗m/2 + Fm/2F

∗m/2

[F ∗m/2F ∗m/2

[F ∗m/2

)F ∗m/2

Regular subsampling

||~x ||2

= AT(ATA

)−1~y

=1√2

[F ∗m/2F ∗m/2

](1√2

[Fm/2 Fm/2

] 1√2

[F ∗m/2F ∗m/2

])−11√2

[Fm/2 Fm/2

] [ ~xup~xdown

[F ∗m/2F ∗m/2

[Fm/2F

∗m/2 + Fm/2F

∗m/2

[F ∗m/2F ∗m/2

[F ∗m/2

)F ∗m/2

Minimum `1-norm solution

Problem: argminA~x=~y ||~x ||1 doesn’t have a closed form

Instead we can use a dual variable to certify optimality

Dual solution

The solution ~ν ∗ to

∣∣∣∣∣∣AT~ν∣∣∣∣∣∣∞≤ 1

satisfies

(AT~ν ∗)[i ]= sign(~x ∗[i ]) for all ~x ∗[i ] 6= 0

where ~x ∗[i ] is a solution to the primal problem

Dual certificate

If there exists a vector ~ν ∈ Rn such that

(AT~ν)[i ] = sign(~x ∗[i ]) if ~x ∗[i ] 6= 0∣∣∣(AT~ν)[i ]∣∣∣<1 if ~x ∗[i ] = 0

then ~x ∗ is the unique solution to the primal problem

as long as the submatrix AT is full rank

Proof 1

~ν is feasible for the dual problem, so for any primal feasible ~x

||~x ||1 ≥ ~yT~ν

= (A~x ∗)T~ν

= (~x ∗)T (AT~ν)

=∑i∈T

~x ∗[i ] sign(~x ∗[i ])

= ||~x ∗||1

~x ∗ must be a solution

Proof 1

||~x ||1 ≥ ~yT~ν = (A~x ∗)T~ν

= (~x ∗)T (AT~ν)

=∑i∈T

~x ∗[i ] sign(~x ∗[i ])

= ||~x ∗||1

Proof 1

||~x ||1 ≥ ~yT~ν = (A~x ∗)T~ν

= (~x ∗)T (AT~ν)

=∑i∈T

~x ∗[i ] sign(~x ∗[i ])

= ||~x ∗||1

Proof 1

||~x ||1 ≥ ~yT~ν = (A~x ∗)T~ν

= (~x ∗)T (AT~ν)

=∑i∈T

~x ∗[i ] sign(~x ∗[i ])

= ||~x ∗||1

Proof 1

||~x ||1 ≥ ~yT~ν = (A~x ∗)T~ν

= (~x ∗)T (AT~ν)

=∑i∈T

~x ∗[i ] sign(~x ∗[i ])

= ||~x ∗||1

Proof 2

AT~ν is a subgradient of the `1 norm at ~x ∗

For any other feasible vector ~x

||~x ||1 ≥ ||~x∗||1 + (AT~ν)T (~x − ~x ∗)

= ||~x ∗||1 + ~νT (A~x − A~x ∗)

= ||~x ∗||1

Proof 2

||~x ||1 ≥ ||~x∗||1 + (AT~ν)T (~x − ~x ∗)

= ||~x ∗||1 + ~νT (A~x − A~x ∗)

= ||~x ∗||1

Proof 2

||~x ||1 ≥ ||~x∗||1 + (AT~ν)T (~x − ~x ∗)

= ||~x ∗||1 + ~νT (A~x − A~x ∗)

= ||~x ∗||1

Random subsampling

0 10 20 30 40 50 60 70-3

Exact sparse recovery via `1-norm minimization

Assumption: There exists a signal ~x ∗ ∈ Rm with s nonzeros such that

A~x ∗ = ~y

for a random A ∈ Rm×n (random Fourier, Gaussian iid, . . . )

Exact recovery: If the number of measurements satisfies

m ≥ C ′s log n

the solution of the problem

minimize ||~x ||1 subject to A ~x = y

is the original signal with probability at least 1− 1n

Show that dual certificate always exists

We need

ATT~ν = sign(~x ∗T ) s constraints∣∣∣∣∣∣AT

T c~ν∣∣∣∣∣∣∞< 1

Idea: Impose AT~ν = sign(~x ∗) and minimize∣∣∣∣AT

T c~ν∣∣∣∣∞

Problem: No closed-form solution

How about minimizing `2 norm?

Proof of exact recovery

Prove that dual certificate exists for any s-sparse ~x ∗

Dual certificate candidate: Solution of

minimize ||~v ||2subject to AT

T~v = sign (~x ∗T )

Closed-form solution ~ν`2 := AT

(ATTAT

)−1 sign (~x ∗T )

ATTAT is invertible with high probability

We need to prove that AT~ν`2 satisfies∣∣∣∣∣∣(AT~ν`2)T c

∣∣∣∣∣∣∞< 1

Dual certificate

0 10 20 30 40 50 60 70-1.5

Sign PatternDual Function

Proof of exact recovery

To control (AT~ν`2)T c , we need to bound

(ATTAT

)−1sign (~x ∗T )

for i ∈ T c

Let ~w :=(ATTAT

)−1 sign (~x ∗T )

|ATi ~w| can be bounded using independence

Result then follows from union bound

Constrained Optimization - NYU Courantcfgranda/pages/OBDA_fall17/slides... · 2017. 12. 11. ·...

Documents

Transcript of Constrained Optimization - NYU Courantcfgranda/pages/OBDA_fall17/slides... · 2017. 12. 11. ·...

Lecture Notes 5: Multiresolution Analysiscfgranda/pages/OBDA_fall17/notes/multiresolution.pdfLecture Notes 5: Multiresolution Analysis 1 Frames A frame is a generalization of an orthonormal

Probability - NYU Courantcfgranda/pages/DSGA1002_fall15/material... · is nonnegative and the probability of the union of several ... we can express the probability of the intersection

Lecture Notes 2: Matrices - Courant Institute of …cfgranda/pages/OBDA_fall17/notes... · 2017-09-20 · Optimization-based data analysis Fall 2017 Lecture Notes 2: Matrices Matrices

Lecture Notes 10: Matrix Factorization - NYU Courantcfgranda/pages/OBDA_fall17/notes/matrix... · The features aand bcapture the respective contributions of the movie and the user

Low-rank models - NYU Courantcfgranda/pages/OBDA_spring16... · 1 25(5) 5 Superman2. Matrixcompletion Thematrixcompletionproblem Nuclearnorm Theoreticalguarantees Algorithms Alternatingminimization

iJffi ff #XJJ##ft 'fiifly,ffi*X j**.*-n1r,'*il;'#J?XH:1ff ...proekt-svyazi.3dn.ru/_ld/0/75_-_.pdf'quuax,,i.dooJ EslcqraJtlodJf, e3ol'-ou s XlqHHeXoyouced n I4JloHxdAgou voVeCO otqIJoHCBlIO

The local and semilocal convergence analysis of new Newton ... · Br[x] = fy 2 X: jjy xjj rg, for any x 2 X and r > 0, Br(x) = fy 2 X: jjy xjj < rg, for any x 2 X and r >

Random projections - NYU Courantcfgranda/pages/OBDA_spring16...The following sections describe two popular alternatives: principal component analysis and random projections. 2.1 Principal

Vector spaces - NYU Courantcfgranda/pages/OBDA_fall17... · Properties I Forany~x;~y2V,~x+~ybelongstoV I Forany~x2Vandanyscalar , ~x2V I Thereexistsazerovector~0suchthat~x+~0= ~x

Lecture Notes 9: Constrained Optimizationcfgranda/pages/OBDA_fall17...Lecture Notes 9: Constrained Optimization 1 Compressed sensing 1.1 Underdetermined linear inverse problems Linear

EUROPEAN ROAD ASSESSMENT PROGRAMME - BIHAMKbihamk.ba/assets/upload/karta_na_bosanskom_final.pdfeuropean road assessment programme univerzitit u sar xjj'.vu fakultet za saobracaj i

Lecture Notes 3: Randomness - NYU Courantcfgranda/pages/OBDA_fall17/notes...Lecture Notes 3: Randomness 1 Gaussian random variables The Gaussian or normal random variable is arguably

Probability basics - Home Page | NYU Courantcfgranda/pages/DSGA1002... · Probability basics DS GA 1002 Probability and Statistics for Data Science cfgranda/pages/DSGA1002_fall17

Statistical Data Analysis - NYU Courantcfgranda/pages/DSGA1002_fall16/material/statistical_data...0 5 10 15 20 25 30 Degrees (Celsius) 0 5 10 15 20 25 30 35 40 45 January August Figure

Predicting the Outcome of an Election - NYU Courantcfgranda/pages/stuff/election.pdf · Predicting the Outcome of an Election Carlos Fernandez-Granda cfgranda cSplash2018

Overview - NYU Courantcfgranda/pages/OBDA_spring16...Minimizing the ‘2 norm of the coe cients does not yield a sparse representation (center), but minimizing the ‘1 norm does.

Learning signal representations - NYU Courantcfgranda/pages/OBDA_spring16/material/learn… · Learning signal representations 1 Introduction In Lecture Notes 4 we described how to

XJJ--50 （60，70，90钳口）

Lecture Notes 6: Linear Models - NYU Courantcfgranda/pages/OBDA_fall17/notes...Optimization-based data analysis Fall 2017 Lecture Notes 6: Linear Models 1 Linear regression 1.1 The

Vieira, L. M. G., Santos, J. C. D., Panzera, T. H., Campos ... · ISO 179-1 (2010) and ASTM D6110 (2010). A Charpy impact machine (XJJ series) was used with an energy of 15J and impact

iJffi ff #XJJ##ft 'fiifly,ffi*X j**.-n1r,'il;'#J?XH:1ff ...proekt-svyazi.3dn.ru/_ld/0/75_-_.pdf'quuax,,i.dooJ EslcqraJtlodJf, e3ol'-ou s XlqHHeXoyouced n I4JloHxdAgou voVeCO otqIJoHCBlIO