Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms...

Gradient Based Optimization Algorithms

Marc Teboulle

School of Mathematical SciencesTel Aviv University

Based on joint works with

Amir Beck (Technion), Jérôme Bolte, (Toulouse I), Ronny Luss (IBM-New York),Shoham Sabach, (Göttingen), Ron Shefi (Tel Aviv)

EPFL - Lausanne, February 4, 2014

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Opening Remark and CreditAbout more than 380 years ago.....In 1629, Fermat suggested the following:

Given f , solve for x :

f (x + d)� f (x)d

d=0= 0

...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....

Pierre de Fermat

Opening Remark and CreditAbout more than 380 years ago.....In 1629, Fermat suggested the following:

Given f , solve for x :

f (x + d)� f (x)d

d=0= 0

...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....

Pierre de Fermat

First-Order/Gradient Methods

First-Order methods are iterative methods that only exploit informationon the objective function and its gradient (sub-gradient).

A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]

A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:

Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]

A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]

Polynomial versus Gradient Methods

Convex problems are polynomially solvable within " accuracy:

Running Time Poly(Problem’s size, # of accuracy digits).

Theoretically: this means that large scale problems can be solved tohigh accuracy with polynomial methods, such as IPM.Practically: Running time is dimension-dependent and growsnonlinearly with problem’s dimension. For IPM which are Newton’s typemethods: ⇠ O(n3).

Thus, a "single iteration" of IPM can last forever!

Polynomial versus Gradient Methods

Convex problems are polynomially solvable within " accuracy:

Running Time Poly(Problem’s size, # of accuracy digits).

Theoretically: this means that large scale problems can be solved tohigh accuracy with polynomial methods, such as IPM.Practically: Running time is dimension-dependent and growsnonlinearly with problem’s dimension. For IPM which are Newton’s typemethods: ⇠ O(n3).Thus, a "single iteration" of IPM can last forever!

Gradient-Based Algorithms

Widely used in applications....

Clustering Analysis: The k-means algorithmNeuro-computing: The backpropagation algorithmStatistical Estimation: The EM (Expectation-Maximization) algorithm.Machine Learning: SVM, Regularized regression, etc...Signal and Image Processing: Sparse Recovery, Denoising andDeblurring Schemes, Total Variation minimization...Matrix minimization Problems....and much more...

Goals and Outline

Building simple and efficient First Order Methods (FOM),exploiting problem structures/data information and analyzetheir complexity and rate of convergence

Outline

Problem Formulations: Convex and Nonconvex ModelsAlgorithms: Derivation and AnalysisApplications: Linear inverse problems, image processing, dimensionreduction [PCA, NMF]

Goals and Outline

Building simple and efficient First Order Methods (FOM),exploiting problem structures/data information and analyzetheir complexity and rate of convergence

Outline

Problem Formulations: Convex and Nonconvex ModelsAlgorithms: Derivation and AnalysisApplications: Linear inverse problems, image processing, dimensionreduction [PCA, NMF]

Algorithms - Historical Development

Old schemes for general problems “resuscitated" and “improved", byexploiting special structures and data info.

Fixed point methods [Babylonian time!/Heron for square root, Picard,Banach, Weisfield’34]Coordinate descent/Alternating Minimization [Gauss-Seidel ’1798]Gradient/subgradient methods [Cauchy’ 1846, Shor’61, Rosen’63,Frank-Wolfe ’56, Polyak’62]Stochastic Gradients [Robbins and Monro ’51]Proximal-Algorithms [Martinet ’70, Rockafellar ’76, Passty’79]Penalty/Barrier methods [Courant’49, Fiacco-McCormick’66]Augmented Lagrangians and Splitting [Arrow-Hurwicz ’58,Hestenes-Powell’69, Goldstein-Treyakov’72, Rockafellar’74,Mercier-Lions ’79, Fortin-Glowinski’76, Bertsekas’82]Extragradient-methods for VI [Korpelevich ’76, Konnov,’80]Optimal/Fast Gradient Schemes [Nemirosvki-Yudin’81, Nesterov’83]

Algorithms - Historical Development

Old schemes for general problems “resuscitated" and “improved", byexploiting special structures and data info.

Fixed point methods [Babylonian time!/Heron for square root, Picard,Banach, Weisfield’34]Coordinate descent/Alternating Minimization [Gauss-Seidel ’1798]Gradient/subgradient methods [Cauchy’ 1846, Shor’61, Rosen’63,Frank-Wolfe ’56, Polyak’62]Stochastic Gradients [Robbins and Monro ’51]Proximal-Algorithms [Martinet ’70, Rockafellar ’76, Passty’79]Penalty/Barrier methods [Courant’49, Fiacco-McCormick’66]Augmented Lagrangians and Splitting [Arrow-Hurwicz ’58,Hestenes-Powell’69, Goldstein-Treyakov’72, Rockafellar’74,Mercier-Lions ’79, Fortin-Glowinski’76, Bertsekas’82]Extragradient-methods for VI [Korpelevich ’76, Konnov,’80]Optimal/Fast Gradient Schemes [Nemirosvki-Yudin’81, Nesterov’83]

Gradient-Based Methods for Convex Composite ModelsSmooth+Nonsmooth

A Useful Convex Optimization Model

(M) min {F (x) = f (x) + g(x) : x 2 E}

E is a finite dimensional Euclidean spacef : E! R is convex smooth of type C1,1, i.e.,

9L(f ) > 0 : krf (x)�rf (y)k L(f )kx� yk, 8x, y.

g : E! (�1,1] is convex nonsmooth extended valued (allowingconstraints)We assume that (M) is solvable, i.e.,

X⇤ := argmin f 6= ;, and for x⇤ 2 X⇤, set F⇤ := F (x⇤).

The model (M) does already have structural information. Rich enough torecover various classes of smooth/nonsmooth convex minimization problems.

Special Cases of the General Model

g = 0 - smooth unconstrained convex minimization.

g = �(·|C) - constrained smooth convex minimization.

{f (x) : x 2 C}

f = 0 - general convex optimization.

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are thenumbers?

Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.

Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..

Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...

A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

Linear Inverse Problems

Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A are known.

Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)

min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)

• Intensive research activities over the last 50 years...Now, more...with SparseOptimization problems..

• Choices for g(·), d(·, ·) depends on the application at hand.Nonsmooth regularizers are particularly useful.

Image Processing

g = �kL(·)k1 - l1-regularized convex problem.

{f (x) + �kLxk1}; f (x) := d(b,Ax)

L - identity, differential operator, wavelet

g = TV (·) - TV-based regularization. Grid points (i , j) with same intensityas their neighbors [ROF 1992].

{f (x) + �TV (x)}

1-dim:TV (x) =

i |xi � xi+1|2-dim:isotropic: TV(x) =

(xi,j � xi+1,j)2 + (xi,j � xi,j+1)2

anisotropic: TV(x) =P

j(|xi,j � xi+1,j |+ |xi,j � xi,j+1|)More on this soon...

Image Processing

g = �kL(·)k1 - l1-regularized convex problem.

{f (x) + �kLxk1}; f (x) := d(b,Ax)

L - identity, differential operator, waveletg = TV (·) - TV-based regularization. Grid points (i , j) with same intensityas their neighbors [ROF 1992].

{f (x) + �TV (x)}

1-dim:TV (x) =

i |xi � xi+1|2-dim:isotropic: TV(x) =

(xi,j � xi+1,j)2 + (xi,j � xi,j+1)2

anisotropic: TV(x) =P

j(|xi,j � xi+1,j |+ |xi,j � xi,j+1|)More on this soon...

Building Gradient-Based Schemes: Basic Old Idea

Pick an adequate approximate model

1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:

q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)

That is, leaving the nonsmooth part g(·) untouched.

2 Linearize only + use info on C: e.g., C compact, g := �C

q(x, y) = f (y) + hx� y,rf (y)i

Solve “some how", the resulting approximate model:

xk+1 = argminx

q(x, xk ), k = 0, . . .

.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).

Pick an adequate approximate model1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:

q(x, y) = f (y) + hx� y,rf (y)i

xk+1 = argminx

q(x, xk ), k = 0, . . .

q(x, y) = f (y) + hx� y,rf (y)i

xk+1 = argminx

q(x, xk ), k = 0, . . .

q(x, y) = f (y) + hx� y,rf (y)i

xk+1 = argminx

q(x, xk ), k = 0, . . .

Example: The Proximal Gradient MethodThe derivation of the proximal gradient method is similar to the one of the(sub)gradient projection method.

For any L � Lf , and a given iterate xk :

QL(x, xk ) := f (xk ) + hx� xk ,rf (xk )i+L2kx� xkk2+ g(x)

untouched

Algorithm:

xk+1 := argminx

QL(x, xk )

= argminx

g(x) +L2

x� (xk �1Lrf (xk ))

= prox 1L g

xk �1Lrf (xk )

where prox operator:proxh(z) := argminu

h(u) +12ku� zk2

Example: The Proximal Gradient MethodThe derivation of the proximal gradient method is similar to the one of the(sub)gradient projection method.

For any L � Lf , and a given iterate xk :

QL(x, xk ) := f (xk ) + hx� xk ,rf (xk )i+L2kx� xkk2+ g(x)

untouched

Algorithm:

xk+1 := argminx

QL(x, xk )

= argminx

g(x) +L2

x� (xk �1Lrf (xk ))

= prox 1L g

xk �1Lrf (xk )

where prox operator:proxh(z) := argminu

h(u) +12ku� zk2

The Proximal Gradient Method for (M)

The proximal gradient method with a constant stepsize rule.

Proximal Gradient Method (with Constant Stepsize)Input: L = L(f ) - A Lipschitz constant of rf .Step 0. Take x0 2 E.Step k. (k � 1) Compute prox of g at a gradient step for f

xk = prox 1L g

xk�1 �1Lrf (xk�1)

The Lipschitz constant L(f ) is not always known or not easily computable, thisissue is resolved with an easy backtracking procedure.

1 Can we compute efficiently proxg/L(·)?2 What is the global rate of convergence for PGM?

First, we illustrate some special cases of ProxGrad.

Special Cases of Proximal Gradient

The Prox-Grad method: xk+1 = prox 1L g�

xk � 1Lrf (xk )

g ⌘ 0) the gradient method.

xk+1 = xk �1Lrf (xk )

g = �(·|C)) the gradient projection method

xk+1 = PC

xk �1Lrf (xk )

f = 0. The proximal minimization method.

xk+1 = prox 1L g(xk )

Some Calculus Rules for Computing proxtg

proxtg(x) = argminu

g(u) +12tku� xk2

g(u) proxt(g)(x)�C(u) ⇧C(x)

�⇤C(u) -support function- x� ⇧C(x)

x + (⇧C(x)�x)tdC(x)

if dC(x) > 1/tx otherwise

kAx� bk2/2,A 2 Rm⇥n (I + t�1AT A)�1(x + t�1AT b)kuk1 (-shrinkage-) sgn (xj)max{|xj |� t , 0}

kuk⇢

kxk2/2t ifkxk tkxk � t/2 otherwise

kUk⇤, U 2 Rm⇥n, (m � n) P diag(s)QT

�1(U) � �2(U) � . . . singular values of UNuclear norm kUk⇤ =

j �j(U)Singular value decomposition

U = P diag(�)QT , then shrinkage sj = sgn (�j)max{|�j |� t , 0}.

Rate of Convergence of Prox-Grad for Convex (M)

Theorem - [Rate of Convergence of Prox-Grad]Let {xk} be the sequence generated by the proximal gradientmethod.

F (xk )� F (x⇤) Lkx0 � x⇤k2

2kfor any optimal solution x⇤.

Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.# iterations for F (xk )� F (x⇤) " is O(1/").

No need to know the Lipschitz constant (backtracking).

Rate of Convergence of Prox-Grad for Convex (M)

Theorem - [Rate of Convergence of Prox-Grad]Let {xk} be the sequence generated by the proximal gradientmethod.

F (xk )� F (x⇤) Lkx0 � x⇤k2

2kfor any optimal solution x⇤.

Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.# iterations for F (xk )� F (x⇤) " is O(1/").No need to know the Lipschitz constant (backtracking).

Toward a Faster Scheme

For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/

pk)...Even worse...

Can we find a faster method?The answer is YES!.

Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.

pk)...Even worse...

Can we find a faster method?

The answer is YES!.

pk)...Even worse...

Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).

But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.

pk)...Even worse...

A Fast Prox-Grad Algorithm - [Beck-Teboulle (2009)]

An equally simple algorithm as prox-grad..

Here with constant step sizeInput: L = L(f ) - A Lipschitz constant of rf .Step 0. Take y1 = x0 2 E, t1 = 1.Step k. (k � 1) Compute

xk = argminx2E

g(x) +L2kx� (yk �

1Lrf (yk ))k2

xk ⌘ prox

gL(yk �

1Lrf (yk )), - main computation as Prox-Grad

• tk+1 =1 +

1 + 4t2k

•• yk+1 = xk +

tk � 1tk+1

(xk � xk�1).

Additional computation in (•) and (••) is clearly marginal.Knowledge of L(f ) is not Necessary, can use backtracking line search.With g = 0, this is the smooth “Optimal" Gradient of Nesterov (83); With tk ⌘ 1 we recover PGM.

An O(1/k2) Global Rate of Convergence for (M)

Theorem Let {xk} be generated by FPG. Then for any k � 1

F (xk )� F (x⇤) 2L(f )kx0 � x⇤k2

(k + 1)2 ,

# of iterations to reach F (x̃)� F⇤ " is ⇠ O(1/p").

Improves Prox Grad by a square root factor.On the practical side this theoretical rate is achieved.

| Many computational studies have confirmed the efficiency of FPG forsolving various optimization models in applications: Signal/image recoveryand in Machine learning e.g., image denoising/deblurring, nuclear matrixnorm regularization, matrix completion problems, multi-task learning, matrixclassification, etc...

Some examples....

An O(1/k2) Global Rate of Convergence for (M)

Theorem Let {xk} be generated by FPG. Then for any k � 1

F (xk )� F (x⇤) 2L(f )kx0 � x⇤k2

(k + 1)2 ,

# of iterations to reach F (x̃)� F⇤ " is ⇠ O(1/p").

Improves Prox Grad by a square root factor.On the practical side this theoretical rate is achieved.

| Many computational studies have confirmed the efficiency of FPG forsolving various optimization models in applications: Signal/image recoveryand in Machine learning e.g., image denoising/deblurring, nuclear matrixnorm regularization, matrix completion problems, multi-task learning, matrixclassification, etc...

Some examples....

Back to Application: Linear Inverse Problems

Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A (Blurring matrix) are known.

min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)(

Nonsmooth regularizers are particularly useful.

The l1 norm being a “starring one"!, as promoting sparsity.The squared l2 convex proximity kb�A(x)k2 will be the smooth part.

An Example l1 regularization

{kAx� bk2 + �kxk1} ⌘ minx

{f (x) + g(x)}

The proximal map of g(x) = �kxk1 is simply:

proxt(g)(y) = argminu

12tku� yk2 + �kuk1

= T �t(y),

where T ↵ : Rn ! Rn is the shrinkage or soft threshold operator:

T ↵(x)i = (|xi |� ↵)+sgn (xi). (1)

} Prox Grad method = ISTA Iterative Shrinkage/Thresholding Algorithm

Other names in the signal processing literature include for example: thresholdLandweber method, iterative denoising, deconvolution algorithms...[ Chambolle (98);Figueiredo-Nowak (03, 05); Daubechies et al. (04),Elad et al. (06), Hale et al. (07)...]

} Fast Prox-grad = FISTA

An Example l1 regularization

{kAx� bk2 + �kxk1} ⌘ minx

{f (x) + g(x)}

The proximal map of g(x) = �kxk1 is simply:

proxt(g)(y) = argminu

12tku� yk2 + �kuk1

= T �t(y),

where T ↵ : Rn ! Rn is the shrinkage or soft threshold operator:

T ↵(x)i = (|xi |� ↵)+sgn (xi). (1)

} Prox Grad method = ISTA Iterative Shrinkage/Thresholding Algorithm

Other names in the signal processing literature include for example: thresholdLandweber method, iterative denoising, deconvolution algorithms...[ Chambolle (98);Figueiredo-Nowak (03, 05); Daubechies et al. (04),Elad et al. (06), Hale et al. (07)...]

} Fast Prox-grad = FISTA

A Numerical Example: l1-Image Deblurring

minx{kAx� bk2 + �kxk1}

Comparing ISTA versus FISTA on Problems• dimension d like d = 256⇥ 256 = 65, 536, or/and 512⇥ 512 = 262, 144.• The d ⇥ d matrix A is dense (Gaussian blurring times inverse of two-stageHaar wavelet transform).• All problems solved with fixed � and Gaussian noise.

Example l1 Deblurring of the Cameraman

original blurred and noisy

1000 Iterations of ISTA versus 200 of FISTA

ISTA: 1000 Iterations FISTA: 200 Iterations

Original Versus Deblurring via FISTA

Original FISTA:1000 Iterations

Function Values errors F (xk)� F (x⇤)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000010−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

ISTAFISTA

Example 2: l1 versus TV RegularizationMain difference between l1 and TV regularization:

prox of l1 - simple and explicit (shrinkage/soft threshold).prox of TV - TV-denoising problem requires an iterative method:g=TV

xk+1 = D✓

xk �2L

AT (Axk � b),2�L

where D(w, t) = argminx

kx�wk2 + 2tTV(x)

Here:Prox operation, TV-based denoisingNo analytic expression in this case. Still can be solved very efficientlyby solving a smooth dual formulation by a fast gradient method [Beck,Teboulle, 2009], which can be seen as an acceleration of [Chambolle04,05].

Important to note:The fast proximal gradient method is implementable if the prox operationcan be computed efficiently.not a method for solving general nonsmooth convex problems. More onthis soon...

Example 2: l1 versus TV RegularizationMain difference between l1 and TV regularization:

prox of l1 - simple and explicit (shrinkage/soft threshold).prox of TV - TV-denoising problem requires an iterative method:g=TV

xk+1 = D✓

xk �2L

AT (Axk � b),2�L

where D(w, t) = argminx

kx�wk2 + 2tTV(x)

Here:Prox operation, TV-based denoisingNo analytic expression in this case. Still can be solved very efficientlyby solving a smooth dual formulation by a fast gradient method [Beck,Teboulle, 2009], which can be seen as an acceleration of [Chambolle04,05].

Important to note:The fast proximal gradient method is implementable if the prox operationcan be computed efficiently.not a method for solving general nonsmooth convex problems. More onthis soon...

MFISTA: Monotone FISTA

FISTA is not a monotone method. Problematic when the prox is not exactlycomputed.

Input: L � L(f ) - An upper bound on the Lipschitz constant of rf .Step 0. Take y1 = x0 2 E, t1 = 1.Step k. (k � 1) Compute

zk = prox 1L

yk �1Lrf (yk )

tk+1 =1 +

1 + 4t2k

xk = argmin{F (x) : x = zk , xk�1}

yk+1 = xk +

tktk+1

(zk � xk ) +

tk � 1tk+1

(xk � xk�1).

With Same Rate of Convergence as FPG!

Lena and 3 Reconstructions – N=100 Iterations

Blurred and Noisy ISTA(F100 = 0.606)

MFISTA(F100 = 0.466)

Extension: Gradient Schemes with Non-EuclideanDistances

All previous schemes were based on using the squared Euclideandistance for measuring proximity of two points in EIt is useful to exploit the geometry of the constraints set XThis is done by selecting a “distance-like” function

Typical example: Bregman type distances - based on kernel :

D (x, y) = (x)� (y)� hx� y,r (y)i, strongly convex

Advantage of using Non Euclidean distance adequately exploiting theconstraints allows to:

1 Simplify the prox computation for the given constraints set2 Often improve the constant in the complexity bound.

Mirror descent algorithms, extragradient-like, lagrangians, smoothing.... Nemirovsky-Yudin (80),Teboulle (92), Beck-Teboulle (03), Nemirovsky (04), Nesterov (05), Auslender-Teboulle (05)...

Extension: Gradient Schemes with Non-EuclideanDistances

All previous schemes were based on using the squared Euclideandistance for measuring proximity of two points in EIt is useful to exploit the geometry of the constraints set XThis is done by selecting a “distance-like” function

Typical example: Bregman type distances - based on kernel :

D (x, y) = (x)� (y)� hx� y,r (y)i, strongly convex

Advantage of using Non Euclidean distance adequately exploiting theconstraints allows to:

1 Simplify the prox computation for the given constraints set2 Often improve the constant in the complexity bound.

Mirror descent algorithms, extragradient-like, lagrangians, smoothing.... Nemirovsky-Yudin (80),Teboulle (92), Beck-Teboulle (03), Nemirovsky (04), Nesterov (05), Auslender-Teboulle (05)...

Convex Nonsmooth Composite: Lagrangians Based Methods

Nonsmooth Convex with Separable Objective

(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},

Here f , g are both nonsmooth, A : Rn ! Rm a given linear map.

Problem (P) is equivalent to (via the standard splitting variables trick):

(P) p⇤ = inf{f (x) + g(z) : Ax = z, x 2 Rn, z 2 Rm}

Rockafellar (’76) has shown that the Proximal Point Algorithm can be appliedto the dual and primal-dual formulation of (P) to produce:

The Multipliers Method (augmented Lagrangian Method).The Proximal Method of Multipliers (PMM).Largely ignored over last 20 years.....Recent strong revival in sparseoptimization: compressive sensing, image processing ect...A very nice recent survey with many machine learning applications:[Boydet al. 2011]

(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},

The PMM– Rockafellar (76)

PMM – The Proximal Method of Multipliers Generate (xk , zk ) and dual mul-tiplier yk via

(xk+1, zk+1) 2 argminx,z

f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z)

yk+1 = yk + c(Axk+1 � zk+1).

The Augmented Lagrangian:

Lc(x , z, y) :=

Lagrangianz }| {

f (x) + g(z) + hyk ,Ax � zi+c2kAx � zk2, (c > 0).

qk (x , z) := 12

kx � xkk2M1

+ kz � zkk2M2

is the additional primal proximalterm.The choice of M1 2 Sn

+,M2 2 Sm+ is used to conveniently describe/analyze

several variants of the PMM.M1 = M2 ⌘ 0, recovers the Multiplier Methods (PPA on the dual).

Proximal Method of Multipliers–Key Difficulty

Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:

f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z).

The quadratic coupling term kAx � zk2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).In many applications, separate minimization is often much easier.....

Strategies to overcome this difficulty:Approximate Minimization – linearized the quad term kAx � zk2 wrt (x , z).Alternating Minimization – à la “Gauss-Seidel" in (x , z).Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.Result in various interesting schemes, e.g., last one can be shown torecover the recent efficient primal-dual method of [Chambolle-Pock,2010].

Proximal Method of Multipliers–Key Difficulty

Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:

f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z).

The quadratic coupling term kAx � zk2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).In many applications, separate minimization is often much easier.....

Strategies to overcome this difficulty:Approximate Minimization – linearized the quad term kAx � zk2 wrt (x , z).Alternating Minimization – à la “Gauss-Seidel" in (x , z).Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.Result in various interesting schemes, e.g., last one can be shown torecover the recent efficient primal-dual method of [Chambolle-Pock,2010].

A Prototype : Aternating Direction of ProximalMultipliers

Eliminate the coupling (x , z) via alternating minimization steps.

Glowinski-Marocco (75), Gabay-Mercier (76), Fortin-Glowinski (83), Ecsktein-Bertsekas (91) ......the so-called Alternating Direction of Mulipliers (ADM),(based on the Multiplier Methods, i.e.,M1 = M2 ⌘ 0.)

(AD-PMM) Alternating Direction Proximal Method of Multipliers1. Start with any (x0, z0, y0) 2 Rn ⇥ Rm ⇥ Rm and c > 02. For k = 0, 1, . . . generate the sequence {xk , zk , yk} as follows:

xk+1 2 argmin⇢

f (x) +c2kAx � zk + c�1ykk2 +

12kx � xkk2

zk+1 = argmin⇢

g(z) +c2kAxk+1 � z + c�1ykk2 +

12kz � zkk2

yk+1 = yk + c(Axk+1 � zk+1).

Global Rate of Convergence ResultsFor AD-PMM and many other variants. [Shefi-Teboulle (2013)]

Let (x⇤, z⇤, y⇤) be a saddle point for the Lagrangian l associated to (P).Then for all N � 1,

l(xN , zN , y)� p⇤ C2(x⇤,z⇤,y)N , 8y 2 Rm

In particular: f (xN) + g(zN) + rkAxN � zNk � p⇤ c1N ,

Residual norm: kAxN � zNk c2pN

Original Primal objective (under g-Lipschitz continuous):

'(xN)� '(x⇤) c3

ci are positive constants (as usual in terms of distances to optimal solution).

For any sequence {xk , zk , yk}, any N � 1, the ergodic sequences {xN , yN , zN}

xN :=1N

N�1X

xk+1, zN :=1N

N�1X

zk+1, and yN :=1N

N�1X

Non-Convex Smooth Models

Sparse PCA

Principal Component Analysis solves

max{xT Ax : kxk2 = 1, x 2 Rn}, (A ⌫ 0)

while Sparse Principal Component Analysis solves

max{xT Ax : kxk2 = 1, kxk0 k, x 2 Rn}, k 2 (1, n] sparsity

kxk0 counts the number of nonzero entries of xIssues:

1 Maximizing a Convex objective.2 Hard Nonconvex Constraint kxk0 k .

Possible Approaches:1 SDP Convex Relaxations [D’aspremont et al. 2008]2 Approximation/Modified formulations: Many proposed approaches

Sparse PCA via Penalization/Relaxation/Approx.

� The problem of interest is the difficult sparse PCA problem as is

max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}

� Literature has focused on solving various modifications:l0-penalized PCA max {xT Ax � skxk0 : kxk2 = 1}, s > 0Relaxed l1-constrained PCA max {xT Ax : kxk2 = 1, kxk1

Relaxed l1-penalized PCA max {xT Ax � skxk1 : kxk2 = 1}Approx-Penalized max {xT Ax � sgp(|xk) : kxk2 = 1} gp(x) ' kxk0

SDP-Convex Relaxations max{tr(AX ) : tr (X ) = 1,X ⌫ 0, kXk1 k}

SDP-relaxations often too computationally expensive for large problems.No algorithm give bounds to the optimal solution of the original problem.Even when "Simple", these algorithms are for modifications:| do not solve the original problem of interest| do require unknown penalty parameter s to be tuned.

Quick Highlight of Simple Algorithms for "ModifiedProblems"

Type Iteration Per-Iteration ReferencesComplexity

l1-constrained xj+1i =

sgn(((A+�2 )xj )i )(|((A+

�2 )xj )i |��j )+qP

h (|((A+�2 )xj )h|��j )2+

O(n2), O(mn) Witten et al. (2009)

l1-constrained xj+1i =

sgn((Axj )i )(|(Axj )i |�sj )+qPh (|(Axj )h|�sj )2+

where O(n2), O(mn) Sigg-Buhman (2008)

sj is (k + 1)-largest entry of vector |Axj |

l0-penalized zj+1 =

Pi [sgn((bT

i zj )2�s)]+(bTi zj )bi

i [sgn((bTi zj )2�s)]+(bT

i zj )bik2O(mn) Shen-Huang (2008),

Journee et al. (2010)

l0-penalized xj+1i =

sgn(2(Axj )i )(|2(Axj )i |�s'0p (|x

ji |))+r

Ph (|2(Axj )h|�s'0

p (|xjh|))

O(n2) Sriperumbudur et al. (2010)

l1-penalized yj+1 = argminy

ikbi � xj yT bik

22 + �kyk2

2 + skyk1} Zou et al. (2006)

xj+1 =(P

i bi bTi )yj+1

i bi bTi )yj+1k2

l1-penalized zj+1 =

Pi (|b

Ti zj |�s)+sgn(bT

i zj )bikP

i (|bTi zj |�s)+sgn(bT

i zj )bik2O(mn) Shen-Huang (2008),

Journee et al. (2010)

A Plethora of Models/Algorithms Revisited -[Luss-Teboulle (2013)]

All previous listed algorithms have been derived from various disparateapproaches/motivations to solve modifications of SPCA: ExpectationMaximization; Majoration-Mininimization techniques; DC programming;Alternating minimization etc...

1 Are all these algorithms different? Any connection?2 Is it possible to tackle the difficult sparse PCA problem “as is"

Very recently we have shown that:

All the previously listed algorithms are a particularrealization of a"Father Algorithm": ConGradU(based on the well-known Conditional Gradient Algorithm)ConGradU CAN be applied directly to the originalproblem!

A Plethora of Models/Algorithms Revisited -[Luss-Teboulle (2013)]

All previous listed algorithms have been derived from various disparateapproaches/motivations to solve modifications of SPCA: ExpectationMaximization; Majoration-Mininimization techniques; DC programming;Alternating minimization etc...

1 Are all these algorithms different? Any connection?2 Is it possible to tackle the difficult sparse PCA problem “as is"

Very recently we have shown that:

All the previously listed algorithms are a particularrealization of a"Father Algorithm": ConGradU(based on the well-known Conditional Gradient Algorithm)ConGradU CAN be applied directly to the originalproblem!

Maximizing a Convex function over a CompactNonconvex setClassic Conditional Gradient Algorithm [Frank-Wolfe’56, Polyak’63, Dunn’79..]

solves : max {F (x) : x 2 C}, with F is C1; C convex compactx0 2 C, pj = argmax {hx � xj ,rF (xj)i : x 2 C}

xj+1 = xj + ↵j(pj � xj), ↵j 2 (0, 1] stepsize

� Here : F is convex, possibly nonsmooth; C is compact but nonconvex

Idea goes back to Mangasarian (96) developed for C a polyhedral set.

ConGradU – Conditional Gradient with Unit Step Size

x0 2 C, xj+1 2 argmax{hx � xj ,F 0(xj)i : x 2 C}

Notes:1 F is not assumed to be differentiable and F 0(x) is a subgradient of F at x .2 Useful when max{hx � xj ,F 0(xj )i : x 2 C} is easy to solve

Maximizing a Convex function over a CompactNonconvex setClassic Conditional Gradient Algorithm [Frank-Wolfe’56, Polyak’63, Dunn’79..]

solves : max {F (x) : x 2 C}, with F is C1; C convex compactx0 2 C, pj = argmax {hx � xj ,rF (xj)i : x 2 C}

xj+1 = xj + ↵j(pj � xj), ↵j 2 (0, 1] stepsize

� Here : F is convex, possibly nonsmooth; C is compact but nonconvex

Idea goes back to Mangasarian (96) developed for C a polyhedral set.

ConGradU – Conditional Gradient with Unit Step Size

x0 2 C, xj+1 2 argmax{hx � xj ,F 0(xj)i : x 2 C}

Notes:1 F is not assumed to be differentiable and F 0(x) is a subgradient of F at x .2 Useful when max{hx � xj ,F 0(xj )i : x 2 C} is easy to solve

Solving Original l0-constrained PCA via ConGradU

Applying ConGradU directly to max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}results in

xj+1 = argmax{xjT Ax : kxk2 = 1, kxk0 k} =Tk (Axj)

kTk (Axj)k2

Tk (a) := argminy

{kx � ak22 : kxk0 k}

Despite the hard constraint, easy to compute: (Tk (a))i = ai for the k largestentries (in absolute value) of a and (Tk (x))i = 0 otherwise.

Convergence: Every limit point of {xj} converges to a stationary point.Complexity: O(kn) or O(mn)

Thus, original problem can be solved using ConGradU with thesame complexity as when applied to modifications!Penalized/Modified problems require tuning an unknown tradeoffpenalty parameter This can be very computationally expensive and notneeded here.

kTk (Axj)k2

Tk (a) := argminy

kTk (Axj)k2

Tk (a) := argminy

ConGradU for a General Class of Problems

(G) maxx

{f (x) + g(|x |) : x 2 C}

f : Rn ! R is convex, C ✓ Rn is a compact set.g : Rn

+ ! R is convex differentiable and montonote decreasing

Particularly useful for handling approximate l0-penalized problems.

CondGradU applied to (G) produces the following simple:

Weighted l1-norm maximization problem:

x0 2 C, xj+1 = argmax{haj , xi �X

w ji |xi | : x 2 C}, j = 0, . . . ,

where wj := �g0(|xj |) > 0 and aj := f 0(xj) 2 Rn.

For penalized/approximate penalized SPCA, C is a unit ball, and above admits aclosed form solution:

xj+1 =Swj (f 0(xj))

kSwj (f 0(xj))k , j = 0, . . . ; Sw (a) := (|a|� w)+sgn(a), (Soft Threshold).

ConGradU for a General Class of Problems

(G) maxx

{f (x) + g(|x |) : x 2 C}

f : Rn ! R is convex, C ✓ Rn is a compact set.g : Rn

+ ! R is convex differentiable and montonote decreasing

Particularly useful for handling approximate l0-penalized problems.CondGradU applied to (G) produces the following simple:

Weighted l1-norm maximization problem:

x0 2 C, xj+1 = argmax{haj , xi �X

w ji |xi | : x 2 C}, j = 0, . . . ,

where wj := �g0(|xj |) > 0 and aj := f 0(xj) 2 Rn.

For penalized/approximate penalized SPCA, C is a unit ball, and above admits aclosed form solution:

xj+1 =Swj (f 0(xj))

kSwj (f 0(xj))k , j = 0, . . . ; Sw (a) := (|a|� w)+sgn(a), (Soft Threshold).

Non-Convex and NonSmooth

A Nonsmooth Nonconvex Optimization Model

(M) minimizex,y (x , y) := f (x) + g (y) + H (x , y)

Assumption

(i) f : Rn ! (�1,+1] and g : Rm ! (�1,+1] proper and lsc functions.(ii) H : Rn ⇥ Rm ! R is a C1 function.

(iii) Partial gradients of H are Lipshitz continuous: H (·, y) 2 C1,1L(y) and

H (x , ·) 2 C1,1L(x).

NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..

Assumption

H (x , ·) 2 C1,1L(x).

NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).

The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..

Assumption

H (x , ·) 2 C1,1L(x).

NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..

Derive a simple scheme.Prove that the whole sequence

k2N := (xk , yk ) converges toa critical point of .

Exploit partial smoothness: Blend alternating minimization withproximal-gradientExploit further data info: to build a general recipe and algorithmicframework to prove convergence for a broad class of nonconvexnonsmooth problems.

J. Bolte, S. Sabach, M. TeboulleProximal alternating linearized minimization for nonconvex and nonsmoothproblems.

Mathematical Programming, Series A. Just published online.

The Algorithm: Proximal Alternating LinearizationMinimization (PALM)

1. Initialization: start with any�

x0, y0� 2 Rn ⇥ Rm.2. For each k = 0, 1, . . . generate a sequence

��

xk , yk�

2.1. Take �1 > 1, set ck = �1L1�

yk� and compute

xk+1 2 prox

✓xk � 1

ckrx H

⇣xk , yk

⌘◆.

2.2. Take �2 > 1, set dk = �2L2�xk+1� and compute

yk+1 2 prox

✓yk � 1

dkry H

⇣xk+1, yk

⌘◆.

Main computational step: prox of a “nonconvex" function.

An interesting example will be given later. (More in our paper...)

Convergence of PALM: If the Data [f , g,H] isSemi-Algebraic

Theorem (Bolte-Sabach-Teboulle (2013))Let

k2N be a sequence generated by PALM. The following assertionshold.(i) The sequence

k2N has finite length, that is,

�zk+1 � zk�� <1.

(ii) The sequence�

k2N converges to a critical point z⇤ = (x⇤, y⇤) of .

Are there many semi-algebraic functions?How do we prove this convergence result?

Convergence of PALM: If the Data [f , g,H] isSemi-Algebraic

Theorem (Bolte-Sabach-Teboulle (2013))Let

k2N be a sequence generated by PALM. The following assertionshold.(i) The sequence

k2N has finite length, that is,

�zk+1 � zk�� <1.

(ii) The sequence�

k2N converges to a critical point z⇤ = (x⇤, y⇤) of .

Are there many semi-algebraic functions?How do we prove this convergence result?

An Informal General Convergence Proof Recipe

Let : RN ! (�1,+1] be a proper, lsc and bounded from below function.

(P) inf�

(z) : z 2 RN .

Suppose A is a generic algorithm which generates a sequence�

k2N via:

z0 2 RN , zk+1 2 A(zk ), k = 0, 1, . . . .

Goal: Prove that the whole sequence�

k2N converges to a criticalpoint of .

The Recipe

Basically, the “ Recipe" consists of three main steps.

(i) Sufficient decrease property: Find a positive constant ⇢1 such that

⇢1kzk+1 � zkk2 (zk )� (zk+1), 8k = 0, 1, . . . .

(ii) A subgradient lower bound for the iterates gap: Assume that�

k2N is bounded. Find another positive constant ⇢2, such that�

�wk�� ⇢2kzk � zk�1k, wk 2 @ (zk ), 8k = 0, 1, . . . .

These two steps are typical for any descent type algorithms but leadONLY to convergence of limit points.

Does the whole

k2N converge to a critical point of Problem (M)?

The Recipe

Basically, the “ Recipe" consists of three main steps.

(i) Sufficient decrease property: Find a positive constant ⇢1 such that

⇢1kzk+1 � zkk2 (zk )� (zk+1), 8k = 0, 1, . . . .

(ii) A subgradient lower bound for the iterates gap: Assume that�

k2N is bounded. Find another positive constant ⇢2, such that�

�wk�� ⇢2kzk � zk�1k, wk 2 @ (zk ), 8k = 0, 1, . . . .

These two steps are typical for any descent type algorithms but leadONLY to convergence of limit points.

Does the whole

k2N converge to a critical point of Problem (M)?

The Third Main Step of our Recipe

(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence

k2Nis a Cauchy sequence, and thus converges!

We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!

The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]

Are there many KL functions?

Are there Many Functions Satisfying KL?

YES! Semi Algebraic Functions

Theorem (Bolte-Daniilidis-Lewis (2006))Let � : Rd ! (�1,+1] be a proper and lsc function. If � is semi-algebraicthen it satisfies the KL property at any point of dom�.

Recall: Semi-algebraic sets and functions(i) A semialgebraic subset of Rd is a finite union of sets

{x 2 Rd : pi(x) = 0, qj(x) < 0, i 2 I, j 2 J}

where pi , qj : Rd ! R are real polynomial functions and I, J are finite.(ii) A function � is semi-algebraic if its graph is a semialgebraic set.

Are there Many Functions Satisfying KL?

YES! Semi Algebraic Functions

Theorem (Bolte-Daniilidis-Lewis (2006))Let � : Rd ! (�1,+1] be a proper and lsc function. If � is semi-algebraicthen it satisfies the KL property at any point of dom�.

Recall: Semi-algebraic sets and functions(i) A semialgebraic subset of Rd is a finite union of sets

{x 2 Rd : pi(x) = 0, qj(x) < 0, i 2 I, j 2 J}

where pi , qj : Rd ! R are real polynomial functions and I, J are finite.(ii) A function � is semi-algebraic if its graph is a semialgebraic set.

There is a Wealth of Semi-Algebraic Functions!

Operations Preserving Semi-Algebraic PropertyFinite sums and product of semi-algebraic functions.Composition of semi-algebraic functions.Sup/Inf type function, e.g., sup {g (u, v) : v 2 C} is semi-algebraic wheng is a semi-algebraic function and C a semi-algebraic set.

Some Semi-Algebraic Sets/Functions .."Starring" in OptimizationReal polynomial functions.Indicator functions of semi-algebraic sets.In matrix theory: cone of PSD matrices, constant rank matrices, Stiefelmanifolds...The function x ! dist (x ,S)2 is semi-algebraic whenever S is a nonemptysemi-algebraic subset of Rn.k·k0 is semi-algebraic.k·kp is semi-algebraic whenever p > 0 is rational.

There is a Wealth of Semi-Algebraic Functions!

Operations Preserving Semi-Algebraic PropertyFinite sums and product of semi-algebraic functions.Composition of semi-algebraic functions.Sup/Inf type function, e.g., sup {g (u, v) : v 2 C} is semi-algebraic wheng is a semi-algebraic function and C a semi-algebraic set.

Some Semi-Algebraic Sets/Functions .."Starring" in OptimizationReal polynomial functions.Indicator functions of semi-algebraic sets.In matrix theory: cone of PSD matrices, constant rank matrices, Stiefelmanifolds...The function x ! dist (x ,S)2 is semi-algebraic whenever S is a nonemptysemi-algebraic subset of Rn.k·k0 is semi-algebraic.k·kp is semi-algebraic whenever p > 0 is rational.

Application to a Broad Class of Matrix FactorizationProblems

Given A 2 Rm⇥n and r ⌧ min {m, n}, find X 2 Rm⇥r and Y 2 Rr⇥n such that8

A ⇡ XY ,X 2 Km,r \ F ,Y 2 Kr ,n \ G.

Kp,q =�

M 2 Rp⇥q : M � 0

F =�

X 2 Rm⇥r : R1 (X ) ↵

G =�

Y 2 Rr⇥n : R2 (Y ) �

Here R1 and R2 are lsc functions and ↵,� 2 R+ are given parameters.R1 (R2) are often used to describe some additional features of X (Y ).

(MF) covers a very large number of problems in applications...

The Optimization ApproachWe adopt the Constrained Nonconvex Nonsmooth Formulation

(MF ) min {d (A,XY ) : X 2 Km,r \ F ,Y 2 Kr ,n \ G} ,

d : Rm⇥n ⇥ Rm⇥n ! R+ stands as a proximity function measuringthe quality of the approximation, with d (U,V ) = 0 if and only if U = V .

Another Way: Penalized Version The "Hard" constraints are candidates tobe penalized with µ1 and µ2 > 0 penalty parameters.

(P �MF ) min {µ1R1 (X ) + µ2R2 (Y ) + d (A,XY ) : X 2 Km,r ,Y 2 Kr ,n} ,

Note: The penalty approach requires the tuning of the unknown penaltyparameters which might be a difficult issue.

Both formulations fit our general nonsmooth nonconvex model (M) withobvious identifications for H, f , g. We now illustrate on two important modelswith semi-algebraic data.

The Optimization ApproachWe adopt the Constrained Nonconvex Nonsmooth Formulation

(MF ) min {d (A,XY ) : X 2 Km,r \ F ,Y 2 Kr ,n \ G} ,

d : Rm⇥n ⇥ Rm⇥n ! R+ stands as a proximity function measuringthe quality of the approximation, with d (U,V ) = 0 if and only if U = V .

Another Way: Penalized Version The "Hard" constraints are candidates tobe penalized with µ1 and µ2 > 0 penalty parameters.

(P �MF ) min {µ1R1 (X ) + µ2R2 (Y ) + d (A,XY ) : X 2 Km,r ,Y 2 Kr ,n} ,

Note: The penalty approach requires the tuning of the unknown penaltyparameters which might be a difficult issue.

Both formulations fit our general nonsmooth nonconvex model (M) withobvious identifications for H, f , g. We now illustrate on two important modelswith semi-algebraic data.

Model I – Nonnegative Matrix Factorization ProblemsLet the proximity measure be defined via the Frobenius norm

d (A,XY ) := H (X ,Y ) =12kA� XYk2

F , and

F ⌘ Rm⇥r ; G ⌘ Rr⇥n.

The Problem (MF ) reduces to the so called Nonnegative Matrix Factorization(NMF)

min⇢

12kA� XYk2

F : X � 0,Y � 0�

H is a real polynomial function hence semi-algebraic.X ! H (X ,Y ) (for fixed Y ) and Y ! H (X ,Y ) (for fixed X ), are C1,1 withL1(Y ) ⌘

�YY T�

F , L2(X ) ⌘�

�X T X�

F .H is C2 on bounded subsets.

Thus we can PALM it! The two computational steps reduce to projectiononto the nonnegative cone of matrices–Trivial!..

P+(U) := argmin{kU � Vk2F : V 2 Rm⇥n,V � 0} = max{0,U}.

d (A,XY ) := H (X ,Y ) =12kA� XYk2

F , and

min⇢

12kA� XYk2

F : X � 0,Y � 0�

�YY T�

F , L2(X ) ⌘�

�X T X�

d (A,XY ) := H (X ,Y ) =12kA� XYk2

F , and

min⇢

12kA� XYk2

F : X � 0,Y � 0�

�YY T�

F , L2(X ) ⌘�

�X T X�

Model II - Sparse Constraints in NMF ProblemsNow, consider in NMF the overall sparsity measure of a matrix defined by

R1 (X ) = kXk0 :=X

kxik0 , (xi column vector of X ) ;R2 (Y ) = kYk0 .

To apply PALM all we need is to compute the prox of f := �X�0 + �kXk0s.It turns out that this can be simply done!

Proposition (Proximal map formula for f )Let U 2 Rm⇥n. Then

f1 (U) = argmin

12kX � Uk2

F : X � 0, kXk0 s�

= Ts (P+ (U))

whereTs (U) := argmin

V2Rm⇥n

kU � Vk2F : kUk0 s

Computing Ts simply requires determining the s-th largest numbers of mn numbers.This can be done in O(mn) time, and zeroing out the proper entries in one more passof the mn numbers.

Model II - Sparse Constraints in NMF ProblemsNow, consider in NMF the overall sparsity measure of a matrix defined by

R1 (X ) = kXk0 :=X

kxik0 , (xi column vector of X ) ;R2 (Y ) = kYk0 .

To apply PALM all we need is to compute the prox of f := �X�0 + �kXk0s.It turns out that this can be simply done!

Proposition (Proximal map formula for f )Let U 2 Rm⇥n. Then

f1 (U) = argmin

12kX � Uk2

F : X � 0, kXk0 s�

= Ts (P+ (U))

whereTs (U) := argmin

V2Rm⇥n

kU � Vk2F : kUk0 s

Computing Ts simply requires determining the s-th largest numbers of mn numbers.This can be done in O(mn) time, and zeroing out the proper entries in one more passof the mn numbers.

PALM for Sparse NMF

1. Initialization: Select random nonnegative X 0 2 Rm⇥r and Y 0 2 Rr⇥n.2. For each k = 0, 1, . . . generate a sequence

��

X k ,Y k�

2.1. Take �1 > 1, set ck = �1

Y k �Y k�T�

Fand compute

Uk = X k � 1ck

X k Y k � A� �

Y k�T; X k+1 2 prox

Uk� = T↵�

Uk�� .

2.2. Take �2 > 1, set dk = �2

X k+1 �X k+1�T�

Fand compute

V k = Y k� 1dk

X k+1�T �X k+1Y k � A

; Y k+1 2 prox

V k� = T��

V k�� .

Applying our main Theorem we get the global convergence result:

Let��

X k ,Y k�

k2N be a sequence generated by PALM-Sparse NMF. Ifinfk2N

�X k�

F ,�

�Y k�

> 0. Then,��

X k ,Y k�

k2N converges to a criticalpoint (X ⇤,Y ⇤) of the Sparse NMF.

PALM for Sparse NMF

1. Initialization: Select random nonnegative X 0 2 Rm⇥r and Y 0 2 Rr⇥n.2. For each k = 0, 1, . . . generate a sequence

��

X k ,Y k�

2.1. Take �1 > 1, set ck = �1

Y k �Y k�T�

Fand compute

Uk = X k � 1ck

X k Y k � A� �

Y k�T; X k+1 2 prox

Uk� = T↵�

Uk�� .

2.2. Take �2 > 1, set dk = �2

X k+1 �X k+1�T�

Fand compute

V k = Y k� 1dk

X k+1�T �X k+1Y k � A

; Y k+1 2 prox

V k� = T��

V k�� .

Applying our main Theorem we get the global convergence result:

Let��

X k ,Y k�

k2N be a sequence generated by PALM-Sparse NMF. Ifinfk2N

�X k�

F ,�

�Y k�

> 0. Then,��

X k ,Y k�

k2N converges to a criticalpoint (X ⇤,Y ⇤) of the Sparse NMF.

Concluding Remarks on FOM

Motivates new model formulations and novel/refined algorithms whichexploit special structuresPowerful for constructing cheap iterationsEfficient algorithms in many applied optimization models with structures.Further research needed for simple and efficient schemes that can copewith curse of dimensionality and Nonconvex/Nonsmooth settings.

THANK YOU FOR YOUR ATTENTION!

Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms...

Documents

Transcript of Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms...