Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms...
Transcript of Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms...
![Page 1: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/1.jpg)
Gradient Based Optimization Algorithms
Marc Teboulle
School of Mathematical SciencesTel Aviv University
Based on joint works with
Amir Beck (Technion), Jérôme Bolte, (Toulouse I), Ronny Luss (IBM-New York),Shoham Sabach, (Göttingen), Ron Shefi (Tel Aviv)
EPFL - Lausanne, February 4, 2014
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 2: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/2.jpg)
Opening Remark and CreditAbout more than 380 years ago.....In 1629, Fermat suggested the following:
Given f , solve for x :
f (x + d)� f (x)d
�
d=0= 0
...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....
Pierre de Fermat
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 3: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/3.jpg)
Opening Remark and CreditAbout more than 380 years ago.....In 1629, Fermat suggested the following:
Given f , solve for x :
f (x + d)� f (x)d
�
d=0= 0
...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....
Pierre de Fermat
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 4: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/4.jpg)
First-Order/Gradient Methods
First-Order methods are iterative methods that only exploit informationon the objective function and its gradient (sub-gradient).
A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 5: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/5.jpg)
First-Order/Gradient Methods
First-Order methods are iterative methods that only exploit informationon the objective function and its gradient (sub-gradient).
A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:
Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 6: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/6.jpg)
First-Order/Gradient Methods
First-Order methods are iterative methods that only exploit informationon the objective function and its gradient (sub-gradient).
A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 7: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/7.jpg)
Polynomial versus Gradient Methods
Convex problems are polynomially solvable within " accuracy:
Running Time Poly(Problem’s size, # of accuracy digits).
Theoretically: this means that large scale problems can be solved tohigh accuracy with polynomial methods, such as IPM.Practically: Running time is dimension-dependent and growsnonlinearly with problem’s dimension. For IPM which are Newton’s typemethods: ⇠ O(n3).
Thus, a "single iteration" of IPM can last forever!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 8: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/8.jpg)
Polynomial versus Gradient Methods
Convex problems are polynomially solvable within " accuracy:
Running Time Poly(Problem’s size, # of accuracy digits).
Theoretically: this means that large scale problems can be solved tohigh accuracy with polynomial methods, such as IPM.Practically: Running time is dimension-dependent and growsnonlinearly with problem’s dimension. For IPM which are Newton’s typemethods: ⇠ O(n3).Thus, a "single iteration" of IPM can last forever!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 9: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/9.jpg)
Gradient-Based Algorithms
Widely used in applications....
Clustering Analysis: The k-means algorithmNeuro-computing: The backpropagation algorithmStatistical Estimation: The EM (Expectation-Maximization) algorithm.Machine Learning: SVM, Regularized regression, etc...Signal and Image Processing: Sparse Recovery, Denoising andDeblurring Schemes, Total Variation minimization...Matrix minimization Problems....and much more...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 10: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/10.jpg)
Goals and Outline
Building simple and efficient First Order Methods (FOM),exploiting problem structures/data information and analyzetheir complexity and rate of convergence
Outline
Problem Formulations: Convex and Nonconvex ModelsAlgorithms: Derivation and AnalysisApplications: Linear inverse problems, image processing, dimensionreduction [PCA, NMF]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 11: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/11.jpg)
Goals and Outline
Building simple and efficient First Order Methods (FOM),exploiting problem structures/data information and analyzetheir complexity and rate of convergence
Outline
Problem Formulations: Convex and Nonconvex ModelsAlgorithms: Derivation and AnalysisApplications: Linear inverse problems, image processing, dimensionreduction [PCA, NMF]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 12: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/12.jpg)
Algorithms - Historical Development
Old schemes for general problems “resuscitated" and “improved", byexploiting special structures and data info.
Fixed point methods [Babylonian time!/Heron for square root, Picard,Banach, Weisfield’34]Coordinate descent/Alternating Minimization [Gauss-Seidel ’1798]Gradient/subgradient methods [Cauchy’ 1846, Shor’61, Rosen’63,Frank-Wolfe ’56, Polyak’62]Stochastic Gradients [Robbins and Monro ’51]Proximal-Algorithms [Martinet ’70, Rockafellar ’76, Passty’79]Penalty/Barrier methods [Courant’49, Fiacco-McCormick’66]Augmented Lagrangians and Splitting [Arrow-Hurwicz ’58,Hestenes-Powell’69, Goldstein-Treyakov’72, Rockafellar’74,Mercier-Lions ’79, Fortin-Glowinski’76, Bertsekas’82]Extragradient-methods for VI [Korpelevich ’76, Konnov,’80]Optimal/Fast Gradient Schemes [Nemirosvki-Yudin’81, Nesterov’83]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 13: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/13.jpg)
Algorithms - Historical Development
Old schemes for general problems “resuscitated" and “improved", byexploiting special structures and data info.
Fixed point methods [Babylonian time!/Heron for square root, Picard,Banach, Weisfield’34]Coordinate descent/Alternating Minimization [Gauss-Seidel ’1798]Gradient/subgradient methods [Cauchy’ 1846, Shor’61, Rosen’63,Frank-Wolfe ’56, Polyak’62]Stochastic Gradients [Robbins and Monro ’51]Proximal-Algorithms [Martinet ’70, Rockafellar ’76, Passty’79]Penalty/Barrier methods [Courant’49, Fiacco-McCormick’66]Augmented Lagrangians and Splitting [Arrow-Hurwicz ’58,Hestenes-Powell’69, Goldstein-Treyakov’72, Rockafellar’74,Mercier-Lions ’79, Fortin-Glowinski’76, Bertsekas’82]Extragradient-methods for VI [Korpelevich ’76, Konnov,’80]Optimal/Fast Gradient Schemes [Nemirosvki-Yudin’81, Nesterov’83]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 14: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/14.jpg)
Gradient-Based Methods for Convex Composite ModelsSmooth+Nonsmooth
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 15: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/15.jpg)
A Useful Convex Optimization Model
(M) min {F (x) = f (x) + g(x) : x 2 E}
E is a finite dimensional Euclidean spacef : E! R is convex smooth of type C1,1, i.e.,
9L(f ) > 0 : krf (x)�rf (y)k L(f )kx� yk, 8x, y.
g : E! (�1,1] is convex nonsmooth extended valued (allowingconstraints)We assume that (M) is solvable, i.e.,
X⇤ := argmin f 6= ;, and for x⇤ 2 X⇤, set F⇤ := F (x⇤).
The model (M) does already have structural information. Rich enough torecover various classes of smooth/nonsmooth convex minimization problems.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 16: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/16.jpg)
Special Cases of the General Model
g = 0 - smooth unconstrained convex minimization.
minx
f (x)
g = �(·|C) - constrained smooth convex minimization.
minx
{f (x) : x 2 C}
f = 0 - general convex optimization.
minx
g(x)
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 17: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/17.jpg)
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 18: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/18.jpg)
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 19: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/19.jpg)
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..
Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 20: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/20.jpg)
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...
A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 21: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/21.jpg)
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 22: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/22.jpg)
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 23: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/23.jpg)
Linear Inverse Problems
Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A are known.
Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)
min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)
• Intensive research activities over the last 50 years...Now, more...with SparseOptimization problems..
• Choices for g(·), d(·, ·) depends on the application at hand.Nonsmooth regularizers are particularly useful.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 24: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/24.jpg)
Linear Inverse Problems
Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A are known.
Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)
min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)
• Intensive research activities over the last 50 years...Now, more...with SparseOptimization problems..
• Choices for g(·), d(·, ·) depends on the application at hand.Nonsmooth regularizers are particularly useful.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 25: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/25.jpg)
Linear Inverse Problems
Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A are known.
Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)
min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)
• Intensive research activities over the last 50 years...Now, more...with SparseOptimization problems..
• Choices for g(·), d(·, ·) depends on the application at hand.Nonsmooth regularizers are particularly useful.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 26: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/26.jpg)
Image Processing
g = �kL(·)k1 - l1-regularized convex problem.
minx
{f (x) + �kLxk1}; f (x) := d(b,Ax)
L - identity, differential operator, wavelet
g = TV (·) - TV-based regularization. Grid points (i , j) with same intensityas their neighbors [ROF 1992].
minx
{f (x) + �TV (x)}
1-dim:TV (x) =
P
i |xi � xi+1|2-dim:isotropic: TV(x) =
P
iP
j
p
(xi,j � xi+1,j)2 + (xi,j � xi,j+1)2
anisotropic: TV(x) =P
iP
j(|xi,j � xi+1,j |+ |xi,j � xi,j+1|)More on this soon...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 27: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/27.jpg)
Image Processing
g = �kL(·)k1 - l1-regularized convex problem.
minx
{f (x) + �kLxk1}; f (x) := d(b,Ax)
L - identity, differential operator, waveletg = TV (·) - TV-based regularization. Grid points (i , j) with same intensityas their neighbors [ROF 1992].
minx
{f (x) + �TV (x)}
1-dim:TV (x) =
P
i |xi � xi+1|2-dim:isotropic: TV(x) =
P
iP
j
p
(xi,j � xi+1,j)2 + (xi,j � xi,j+1)2
anisotropic: TV(x) =P
iP
j(|xi,j � xi+1,j |+ |xi,j � xi,j+1|)More on this soon...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 28: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/28.jpg)
Building Gradient-Based Schemes: Basic Old Idea
Pick an adequate approximate model
1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:
q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)
That is, leaving the nonsmooth part g(·) untouched.
2 Linearize only + use info on C: e.g., C compact, g := �C
q(x, y) = f (y) + hx� y,rf (y)i
Solve “some how", the resulting approximate model:
xk+1 = argminx
q(x, xk ), k = 0, . . .
.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 29: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/29.jpg)
Building Gradient-Based Schemes: Basic Old Idea
Pick an adequate approximate model1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:
q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)
That is, leaving the nonsmooth part g(·) untouched.
2 Linearize only + use info on C: e.g., C compact, g := �C
q(x, y) = f (y) + hx� y,rf (y)i
Solve “some how", the resulting approximate model:
xk+1 = argminx
q(x, xk ), k = 0, . . .
.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 30: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/30.jpg)
Building Gradient-Based Schemes: Basic Old Idea
Pick an adequate approximate model1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:
q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)
That is, leaving the nonsmooth part g(·) untouched.
2 Linearize only + use info on C: e.g., C compact, g := �C
q(x, y) = f (y) + hx� y,rf (y)i
Solve “some how", the resulting approximate model:
xk+1 = argminx
q(x, xk ), k = 0, . . .
.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 31: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/31.jpg)
Building Gradient-Based Schemes: Basic Old Idea
Pick an adequate approximate model1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:
q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)
That is, leaving the nonsmooth part g(·) untouched.
2 Linearize only + use info on C: e.g., C compact, g := �C
q(x, y) = f (y) + hx� y,rf (y)i
Solve “some how", the resulting approximate model:
xk+1 = argminx
q(x, xk ), k = 0, . . .
.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 32: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/32.jpg)
Example: The Proximal Gradient MethodThe derivation of the proximal gradient method is similar to the one of the(sub)gradient projection method.
For any L � Lf , and a given iterate xk :
QL(x, xk ) := f (xk ) + hx� xk ,rf (xk )i+L2kx� xkk2+ g(x)
|{z}
untouched
Algorithm:
xk+1 := argminx
QL(x, xk )
= argminx
(
g(x) +L2
�
�
�
�
x� (xk �1Lrf (xk ))
�
�
�
�
2)
= prox 1L g
✓
xk �1Lrf (xk )
◆
.
where prox operator:proxh(z) := argminu
⇢
h(u) +12ku� zk2
�
.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 33: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/33.jpg)
Example: The Proximal Gradient MethodThe derivation of the proximal gradient method is similar to the one of the(sub)gradient projection method.
For any L � Lf , and a given iterate xk :
QL(x, xk ) := f (xk ) + hx� xk ,rf (xk )i+L2kx� xkk2+ g(x)
|{z}
untouched
Algorithm:
xk+1 := argminx
QL(x, xk )
= argminx
(
g(x) +L2
�
�
�
�
x� (xk �1Lrf (xk ))
�
�
�
�
2)
= prox 1L g
✓
xk �1Lrf (xk )
◆
.
where prox operator:proxh(z) := argminu
⇢
h(u) +12ku� zk2
�
.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 34: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/34.jpg)
The Proximal Gradient Method for (M)
The proximal gradient method with a constant stepsize rule.
Proximal Gradient Method (with Constant Stepsize)Input: L = L(f ) - A Lipschitz constant of rf .Step 0. Take x0 2 E.Step k. (k � 1) Compute prox of g at a gradient step for f
xk = prox 1L g
✓
xk�1 �1Lrf (xk�1)
◆
The Lipschitz constant L(f ) is not always known or not easily computable, thisissue is resolved with an easy backtracking procedure.
1 Can we compute efficiently proxg/L(·)?2 What is the global rate of convergence for PGM?
First, we illustrate some special cases of ProxGrad.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 35: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/35.jpg)
Special Cases of Proximal Gradient
The Prox-Grad method: xk+1 = prox 1L g�
xk � 1Lrf (xk )
�
.
g ⌘ 0) the gradient method.
xk+1 = xk �1Lrf (xk )
g = �(·|C)) the gradient projection method
xk+1 = PC
✓
xk �1Lrf (xk )
◆
f = 0. The proximal minimization method.
xk+1 = prox 1L g(xk )
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 36: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/36.jpg)
Some Calculus Rules for Computing proxtg
proxtg(x) = argminu
⇢
g(u) +12tku� xk2
�
.
g(u) proxt(g)(x)�C(u) ⇧C(x)
�⇤C(u) -support function- x� ⇧C(x)
dC(u)
(
x + (⇧C(x)�x)tdC(x)
if dC(x) > 1/tx otherwise
kAx� bk2/2,A 2 Rm⇥n (I + t�1AT A)�1(x + t�1AT b)kuk1 (-shrinkage-) sgn (xj)max{|xj |� t , 0}
kuk⇢
kxk2/2t ifkxk tkxk � t/2 otherwise
kUk⇤, U 2 Rm⇥n, (m � n) P diag(s)QT
�1(U) � �2(U) � . . . singular values of UNuclear norm kUk⇤ =
P
j �j(U)Singular value decomposition
U = P diag(�)QT , then shrinkage sj = sgn (�j)max{|�j |� t , 0}.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 37: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/37.jpg)
Rate of Convergence of Prox-Grad for Convex (M)
Theorem - [Rate of Convergence of Prox-Grad]Let {xk} be the sequence generated by the proximal gradientmethod.
F (xk )� F (x⇤) Lkx0 � x⇤k2
2kfor any optimal solution x⇤.
Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.# iterations for F (xk )� F (x⇤) " is O(1/").
No need to know the Lipschitz constant (backtracking).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 38: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/38.jpg)
Rate of Convergence of Prox-Grad for Convex (M)
Theorem - [Rate of Convergence of Prox-Grad]Let {xk} be the sequence generated by the proximal gradientmethod.
F (xk )� F (x⇤) Lkx0 � x⇤k2
2kfor any optimal solution x⇤.
Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.# iterations for F (xk )� F (x⇤) " is O(1/").No need to know the Lipschitz constant (backtracking).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 39: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/39.jpg)
Toward a Faster Scheme
For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/
pk)...Even worse...
Can we find a faster method?The answer is YES!.
Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 40: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/40.jpg)
Toward a Faster Scheme
For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/
pk)...Even worse...
Can we find a faster method?
The answer is YES!.
Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 41: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/41.jpg)
Toward a Faster Scheme
For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/
pk)...Even worse...
Can we find a faster method?The answer is YES!.
Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 42: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/42.jpg)
Toward a Faster Scheme
For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/
pk)...Even worse...
Can we find a faster method?The answer is YES!.
Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).
But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 43: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/43.jpg)
Toward a Faster Scheme
For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/
pk)...Even worse...
Can we find a faster method?The answer is YES!.
Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 44: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/44.jpg)
A Fast Prox-Grad Algorithm - [Beck-Teboulle (2009)]
An equally simple algorithm as prox-grad..
Here with constant step sizeInput: L = L(f ) - A Lipschitz constant of rf .Step 0. Take y1 = x0 2 E, t1 = 1.Step k. (k � 1) Compute
xk = argminx2E
⇢
g(x) +L2kx� (yk �
1Lrf (yk ))k2
�
xk ⌘ prox
gL(yk �
1Lrf (yk )), - main computation as Prox-Grad
• tk+1 =1 +
q
1 + 4t2k
2,
•• yk+1 = xk +
✓
tk � 1tk+1
◆
(xk � xk�1).
Additional computation in (•) and (••) is clearly marginal.Knowledge of L(f ) is not Necessary, can use backtracking line search.With g = 0, this is the smooth “Optimal" Gradient of Nesterov (83); With tk ⌘ 1 we recover PGM.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 45: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/45.jpg)
An O(1/k2) Global Rate of Convergence for (M)
Theorem Let {xk} be generated by FPG. Then for any k � 1
F (xk )� F (x⇤) 2L(f )kx0 � x⇤k2
(k + 1)2 ,
# of iterations to reach F (x̃)� F⇤ " is ⇠ O(1/p").
Improves Prox Grad by a square root factor.On the practical side this theoretical rate is achieved.
| Many computational studies have confirmed the efficiency of FPG forsolving various optimization models in applications: Signal/image recoveryand in Machine learning e.g., image denoising/deblurring, nuclear matrixnorm regularization, matrix completion problems, multi-task learning, matrixclassification, etc...
Some examples....
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 46: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/46.jpg)
An O(1/k2) Global Rate of Convergence for (M)
Theorem Let {xk} be generated by FPG. Then for any k � 1
F (xk )� F (x⇤) 2L(f )kx0 � x⇤k2
(k + 1)2 ,
# of iterations to reach F (x̃)� F⇤ " is ⇠ O(1/p").
Improves Prox Grad by a square root factor.On the practical side this theoretical rate is achieved.
| Many computational studies have confirmed the efficiency of FPG forsolving various optimization models in applications: Signal/image recoveryand in Machine learning e.g., image denoising/deblurring, nuclear matrixnorm regularization, matrix completion problems, multi-task learning, matrixclassification, etc...
Some examples....
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 47: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/47.jpg)
Back to Application: Linear Inverse Problems
Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A (Blurring matrix) are known.
Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)
min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)(
Nonsmooth regularizers are particularly useful.
The l1 norm being a “starring one"!, as promoting sparsity.The squared l2 convex proximity kb�A(x)k2 will be the smooth part.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 48: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/48.jpg)
An Example l1 regularization
minx
{kAx� bk2 + �kxk1} ⌘ minx
{f (x) + g(x)}
The proximal map of g(x) = �kxk1 is simply:
proxt(g)(y) = argminu
⇢
12tku� yk2 + �kuk1
�
= T �t(y),
where T ↵ : Rn ! Rn is the shrinkage or soft threshold operator:
T ↵(x)i = (|xi |� ↵)+sgn (xi). (1)
} Prox Grad method = ISTA Iterative Shrinkage/Thresholding Algorithm
Other names in the signal processing literature include for example: thresholdLandweber method, iterative denoising, deconvolution algorithms...[ Chambolle (98);Figueiredo-Nowak (03, 05); Daubechies et al. (04),Elad et al. (06), Hale et al. (07)...]
} Fast Prox-grad = FISTA
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 49: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/49.jpg)
An Example l1 regularization
minx
{kAx� bk2 + �kxk1} ⌘ minx
{f (x) + g(x)}
The proximal map of g(x) = �kxk1 is simply:
proxt(g)(y) = argminu
⇢
12tku� yk2 + �kuk1
�
= T �t(y),
where T ↵ : Rn ! Rn is the shrinkage or soft threshold operator:
T ↵(x)i = (|xi |� ↵)+sgn (xi). (1)
} Prox Grad method = ISTA Iterative Shrinkage/Thresholding Algorithm
Other names in the signal processing literature include for example: thresholdLandweber method, iterative denoising, deconvolution algorithms...[ Chambolle (98);Figueiredo-Nowak (03, 05); Daubechies et al. (04),Elad et al. (06), Hale et al. (07)...]
} Fast Prox-grad = FISTA
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 50: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/50.jpg)
A Numerical Example: l1-Image Deblurring
minx{kAx� bk2 + �kxk1}
Comparing ISTA versus FISTA on Problems• dimension d like d = 256⇥ 256 = 65, 536, or/and 512⇥ 512 = 262, 144.• The d ⇥ d matrix A is dense (Gaussian blurring times inverse of two-stageHaar wavelet transform).• All problems solved with fixed � and Gaussian noise.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 51: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/51.jpg)
Example l1 Deblurring of the Cameraman
original blurred and noisy
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 52: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/52.jpg)
1000 Iterations of ISTA versus 200 of FISTA
ISTA: 1000 Iterations FISTA: 200 Iterations
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 53: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/53.jpg)
Original Versus Deblurring via FISTA
Original FISTA:1000 Iterations
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 54: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/54.jpg)
Function Values errors F (xk)� F (x⇤)
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000010−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
101
ISTAFISTA
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 55: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/55.jpg)
Example 2: l1 versus TV RegularizationMain difference between l1 and TV regularization:
prox of l1 - simple and explicit (shrinkage/soft threshold).prox of TV - TV-denoising problem requires an iterative method:g=TV
xk+1 = D✓
xk �2L
AT (Axk � b),2�L
◆
.
where D(w, t) = argminx
�
kx�wk2 + 2tTV(x)
Here:Prox operation, TV-based denoisingNo analytic expression in this case. Still can be solved very efficientlyby solving a smooth dual formulation by a fast gradient method [Beck,Teboulle, 2009], which can be seen as an acceleration of [Chambolle04,05].
Important to note:The fast proximal gradient method is implementable if the prox operationcan be computed efficiently.not a method for solving general nonsmooth convex problems. More onthis soon...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 56: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/56.jpg)
Example 2: l1 versus TV RegularizationMain difference between l1 and TV regularization:
prox of l1 - simple and explicit (shrinkage/soft threshold).prox of TV - TV-denoising problem requires an iterative method:g=TV
xk+1 = D✓
xk �2L
AT (Axk � b),2�L
◆
.
where D(w, t) = argminx
�
kx�wk2 + 2tTV(x)
Here:Prox operation, TV-based denoisingNo analytic expression in this case. Still can be solved very efficientlyby solving a smooth dual formulation by a fast gradient method [Beck,Teboulle, 2009], which can be seen as an acceleration of [Chambolle04,05].
Important to note:The fast proximal gradient method is implementable if the prox operationcan be computed efficiently.not a method for solving general nonsmooth convex problems. More onthis soon...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 57: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/57.jpg)
MFISTA: Monotone FISTA
FISTA is not a monotone method. Problematic when the prox is not exactlycomputed.
Input: L � L(f ) - An upper bound on the Lipschitz constant of rf .Step 0. Take y1 = x0 2 E, t1 = 1.Step k. (k � 1) Compute
zk = prox 1L
✓
yk �1Lrf (yk )
◆
,
tk+1 =1 +
q
1 + 4t2k
2,
xk = argmin{F (x) : x = zk , xk�1}
yk+1 = xk +
✓
tktk+1
◆
(zk � xk ) +
✓
tk � 1tk+1
◆
(xk � xk�1).
With Same Rate of Convergence as FPG!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 58: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/58.jpg)
Lena and 3 Reconstructions – N=100 Iterations
Blurred and Noisy ISTA(F100 = 0.606)
MFISTA(F100 = 0.466)
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 59: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/59.jpg)
Extension: Gradient Schemes with Non-EuclideanDistances
All previous schemes were based on using the squared Euclideandistance for measuring proximity of two points in EIt is useful to exploit the geometry of the constraints set XThis is done by selecting a “distance-like” function
Typical example: Bregman type distances - based on kernel :
D (x, y) = (x)� (y)� hx� y,r (y)i, strongly convex
Advantage of using Non Euclidean distance adequately exploiting theconstraints allows to:
1 Simplify the prox computation for the given constraints set2 Often improve the constant in the complexity bound.
Mirror descent algorithms, extragradient-like, lagrangians, smoothing.... Nemirovsky-Yudin (80),Teboulle (92), Beck-Teboulle (03), Nemirovsky (04), Nesterov (05), Auslender-Teboulle (05)...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 60: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/60.jpg)
Extension: Gradient Schemes with Non-EuclideanDistances
All previous schemes were based on using the squared Euclideandistance for measuring proximity of two points in EIt is useful to exploit the geometry of the constraints set XThis is done by selecting a “distance-like” function
Typical example: Bregman type distances - based on kernel :
D (x, y) = (x)� (y)� hx� y,r (y)i, strongly convex
Advantage of using Non Euclidean distance adequately exploiting theconstraints allows to:
1 Simplify the prox computation for the given constraints set2 Often improve the constant in the complexity bound.
Mirror descent algorithms, extragradient-like, lagrangians, smoothing.... Nemirovsky-Yudin (80),Teboulle (92), Beck-Teboulle (03), Nemirovsky (04), Nesterov (05), Auslender-Teboulle (05)...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 61: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/61.jpg)
Convex Nonsmooth Composite: Lagrangians Based Methods
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 62: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/62.jpg)
Nonsmooth Convex with Separable Objective
(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},
Here f , g are both nonsmooth, A : Rn ! Rm a given linear map.
Problem (P) is equivalent to (via the standard splitting variables trick):
(P) p⇤ = inf{f (x) + g(z) : Ax = z, x 2 Rn, z 2 Rm}
Rockafellar (’76) has shown that the Proximal Point Algorithm can be appliedto the dual and primal-dual formulation of (P) to produce:
The Multipliers Method (augmented Lagrangian Method).The Proximal Method of Multipliers (PMM).Largely ignored over last 20 years.....Recent strong revival in sparseoptimization: compressive sensing, image processing ect...A very nice recent survey with many machine learning applications:[Boydet al. 2011]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 63: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/63.jpg)
Nonsmooth Convex with Separable Objective
(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},
Here f , g are both nonsmooth, A : Rn ! Rm a given linear map.
Problem (P) is equivalent to (via the standard splitting variables trick):
(P) p⇤ = inf{f (x) + g(z) : Ax = z, x 2 Rn, z 2 Rm}
Rockafellar (’76) has shown that the Proximal Point Algorithm can be appliedto the dual and primal-dual formulation of (P) to produce:
The Multipliers Method (augmented Lagrangian Method).The Proximal Method of Multipliers (PMM).Largely ignored over last 20 years.....Recent strong revival in sparseoptimization: compressive sensing, image processing ect...A very nice recent survey with many machine learning applications:[Boydet al. 2011]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 64: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/64.jpg)
Nonsmooth Convex with Separable Objective
(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},
Here f , g are both nonsmooth, A : Rn ! Rm a given linear map.
Problem (P) is equivalent to (via the standard splitting variables trick):
(P) p⇤ = inf{f (x) + g(z) : Ax = z, x 2 Rn, z 2 Rm}
Rockafellar (’76) has shown that the Proximal Point Algorithm can be appliedto the dual and primal-dual formulation of (P) to produce:
The Multipliers Method (augmented Lagrangian Method).The Proximal Method of Multipliers (PMM).Largely ignored over last 20 years.....Recent strong revival in sparseoptimization: compressive sensing, image processing ect...A very nice recent survey with many machine learning applications:[Boydet al. 2011]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 65: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/65.jpg)
The PMM– Rockafellar (76)
PMM – The Proximal Method of Multipliers Generate (xk , zk ) and dual mul-tiplier yk via
(xk+1, zk+1) 2 argminx,z
f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z)
yk+1 = yk + c(Axk+1 � zk+1).
The Augmented Lagrangian:
Lc(x , z, y) :=
Lagrangianz }| {
f (x) + g(z) + hyk ,Ax � zi+c2kAx � zk2, (c > 0).
qk (x , z) := 12
�
kx � xkk2M1
+ kz � zkk2M2
�
is the additional primal proximalterm.The choice of M1 2 Sn
+,M2 2 Sm+ is used to conveniently describe/analyze
several variants of the PMM.M1 = M2 ⌘ 0, recovers the Multiplier Methods (PPA on the dual).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 66: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/66.jpg)
Proximal Method of Multipliers–Key Difficulty
Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:
f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z).
The quadratic coupling term kAx � zk2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).In many applications, separate minimization is often much easier.....
Strategies to overcome this difficulty:Approximate Minimization – linearized the quad term kAx � zk2 wrt (x , z).Alternating Minimization – à la “Gauss-Seidel" in (x , z).Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.Result in various interesting schemes, e.g., last one can be shown torecover the recent efficient primal-dual method of [Chambolle-Pock,2010].
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 67: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/67.jpg)
Proximal Method of Multipliers–Key Difficulty
Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:
f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z).
The quadratic coupling term kAx � zk2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).In many applications, separate minimization is often much easier.....
Strategies to overcome this difficulty:Approximate Minimization – linearized the quad term kAx � zk2 wrt (x , z).Alternating Minimization – à la “Gauss-Seidel" in (x , z).Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.Result in various interesting schemes, e.g., last one can be shown torecover the recent efficient primal-dual method of [Chambolle-Pock,2010].
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 68: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/68.jpg)
A Prototype : Aternating Direction of ProximalMultipliers
Eliminate the coupling (x , z) via alternating minimization steps.
Glowinski-Marocco (75), Gabay-Mercier (76), Fortin-Glowinski (83), Ecsktein-Bertsekas (91) ......the so-called Alternating Direction of Mulipliers (ADM),(based on the Multiplier Methods, i.e.,M1 = M2 ⌘ 0.)
(AD-PMM) Alternating Direction Proximal Method of Multipliers1. Start with any (x0, z0, y0) 2 Rn ⇥ Rm ⇥ Rm and c > 02. For k = 0, 1, . . . generate the sequence {xk , zk , yk} as follows:
xk+1 2 argmin⇢
f (x) +c2kAx � zk + c�1ykk2 +
12kx � xkk2
M1
�
,
zk+1 = argmin⇢
g(z) +c2kAxk+1 � z + c�1ykk2 +
12kz � zkk2
M2
�
,
yk+1 = yk + c(Axk+1 � zk+1).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 69: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/69.jpg)
Global Rate of Convergence ResultsFor AD-PMM and many other variants. [Shefi-Teboulle (2013)]
Let (x⇤, z⇤, y⇤) be a saddle point for the Lagrangian l associated to (P).Then for all N � 1,
l(xN , zN , y)� p⇤ C2(x⇤,z⇤,y)N , 8y 2 Rm
In particular: f (xN) + g(zN) + rkAxN � zNk � p⇤ c1N ,
Residual norm: kAxN � zNk c2pN
Original Primal objective (under g-Lipschitz continuous):
'(xN)� '(x⇤) c3
N,
ci are positive constants (as usual in terms of distances to optimal solution).
For any sequence {xk , zk , yk}, any N � 1, the ergodic sequences {xN , yN , zN}
xN :=1N
N�1X
k=0
xk+1, zN :=1N
N�1X
k=0
zk+1, and yN :=1N
N�1X
k=0
yk+1.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 70: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/70.jpg)
Non-Convex Smooth Models
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 71: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/71.jpg)
Sparse PCA
Principal Component Analysis solves
max{xT Ax : kxk2 = 1, x 2 Rn}, (A ⌫ 0)
while Sparse Principal Component Analysis solves
max{xT Ax : kxk2 = 1, kxk0 k, x 2 Rn}, k 2 (1, n] sparsity
kxk0 counts the number of nonzero entries of xIssues:
1 Maximizing a Convex objective.2 Hard Nonconvex Constraint kxk0 k .
Possible Approaches:1 SDP Convex Relaxations [D’aspremont et al. 2008]2 Approximation/Modified formulations: Many proposed approaches
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 72: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/72.jpg)
Sparse PCA via Penalization/Relaxation/Approx.
� The problem of interest is the difficult sparse PCA problem as is
max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}
� Literature has focused on solving various modifications:l0-penalized PCA max {xT Ax � skxk0 : kxk2 = 1}, s > 0Relaxed l1-constrained PCA max {xT Ax : kxk2 = 1, kxk1
pk}
Relaxed l1-penalized PCA max {xT Ax � skxk1 : kxk2 = 1}Approx-Penalized max {xT Ax � sgp(|xk) : kxk2 = 1} gp(x) ' kxk0
SDP-Convex Relaxations max{tr(AX ) : tr (X ) = 1,X ⌫ 0, kXk1 k}
SDP-relaxations often too computationally expensive for large problems.No algorithm give bounds to the optimal solution of the original problem.Even when "Simple", these algorithms are for modifications:| do not solve the original problem of interest| do require unknown penalty parameter s to be tuned.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 73: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/73.jpg)
Sparse PCA via Penalization/Relaxation/Approx.
� The problem of interest is the difficult sparse PCA problem as is
max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}
� Literature has focused on solving various modifications:l0-penalized PCA max {xT Ax � skxk0 : kxk2 = 1}, s > 0Relaxed l1-constrained PCA max {xT Ax : kxk2 = 1, kxk1
pk}
Relaxed l1-penalized PCA max {xT Ax � skxk1 : kxk2 = 1}Approx-Penalized max {xT Ax � sgp(|xk) : kxk2 = 1} gp(x) ' kxk0
SDP-Convex Relaxations max{tr(AX ) : tr (X ) = 1,X ⌫ 0, kXk1 k}
SDP-relaxations often too computationally expensive for large problems.No algorithm give bounds to the optimal solution of the original problem.Even when "Simple", these algorithms are for modifications:| do not solve the original problem of interest| do require unknown penalty parameter s to be tuned.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 74: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/74.jpg)
Sparse PCA via Penalization/Relaxation/Approx.
� The problem of interest is the difficult sparse PCA problem as is
max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}
� Literature has focused on solving various modifications:l0-penalized PCA max {xT Ax � skxk0 : kxk2 = 1}, s > 0Relaxed l1-constrained PCA max {xT Ax : kxk2 = 1, kxk1
pk}
Relaxed l1-penalized PCA max {xT Ax � skxk1 : kxk2 = 1}Approx-Penalized max {xT Ax � sgp(|xk) : kxk2 = 1} gp(x) ' kxk0
SDP-Convex Relaxations max{tr(AX ) : tr (X ) = 1,X ⌫ 0, kXk1 k}
SDP-relaxations often too computationally expensive for large problems.No algorithm give bounds to the optimal solution of the original problem.Even when "Simple", these algorithms are for modifications:| do not solve the original problem of interest| do require unknown penalty parameter s to be tuned.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 75: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/75.jpg)
Quick Highlight of Simple Algorithms for "ModifiedProblems"
Type Iteration Per-Iteration ReferencesComplexity
l1-constrained xj+1i =
sgn(((A+�2 )xj )i )(|((A+
�2 )xj )i |��j )+qP
h (|((A+�2 )xj )h|��j )2+
O(n2), O(mn) Witten et al. (2009)
l1-constrained xj+1i =
sgn((Axj )i )(|(Axj )i |�sj )+qPh (|(Axj )h|�sj )2+
where O(n2), O(mn) Sigg-Buhman (2008)
sj is (k + 1)-largest entry of vector |Axj |
l0-penalized zj+1 =
Pi [sgn((bT
i zj )2�s)]+(bTi zj )bi
kP
i [sgn((bTi zj )2�s)]+(bT
i zj )bik2O(mn) Shen-Huang (2008),
Journee et al. (2010)
l0-penalized xj+1i =
sgn(2(Axj )i )(|2(Axj )i |�s'0p (|x
ji |))+r
Ph (|2(Axj )h|�s'0
p (|xjh|))
2+
O(n2) Sriperumbudur et al. (2010)
l1-penalized yj+1 = argminy
{X
ikbi � xj yT bik
22 + �kyk2
2 + skyk1} Zou et al. (2006)
xj+1 =(P
i bi bTi )yj+1
k(P
i bi bTi )yj+1k2
l1-penalized zj+1 =
Pi (|b
Ti zj |�s)+sgn(bT
i zj )bikP
i (|bTi zj |�s)+sgn(bT
i zj )bik2O(mn) Shen-Huang (2008),
Journee et al. (2010)
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 76: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/76.jpg)
A Plethora of Models/Algorithms Revisited -[Luss-Teboulle (2013)]
All previous listed algorithms have been derived from various disparateapproaches/motivations to solve modifications of SPCA: ExpectationMaximization; Majoration-Mininimization techniques; DC programming;Alternating minimization etc...
1 Are all these algorithms different? Any connection?2 Is it possible to tackle the difficult sparse PCA problem “as is"
Very recently we have shown that:
All the previously listed algorithms are a particularrealization of a"Father Algorithm": ConGradU(based on the well-known Conditional Gradient Algorithm)ConGradU CAN be applied directly to the originalproblem!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 77: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/77.jpg)
A Plethora of Models/Algorithms Revisited -[Luss-Teboulle (2013)]
All previous listed algorithms have been derived from various disparateapproaches/motivations to solve modifications of SPCA: ExpectationMaximization; Majoration-Mininimization techniques; DC programming;Alternating minimization etc...
1 Are all these algorithms different? Any connection?2 Is it possible to tackle the difficult sparse PCA problem “as is"
Very recently we have shown that:
All the previously listed algorithms are a particularrealization of a"Father Algorithm": ConGradU(based on the well-known Conditional Gradient Algorithm)ConGradU CAN be applied directly to the originalproblem!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 78: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/78.jpg)
Maximizing a Convex function over a CompactNonconvex setClassic Conditional Gradient Algorithm [Frank-Wolfe’56, Polyak’63, Dunn’79..]
solves : max {F (x) : x 2 C}, with F is C1; C convex compactx0 2 C, pj = argmax {hx � xj ,rF (xj)i : x 2 C}
xj+1 = xj + ↵j(pj � xj), ↵j 2 (0, 1] stepsize
� Here : F is convex, possibly nonsmooth; C is compact but nonconvex
Idea goes back to Mangasarian (96) developed for C a polyhedral set.
ConGradU – Conditional Gradient with Unit Step Size
x0 2 C, xj+1 2 argmax{hx � xj ,F 0(xj)i : x 2 C}
Notes:1 F is not assumed to be differentiable and F 0(x) is a subgradient of F at x .2 Useful when max{hx � xj ,F 0(xj )i : x 2 C} is easy to solve
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 79: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/79.jpg)
Maximizing a Convex function over a CompactNonconvex setClassic Conditional Gradient Algorithm [Frank-Wolfe’56, Polyak’63, Dunn’79..]
solves : max {F (x) : x 2 C}, with F is C1; C convex compactx0 2 C, pj = argmax {hx � xj ,rF (xj)i : x 2 C}
xj+1 = xj + ↵j(pj � xj), ↵j 2 (0, 1] stepsize
� Here : F is convex, possibly nonsmooth; C is compact but nonconvex
Idea goes back to Mangasarian (96) developed for C a polyhedral set.
ConGradU – Conditional Gradient with Unit Step Size
x0 2 C, xj+1 2 argmax{hx � xj ,F 0(xj)i : x 2 C}
Notes:1 F is not assumed to be differentiable and F 0(x) is a subgradient of F at x .2 Useful when max{hx � xj ,F 0(xj )i : x 2 C} is easy to solve
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 80: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/80.jpg)
Solving Original l0-constrained PCA via ConGradU
Applying ConGradU directly to max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}results in
xj+1 = argmax{xjT Ax : kxk2 = 1, kxk0 k} =Tk (Axj)
kTk (Axj)k2
Tk (a) := argminy
{kx � ak22 : kxk0 k}
Despite the hard constraint, easy to compute: (Tk (a))i = ai for the k largestentries (in absolute value) of a and (Tk (x))i = 0 otherwise.
Convergence: Every limit point of {xj} converges to a stationary point.Complexity: O(kn) or O(mn)
Thus, original problem can be solved using ConGradU with thesame complexity as when applied to modifications!Penalized/Modified problems require tuning an unknown tradeoffpenalty parameter This can be very computationally expensive and notneeded here.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 81: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/81.jpg)
Solving Original l0-constrained PCA via ConGradU
Applying ConGradU directly to max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}results in
xj+1 = argmax{xjT Ax : kxk2 = 1, kxk0 k} =Tk (Axj)
kTk (Axj)k2
Tk (a) := argminy
{kx � ak22 : kxk0 k}
Despite the hard constraint, easy to compute: (Tk (a))i = ai for the k largestentries (in absolute value) of a and (Tk (x))i = 0 otherwise.
Convergence: Every limit point of {xj} converges to a stationary point.Complexity: O(kn) or O(mn)
Thus, original problem can be solved using ConGradU with thesame complexity as when applied to modifications!Penalized/Modified problems require tuning an unknown tradeoffpenalty parameter This can be very computationally expensive and notneeded here.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 82: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/82.jpg)
Solving Original l0-constrained PCA via ConGradU
Applying ConGradU directly to max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}results in
xj+1 = argmax{xjT Ax : kxk2 = 1, kxk0 k} =Tk (Axj)
kTk (Axj)k2
Tk (a) := argminy
{kx � ak22 : kxk0 k}
Despite the hard constraint, easy to compute: (Tk (a))i = ai for the k largestentries (in absolute value) of a and (Tk (x))i = 0 otherwise.
Convergence: Every limit point of {xj} converges to a stationary point.Complexity: O(kn) or O(mn)
Thus, original problem can be solved using ConGradU with thesame complexity as when applied to modifications!Penalized/Modified problems require tuning an unknown tradeoffpenalty parameter This can be very computationally expensive and notneeded here.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 83: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/83.jpg)
ConGradU for a General Class of Problems
(G) maxx
{f (x) + g(|x |) : x 2 C}
f : Rn ! R is convex, C ✓ Rn is a compact set.g : Rn
+ ! R is convex differentiable and montonote decreasing
Particularly useful for handling approximate l0-penalized problems.
CondGradU applied to (G) produces the following simple:
Weighted l1-norm maximization problem:
x0 2 C, xj+1 = argmax{haj , xi �X
i
w ji |xi | : x 2 C}, j = 0, . . . ,
where wj := �g0(|xj |) > 0 and aj := f 0(xj) 2 Rn.
For penalized/approximate penalized SPCA, C is a unit ball, and above admits aclosed form solution:
xj+1 =Swj (f 0(xj))
kSwj (f 0(xj))k , j = 0, . . . ; Sw (a) := (|a|� w)+sgn(a), (Soft Threshold).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 84: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/84.jpg)
ConGradU for a General Class of Problems
(G) maxx
{f (x) + g(|x |) : x 2 C}
f : Rn ! R is convex, C ✓ Rn is a compact set.g : Rn
+ ! R is convex differentiable and montonote decreasing
Particularly useful for handling approximate l0-penalized problems.CondGradU applied to (G) produces the following simple:
Weighted l1-norm maximization problem:
x0 2 C, xj+1 = argmax{haj , xi �X
i
w ji |xi | : x 2 C}, j = 0, . . . ,
where wj := �g0(|xj |) > 0 and aj := f 0(xj) 2 Rn.
For penalized/approximate penalized SPCA, C is a unit ball, and above admits aclosed form solution:
xj+1 =Swj (f 0(xj))
kSwj (f 0(xj))k , j = 0, . . . ; Sw (a) := (|a|� w)+sgn(a), (Soft Threshold).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 85: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/85.jpg)
Non-Convex and NonSmooth
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 86: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/86.jpg)
A Nonsmooth Nonconvex Optimization Model
(M) minimizex,y (x , y) := f (x) + g (y) + H (x , y)
Assumption
(i) f : Rn ! (�1,+1] and g : Rm ! (�1,+1] proper and lsc functions.(ii) H : Rn ⇥ Rm ! R is a C1 function.
(iii) Partial gradients of H are Lipshitz continuous: H (·, y) 2 C1,1L(y) and
H (x , ·) 2 C1,1L(x).
NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 87: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/87.jpg)
A Nonsmooth Nonconvex Optimization Model
(M) minimizex,y (x , y) := f (x) + g (y) + H (x , y)
Assumption
(i) f : Rn ! (�1,+1] and g : Rm ! (�1,+1] proper and lsc functions.(ii) H : Rn ⇥ Rm ! R is a C1 function.
(iii) Partial gradients of H are Lipshitz continuous: H (·, y) 2 C1,1L(y) and
H (x , ·) 2 C1,1L(x).
NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).
The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 88: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/88.jpg)
A Nonsmooth Nonconvex Optimization Model
(M) minimizex,y (x , y) := f (x) + g (y) + H (x , y)
Assumption
(i) f : Rn ! (�1,+1] and g : Rm ! (�1,+1] proper and lsc functions.(ii) H : Rn ⇥ Rm ! R is a C1 function.
(iii) Partial gradients of H are Lipshitz continuous: H (·, y) 2 C1,1L(y) and
H (x , ·) 2 C1,1L(x).
NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 89: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/89.jpg)
Goal
Derive a simple scheme.Prove that the whole sequence
�
zk
k2N := (xk , yk ) converges toa critical point of .
Exploit partial smoothness: Blend alternating minimization withproximal-gradientExploit further data info: to build a general recipe and algorithmicframework to prove convergence for a broad class of nonconvexnonsmooth problems.
J. Bolte, S. Sabach, M. TeboulleProximal alternating linearized minimization for nonconvex and nonsmoothproblems.
Mathematical Programming, Series A. Just published online.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 90: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/90.jpg)
The Algorithm: Proximal Alternating LinearizationMinimization (PALM)
1. Initialization: start with any�
x0, y0� 2 Rn ⇥ Rm.2. For each k = 0, 1, . . . generate a sequence
��
xk , yk�
k2N:
2.1. Take �1 > 1, set ck = �1L1�
yk� and compute
xk+1 2 prox
fck
✓xk � 1
ckrx H
⇣xk , yk
⌘◆.
2.2. Take �2 > 1, set dk = �2L2�xk+1� and compute
yk+1 2 prox
gdk
✓yk � 1
dkry H
⇣xk+1, yk
⌘◆.
Main computational step: prox of a “nonconvex" function.
An interesting example will be given later. (More in our paper...)
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 91: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/91.jpg)
Convergence of PALM: If the Data [f , g,H] isSemi-Algebraic
Theorem (Bolte-Sabach-Teboulle (2013))Let
�
zk
k2N be a sequence generated by PALM. The following assertionshold.(i) The sequence
�
zk
k2N has finite length, that is,
1X
k=1
�
�zk+1 � zk�� <1.
(ii) The sequence�
zk
k2N converges to a critical point z⇤ = (x⇤, y⇤) of .
Are there many semi-algebraic functions?How do we prove this convergence result?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 92: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/92.jpg)
Convergence of PALM: If the Data [f , g,H] isSemi-Algebraic
Theorem (Bolte-Sabach-Teboulle (2013))Let
�
zk
k2N be a sequence generated by PALM. The following assertionshold.(i) The sequence
�
zk
k2N has finite length, that is,
1X
k=1
�
�zk+1 � zk�� <1.
(ii) The sequence�
zk
k2N converges to a critical point z⇤ = (x⇤, y⇤) of .
Are there many semi-algebraic functions?How do we prove this convergence result?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 93: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/93.jpg)
An Informal General Convergence Proof Recipe
Let : RN ! (�1,+1] be a proper, lsc and bounded from below function.
(P) inf�
(z) : z 2 RN .
Suppose A is a generic algorithm which generates a sequence�
zk
k2N via:
z0 2 RN , zk+1 2 A(zk ), k = 0, 1, . . . .
Goal: Prove that the whole sequence�
zk
k2N converges to a criticalpoint of .
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 94: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/94.jpg)
The Recipe
Basically, the “ Recipe" consists of three main steps.
(i) Sufficient decrease property: Find a positive constant ⇢1 such that
⇢1kzk+1 � zkk2 (zk )� (zk+1), 8k = 0, 1, . . . .
(ii) A subgradient lower bound for the iterates gap: Assume that�
zk
k2N is bounded. Find another positive constant ⇢2, such that�
�wk�� ⇢2kzk � zk�1k, wk 2 @ (zk ), 8k = 0, 1, . . . .
These two steps are typical for any descent type algorithms but leadONLY to convergence of limit points.
Does the whole
�
zk
k2N converge to a critical point of Problem (M)?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 95: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/95.jpg)
The Recipe
Basically, the “ Recipe" consists of three main steps.
(i) Sufficient decrease property: Find a positive constant ⇢1 such that
⇢1kzk+1 � zkk2 (zk )� (zk+1), 8k = 0, 1, . . . .
(ii) A subgradient lower bound for the iterates gap: Assume that�
zk
k2N is bounded. Find another positive constant ⇢2, such that�
�wk�� ⇢2kzk � zk�1k, wk 2 @ (zk ), 8k = 0, 1, . . . .
These two steps are typical for any descent type algorithms but leadONLY to convergence of limit points.
Does the whole
�
zk
k2N converge to a critical point of Problem (M)?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 96: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/96.jpg)
The Third Main Step of our Recipe
(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence
�
zk
k2Nis a Cauchy sequence, and thus converges!
We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!
The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]
Are there many KL functions?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 97: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/97.jpg)
The Third Main Step of our Recipe
(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence
�
zk
k2Nis a Cauchy sequence, and thus converges!
We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!
The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]
Are there many KL functions?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 98: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/98.jpg)
The Third Main Step of our Recipe
(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence
�
zk
k2Nis a Cauchy sequence, and thus converges!
We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!
The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]
Are there many KL functions?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 99: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/99.jpg)
The Third Main Step of our Recipe
(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence
�
zk
k2Nis a Cauchy sequence, and thus converges!
We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!
The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]
Are there many KL functions?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 100: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/100.jpg)
Are there Many Functions Satisfying KL?
YES! Semi Algebraic Functions
Theorem (Bolte-Daniilidis-Lewis (2006))Let � : Rd ! (�1,+1] be a proper and lsc function. If � is semi-algebraicthen it satisfies the KL property at any point of dom�.
Recall: Semi-algebraic sets and functions(i) A semialgebraic subset of Rd is a finite union of sets
{x 2 Rd : pi(x) = 0, qj(x) < 0, i 2 I, j 2 J}
where pi , qj : Rd ! R are real polynomial functions and I, J are finite.(ii) A function � is semi-algebraic if its graph is a semialgebraic set.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 101: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/101.jpg)
Are there Many Functions Satisfying KL?
YES! Semi Algebraic Functions
Theorem (Bolte-Daniilidis-Lewis (2006))Let � : Rd ! (�1,+1] be a proper and lsc function. If � is semi-algebraicthen it satisfies the KL property at any point of dom�.
Recall: Semi-algebraic sets and functions(i) A semialgebraic subset of Rd is a finite union of sets
{x 2 Rd : pi(x) = 0, qj(x) < 0, i 2 I, j 2 J}
where pi , qj : Rd ! R are real polynomial functions and I, J are finite.(ii) A function � is semi-algebraic if its graph is a semialgebraic set.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 102: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/102.jpg)
There is a Wealth of Semi-Algebraic Functions!
Operations Preserving Semi-Algebraic PropertyFinite sums and product of semi-algebraic functions.Composition of semi-algebraic functions.Sup/Inf type function, e.g., sup {g (u, v) : v 2 C} is semi-algebraic wheng is a semi-algebraic function and C a semi-algebraic set.
Some Semi-Algebraic Sets/Functions .."Starring" in OptimizationReal polynomial functions.Indicator functions of semi-algebraic sets.In matrix theory: cone of PSD matrices, constant rank matrices, Stiefelmanifolds...The function x ! dist (x ,S)2 is semi-algebraic whenever S is a nonemptysemi-algebraic subset of Rn.k·k0 is semi-algebraic.k·kp is semi-algebraic whenever p > 0 is rational.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 103: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/103.jpg)
There is a Wealth of Semi-Algebraic Functions!
Operations Preserving Semi-Algebraic PropertyFinite sums and product of semi-algebraic functions.Composition of semi-algebraic functions.Sup/Inf type function, e.g., sup {g (u, v) : v 2 C} is semi-algebraic wheng is a semi-algebraic function and C a semi-algebraic set.
Some Semi-Algebraic Sets/Functions .."Starring" in OptimizationReal polynomial functions.Indicator functions of semi-algebraic sets.In matrix theory: cone of PSD matrices, constant rank matrices, Stiefelmanifolds...The function x ! dist (x ,S)2 is semi-algebraic whenever S is a nonemptysemi-algebraic subset of Rn.k·k0 is semi-algebraic.k·kp is semi-algebraic whenever p > 0 is rational.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 104: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/104.jpg)
Application to a Broad Class of Matrix FactorizationProblems
Given A 2 Rm⇥n and r ⌧ min {m, n}, find X 2 Rm⇥r and Y 2 Rr⇥n such that8
<
:
A ⇡ XY ,X 2 Km,r \ F ,Y 2 Kr ,n \ G.
Where
Kp,q =�
M 2 Rp⇥q : M � 0
,
F =�
X 2 Rm⇥r : R1 (X ) ↵
,
G =�
Y 2 Rr⇥n : R2 (Y ) �
,
Here R1 and R2 are lsc functions and ↵,� 2 R+ are given parameters.R1 (R2) are often used to describe some additional features of X (Y ).
(MF) covers a very large number of problems in applications...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 105: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/105.jpg)
The Optimization ApproachWe adopt the Constrained Nonconvex Nonsmooth Formulation
(MF ) min {d (A,XY ) : X 2 Km,r \ F ,Y 2 Kr ,n \ G} ,
d : Rm⇥n ⇥ Rm⇥n ! R+ stands as a proximity function measuringthe quality of the approximation, with d (U,V ) = 0 if and only if U = V .
Another Way: Penalized Version The "Hard" constraints are candidates tobe penalized with µ1 and µ2 > 0 penalty parameters.
(P �MF ) min {µ1R1 (X ) + µ2R2 (Y ) + d (A,XY ) : X 2 Km,r ,Y 2 Kr ,n} ,
Note: The penalty approach requires the tuning of the unknown penaltyparameters which might be a difficult issue.
Both formulations fit our general nonsmooth nonconvex model (M) withobvious identifications for H, f , g. We now illustrate on two important modelswith semi-algebraic data.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 106: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/106.jpg)
The Optimization ApproachWe adopt the Constrained Nonconvex Nonsmooth Formulation
(MF ) min {d (A,XY ) : X 2 Km,r \ F ,Y 2 Kr ,n \ G} ,
d : Rm⇥n ⇥ Rm⇥n ! R+ stands as a proximity function measuringthe quality of the approximation, with d (U,V ) = 0 if and only if U = V .
Another Way: Penalized Version The "Hard" constraints are candidates tobe penalized with µ1 and µ2 > 0 penalty parameters.
(P �MF ) min {µ1R1 (X ) + µ2R2 (Y ) + d (A,XY ) : X 2 Km,r ,Y 2 Kr ,n} ,
Note: The penalty approach requires the tuning of the unknown penaltyparameters which might be a difficult issue.
Both formulations fit our general nonsmooth nonconvex model (M) withobvious identifications for H, f , g. We now illustrate on two important modelswith semi-algebraic data.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 107: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/107.jpg)
Model I – Nonnegative Matrix Factorization ProblemsLet the proximity measure be defined via the Frobenius norm
d (A,XY ) := H (X ,Y ) =12kA� XYk2
F , and
F ⌘ Rm⇥r ; G ⌘ Rr⇥n.
The Problem (MF ) reduces to the so called Nonnegative Matrix Factorization(NMF)
min⇢
12kA� XYk2
F : X � 0,Y � 0�
.
H is a real polynomial function hence semi-algebraic.X ! H (X ,Y ) (for fixed Y ) and Y ! H (X ,Y ) (for fixed X ), are C1,1 withL1(Y ) ⌘
�
�YY T�
�
F , L2(X ) ⌘�
�X T X�
�
F .H is C2 on bounded subsets.
Thus we can PALM it! The two computational steps reduce to projectiononto the nonnegative cone of matrices–Trivial!..
P+(U) := argmin{kU � Vk2F : V 2 Rm⇥n,V � 0} = max{0,U}.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 108: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/108.jpg)
Model I – Nonnegative Matrix Factorization ProblemsLet the proximity measure be defined via the Frobenius norm
d (A,XY ) := H (X ,Y ) =12kA� XYk2
F , and
F ⌘ Rm⇥r ; G ⌘ Rr⇥n.
The Problem (MF ) reduces to the so called Nonnegative Matrix Factorization(NMF)
min⇢
12kA� XYk2
F : X � 0,Y � 0�
.
H is a real polynomial function hence semi-algebraic.X ! H (X ,Y ) (for fixed Y ) and Y ! H (X ,Y ) (for fixed X ), are C1,1 withL1(Y ) ⌘
�
�YY T�
�
F , L2(X ) ⌘�
�X T X�
�
F .H is C2 on bounded subsets.
Thus we can PALM it! The two computational steps reduce to projectiononto the nonnegative cone of matrices–Trivial!..
P+(U) := argmin{kU � Vk2F : V 2 Rm⇥n,V � 0} = max{0,U}.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 109: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/109.jpg)
Model I – Nonnegative Matrix Factorization ProblemsLet the proximity measure be defined via the Frobenius norm
d (A,XY ) := H (X ,Y ) =12kA� XYk2
F , and
F ⌘ Rm⇥r ; G ⌘ Rr⇥n.
The Problem (MF ) reduces to the so called Nonnegative Matrix Factorization(NMF)
min⇢
12kA� XYk2
F : X � 0,Y � 0�
.
H is a real polynomial function hence semi-algebraic.X ! H (X ,Y ) (for fixed Y ) and Y ! H (X ,Y ) (for fixed X ), are C1,1 withL1(Y ) ⌘
�
�YY T�
�
F , L2(X ) ⌘�
�X T X�
�
F .H is C2 on bounded subsets.
Thus we can PALM it! The two computational steps reduce to projectiononto the nonnegative cone of matrices–Trivial!..
P+(U) := argmin{kU � Vk2F : V 2 Rm⇥n,V � 0} = max{0,U}.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 110: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/110.jpg)
Model II - Sparse Constraints in NMF ProblemsNow, consider in NMF the overall sparsity measure of a matrix defined by
R1 (X ) = kXk0 :=X
i
kxik0 , (xi column vector of X ) ;R2 (Y ) = kYk0 .
To apply PALM all we need is to compute the prox of f := �X�0 + �kXk0s.It turns out that this can be simply done!
Proposition (Proximal map formula for f )Let U 2 Rm⇥n. Then
prox
f1 (U) = argmin
⇢
12kX � Uk2
F : X � 0, kXk0 s�
= Ts (P+ (U))
whereTs (U) := argmin
V2Rm⇥n
n
kU � Vk2F : kUk0 s
o
.
Computing Ts simply requires determining the s-th largest numbers of mn numbers.This can be done in O(mn) time, and zeroing out the proper entries in one more passof the mn numbers.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 111: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/111.jpg)
Model II - Sparse Constraints in NMF ProblemsNow, consider in NMF the overall sparsity measure of a matrix defined by
R1 (X ) = kXk0 :=X
i
kxik0 , (xi column vector of X ) ;R2 (Y ) = kYk0 .
To apply PALM all we need is to compute the prox of f := �X�0 + �kXk0s.It turns out that this can be simply done!
Proposition (Proximal map formula for f )Let U 2 Rm⇥n. Then
prox
f1 (U) = argmin
⇢
12kX � Uk2
F : X � 0, kXk0 s�
= Ts (P+ (U))
whereTs (U) := argmin
V2Rm⇥n
n
kU � Vk2F : kUk0 s
o
.
Computing Ts simply requires determining the s-th largest numbers of mn numbers.This can be done in O(mn) time, and zeroing out the proper entries in one more passof the mn numbers.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 112: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/112.jpg)
PALM for Sparse NMF
1. Initialization: Select random nonnegative X 0 2 Rm⇥r and Y 0 2 Rr⇥n.2. For each k = 0, 1, . . . generate a sequence
��
X k ,Y k�
k2N:
2.1. Take �1 > 1, set ck = �1
�
�
�
Y k �Y k�T�
�
�
Fand compute
Uk = X k � 1ck
�
X k Y k � A� �
Y k�T; X k+1 2 prox
R1ck
�
Uk� = T↵�
P+
�
Uk�� .
2.2. Take �2 > 1, set dk = �2
�
�
�
X k+1 �X k+1�T�
�
�
Fand compute
V k = Y k� 1dk
�
X k+1�T �X k+1Y k � A
�
; Y k+1 2 prox
R2dk
�
V k� = T��
P+
�
V k�� .
Applying our main Theorem we get the global convergence result:
Let��
X k ,Y k�
k2N be a sequence generated by PALM-Sparse NMF. Ifinfk2N
�
�
�X k�
�
F ,�
�Y k�
�
F
> 0. Then,��
X k ,Y k�
k2N converges to a criticalpoint (X ⇤,Y ⇤) of the Sparse NMF.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 113: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/113.jpg)
PALM for Sparse NMF
1. Initialization: Select random nonnegative X 0 2 Rm⇥r and Y 0 2 Rr⇥n.2. For each k = 0, 1, . . . generate a sequence
��
X k ,Y k�
k2N:
2.1. Take �1 > 1, set ck = �1
�
�
�
Y k �Y k�T�
�
�
Fand compute
Uk = X k � 1ck
�
X k Y k � A� �
Y k�T; X k+1 2 prox
R1ck
�
Uk� = T↵�
P+
�
Uk�� .
2.2. Take �2 > 1, set dk = �2
�
�
�
X k+1 �X k+1�T�
�
�
Fand compute
V k = Y k� 1dk
�
X k+1�T �X k+1Y k � A
�
; Y k+1 2 prox
R2dk
�
V k� = T��
P+
�
V k�� .
Applying our main Theorem we get the global convergence result:
Let��
X k ,Y k�
k2N be a sequence generated by PALM-Sparse NMF. Ifinfk2N
�
�
�X k�
�
F ,�
�Y k�
�
F
> 0. Then,��
X k ,Y k�
k2N converges to a criticalpoint (X ⇤,Y ⇤) of the Sparse NMF.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 114: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/114.jpg)
Concluding Remarks on FOM
Motivates new model formulations and novel/refined algorithms whichexploit special structuresPowerful for constructing cheap iterationsEfficient algorithms in many applied optimization models with structures.Further research needed for simple and efficient schemes that can copewith curse of dimensionality and Nonconvex/Nonsmooth settings.
THANK YOU FOR YOUR ATTENTION!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 115: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/115.jpg)
Concluding Remarks on FOM
Motivates new model formulations and novel/refined algorithms whichexploit special structuresPowerful for constructing cheap iterationsEfficient algorithms in many applied optimization models with structures.Further research needed for simple and efficient schemes that can copewith curse of dimensionality and Nonconvex/Nonsmooth settings.
THANK YOU FOR YOUR ATTENTION!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
![Page 116: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint](https://reader035.fdocuments.in/reader035/viewer/2022071123/601fea62ad3c52628b6652be/html5/thumbnails/116.jpg)
Concluding Remarks on FOM
Motivates new model formulations and novel/refined algorithms whichexploit special structuresPowerful for constructing cheap iterationsEfficient algorithms in many applied optimization models with structures.Further research needed for simple and efficient schemes that can copewith curse of dimensionality and Nonconvex/Nonsmooth settings.
THANK YOU FOR YOUR ATTENTION!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms