Lecture 3 Optimization Problems and Iterative AlgorithmsLecture 3 Optimization Problems and...
Transcript of Lecture 3 Optimization Problems and Iterative AlgorithmsLecture 3 Optimization Problems and...
Lecture 3
Optimization Problems and Iterative Algorithms
January 13, 2016
This material was jointly developed with Angelia Nedic at UIUC for IE 598ns
Uday V. Shanbhag Lecture 3
Outline
• Special Functions: Linear, Quadratic, Convex
• Criteria for Convexity of a Function
• Operations Preserving Convexity
• Unconstrained Optimization
• First-Order Necessary Optimality Conditions
• Constrained Optimization
• First-Order Necessary Optimality Conditions
• KKT Conditions
• Iterative Algorithms
Stochastic Optimization 1
Uday V. Shanbhag Lecture 3
Convex Functionf is convex when dom(f) is convex set and there holds
f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y)
for all x, y ∈ dom(f) and α ∈ [0,1]
strictly convex if the inequality is strict for all x, y ∈ dom(f) & α ∈ (0,1)
Note that dom(f) is defined as
dom(f) , {x : f(x) < +∞} .
Stochastic Optimization 2
Uday V. Shanbhag Lecture 3
x
f (x)
yx
f (x)
y
f (y)
f is concave when −f is convex
f is strictly concave when −f is strictly convex
Stochastic Optimization 3
Uday V. Shanbhag Lecture 3
Examples of Convex/Concave FunctionsExamples on RConvex:
• Affine: ax+ b over R for any a, b ∈ R• Exponential: eax over R for any a ∈ R• Power: xp over (0,+∞) for p ≥ 1 or p ≤ 0
• Powers of absolute value: |x|p over R for p ≥ 1
• Negative entropy: x lnx over (0,+∞)
Concave:
• Affine: ax+ b over R for any a, b ∈ R• Powers: xp over (0,+∞) for 0 ≤ p ≤ 1
• Logarithm: lnx over (0,+∞)
Examples on Rn
• Affine functions are both convex and concave
• Norms ‖x‖, ‖x‖1, ‖x‖∞ are convex
Stochastic Optimization 4
Uday V. Shanbhag Lecture 3
Second-Order Conditions for Convexity
• Let f be twice differentiable and let dom(f) be the domain of f
[In general, when differentiability is considered, it is required that dom(f) is open]
• The Hessian ∇2f(x) is a symmetric n× n matrix whose entries are the
second-order partial derivatives of f at x:
[∇2f(x)
]ij=∂2f(x)
∂xi∂xjfor i, j = 1, . . . , n
2nd-order conditions:• f is convex if and only if dom(f) is convex set and
∇2f(x) � 0 for all x ∈ dom(f)
Positive semidefiniteness of a matrix: [Recall that Rn×n 3M � 0 if for all x ∈ Rn, xTMx ≥ 0]
• f is strictly convex if dom(f) is convex set
∇2f(x) � 0 for all x ∈ dom(f)
Positive definiteness of a matrix: [Recall that Rn×n 3M � 0 if for all x ∈ Rn, xTMx > 0]
Stochastic Optimization 5
Uday V. Shanbhag Lecture 3
Examples
• Quadratic function: f(x) = (1/2)x′Qx+ q′x+ r with a symmetric
n× n matrix Q
∇f(x) = Qx+ q, ∇2f(x) = Q
Convex for Q � 0
• Least-squares objective: f(x) = ‖Ax− b‖2 with an m× n matrix A
∇f(x) = 2AT(Ax− b), ∇2f(x) = 2ATA
Convex for any A
• Quadratic-over-linear: f(x, y) = x2/y
∇2f(x, y) = 2y3
[y
−x
] [y
−x
]T� 0
Convex for y > 0
Stochastic Optimization 6
Uday V. Shanbhag Lecture 3
First-Order Condition for Convexity
Let f be differentiable and let dom(f) be its domain. Then, the gradient
∇f(x) =
∂f(x)∂x1∂f(x)∂x2...
∂f(x)∂xn
exists at each x ∈ dom(f)
• 1st-order condition: f is convex if and only if dom(f) is convex and
f(x) +∇f(x)T(z − x) ≤ f(z) for all x, z ∈ dom(f)
• Note: A first order approximation is a global underestimate of f
Stochastic Optimization 7
Uday V. Shanbhag Lecture 3
• Very important property used in convex optimization for algorithm
designs and performance analysis
Stochastic Optimization 8
Uday V. Shanbhag Lecture 3
Operations Preserving ConvexityLet f and g be convex functions over Rn
• Positive Scaling: λf is convex for λ > 0; (λf)(x) = λf(x) for all x
• Sum: f + g is convex; (f + g)(x) = f(x) + g(x) for all x
• Composition with affine function: for g affine [i.e., g(x) = Ax+ b],
the composition f ◦ g is convex, where
(f ◦ g)(x) = f(Ax+ b) for all x
• Pointwise maximum: For convex functions f1, . . . , fm, the pointwise-
max function
h(x) = max {f1(x), . . . , fm(x)} is convex
• Polyhedral function: f(x) = maxi=1,...,m(aTi x+ bi) is convex
• Pointwise supremum: Let Y ⊆ Rm and f : Rn × Rm → R. Let
f(x, y) be convex in x for each y ∈ Y . Then, the supremum function
over the set Y
h(x) = supy∈Y f(x, y) is convex
Stochastic Optimization 9
Uday V. Shanbhag Lecture 3
Optimization TerminologyLet C ⊆ Rn and f : C → R. Consider the following optimization problem
minimize f(x)
subject to x ∈ C
Example: C = {x ∈ Rn | g(x) ≤ 0, x ∈ X}Terminology:
• The set C is referred to as feasible set
• We say that the problem is feasible when C is nonempty
• The problem is unconstrained when C = Rn, and it is constrained
otherwise
• We say that a vector x∗ is optimal solution or a global minimum when
x∗ is feasible and the value f(x∗) is not exceeded at any x ∈ C, i.e.,
x∗ ∈ Cf(x∗) ≤ f(x) for all x ∈ C
Stochastic Optimization 10
Uday V. Shanbhag Lecture 3
Local Minimum
minimize f(x)
subject to x ∈ C
• A vector x is a local minimum for the problem if x ∈ C and there is a
ball B(x, r) such that
f(x) ≤ f(x) for all x ∈ C with ‖x− x‖ ≤ r
• Every global minimum is also a local minimum
•When the set C is convex and the function f is convex then a local
minimum is also global
Stochastic Optimization 11
Uday V. Shanbhag Lecture 3
First-Order Necessary Optimality Condition:
Unconstrained ProblemLet f be a differentiable function with dom(f) = Rn and let C = Rn.
• If x is a local minimum of f over Rn, then the following holds:
∇f(x) = 0
• The gradient relation can be equivalently given as:
(y − x)′∇f(x) ≥ 0 for all y ∈ Rn
This is a variational inequality V I(K,F ) with the set K and the
mapping F given by
K = Rn, F (x) = ∇f(x)
• Solving a minimization problem can be reduced to solving a correspond-
ing variational inequality
Stochastic Optimization 12
Uday V. Shanbhag Lecture 3
First-Order Necessary Optimality Condition:
Constrained Problem
Let f be a differentiable function with dom(f) = Rn and let C ⊆ Rn be a
closed convex set.
• If x is a local minimum of f over C, then the following holds:
(y − x)′∇f(x) ≥ 0 for all y ∈ C (1)
Again, this is a variational inequality V I(K,F ) with the set K and
the mapping F given by
K = C, F (x) = ∇f(x)• Recall that when f is convex, then a local minimum is also global
•When f is convex: the preceding relation is also sufficient for x to be
a global minimum i.e.,
if x satisfies relation (1), then x is a (global) minimum
Stochastic Optimization 13
Uday V. Shanbhag Lecture 3
Equality and Inequality Constrained Problem
Consider the following problem
minimize f(x)
subject to h1(x) = 0, . . . , hp(x) = 0
g1(x) ≤ 0, . . . , gm(x) ≤ 0
where f , hi and gj are continuously differentiable over Rn.
Def. For a feasible vector x, an active set of (inequality) constraints is
the set given by
A(x) = {j | gj(x) = 0}
If j 6∈ A(x), we say that the j-th constraint is inactive at x
Def. We say that a vector x is regular if the gradients
∇h1(x), . . . ,∇hp(x), and ∇gj(x) for j ∈ A(x)are linearly independent
NOTE: x is regular when there are no equality constraints, and all the
inequality constrains are inactive [p = 0 and A(x) = ∅]
Stochastic Optimization 14
Uday V. Shanbhag Lecture 3
Lagrangian Function
With the problem
minimize f(x)
subject to h1(x) = 0, . . . , hp(x) = 0
g1(x) ≤ 0, . . . , gm(x) ≤ 0
(2)
we associate the Lagrangian function L(x, λ, µ) defined by
L(x, λ, µ) = f(x) +p∑
i=1
λihi(x) +m∑j=1
µjgj(x)
where λi ∈ R for all i, and µj ∈ R+ for all j
Stochastic Optimization 15
Uday V. Shanbhag Lecture 3
First-Order Karush-Kuhn-Tucker (KKT)
Necessary Conditions
Th. Let x be a local minimum of the equality/inequality constrained
problem (2). Also, assume that x is regular. Then, there exist unique
multipliers λ and µ such that
• ∇xL(x, λ, µ) = 0 [L is the Lagrangian function]
• µj ≥ 0 for all j
• µj = 0 for all j 6∈ A(x)
The last condition is referred to as complementarity conditions
We can compactly write them as:
g(x) ⊥ µ
Stochastic Optimization 16
Uday V. Shanbhag Lecture 3
In fact, the complementarity-based formulation can be used to write the
first-order optimality conditions more compactly. Consider the following
constrained optimization problem:
minimize f(x)
subject to c1(x) ≥ 0...
cm(x) ≥ 0
≥ 0.
Then, if x is regular, then there exists multipliers λ such that
0 ≤ x ⊥ ∇xf(x)−∇xc(x)T λ ≥ 0 (3)
0 ≤ λ ⊥ c(x) ≥ 0 (4)
More succinctly, this is a nonlinear complementarity problem, denoted by
Stochastic Optimization 17
Uday V. Shanbhag Lecture 3
CP (Rm+n, F ), a problem that requires a z that satisfies
0 ≤ z ⊥ F (z) ≥ 0,
where
z ,
(x
λ
)and F (z) ,
(∇xf −∇xcTλ
c(x)
).
Stochastic Optimization 18
Uday V. Shanbhag Lecture 3
Second-Order KKT Necessary Conditions
Th. Let x be a local minimum of the equality/inequality constrained
problem (2). Also, assume that x is regular and that f, hi, gj are twice
continuously differentiable. Then, there exist unique multipliers λ and µ
such that
• ∇xL(x, λ, µ) = 0
• µj ≥ 0 for all j
• µj = 0 for all j 6∈ A(x)
• For any vector y such that ∇hi(x)′y = 0 for all i and ∇gj(x)′y = 0
for all j ∈ A(x), the following relation holds:
y′∇2xxL(x, λ, µ)y ≥ 0
Stochastic Optimization 19
Uday V. Shanbhag Lecture 3
Solution Procedures: Iterative Algorithms
For solving problems, we will consider iterative algorithms
• Given an initial iterate x0
•We generate a new iterate
xk+1 = Gk(xk)
where Gk is a mapping that depends on the optimization problem
Objectives:
• Provide necessary conditions on the mappings Gk that yield a sequence
{xk} converging to a solution of the problem of interest
• Study how fast the sequence {xk} converges:
• Global convergence rate (when far from optimal points)
• Local convergence rate (when near an optimal point)
Stochastic Optimization 20
Uday V. Shanbhag Lecture 3
Gradient Descent MethodConsider continuously differentiable function f . We want to
minimize f(x) over x ∈ Rn
Gradient descent methodxk+1 = xk − αk∇f(xk)
• The scalar αk is a stepsize: αk > 0• The stepsize choices αk = α, or line search, or other stepsize rule so
that f(xk+1) < f(xk)
Convergence Rate:
• Looking at the tail of an error e(xk) = dist(xk, X∗) sequence:
where
dist(x,A) , {d(x, a) : a ∈ A} .
Local convergence is at the best linear
lim supk→∞
e(xk+1)
e(xk)≤ q for some q ∈ (0,1)
Stochastic Optimization 21
Uday V. Shanbhag Lecture 3
• Global convergence is also at the best linear
Stochastic Optimization 22
Uday V. Shanbhag Lecture 3
Newton’s Method
Consider twice continuously differentiable function f with Hessian∇2f(x) �0 for all x. We want to solve the following problem:
minimize {f(x) : x ∈ Rn}
Newton’s method
xk+1 = xk − αk∇2f(xk)−1∇f(xk)
Local Convergence Rate (near x∗)
• ‖∇f(x)‖ converges to zero quadratically:
‖∇f(xk)‖ ≤ C q2k
for all large enough k
where C > 0 and q ∈ (0,1)
Stochastic Optimization 23
Uday V. Shanbhag Lecture 3
Penalty Methods
For solving inequality constrained problems:
minimize f(x)
subject to gj(x) ≤ 0, j = 1, . . . ,m
Penalty Approach: Remove the constraints but penalize their violation
Pc : minimize F (x, c) = f(x)+cP (g1(x), . . . , gm(x)) over x ∈ Rn
where c > 0 is a penalty parameter and P is some penalty function
Penalty methods operate in two stages for c and x, respectively
• Choose initial value c0
(1) Having ck, solve the problem Pck to obtain its optimal x∗(ck)
(2) Using x∗(ck), update ck to obtain ck+1 and go to step 1
Stochastic Optimization 24
Uday V. Shanbhag Lecture 3
Q-Rates of Convergence
Let {xk} be a sequence in Rn that converges to x∗
Convergence is said to be:
1. Q-linear if ∃r ∈ (0,1) such that‖xk+1−x∗‖‖xk−x∗‖
≤ r for k > K.
Example: (1 + 0.5k) converges Q-linearly to 1.
2. Q-quadratic if ∃M such that‖xk+1−x∗‖‖xk−x∗‖2
≤M for k > K.
Example: (1 + 0.52k) converges Q-quadratically to 1.
3. Q-superlinear if ∃r ∈ (0,1) such that limk→∞‖xk+1−x∗‖‖xk−x∗‖
= 0
Example: (1 + k−k) converges Q-superlinearly to 1.
4. Q-quadratically =⇒ Q-superlinearly =⇒ Q-linearly
Stochastic Optimization 25
Uday V. Shanbhag Lecture 3
Example 1
f(x, y) = x2 + y2
1. Steepest descent from
(−1−1
)
2. Newton from
(−1−1
)
3. Newton from
(−11
)
Stochastic Optimization 26
Uday V. Shanbhag Lecture 3
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
x
y
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
x
y−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
x
y
Figure 1: Well Conditioned Function:Steepest, Newton, Newton
Stochastic Optimization 27
Uday V. Shanbhag Lecture 3
Example 2
f(x, y) = 0.1x2 + y2
1. Steepest descent from
(−1−1
)
2. Newton from
(−1−1
)
3. Newton from
(−11
)
Stochastic Optimization 28
Uday V. Shanbhag Lecture 3
x
y
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
xy
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
x
y
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Figure 2: Ill-Conditioned Function: Steepest, Newton, Newton
Stochastic Optimization 29
Uday V. Shanbhag Lecture 3
Interior-Point Methods
Solve inequality (and more generally) constrained problem:
minimize f(x)
subject to gj(x) ≤ 0, j = 1, . . . ,m
The IPM solves a sequence of problems parametrized by t > 0:
minimize f(x)−1
t
m∑j=1
ln(−gj(x)) over x ∈ Rn
• Can be viewed as a penalty method with
• Penalty parameter c = 1t
• Penalty function
P (u1, . . . , um) = −m∑j=1
ln(−uj)
This function is known as logarithmic barrier or log barrier function
Stochastic Optimization 30
Uday V. Shanbhag Lecture 3
References for this lecture
The material for this lecture:
• (B) Bertsekas D.P. Nonlinear Programming
• Chapter 1 and Chapter 3 (descent and Newton’s methods, KKT
conditions)
• (FP) Facchinei and Pang Finite Dimensional ..., Vol I (Part on Com-
plementarity Problems)
• Chapter 1 for Normal Cone, Dual Cone, and Tangent Cone
• (BNO) Bertsekas, Nedic, Ozdaglar Convex Analysis and Optimization
• Chapter 1 (convex functions)
Stochastic Optimization 31