NLP Slides

download NLP Slides

of 201

Transcript of NLP Slides

  • 8/8/2019 NLP Slides

    1/201

    ECTURE SLIDES ON NONLINEAR PROGRAMMIN

    BASED ON LECTURES GIVEN AT THE

    MASSACHUSETTS INSTITUTE OF TECHNOLOGY

    CAMBRIDGE, MASS

    DIMITRI P. BERTSEKAS

    These lecture slides are based on the book:Nonlinear Programming, Athena Scientific,by Dimitri P. Bertsekas; see

    http://www.athenasc.com/nonlinbook.html

    for errata, selected problem solutions, and othersupport material.

    The slides are copyrighted but may be freelyreproduced and distributed for any noncom-mercial purpose.

    LAST REVISED: Feb. 3, 2005

  • 8/8/2019 NLP Slides

    2/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 1: INTRODUCTION

    LECTURE OUTLINE

    Nonlinear Programming

    Application Contexts

    Characterization Issue

    Computation Issue

    Duality

    Organization

  • 8/8/2019 NLP Slides

    3/201

    NONLINEAR PROGRAMMING

    minxX

    f(x),

    where

    f : n is a continuous (and usually differ-entiable) function of n variables

    X = n or X is a subset of n with a continu-ous character.

    If X = n, the problem is called unconstrained

    If f is linear and X is polyhedral, the problemis a linear programming problem. Otherwise it isa nonlinear programming problem

    Linear and nonlinear programming have tradi-

    tionally been treated separately. Their method-ologies have gradually come closer.

  • 8/8/2019 NLP Slides

    4/201

    TWO MAIN ISSUES

    Characterization of minima Necessary conditions

    Sufficient conditions

    Lagrange multiplier theory

    Sensitivity

    Duality

    Computation by iterative algorithms

    Iterative descent

    Approximation methods

    Dual and primal-dual methods

  • 8/8/2019 NLP Slides

    5/201

    APPLICATIONS OF NONLINEAR PROGRAMMING

    Data networks Routing

    Production planning

    Resource allocation

    Computer-aided design

    Solution of equilibrium models

    Data analysis and least squares formulations

    Modeling human or organizational behavior

  • 8/8/2019 NLP Slides

    6/201

    CHARACTERIZATION PROBLEM

    Unconstrained problems Zero 1st order variation along all directions

    Constrained problems

    Nonnegative 1st order variation along all fea-sible directions

    Equality constraints

    Zero 1st order variation along all directionson the constraint surface

    Lagrange multiplier theory

    Sensitivity

  • 8/8/2019 NLP Slides

    7/201

    COMPUTATION PROBLEM

    Iterative descent

    Approximation

    Role of convergence analysis

    Role of rate of convergence analysis

    Using an existing package to solve a nonlinearprogramming problem

  • 8/8/2019 NLP Slides

    8/201

    POST-OPTIMAL ANALYSIS

    Sensitivity

    Role of Lagrange multipliers as prices

  • 8/8/2019 NLP Slides

    9/201

    DUALITY

    Min-common point problem / max-intercept prob-lem duality

    0 0

    (a) (b)

    Min Common Point

    Max Intercept Point Max Intercept Point

    Min Common Point

    S S

    Illustration of the optimal values of the min common pointand max intercept point problems. In (a), the two optimal

    values are not equal. In (b), the set S, when extendedupwards along the nth axis, yields the set

    S = {x | for some x S, xn xn, xi = xi, i = 1, . . . , n 1}

    which is convex. As a result, the two optimal values are

    equal. This fact, when suitably formalized, is the basis for

    some of the most important duality results.

  • 8/8/2019 NLP Slides

    10/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 2

    UNCONSTRAINED OPTIMIZATION -

    OPTIMALITY CONDITIONS

    LECTURE OUTLINE

    Unconstrained Optimization

    Local Minima

    Necessary Conditions for Local Minima

    Sufficient Conditions for Local Minima

    The Role of Convexity

  • 8/8/2019 NLP Slides

    11/201

    MATHEMATICAL BACKGROUND

    Vectors and matrices in n

    Transpose, inner product, norm

    Eigenvalues of symmetric matrices

    Positive definite and semidefinite matrices

    Convergent sequences and subsequences

    Open, closed, and compact sets

    Continuity of functions

    1st and 2nd order differentiability of functions

    Taylor series expansions

    Mean value theorems

  • 8/8/2019 NLP Slides

    12/201

    LOCAL AND GLOBAL MINIMA

    f(x)

    x

    Strict LocalMinimum

    Local Minima Strict GlobalMinimum

    Unconstrained local and global minima in one dimension.

  • 8/8/2019 NLP Slides

    13/201

    NECESSARY CONDITIONS FOR A LOCAL MIN

    1st order condition: Zero slope at a localminimum x

    f(x) = 0

    2nd order condition: Nonnegative curvature

    at a local minimum x

    2f(x) : Positive Semidefinite

    There may exist points that satisfy the 1st and

    2nd order conditions but are not local minima

    x xx

    f(x) = |x|3 (convex) f(x) = x3 f(x) = - |x|3

    x* = 0 x* = 0x* = 0

    First and second order necessary optimality conditions for

    functions of one variable.

  • 8/8/2019 NLP Slides

    14/201

    PROOFS OF NECESSARY CONDITIONS

    1st order condition f(x) = 0. Fix d n.Then (since x is a local min), from 1st order Taylor

    df(x) = lim0

    f(x + d) f(x)

    0,

    Replace d with d, to obtain

    df(x) = 0, d n

    2nd order condition 2

    f(x

    ) 0. From 2ndorder Taylor

    f(x+d)f(x) = f(x)d+2

    2d2f(x)d+o(2

    Since f(x) = 0 and x is local min, there issufficiently small > 0 such that for all (0, ),

    0 f(x + d) f(x)

    2= 12d

    2f(x)d +o(2)

    2

    Take the limit as 0.

  • 8/8/2019 NLP Slides

    15/201

    SUFFICIENT CONDITIONS FOR A LOCAL MIN

    1st order condition: Zero slope

    f(x) = 0

    1st order condition: Positive curvature

    2f(x) : Positive Definite

    Proof: Let > 0 be the smallest eigenvalue of2f(x). Using a second order Taylor expansion,

    we have for all d

    f(x + d) f(x) = f(x)d +1

    2d2f(x)d

    + o(d2)

    2 d2 + o(d2)

    =

    2+

    o(d2)

    d2

    d2.

    For d small enough, o(d2)/d2 is negligible

    relative to /2.

  • 8/8/2019 NLP Slides

    16/201

    CONVEXITY

    Convex Sets Nonconvex Sets

    x

    y

    x + (1 - )y, 0 < < 1

    x

    x

    y

    y

    xy

    Convex and nonconvex sets.

    f(x) + (1 - )f(y)

    x y

    C

    z

    f(z)

    A convex function. Linear interpolation underestimates

    the function.

  • 8/8/2019 NLP Slides

    17/201

    MINIMA AND CONVEXITY

    Local minima are also global under convexity

    f(x*) + (1 - )f(x)

    x

    f(x* + (1- )x)

    x x*

    f(x)

    Illustration of why local minima of convex functions are

    also global. Suppose that f is convex and that x is alocal minimum of f. Let x be such that f(x) < f(x). Byconvexity, for all

    (0, 1),

    f

    x + (1 )x

    f(x) + (1 )f(x) < f(x).

    Thus, f takes values strictly lower than f(x) on the linesegment connecting x with x, and x cannot be a local

    minimum which is not global.

  • 8/8/2019 NLP Slides

    18/201

    OTHER PROPERTIES OF CONVEX FUNCTIONS

    f is convex if and only if the linear approximationat a point x based on the gradient, underestimatesf:

    f(z) f(x) + f(x)(z x), z n

    f(z)

    f(z) + (z - x)'f(x)

    x z

    Implication:

    f(x) = 0 x is a global minimum

    f is convex if and only if 2f(x) is positivesemidefinite for all x

  • 8/8/2019 NLP Slides

    19/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 3: GRADIENT METHODS

    LECTURE OUTLINE

    Quadratic Unconstrained Problems

    Existence of Optimal Solutions

    Iterative Computational Methods

    Gradient Methods - Motivation

    Principal Gradient Methods

    Gradient Methods - Choices of Direction

  • 8/8/2019 NLP Slides

    20/201

    QUADRATIC UNCONSTRAINED PROBLEMS

    minxn

    f(x) = 12xQx bx,

    where Q is n n symmetric, and b n.

    Necessary conditions:

    f(x) = Qx b = 0,

    2f(x) = Q 0 : positive semidefinite.

    Q 0 f : convex, nec. conditions are alsosufficient, and local minima are also global

    Conclusions:

    Q : not 0 f has no local minima

    If Q > 0 (and hence invertible), x = Q1bis the unique global minimum.

    If Q 0 but not invertible, either no solutionor number of solutions

  • 8/8/2019 NLP Slides

    21/201

    00

    0 0x x

    x x

    y

    y y

    y

    1/ 1/

    1/

    > 0, > 0

    (1/, 0) is the unique

    global minimum

    > 0, = 0

    {(1/, ) | : real} is the set of

    global minima

    = 0There is no global minimum

    > 0, < 0

    There is no global minimum

    Illustration of the isocost surfaces of the quadratic costfunction f : 2 given by

    f(x, y) = 12

    x2 + y2

    x

    for various values of and .

  • 8/8/2019 NLP Slides

    22/201

    EXISTENCE OF OPTIMAL SOLUTIONS

    Consider the problem

    minxX

    f(x)

    The set of optimal solutions is

    X = k=1

    x X | f(x) k

    where {k} is a scalar sequence such that k f

    with

    f = infxX

    f(x)

    X is nonempty and compact if all the sets{x X | f(x) k

    are compact. So:

    A global minimum exists if f is continuousand X is compact (Weierstrass theorem)

    A global minimum exists if X is closed, andf is continuous and coercive, that is, f(x) when x

  • 8/8/2019 NLP Slides

    23/201

    GRADIENT METHODS - MOTIVATION

    f(x) = c1

    f(x) = c2 < c1

    f(x) = c3 < c2

    xx = x - f(x)

    f(x)

    x - f(x)

    If f(x) = 0, there is aninterval (0, ) of stepsizessuch that

    f

    x f(x) < f(x)for all (0, ).

    f(x) = c1

    f(x) = c2 < c1

    f(x) = c3 < c2

    x

    f(x)

    d

    x + d

    x = x + d

    If d makes an angle withf(x) that is greater than90 degrees,

    f(x)d < 0,

    there is an interval (0, )

    of stepsizes such that f(x+d) < f(x) for all (0, ).

  • 8/8/2019 NLP Slides

    24/201

    PRINCIPAL GRADIENT METHODS

    xk+1 = xk + kdk, k = 0, 1, . . .

    where, if f(xk) = 0, the direction dk satisfies

    f(xk)dk < 0,

    and k is a positive stepsize. Principal example:

    xk+1 = xk kDkf(xk),

    where Dk is a positive definite symmetric matrix

    Simplest method: Steepest descent

    xk+1 = xk kf(xk), k = 0, 1, . . .

    Most sophisticated method: Newtons method

    xk+1 = xkk2f(xk)

    1f(xk), k = 0, 1, . .

  • 8/8/2019 NLP Slides

    25/201

    STEEPEST DESCENT AND NEWTONS METHOD

    x0 Slow convergence of steep-

    est descent

    x0

    x1

    x2

    f(x) = c1

    f(x) = c3 < c2

    f(x) = c2

  • 8/8/2019 NLP Slides

    26/201

    OTHER CHOICES OF DIRECTION

    Diagonally Scaled Steepest Descent

    Dk = Diagonal approximation to

    2f(xk)1

    Modified Newtons Method

    Dk = (2f(x0))1

    , k = 0, 1, . . . ,

    Discretized Newtons Method

    Dk =

    H(xk)1, k = 0, 1, . . . ,

    where H(xk) is a finite-difference based approxi-mation of 2f(xk)

    Gauss-Newton method for least squares prob-lems: minxn 12g(x)2. Here

    Dk =

    g(xk)g(xk)1

    , k = 0, 1, . . .

  • 8/8/2019 NLP Slides

    27/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 4

    ONVERGENCE ANALYSIS OF GRADIENT METHO

    LECTURE OUTLINE

    Gradient Methods - Choice of Stepsize

    Gradient Methods - Convergence Issues

  • 8/8/2019 NLP Slides

    28/201

    CHOICES OF STEPSIZE I

    Minimization Rule: k is such that

    f(xk + kdk) = min0

    f(xk + dk).

    Limited Minimization Rule: Min over [0, s]

    Armijo rule:

    f(xk)'dk

    f(xk)'dk

    0

    Set of Acceptable

    Stepsizes

    s

    s

    Unsuccessful Stepsize

    Trials

    2sStepsize k

    =

    f(xk + dk) - f(xk)

    Start with s and continue with s,2s, ..., until ms fallswithin the set of with

    f(xk) f(xk + dk) f(xk)dk.

  • 8/8/2019 NLP Slides

    29/201

    CHOICES OF STEPSIZE II

    Constant stepsize: k is such that

    k = s : a constant

    Diminishing stepsize:

    k 0

    but satisfies the infinite travel condition

    k=0

    k =

  • 8/8/2019 NLP Slides

    30/201

    GRADIENT METHODS WITH ERRORS

    xk+1 = xk k(f(xk) + ek)

    where ek is an uncontrollable error vector

    Several special cases:

    ek small relative to the gradient; i.e., for allk, ek < f(xk)

    f(xk)

    ek

    gk

    Illustration of the descent

    property of the directiongk =

    f(xk) + ek.

    {ek} is bounded, i.e., for all k, ek ,where is some scalar.

    {ek} is proportional to the stepsize, i.e., forall k, ek qk, where q is some scalar.

    {ek} are independent zero mean random vec-tors

  • 8/8/2019 NLP Slides

    31/201

    CONVERGENCE ISSUES

    Only convergence to stationary points can beguaranteed

    Even convergence to a single limit may be hardto guarantee (capture theorem)

    Danger of nonconvergence if directions dk tendto be orthogonal to f(xk)

    Gradient related condition:

    For any subsequence {xk}kK that converges to

    a nonstationary point, the corresponding subse-quence {dk}kK is bounded and satisfies

    lim supk, kK

    f(xk)dk < 0.

    Satisfied if dk = Dkf(xk) and the eigenval-ues of Dk are bounded above and bounded awayfrom zero

  • 8/8/2019 NLP Slides

    32/201

    CONVERGENCE RESULTS

    CONSTANT AND DIMINISHING STEPSIZES

    Let {xk} be a sequence generated by a gradientmethod xk+1 = xk + kdk, where {dk} is gradientrelated. Assume that for some constant L > 0,we have

    f(x) f(y) Lx y, x, y n,

    Assume that either

    (1) there exists a scalar such that for all k

    0 < k (2 )|f(xk)dk|

    Ldk2

    or

    (2) k 0 andk=0

    k = .

    Then either f(xk) or else {f(xk)} con-verges to a finite value and f(xk) 0.

  • 8/8/2019 NLP Slides

    33/201

    MAIN PROOF IDEA

    0

    f(xk)'dk

    + (1/2)2L||dk||2

    f(xk)'dk

    =|f(x

    k)'d

    k|

    L||dk||

    |2

    f(xk + dk) - f(xk)

    The idea of the convergence proof for a constant stepsize.Given xk and the descent direction dk, the cost differ-

    ence f(xk + dk) f(xk) is majorized by f(xk)dk +12

    2Ldk2 (based on the Lipschitz assumption; see nextslide). Minimization of this function over yields the step-

    size

    =|f(xk)dk|

    Ldk2

    This stepsize reduces the cost function f as well.

  • 8/8/2019 NLP Slides

    34/201

    DESCENT LEMMA

    Let be a scalar and let g() = f(x + y). Have

    f(x + y) f(x) = g(1) g(0) =

    10

    dg

    d() d

    =10 y

    f(x + y) d

    10

    yf(x) d

    + 1

    0

    yf(x + y) f(x) d

    10

    yf(x) d

    + 1

    0

    y f(x + y) f(x)d

    yf(x) + y

    10

    Ly d

    = yf(x) +L

    2y2.

  • 8/8/2019 NLP Slides

    35/201

    CONVERGENCE RESULT ARMIJO RULE

    Let {xk} be generated by xk+1 = xk+kdk, where{dk} is gradient related and k is chosen by theArmijo rule. Then every limit point of {xk} is sta-tionary.

    Proof Outline: Assume x is a nonstationary limit

    point. Then f(xk

    ) f(x), so k

    f(xk

    )

    dk

    0. If {xk}K x, lim supk, kK f(xk)dk < 0,by gradient relatedness, so that {k}K 0.

    By the Armijo rule, for large k K

    f(xk)f

    xk +(k/)dk

    < (k/)f(xk)dk.

    Defining pk = dk

    dkand k =

    kdk , we have

    f(xk) f(xk + kpk)

    k < f(xk)pk.

    Use the Mean Value Theorem and let k .We get f(x)p f(x)p, wherep is a limitpoint of pk a contradiction since f(x)p < 0.

  • 8/8/2019 NLP Slides

    36/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 5: RATE OF CONVERGENCE

    LECTURE OUTLINE

    Approaches for Rate of Convergence Analysis

    The Local Analysis Method

    Quadratic Model Analysis

    The Role of the Condition Number

    Scaling

    Diagonal Scaling

    Extension to Nonquadratic Problems

    Singular and Difficult Problems

  • 8/8/2019 NLP Slides

    37/201

    APPROACHES FOR RATE OF

    CONVERGENCE ANALYSIS

    Computational complexity approach

    Informational complexity approach

    Local analysis Why we will focus on the local analysis method

  • 8/8/2019 NLP Slides

    38/201

    THE LOCAL ANALYSIS APPROACH

    Restrict attention to sequences xk convergingto a local min x

    Measure progress in terms of an error functione(x) with e(x) = 0, such as

    e(x) = x x, e(x) = f(x) f(x)

    Compare the tail of the sequence e(xk) with thetail of standard sequences

    Geometric or linear convergence [if e(xk

    ) qk

    for some q > 0 and [0, 1), and for all k]. Holdsif

    lim supk

    e(xk+1)

    e(xk)<

    Superlinear convergence [if e(xk) q pk forsome q > 0, p > 1 and [0, 1), and for all k].

    Sublinear convergence

  • 8/8/2019 NLP Slides

    39/201

    QUADRATIC MODEL ANALYSIS

    Focus on the quadratic function f(x) = (1/2)xQxwith Q > 0.

    Analysis also applies to nonquadratic problemsin the neighborhood of a nonsingular local min

    Consider steepest descent

    xk+1 = xk kf(xk) = (I kQ)xk

    xk+12 = xk(I kQ)2xk

    max eig. (I k

    Q)2

    xk

    2

    The eigenvalues of (I kQ)2 are equal to (1 ki)2, where i are the eigenvalues of Q, so

    max eig of (IkQ)2 = max(1km)2, (1kM)where m, M are the smallest and largest eigen-values of Q. Thus

    xk+1

    xk max|1 km|, |1 kM|

  • 8/8/2019 NLP Slides

    40/201

    OPTIMAL CONVERGENCE RATE

    The value of k that minimizes the bound is = 2/(M + m), in which case

    xk+1

    xk

    M m

    M + m

    0

    1

    |1 - M |

    |1 - m |

    max {|1 - m|, |1 - M|}

    M - m

    M + m

    2

    M + m

    1

    M1m

    2M

    Stepsizes thatGuarantee Convergence

    Conv. rate for minimization stepsize (see text)

    f(xk+1)

    f(xk)

    M m

    M + m2

    The ratio M/m is called the condition numberof Q, and problems with M/m: large are calledill-conditioned.

  • 8/8/2019 NLP Slides

    41/201

    SCALING AND STEEPEST DESCENT

    View the more general method

    xk+1 = xk kDkf(xk)

    as a scaled version of steepest descent.

    Consider a change of variables x = Sy withS = (Dk)1/2. In the space of y, the problem is

    minimize h(y) f(Sy)

    subject to y n

    Apply steepest descent to this problem, multiplywith S, and pass back to the space of x, usingh(yk) = Sf(xk),

    yk+1 = yk kh(yk)

    Syk+1 = Syk kSh(yk)

    xk+1 = xk kDkf(xk)

  • 8/8/2019 NLP Slides

    42/201

    DIAGONAL SCALING

    Apply the results for steepest descent to thescaled iteration yk+1 = yk kh(yk):

    yk+1

    yk max

    |1 kmk|, |1 kMk|

    f(xk+1)

    f(xk)=

    h(yk+1)

    h(yk)

    Mk mk

    Mk + mk

    2where mk and Mk are the smallest and largesteigenvalues of the Hessian of h, which is

    2h(y) = S2f(x)S = (Dk)1/2Q(Dk)1/2

    It is desirable to choose Dk as close as possibleto Q1. Also if Dk is so chosen, the stepsize = 1

    is near the optimal 2/(Mk + mk).

    Using as Dk a diagonal approximation to Q1

    is common and often very effective. Corrects forpoor choice of units expressing the variables.

  • 8/8/2019 NLP Slides

    43/201

    NONQUADRATIC PROBLEMS

    Rate of convergence to a nonsingular local min-imum of a nonquadratic function is very similar tothe quadratic case (linear convergence is typical).

    If Dk

    2f(x)

    1

    , we asymptotically obtainoptimal scaling and superlinear convergence

    More generally, if the direction dk = Dkf(xk)approaches asymptotically the Newton direction,

    i.e.,

    limk

    dk + 2f(x)1f(xk)f(xk) = 0

    and the Armijo rule is used with initial stepsize

    equal to one, the rate of convergence is superlin-ear.

    Convergence rate to a singular local min is typ-ically sublinear (in effect, condition number = )

  • 8/8/2019 NLP Slides

    44/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 6

    NEWTON AND GAUSS-NEWTON METHODS

    LECTURE OUTLINE

    Newtons Method

    Convergence Rate of the Pure Form

    Global Convergence Variants of Newtons Method

    Least Squares Problems

    The Gauss-Newton Method

  • 8/8/2019 NLP Slides

    45/201

    NEWTONS METHOD

    xk+1 = xk k

    2f(xk)1

    f(xk)

    assuming that the Newton direction is defined andis a direction of descent

    Pure form of Newtons method (stepsize = 1)

    xk+1 = xk

    2f(xk)

    1

    f(xk)

    Very fast when it converges (how fast?)

    May not converge (or worse, it may not bedefined) when started far from a nonsingular

    local min

    Issue: How to modify the method so thatit converges globally, while maintaining thefast convergence rate

  • 8/8/2019 NLP Slides

    46/201

    CONVERGENCE RATE OF PURE FORM

    Consider solution of nonlinear system g(x) = 0where g : n n, with method

    xk+1 = xk

    g(xk)1

    g(xk)

    If g(x) = f(x), we get pure form of Newton

    Quick derivation: Suppose xk x withg(x) = 0 and g(x) is invertible. By Taylor

    0 = g(x) = g(xk)+g(xk)(xxk)+oxkx.Multiply with

    g(xk)

    1:

    xk x

    g(xk)

    1

    g(xk) = o

    xk x

    ,

    soxk+1 x = o

    xk x

    ,

    implying superlinear convergence and capture.

  • 8/8/2019 NLP Slides

    47/201

    CONVERGENCE BEHAVIOR OF PURE FORM

    x0 = -1x2 x

    1

    x

    k xk g(xk)

    0 - 1.00000 - 0.632121 0.71828 1.050912 0.20587 0.228593 0.01981 0.020004 0.00019 0.000195 0.00000 0.00000

    0

    g(x) = ex - 1

    g(x)

    0

    x

    x0 x2x1x3

  • 8/8/2019 NLP Slides

    48/201

    MODIFICATIONS FOR GLOBAL CONVERGENCE

    Use a stepsize

    Modify the Newton direction when:

    Hessian is not positive definite

    When Hessian is nearly singular (needed toimprove performance)

    Use

    dk =

    2f(xk) + k1

    f(xk),

    whenever the Newton direction does not exist oris not a descent direction. Here k is a diagonalmatrix such that

    2f(xk) + k > 0

    Modified Cholesky factorization

    Trust region methods

  • 8/8/2019 NLP Slides

    49/201

    LEAST-SQUARES PROBLEMS

    minimize f(x) = 12g(x)2 =12

    mi=1

    gi(x)2

    subject to x n,

    where g = (g1, . . . , gm), gi : n ri .

    Many applications:

    Solution of systems of n nonlinear equationswith n unknowns

    Model Construction Curve Fitting

    Neural Networks

    Pattern Classification

  • 8/8/2019 NLP Slides

    50/201

    PURE FORM OF THE GAUSS-NEWTON METHOD

    Idea: Linearize around the current point xk

    g(x, xk) = g(xk) + g(xk)(x xk)

    and minimize the norm of the linearized function

    g:

    xk+1 = arg minxn

    12g(x, xk)

    2

    = xk

    g(xk)g(xk)

    1

    g(xk)g(xk)

    The direction

    g(xk)g(xk)1

    g(xk)g(xk)

    is a descent direction since

    g(xk)g(xk) =

    (1/2)g(x)2

    g(xk)g(xk) > 0

  • 8/8/2019 NLP Slides

    51/201

    MODIFICATIONS OF THE GAUSS-NEWTON

    Similar to those for Newtons method:

    xk+1 = xkk

    g(xk)g(xk)+k1

    g(xk)g(xk

    where k is a stepsize and k is a diagonal matrixsuch that

    g(xk)g(xk) + k > 0

    Incremental version of the Gauss-Newton method

    Operate in cycles

    Start a cycle with 0 (an estimate of x)

    Update using a single component of g

    i = arg minxn

    ij=1

    gj(x, j1)2, i = 1, . . . , m

    where gj are the linearized functions

    gj(x, j1) = gj(j1)+gj(j1)(xj1)

  • 8/8/2019 NLP Slides

    52/201

    MODEL CONSTRUCTION

    Given set of m input-output data pairs (yi, zi),i = 1, . . . , m, from the physical system

    Hypothesize an input/output relation z = h(x, y),where x is a vector of unknown parameters, andh is known

    Find x that matches best the data in the sensethat it minimizes the sum of squared errors

    12

    m

    i=1

    zi h(x, yi)2

    Example of a linear model: Fit the data pairs bya cubic polynomial approximation. Take

    h(x, y) = x3y3

    + x2y2

    + x1y + x0,

    where x = (x0, x1, x2, x3) is the vector of unknowncoefficients of the cubic polynomial.

  • 8/8/2019 NLP Slides

    53/201

    NEURAL NETS

    Nonlinear model construction with multilayerperceptrons

    x of the vector of weights

    Universal approximation property

  • 8/8/2019 NLP Slides

    54/201

    PATTERN CLASSIFICATION

    Objects are presented to us, and we wish toclassify them in one of s categories 1, . . . , s, basedon a vector y of their features.

    Classical maximum posterior probability ap-proach: Assume we know

    p(j|y) = P(object w/ feature vector y is of category

    Assign object with feature vector y to category

    j

    (y) = arg maxj=1,...,sp(j|y).

    If p(j|y) are unknown, we can estimate themusing functions hj(xj , y) parameterized by vectorsxj . Obtain xj by minimizing

    12

    mi=1

    zij hj(xj , yi)

    2,

    where

    zij = 1 if yi is of category j,0 otherwise.

  • 8/8/2019 NLP Slides

    55/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 7: ADDITIONAL METHODS

    LECTURE OUTLINE

    Least-Squares Problems and Incremental Gra-

    dient Methods

    Conjugate Direction Methods

    The Conjugate Gradient Method

    Quasi-Newton Methods

    Coordinate Descent Methods

    Recall the least-squares problem:

    minimize f(x) = 12g(x)2 = 12

    mi=1

    gi(x)2

    subject to x n,

    where g = (g1, . . . , gm), gi : n ri .

  • 8/8/2019 NLP Slides

    56/201

    INCREMENTAL GRADIENT METHODS

    Steepest descent method

    xk+1 = xkkf(xk) = xkkmi=1

    gi(xk)gi(xk)

    Incremental gradient method:

    i = i1 kgi(i1)gi(i1), i = 1, . . . , m

    0 = xk, xk+1 = m

    (aix - bi)2

    amini

    i

    bi

    amax i

    i

    bi

    x*

    xR

    Advantage of incrementalism

  • 8/8/2019 NLP Slides

    57/201

    VIEW AS GRADIENT METHOD W/ ERRORS

    Can write incremental gradient method as

    xk+1 = xk kmi=1

    gi(xk)gi(xk)

    + k

    mi=1

    gi(xk)gi(xk) gi(i1)gi(i1)

    Error term is proportional to stepsize k

    Convergence (generically) for a diminishing step-size (under a Lipschitz condition on gigi)

    Convergence to a neighborhood for a constantstepsize

  • 8/8/2019 NLP Slides

    58/201

    CONJUGATE DIRECTION METHODS

    Aim to improve convergence rate of steepest de-scent, without the overhead of Newtons method.

    Analyzed for a quadratic model. They require niterations to minimize f(x) = (1/2)xQx bx withQ an n n positive definite matrix Q > 0.

    Analysis also applies to nonquadratic problemsin the neighborhood of a nonsingular local min.

    The directions d1, . . . , dk are Q-conjugate if diQdj

    0 for all i = j.

    Generic conjugate direction method:

    xk+1 = xk + kdk

    where k is obtained by line minimization.

    y0

    y1

    y2

    w0

    w1

    x0 x1

    x2

    d1 = Q -1/2w1

    d0 = Q -1/2w0

    Expanding Subspace Theorem

  • 8/8/2019 NLP Slides

    59/201

    GENERATING Q-CONJUGATE DIRECTIONS

    Given set of linearly independent vectors 0, . . . , we can construct a set of Q-conjugate directionsd0, . . . , dk s.t. Span(d0, . . . , di) = Span(0, . . . , i)

    Gram-Schmidt procedure. Start with d0 = 0.If for some i < k, d0, . . . , di are Q-conjugate and

    the above property holds, take

    di+1 = i+1 +i

    m=0

    c(i+1)mdm;

    choose c(i+1)m so di+1 is Q-conjugate to d0, . . . , di,

    di+1Qdj = i+1Qdj+

    i

    m=0

    c(i+1)mdm

    Qdj = 0.

    0 = d0

    - c10

    d0

    1 2

    d1

    d00

    0

    d1= 1 + c10d0d2= 2 + c20d0 + c21d1

  • 8/8/2019 NLP Slides

    60/201

    CONJUGATE GRADIENT METHOD

    Apply Gram-Schmidt to the vectors k = gk =f(xk), k = 0, 1, . . . , n 1. Then

    dk = gk +k1

    j=0gkQdj

    dj Qdjdj

    Key fact: Direction formula can be simplified.

    Proposition : The directions of the CGM aregenerated by d0 = g0, and

    dk = gk + kdk1, k = 1, . . . , n 1,

    where k is given by

    k =gkgk

    gk1gk1or k =

    (gk gk1)gk

    gk1gk1

    Furthermore, the method terminates with an opti-mal solution after at most n steps.

    Extension to nonquadratic problems.

  • 8/8/2019 NLP Slides

    61/201

    PROOF OF CONJUGATE GRADIENT RESULT

    Use induction to show that all gradients gk gen-erated up to termination are linearly independent.True for k = 1. Suppose no termination after ksteps, and g0, . . . , gk1 are linearly independent.Then, Span(d0, . . . , dk1) = Span(g0, . . . , gk1)

    and there are two possibilities: gk = 0, and the method terminates.

    gk = 0, in which case from the expandingmanifold property

    gk

    is orthogonal to d0

    , . . . , dk1

    gk is orthogonal to g0, . . . , gk1

    so gk is linearly independent of g0, . . . , gk1,completing the induction.

    Since at most n lin. independent gradients canbe generated, gk = 0 for some k n.

    Algebra to verify the direction formula.

  • 8/8/2019 NLP Slides

    62/201

    QUASI-NEWTON METHODS

    xk+1 = xk kDkf(xk), where Dk is aninverse Hessian approximation.

    Key idea: Successive iterates xk, xk+1 and gra-dients f(xk), f(xk+1), yield curvature info

    qk 2f(xk+1)pk,

    pk = xk+1 xk, qk = f(xk+1) f(xk),

    2f(xn) q0 qn1 p

    0 pn1 1

    Most popular Quasi-Newton method is a cleverway to implement this idea

    Dk+1 = Dk +pkpk

    pkqk

    DkqkqkDk

    qkDkqk

    + kkvkvk,

    vk =pk

    pkqk

    Dkqk

    k, k = qk

    Dkqk, 0 k 1

    and D0 > 0 is arbitrary, k by line minimization,and Dn = Q1 for a quadratic.

  • 8/8/2019 NLP Slides

    63/201

    NONDERIVATIVE METHODS

    Finite difference implementations

    Forward and central difference formulas

    f(xk)

    xi

    1

    hf(xk + hei) f(xk)

    f(xk)

    xi

    1

    2h

    f(xk + hei) f(xk hei)

    Use central difference for more accuracy nearconvergence

    xk

    xk+1xk+2

    Coordinate descent.Applies also to the casewhere there are boundconstraints on the vari-

    ables.

    Direct search methods. Nelder-Mead method.

  • 8/8/2019 NLP Slides

    64/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 8

    OPTIMIZATION OVER A CONVEX SET;

    OPTIMALITY CONDITIONS

    Problem: minxX f(x), where:

    (a) X n is nonempty, convex, and closed.

    (b) f is continuously differentiable over X. Local and global minima. If f is convex localminima are also global.

    f(x)

    x

    Local Minima Global Minimum

    X

  • 8/8/2019 NLP Slides

    65/201

    OPTIMALITY CONDITION

    Proposition (Optimality Condition)

    (a) If x is a local minimum of f over X, then

    f(x)(x x) 0, x X.

    (b) If f is convex over X, then this condition isalso sufficient for x to minimize f over X.

    Surfaces of equal cost f(x)

    Constraint set X

    x

    x*

    f(x*)At a local minimum x,the gradient

    f(x) makes

    an angle less than or equalto 90 degrees with all fea-sible variations xx, x X.

    Constraint set X

    x*

    x

    f(x*)Illustration of failure of theoptimality condition when

    X is not convex. Here xis a local min but we havef(x)(x x) < 0 forthe feasible vector x shown.

  • 8/8/2019 NLP Slides

    66/201

    PROOF

    Proof: (a) By contradiction. Suppose that f(x)x) < 0 for some x X. By the Mean Value The-orem, for every > 0 there exists an s [0, 1]such that

    fx+(xx) = f(x)+fx+s(xx)(xx).Since f is continuous, for suff. small > 0,

    f

    x + s(x x)

    (x x) < 0

    so that f

    x + (x x)

    < f(x). The vectorx + (x x) is feasible for all [0, 1] becauseXis convex, so the optimality of x is contradicted.

    (b) Using the convexity of f

    f(x) f(x) + f(x)(x x)

    for every x X. If the condition f(x)(xx) 0 holds for all x X, we obtain f(x) f(x), sox minimizes f over X. Q.E.D.

  • 8/8/2019 NLP Slides

    67/201

    OPTIMIZATION SUBJECT TO BOUNDS

    Let X = {x | x 0}. Then the necessarycondition for x = (x1, . . . , x

    n) to be a local min is

    ni=1

    f(x)xi

    (xi xi ) 0, xi 0, i = 1, . . . , n .

    Fix i. Let xj = xj for j = i and xi = xi + 1:

    f(x)

    xi 0, i.

    If xi > 0, let also xj = xj forj = i and xi =

    12xi .

    Then f(x)/xi 0, so

    f(x)

    xi= 0, if xi > 0.

    x* x* = 0

    f(x*)f(x*)

  • 8/8/2019 NLP Slides

    68/201

    OPTIMIZATION OVER A SIMPLEX

    X =

    x x 0,

    ni=1

    xi = r

    where r > 0 is a given scalar.

    Necessary condition for x = (x1, . . . , xn) to be

    a local min:

    ni=1

    f(x)xi

    (xixi ) 0, xi 0 with

    ni=1

    xi = r.

    Fix i with xi > 0 and let j be any other index.Use x with xi = 0, xj = xj + x

    i , and xm = x

    m for

    all m = i, j:

    f(x)

    xj

    f(x)

    xixi 0,

    xi > 0 =f(x)

    xi

    f(x)

    xj, j,

    i.e., at the optimum, positive components have

    minimal (and equal) first cost derivative.

  • 8/8/2019 NLP Slides

    69/201

    OPTIMAL ROUTING

    Given a data net, and a set W of OD pairsw = (i, j). Each OD pair w has input traffic rw.

    Origin ofOD pair w

    Destination ofOD pair w

    rw rw

    x1

    x4

    x3

    x2

    Optimal routing problem:

    minimize D(x) =(i,j)

    Dij

    all paths pcontaining (i,j)

    xp

    subject to

    pPw

    xp = rw, w W,

    xp 0, p Pw, w W

    Optimality condition

    xp > 0 =D(x)

    xp

    D(x)

    xp, p Pw,

    i.e., paths carrying > 0 flow are shortest with re-s ect to first cost derivative.

  • 8/8/2019 NLP Slides

    70/201

    TRAFFIC ASSIGNMENT

    Transportation network with OD pairs w. Eachw has paths p Pw and traffic rw. Let xp be the

    flow of path p and let Tij

    p: crossing (i,j) xp

    be the travel time of link (i, j).

    User-optimization principle: Traffic equilib-rium is established when each user of the networkchooses, among all available paths, a path of min-imum travel time, i.e., for all w W and pathsp Pw,

    xp > 0 = tp(x) tp(x), p Pw, w W

    where tp(x), is the travel time of path p

    tp(x) = all arcs (i,j)on path p

    Tij (Fij ), p Pw, w W.

    Identical with the optimality condition of the routingproblem if we identify the arc travel time Tij(Fij)with the cost derivative Dij(Fij).

  • 8/8/2019 NLP Slides

    71/201

    PROJECTION OVER A CONVEX SET

    Let z n and a closed convex set Xbe given.Problem:

    minimize f(x) = z x2

    subject to x X.

    Proposition (Projection Theorem) Problemhas a unique solution [z]+ (the projection of z).

    z

    x

    Constraint set X

    x*

    x - x*

    z - x*

    Necessary and sufficient con-dition for x to be the pro-jection. The angle between

    z x and x x shouldbe greater or equal to 90

    degrees for all x X, or(z x)(x x) 0

    If X is a subspace, z x X.

    The mapping f : n X defined by f(x) =[x]+ is continuous and nonexpansive, that is,

    [x]+ [y]+ x y, x, y n.

  • 8/8/2019 NLP Slides

    72/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 9: FEASIBLE DIRECTION METHODS

    LECTURE OUTLINE

    Conditional Gradient Method

    Gradient Projection Methods

    A feasible direction at an x X is a vector d = 0such that x + d is feasible for all suff. small > 0

    x1

    x2

    d

    Constraint set X

    Feasibledirections at x

    x

    Note: the set of feasible directions at x is theset of all (z x) where z X, z = x, and > 0

  • 8/8/2019 NLP Slides

    73/201

    FEASIBLE DIRECTION METHODS

    A feasible direction method:

    xk+1 = xk + kdk,

    where dk: feasible descentdirection [f(xk)dk 0 and such that xk+1

    X. Alternative definition:

    xk+1 = xk + k(xk xk),

    where k

    (0, 1] and if xk

    is nonstationary,

    xk X, f(xk)(xk xk) < 0.

    Stepsize rules: Limited minimization, Constant

    k = 1, Armijo: k = mk , where mk is the firstnonnegative m for which

    f(xk)f

    xk+m(xkxk)

    mf(xk)(xkxk

  • 8/8/2019 NLP Slides

    74/201

    CONVERGENCE ANALYSIS

    Similar to the one for (unconstrained) gradientmethods.

    The direction sequence {dk} is gradient relatedto {xk} if the following property can be shown:For any subsequence {xk}kK that converges to

    a nonstationary point, the corresponding subse-quence {dk}kK is bounded and satisfies

    lim supk, kK

    f(xk)dk < 0.

    Proposition (Stationarity of Limit Points)Let {xk} be a sequence generated by the feasibledirection method xk+1 = xk + kdk. Assume that:

    {dk} is gradient related

    k is chosen by the limited minimization rule

    or the Armijo rule.Then every limit point of {xk} is a stationary point.

    Proof: Nearly identical to the unconstrainedcase.

  • 8/8/2019 NLP Slides

    75/201

    CONDITIONAL GRADIENT METHOD

    xk+1 = xk + k(xk xk), where

    xk = arg minxX

    f(xk)(x xk).

    We assume that X is compact, so x

    k

    is guar-anteed to exist by Weierstrass.f(x)

    x

    x_

    Constraint set X

    Surfaces ofequal cost

    Illustration of the directionof the conditional gradient

    method.

    x0

    x1

    x2

    x1 x0__

    Constraint set X

    Surfaces of

    equal cost

    x*

    Operation of the method.Slow (sublinear) convergence.

  • 8/8/2019 NLP Slides

    76/201

    CONVERGENCE OF CONDITIONAL GRADIENT

    Show that the direction sequence of the condi-tional gradient method is gradient related, so thegeneric convergence result applies.

    Suppose that {xk}kK converges to a nonsta-tionary point x. We must prove that

    {xkxk}kK : bounded, lim supk, kK

    f(xk)(xkxk) 0 (involves transformation yk = (Hk)1/2xSince the minimum value above is negative when

    xk is nonstationary, f(xk)(xk xk) < 0.

    Newtons method for Hk = 2f(xk).

    Variants: Projecting on an expanded constraintset, projecting on a restricted constraint set, com-

    binations with unconstrained methods, etc.

  • 8/8/2019 NLP Slides

    80/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 10

    ALTERNATIVES TO GRADIENT PROJECTION

    LECTURE OUTLINE

    Three Alternatives/Remedies for Gradient Pro-jection

    Two-Metric Projection Methods

    Manifold Suboptimization Methods

    Affine Scaling Methods

    Scaled GP method with scaling matrix Hk > 0:

    xk+1 = xk + k(xk xk),

    xk = arg minxX

    f(xk)(x xk) + 1

    2sk(x xk)Hk(x xk)

    The QP direction subproblem is complicated by:

    Difficult inequality (e.g., nonorthant) constraint

    Nondiagonal Hk, needed for Newton scaling

  • 8/8/2019 NLP Slides

    81/201

    THREE WAYS TO DEAL W/ THE DIFFICULTY

    Two-metric projection methods:

    xk+1 =

    xk kDkf(xk)+

    Use Newton-like scaling but use a standard

    projection Suitable for bounds, simplexes, Cartesian

    products of simple sets, etc

    Manifold suboptimization methods:

    Use (scaled) gradient projection on the man-ifold of active inequality constraints

    Each QP subproblem is equality-constrained

    Need strategies to cope with changing activemanifold (add-drop constraints)

    Affine Scaling Methods Go through the interior of the feasible set

    Each QP subproblem is equality-constrained,AND we dont have to deal with changing ac-

    tive manifold

  • 8/8/2019 NLP Slides

    82/201

    TWO-METRIC PROJECTION METHODS

    In their simplest form, apply to constraint: x 0,but generalize to bound and other constraints

    Like unconstr. gradient methods except for []+

    xk+1 = xk kDkf(xk)+, Dk > 0 Major difficulty: Descent is not guaranteed forDk: arbitrary

    f(xk)

    xk

    xk - kDkf(xk)

    xk

    xk - kDkf(xk)

    (a) (b)

    Remedy: Use Dk that is diagonal w/ respect toindices that are active and want to stay active

    I+(xk) =

    i

    xki = 0, f(x

    k)/xi > 0

  • 8/8/2019 NLP Slides

    83/201

    PROPERTIES OF 2-METRIC PROJECTION

    Suppose Dk is diagonal with respect to I+(xk),i.e., dkij = 0 for i, j I

    +(xk) with i = j, and let

    xk(a) =

    xk Dkf(xk)

    +

    If xk is stationary, xk = xk() for all > 0. Otherwise f

    x()

    < f(xk) for all sufficiently

    small > 0 (can use Armijo rule).

    Because I+(x) is discontinuous w/ respect tox, to guarantee convergence we need to includein I+(x) constraints that are -active [those w/xki [0, ] and f(x

    k)/xi > 0].

    The constraints in I+(x) eventually becomeactive and dont matter.

    Method reduces to unconstrained Newton-likemethod on the manifold of active constraints at x.

    Thus, superlinear convergence is possible w/simple projections.

  • 8/8/2019 NLP Slides

    84/201

    MANIFOLD SUBOPTIMIZATION METHODS

    Feasible direction methods for

    min f(x) subject to ajx bj , j = 1, . . . , r

    Gradient is projected on a linear manifold of ac-tive constraints rather than on the entire constraint

    set (linearly constrained QP).x0

    x1

    x2

    x3

    (a)

    x0

    x4

    x2

    x1

    (b)

    x3

    Searches through sequence of manifolds, eachdiffering by at most one constraint from the next.

    Potentially many iterations to identify the activemanifold; then method reduces to (scaled) steep-

    est descent on the active manifold.

    Well-suited for a small number of constraints,and for quadratic programming.

  • 8/8/2019 NLP Slides

    85/201

    OPERATION OF MANIFOLD METHODS

    Let A(x) = {j | ajx = bj} be the active indexset at x. Given xk, we find

    dk = arg minajd=0, jA(x

    k)f(xk)d + 12dHkd

    If dk

    = 0, then dk

    is a feasible descent direction.Perform feasible descent on the current manifold.

    If dk = 0, either (1) xk is stationary or (2) weenlarge the current manifold (drop an active con-

    straint). For this, use the scalars j such that

    f(xk) +

    jA(xk)

    jaj = 0

    f(xk)

    a1

    a2

    X

    a1'x = b1

    a2'x = b2

    - 2a2

    - 1a1

    (1 < 0)

    (2 > 0)

    xk

    If j 0 for all j, xk isstationary, since for all fea-sible x,

    f(xk)

    (x

    xk) is

    equal to

    jA(xk)j a

    j (xxk) 0

    Else, drop a constraint j

    with j < 0.

  • 8/8/2019 NLP Slides

    86/201

  • 8/8/2019 NLP Slides

    87/201

    AFFINE SCALING

    Particularly interesting choice (affine scaling)

    Hk = (Xk)2,

    where Xk is the diagonal matrix having the (pos-itive) coordinates xk

    ialong the diagonal:

    xk+1 = xkk(Xk)2(cAk), k =

    A(Xk)2A1

    A(Xk

    Corresponds to unscaled gradient projection it-eration in the variables y = (Xk)1x. The vector xk

    is mapped onto the unit vector yk = (1, . . . , 1).x*

    xk

    xk+1 yk+1

    yk = (1,1,1)

    y*= (Xk)-1 x*

    yk= (Xk)-1 xk

    Extensions, convergence, practical issues.

  • 8/8/2019 NLP Slides

    88/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 11

    CONSTRAINED OPTIMIZATION;

    LAGRANGE MULTIPLIERS

    LECTURE OUTLINE

    Equality Constrained Problems Basic Lagrange Multiplier Theorem

    Proof 1: Elimination Approach Proof 2: Penalty Approach

    Equality constrained problem

    minimize f(x)subject to hi(x) = 0, i = 1, . . . , m .

    where f : n , hi : n , i = 1, . . . , m, are con-tinuously differentiable functions. (Theory alsoapplies to case where f and hi are cont. differ-

    entiable in a neighborhood of a local minimum.)

  • 8/8/2019 NLP Slides

    89/201

    LAGRANGE MULTIPLIER THEOREM

    Let x be a local min and a regular point [hi(x):linearly independent]. Then there exist uniquescalars 1, . . . ,

    m such that

    f(x) +m

    i=1

    i hi(x) = 0.

    If in addition f and h are twice cont. differentiable,

    y

    2f(x) +

    mi=1

    i 2hi(x)

    y 0, y s.t. h(x)y =

    x1

    x2

    x* = (-1,-1)

    h(x*) = (-2,-2)

    f(x*) = (1,1) 0

    2

    2

    h(x) = 0

    minimize x1 + x2

    subject to x21 + x22 = 2.

    The Lagrange multiplier is

    = 1/2.

    x1

    x2

    f(x*) = (1,1)h1(x

    *) = (-2,0)

    h2(x*) = (-4,0)

    h1(x) = 0

    h2(x) = 0

    21

    minimize x1 + x2

    s. t. (x1 1)2 + x22 1 =(x1 2)2 + x22 4 =

  • 8/8/2019 NLP Slides

    90/201

    PROOF VIA ELIMINATION APPROACH

    Consider the linear constraints caseminimize f(x)

    subject to Ax = b

    where A is an m n matrix with linearly indepen-dent rows and b m is a given vector. Partition A = ( B R ) , where B is mm invertible,and x = ( xB xR )

    . Equivalent problem:

    minimize F(xR) f

    B1(b RxR), xR

    subject to xR nm.

    Unconstrained optimality condition:

    0 = F(xR) = R(B)1B f(x

    ) + Rf(x) (1)

    By defining

    = (B)1

    B f(x),we have B f(x) + B = 0, while Eq. (1) is writtenRf(x) + R = 0. Combining:

    f(x) + A = 0

  • 8/8/2019 NLP Slides

    91/201

    ELIMINATION APPROACH - CONTINUED

    Second order condition: For all d n

    m

    0 d2F(xR)d = d2

    f

    B1(b RxR), xR

    d. (2)

    After calculation we obtain

    2F(xR) = R(B)12BB f(x)B1R

    R(B)12BR f(x) 2RB f(x

    )B1R + 2RRf(x

    Eq. (2) and the linearity of the constraints [im-plying that 2hi(x) = 0], yields for all d nm

    0 d2F(xR)d = y2f(x)y

    = y

    2f(x) +

    mi=1

    i 2hi(x)

    y,

    where y = ( yB yR ) = ( B1Rd d ) . y has this form iff

    0 = ByB + RyR = h(x)y.

  • 8/8/2019 NLP Slides

    92/201

  • 8/8/2019 NLP Slides

    93/201

  • 8/8/2019 NLP Slides

    94/201

    LAGRANGIAN FUNCTION

    Define the Lagrangian functionL(x, ) = f(x) +

    mi=1

    ihi(x).

    Then, if x is a local minimum which is regular, theLagrange multiplier conditions are written

    xL(x, ) = 0, L(x, ) = 0,

    System of n + m equations with n + m unknowns.

    y2

    xxL(x, )y 0, y s.t. h(x)y = 0.

    Exampleminimize 1

    2

    x21 + x

    22 + x

    23

    subject to x1 + x2 + x3 = 3.

    Necessary conditions

    x1 + = 0, x2 +

    = 0,

    x3 + = 0, x1 + x

    2 + x

    3 = 3.

  • 8/8/2019 NLP Slides

    95/201

    EXAMPLE - PORTFOLIO SELECTION

    Investment of 1 unit of wealth among n assetswith random rates of return ei, and given meansei, and covariance matrix Q =

    E{(ei ei)(ej ej )}

    .

    If xi: amount invested in asset i, we want to

    minimize xQx

    = Variance of return

    ieixi

    subject to

    ixi = 1, and a given mean

    i

    eixi = m

    Let 1 and 2 be the L-multipliers. Have 2Qx +1u+2e = 0, where u = (1, . . . , 1) and e = (e1, . . . , en).This yields

    x = mv+w, Variance of return = 2 = (m+)2+,

    where v and w are vectors, and , , and aresome scalars that depend on Q and e.

    m

    ef-

    Efficient Frontier = m +

    For given m the optimal

    lies on a line (called effi-cient frontier).

  • 8/8/2019 NLP Slides

    96/201

  • 8/8/2019 NLP Slides

    97/201

    SUFFICIENCY CONDITIONS

    Second Order Sufficiency Conditions: Let x

    n

    and m satisfy

    xL(x, ) = 0, L(x, ) = 0,

    y2xxL(x, )y > 0, y = 0 with h(x)y = 0.

    Then x is a strict local minimum.

    Example: Minimize (x1x2 + x2x3 + x1x3) subject tox1 + x2 + x3 = 3. We have that x1 = x

    2 = x

    3 = 1 and

    = 2 satisfy the 1st order conditions. Also

    2xxL(x, ) =

    0 1 11 0 11 1 0

    .

    We have for all y = 0 with h(x)y = 0 or y1 + y2 +y3 = 0,

    y2xxL(x, )y = y1(y2 + y3) y2(y1 + y3) y3(y1 + y= y21 + y

    22 + y

    23 > 0.

    Hence, x is a strict local minimum.

  • 8/8/2019 NLP Slides

    98/201

    A BASIC LEMMA

    Lemma: Let P and Q be two symmetric matrices.Assume that Q 0 and P > 0 on the nullspace ofQ, i.e., xP x > 0 for all x = 0 with xQx = 0. Thenthere exists a scalar c such that

    P + cQ : positive definite, c > c.

    Proof: Assume the contrary. Then for every k,there exists a vector xk with xk = 1 such that

    xkP xk + kxk

    Qxk 0.

    Consider a subsequence {xk}kK converging tosome x with x = 1. Taking the limit superior,

    xPx + lim supk, kK

    (kxkQxk) 0. (*)

    We have xkQxk 0 (since Q 0), so {xkQxk}kK 0. Therefore, xQx = 0 and using the hypothesis,xP x > 0. This contradicts (*).

  • 8/8/2019 NLP Slides

    99/201

    PROOF OF SUFFICIENCY CONDITIONS

    Consider the augmented Lagrangian function

    Lc(x, ) = f(x) + h(x) +

    c

    2h(x)2,

    where c is a scalar. We have

    xLc(x, ) = xL(x, ),2xxLc(x, ) = 2xxL(x, ) + ch(x)h(x)

    where = + ch(x). If (x, ) satisfy the suff. con-ditions, we have using the lemma,

    xLc(x, ) = 0, 2xxLc(x, ) > 0,

    for suff. large c. Hence for some > 0, > 0,

    Lc(x, ) Lc(x, ) +

    2x x2, if x x < .

    Since Lc(x, ) = f(x) when h(x) = 0,

    f(x) f(x) + 2

    x x2, if h(x) = 0, x x < .

  • 8/8/2019 NLP Slides

    100/201

    SENSITIVITY - GRAPHICAL DERIVATION

    f(x*)

    x* + x

    x*

    x

    a a'x = b + b

    a'x = b

    Sensitivity theorem for the problem minax=b f(x). If b ischanged to b+b, the minimum x will change to x+x.Since b + b = a(x + x) = ax + ax = b + ax, wehave ax = b. Using the condition

    f(x) =

    a,

    cost = f(x + x) f(x) = f(x)x + o(x)= ax + o(x)

    Thus cost = b + o(x), so up to first order

    = cost

    b .

    For multiple constraints aix = bi, i = 1, . . . , n, we have

    cost = m

    i=1

    i bi + o(x).

  • 8/8/2019 NLP Slides

    101/201

  • 8/8/2019 NLP Slides

    102/201

    EXAMPLE

    p(u)

    -1 0u

    slope p(0) = - * = -1

    Illustration of the primal function p(u) = f

    x(u)

    for the two-dimensional problem

    minimize f(x) = 12x

    21 x22 x2

    subject to h(x) = x2 = 0.

    Here,p(u) = min

    h(x)=uf(x) = 1

    2u2 u

    and =

    p(0) = 1, consistently with the sensitivity

    theorem.

    Need for regularity of x: Change constraint toh(x) = x22 = 0. Then p(u) = u/2

    u for u 0 and

    is undefined for u < 0.

  • 8/8/2019 NLP Slides

    103/201

    PROOF OUTLINE OF SENSITIVITY THEOREM

    Apply implicit function theorem to the system

    f(x) + h(x) = 0, h(x) = u.For u = 0 the system has the solution (x, ), andthe corresponding (n + m) (n + m) Jacobian

    J = 2f(x) +

    m

    i=1i 2hi(x) h(x)

    h(x) 0 is shown nonsingular using the sufficiency con-ditions. Hence, for all u in some open sphere Scentered at u = 0, there exist x(u) and (u) suchthat x(0) = x, (0) = , the functions x() and ()are continuously differentiable, and

    f

    x(u)

    + h

    x(u)

    (u) = 0, h

    x(u)

    = u.

    For u close to u = 0, using the sufficiency condi-tions, x(u) and (u) are a local minimum-Lagrangemultiplier pair for the problem minh(x)=u f(x).

    To derive p(u), differentiate hx(u) = u, toobtain I = x(u)hx(u), and combine with the re-lations x(u)f

    x(u)

    + x(u)h

    x(u)

    (u) = 0 and

    p(u) = u

    f

    x(u)

    = x(u)f

    x(u)

    .

  • 8/8/2019 NLP Slides

    104/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 13: INEQUALITY CONSTRAINTS

    LECTURE OUTLINE

    Inequality Constrained Problems

    Necessary Conditions Sufficiency Conditions Linear Constraints

    Inequality constrained problem

    minimize f(x)

    subject to h(x) = 0, g(x) 0

    where f : n , h : n m, g : n r are

    continuously differentiable. Here

    h = (h1,...,hm), g = (g1,...,gr ).

  • 8/8/2019 NLP Slides

    105/201

    TREATING INEQUALITIES AS EQUATIONS

    Consider the set of active inequality constraintsA(x) =

    j | gj (x) = 0

    .

    If x is a local minimum: The active inequality constraints at x can be

    treated as equations

    The inactive constraints at x dont matter Assuming regularity of x and assigning zeroLagrange multipliers to inactive constraints,

    f(x) +

    m

    i=1

    i

    hi(x) +

    r

    j=1

    j

    gj (x) = 0,

    j = 0, j / A(x).

    Extra property: j 0 for all j. Intuitive reason: Relax jth constraint, gj (x) uj .Since cost 0 if uj > 0, by the sensitivity theorem,we have

    j = (cost due to uj )/uj 0

  • 8/8/2019 NLP Slides

    106/201

    BASIC RESULTS

    Kuhn-Tucker Necessary Conditions: Let x be a lo-cal minimum and a regular point. Then there existunique Lagrange mult. vectors = (1, . . . ,

    m),

    = (1, . . . , r ), such that

    xL(x, , ) = 0,

    j 0, j = 1, . . . , r ,j = 0, j / A(x).

    If f, h, and g are twice cont. differentiable,

    y2

    xxL(x, , )y 0, for all y V(x),where

    V(x) =

    y | h(x)y = 0, gj (x

    )y = 0, j A(x)

    Similar sufficiency conditions and sensitivity re-sults. They require strict complementarity, i.e.,j > 0, j A(x),

    as well as regularity of x.

  • 8/8/2019 NLP Slides

    107/201

    PROOF OF KUHN-TUCKER CONDITIONS

    Use equality-constraints result to obtain all theconditions except for j 0 for j A(x). Intro-duce the penalty functions

    g+j (x) = max

    0, gj (x)

    , j = 1, . . . , r ,

    and for k = 1, 2, . . ., let xk minimize

    f(x) +k

    2||h(x)||2 +

    k

    2

    rj=1

    g

    +j (x)

    2+

    1

    2||x x||2

    over a closed sphere of x such that f(x) f(x).Using the same argument as for equality con-straints,

    i = limk

    khi(xk), i = 1, . . . , m ,

    j = limk

    kg+j (xk), j = 1, . . . , r .

    Since g+j (xk) 0, we obtain j 0 for all j.

  • 8/8/2019 NLP Slides

    108/201

    GENERAL SUFFICIENCY CONDITION

    Consider the problem

    minimize f(x)

    subject to x X, gj (x) 0, j = 1, . . . , r .

    Let x be feasible and satisfy

    j 0, j = 1, . . . , r, j = 0, j / A(x),

    x = arg minxX

    L(x, ).

    Then x is a global minimum of the problem.

    Proof: We have

    f(x) = f(x) + g(x) = minxX

    f(x) + g(x)

    min

    xX, g(x)0

    f(x) + g(x)

    min

    xX, g(x)0f(x),

    where the first equality follows from the hypothe-sis, which implies that g(x) = 0, and the last in-equality follows from the nonnegativity of . Q.E.D.

    Special Case: Let X = n, f and gj be con-vex and differentiable. Then the 1st order Kuhn-Tucker conditions are also sufficient for global op-

    timality.

  • 8/8/2019 NLP Slides

    109/201

    LINEAR CONSTRAINTS

    Consider the problem minajxbj , j=1,...,r f(x). Remarkable property: No need for regularity. Proposition: If x is a local minimum, there exist1, . . . ,

    r with j 0, j = 1, . . . , r, such that

    f(x) +r

    j=1

    j aj = 0, j = 0, j / A(x).

    The proof uses Farkas Lemma: Consider thecone Cgenerated by aj , j A(x), and the polarcone C

    shown below

    a1

    a2

    0C

    = {y | aj'y 0, j=1,...,r}

    C ={x | x =j=1

    r

    jaj, j 0}

    Then, C

    = C, i.e.,

    x C iff xy 0, y C.

  • 8/8/2019 NLP Slides

    110/201

  • 8/8/2019 NLP Slides

    111/201

    PROOF OF LAGRANGE MULTIPLIER RESULT

    a2

    a1

    Cone generated by aj, j A(x*)

    f(x*)

    x*

    Constraint set

    {x | aj'x bj, j = 1,...,r}

    C ={x | x =j=1

    r

    jaj, j 0}

    The local min x of the original problem is also a local minfor the problem mina

    jxbj , jA(x) f(x). Hence

    f(x)(x x) 0, x with aj x bj , j A(x).

    Since a constraint aj x bj , j A(x) can also be ex-pressed as aj (x x) 0, we have

    f(x)y 0, y with aj y 0, j A(x).

    From Farkas lemma, f(x) has the formjA(x)

    j aj , for some j 0, j A(x).

    Let j = 0 for j /

    A(x).

  • 8/8/2019 NLP Slides

    112/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 14: INTRODUCTION TO DUALITY

    LECTURE OUTLINE

    Convex Cost/Linear Constraints

    Duality Theorem Linear Programming Duality Quadratic Programming Duality

    Linear inequality constrained problem

    minimize f(x)

    subject to aj x bj , j = 1, . . . , r ,

    where f is convex and continuously differentiable

    over n

    .

  • 8/8/2019 NLP Slides

    113/201

    LAGRANGE MULTIPLIER RESULT

    Let J {

    1, . . . , r

    }. Then x is a global min if and

    only if x is feasible and there exist j 0, j J,such that j = 0 for all j J / A(x), and

    x = arg minajxbj

    j/J

    f(x) +

    jJ

    j (aj x bj )

    .

    Proof: Assume x is global min. Then there existj 0, such that j (aj x bj ) = 0 for all j andf(x) +

    rj=1

    j aj = 0, implying

    x = arg minxnf(x) +

    r

    j=1 j (a

    j x bj ).

    Since j (aj x bj ) = 0 for all j,

    f(x) = minxn

    f(x) +

    rj=1

    j (aj x bj )

    minajxbjj/J

    f(x) +

    rj=1

    j (aj x bj )Since j (a

    j x bj ) 0 if aj x bj 0,

    f(x) minajxbj

    j/J

    f(x) +

    jJ

    j (aj x bj )

    f(x).

  • 8/8/2019 NLP Slides

    114/201

    PROOF (CONTINUED)

    Conversely, if x is feasible and there exist scalarsj , j J with the stated properties, then x is aglobal min by the general sufficiency condition ofthe preceding lecture (where X is taken to be theset of x such that aj x bj for all j / J). Q.E.D.

    Interesting observation: The same set of jworks for all index sets J.

    The flexibility to split the set of constraints intothose that are handled by Lagrange multipliers(set J) and those that are handled explicitly comeshandy in many analytical and computational con-

    texts.

  • 8/8/2019 NLP Slides

    115/201

    THE DUAL PROBLEM

    Consider the problemmin

    xX, aj

    xbj , j=1,...,rf(x)

    where f is convex and cont. differentiable over n

    and X is polyhedral. Define the dual function q : r [, )

    q() = infxX

    L(x, ) = infxX

    f(x) +

    r

    j=1j (a

    j x bj )

    and the dual problem

    max0

    q().

    If X is bounded, the dual function takes real

    values. In general, q() can take the value .The effective constraint set of the dual is

    Q =

    | 0, q() >

    .

  • 8/8/2019 NLP Slides

    116/201

    DUALITY THEOREM

    (a) If the primal problem has an optimal solution,the dual problem also has an optimal solution andthe optimal values are equal.(b) x is primal-optimal and is dual-optimal ifand only if x is primal-feasible, 0, and

    f(x) = L(x, ) = minxX

    L(x, ).

    Proof: (a) Let x be a primal optimal solution. Forall primal feasible x, and all 0, we have j (aj xbj ) 0 for all j, so

    q() infxX, a

    jxbj , j=1,...,r

    f(x) +

    rj=1

    j (aj x bj )

    infxX, a

    jxbj , j=1,...,r

    f(x) = f(x).

    (*)

    By L-Mult. Th., there exists 0 such that j (aj xbj ) = 0 for all j, and x = arg minxX L(x, ), so

    q() = L(x, ) = f(x) +r

    j=1

    j (aj x bj ) = f(x).

  • 8/8/2019 NLP Slides

    117/201

    PROOF (CONTINUED)

    (b) If x is primal-optimal and is dual-optimal,by part (a)

    f(x) = q(),

    which when combined with Eq. (*), yields

    f(x) = L(x, ) = q() = minx

    XL(x, ).

    Conversely, the relation f(x) = minxX L(x, ) iswritten as f(x) = q(), and since x is primal-feasible and 0, Eq. (*) implies that x is primal-optimal and is dual-optimal. Q.E.D.

    Linear equality constraints are treated similar toinequality constraints, except that the sign of theLagrange multipliers is unrestricted:

    Primal: minxX, e

    ix=di, i=1,...,m a

    j

    xbj , j=1,...,rf(x)

    Dual: maxm, 0

    q(, ) = maxm, 0

    infxX

    L(x,,).

  • 8/8/2019 NLP Slides

    118/201

    THE DUAL OF A LINEAR PROGRAM

    Consider the linear program

    minimize cx

    subject to eix = di, i = 1, . . . , m, x 0

    Dual function

    q() = infx0

    n

    j=1

    cj

    mi=1

    ieij

    xj +

    mi=1

    idi

    .

    If cj mi=1 ieij 0 for all j, the infimum isattained for x = 0, and q() =

    mi=1

    idi. If cj mi=1

    ieij < 0 for some j, the expression in bracescan be arbitrarily small by taking xj suff. large, soq() = . Thus, the dual is

    maximize

    mi=1

    idi

    subject to

    mi=1

    ieij cj , j = 1, . . . , n .

  • 8/8/2019 NLP Slides

    119/201

    THE DUAL OF A QUADRATIC PROGRAM

    Consider the quadratic programminimize 1

    2xQx + cx

    subject to Ax b,

    where Q is a given nn positive definite symmetricmatrix, A is a given r

    n matrix, and b

    r and

    c n are given vectors. Dual function:

    q() = inf xn

    12

    xQx + cx + (Ax b)

    .

    The infimum is attained for x = Q1(c + A), and,after substitution and calculation,

    q() = 12

    AQ1A (b + AQ1c) 12

    cQ1c.

    The dual problem, after a sign change, isminimize 12P + t

    subject to 0,

    where P = AQ1A and t = b + AQ1c.

  • 8/8/2019 NLP Slides

    120/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 15: INTERIOR POINT METHODS

    LECTURE OUTLINE

    Barrier and Interior Point Methods

    Linear Programs and the Logarithmic Barrier Path Following Using Newtons Method

    Inequality constrained problem

    minimize f(x)

    subject to x X, gj (x) 0, j = 1, . . . , r ,

    where f and gj are continuous and X is closed.We assume that the set

    S =

    x X | gj (x) < 0, j = 1, . . . , ris nonempty and any feasible point is in the closureof S.

  • 8/8/2019 NLP Slides

    121/201

    BARRIER METHOD

    Consider a barrier function, that is continuousand goes to as any one of the constraints gj (x)approaches 0 from negative values. Examples:

    B(x) = r

    j=1ln

    gj (x)

    , B(x) =

    r

    j=11

    gj (x).

    Barrier Method:

    xk = arg minxS

    f(x) + kB(x)

    , k = 0, 1, . . . ,

    where the parameter sequence{

    k

    }satisfies 0 0,

    x() = arg minxS

    F(x) = arg minxS

    cx

    ni=1

    ln xi

    ,

    where S =

    x | Ax = b, x > 0}. We assume that S is

    nonempty and bounded.

    As 0, x() follows the central path

    Point x() oncentral path

    x

    S

    x* ( = 0)

    c

    All central paths start atthe analytic center

    x = arg minxS

    ni=1

    ln xi

    ,

    and end at optimal solu-

    tions of (LP).

  • 8/8/2019 NLP Slides

    124/201

    PATH FOLLOWING W/ NEWTONS METHOD

    Newtons method for minimizing F:x = x + (x x),

    where x is the pure Newton iterate

    x = arg minAz=bF(x)(z x) + 12 (z x)2F(x)(z x

    By straightforward calculation

    x = x Xq(x, ),

    q(x, ) =Xz

    e, e = (1 . . . 1), z = c

    A,

    = (AX2A)1AX

    Xc e

    ,

    and X is the diagonal matrix with xi, i = 1, . . . , nalong the diagonal.

    View q(x, ) as the Newton increment (xx) trans-formed by X1 that maps x into e. Consider q(x, ) as a proximity measure of thecurrent point to the point x() on the central path.

  • 8/8/2019 NLP Slides

    125/201

    KEY RESULTS

    It is sufficient to minimize F approximately, upto where q(x, ) < 1.

    x

    S

    x*

    Central Path

    Set {x | ||q(x,0)|| < 1}

    x(2)

    x(1)

    x(0)x0

    x2

    x1

    If x > 0, Ax = b, andq(x, ) < 1, then

    cx minAy=b, y0

    cy

    n+

    n

    The termination set

    x | q(x, ) < 1

    is part

    of the region of quadratic convergence of the pureform of Newtons method. In particular, if q(x, ) 0, Ax = b, and suppose thatfor some < 1 we have

    q(x, )

    . Then if =

    (1 n1/2) for some > 0,q(x, )

    2 +

    1 n1/2 .

    In particular, if

    (1

    )(1 + )1,

    we have q(x, ) . Can be used to establish nice complexity results;but must be reduced VERY slowly.

  • 8/8/2019 NLP Slides

    127/201

    LONG STEP METHODS

    Main features: Decrease faster than dictated by complex-ity analysis.

    Require more than one Newton step per (ap-proximate) minimization.

    Use line search as in unconstrained New-

    tons method. Require much smaller number of (approxi-

    mate) minimizations.

    S

    x*

    Central Path

    x

    x(k+1)

    x(k)xk

    xk+1x(k+2)xk+2

    (a) (b)

    S

    x*

    Central Path

    x

    x(k+1)

    x(k)xk

    xk+1

    x(k+2)xk+2

    The methodology generalizes to quadratic pro-gramming and convex programming.

  • 8/8/2019 NLP Slides

    128/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 16: PENALTY METHODS

    LECTURE OUTLINE

    Quadratic Penalty Methods

    Introduction to Multiplier Methods*******************************************

    Consider the equality constrained problem

    minimize f(x)

    subject to x X, h(x) = 0,where f : n and h : n m are continuous,and X is closed.

    The quadratic penalty method:

    xk

    = arg minxX Lck (x, k

    ) f(x) + k

    h(x) +ck

    2 h(x)2

    where the {k} is a bounded sequence and {ck}satisfies 0 < ck < ck+1 for all k and ck .

  • 8/8/2019 NLP Slides

    129/201

    TWO CONVERGENCE MECHANISMS

    Taking k

    close to a Lagrange multiplier vector Assume X = n and (x, ) is a local min-Lagrange multiplier pair satisfying the 2ndorder sufficiency conditions

    For c suff. large, x is a strict local min ofLc(, )

    Taking ck very large For large c and any

    Lc(, )

    f(x) if x X and h(x) = 0 otherwise

    Example:minimize f(x) = 1

    2(x21 + x

    22)

    subject to x1 = 1

    Lc(x, ) =12 (x

    21 + x

    22) + (x1 1) +

    c

    2 (x1 1)2

    x1(, c) =c c + 1

    , x2(, c) = 0

  • 8/8/2019 NLP Slides

    130/201

    EXAMPLE CONTINUED

    minx1=1

    x21 + x22, x

    = 1, = 1

    0 x13/4

    x2

    10 x11/2

    x2

    1

    c = 1

    = 0

    c = 1

    = - 1/2

    0 x11/2

    x2

    1 0 x110/11

    x2

    1

    c = 1

    = 0

    c = 10

    = 0

  • 8/8/2019 NLP Slides

    131/201

    GLOBAL CONVERGENCE

    Every limit point of {xk

    } is a global min.Proof: The optimal value of the problem is f =infh(x)=0, xX Lck (x,

    k). We have

    Lck (xk, k) Lck (x, k), x X

    so taking the inf of the RHS over x X, h(x) = 0

    Lck (xk, k) = f(xk) + k

    h(xk) +

    ck

    2h(xk)2 f.

    Let (x, ) be a limit point of {xk, k}. Without lossof generality, assume that {x

    k

    , k

    } (x, ). Takingthe limsup above

    f(x) + h(x) + lim supk

    ck

    2h(xk)2 f. (*)

    Since h(xk

    )2

    0 and ck

    , it follows thath(xk) 0 and h(x) = 0. Hence, x is feasible, andsince from Eq. (*) we have f(x) f, x is optimal.Q.E.D.

  • 8/8/2019 NLP Slides

    132/201

    LAGRANGE MULTIPLIER ESTIMATES

    Assume that X = n

    , and f and h are cont.differentiable. Let {k} be bounded, and ck .Assume xk satisfies xLck (xk, k) = 0 for all k, andthat xk x, where x is such that h(x) has rankm. Then h(x) = 0 and k , where

    k = k + ckh(xk), xL(x, ) = 0.

    Proof: We have

    0 = xLck(xk

    , k) = f(xk) + h(xk)

    k + ckh(xk)

    = f(xk) + h(xk)k.

    Multiply withh(xk)h(xk)

    1h(xk)and take lim to obtain k with

    = h(x)h(x)1h(x)f(x).We also have xL(x, ) = 0 and h(x) = 0 (sincek converges).

  • 8/8/2019 NLP Slides

    133/201

    PRACTICAL BEHAVIOR

    Three possibilities:

    The method breaks downbecause an xk withxLck (xk, k) 0 cannot be found.

    A sequence {xk} with xLck (xk, k) 0 is ob-tained, but it either has no limit points, or foreach of its limit points x the matrix h(x)has rank < m.

    A sequence {xk} with with xLck (xk, k) 0is found and it has a limit point x such thath(x) has rank m. Then, x together with [the corresp. limit point of

    k + ckh(xk)

    ] sat-

    isfies the first-order necessary conditions.

    Ill-conditioning: The condition number of theHessian 2xxLck (xk, k) tends to increase with ck. To overcome ill-conditioning:

    Use Newton-like method (and double preci-sion).

    Use good starting points. Increase ck at a moderate rate (if ck is in-creased at a fast rate, {xk} converges faster,but the likelihood of ill-conditioning is greater).

  • 8/8/2019 NLP Slides

    134/201

    INEQUALITY CONSTRAINTS

    Convert them to equality constraints by usingsquared slack variables that are eliminated later. Convert inequality constraint gj (x) 0 to equalityconstraint gj (x) + z2j = 0.

    The penalty method solves problems of the form

    minx,z

    Lc(x,z,,) = f(x)

    +

    rj=1

    j

    gj (x) + z2j

    +

    c

    2|gj (x) + z2j |2

    ,

    for various values of and c.

    First minimize Lc(x,z,,) with respect to z,

    Lc(x,,) = minz

    Lc(x,z,,) = f(x)

    +r

    j=1

    minzj

    j

    gj (x) + z2j

    +

    c

    2|gj (x) + z2j |2

    and then minimize Lc(x,,) with respect to x.

  • 8/8/2019 NLP Slides

    135/201

    MULTIPLIER METHODS

    Recall that if (x, ) is a local min-Lagrangemultiplier pair satisfying the 2nd order sufficiencyconditions, then for c suff. large, x is a strict localmin of Lc(, ). This suggests that for k , xk x.

    Hence it is a good idea to use k

    , such as

    k+1 = k = k + ckh(xk)

    This is the (1st order) method of multipliers.

    Key advantages to be shown:

    Less ill-conditioning: It is not necessary thatck (only that ck exceeds some thresh-old).

    Faster convergence when k is updated thanwhen k is kept constant (whether ck ornot).

  • 8/8/2019 NLP Slides

    136/201

    6.252 NONLINEAR PROGRAMMING

    ECTURE 17: AUGMENTED LAGRANGIAN METHO

    LECTURE OUTLINE

    Multiplier Methods

    ******************************************* Consider the equality constrained problem

    minimize f(x)

    subject to h(x) = 0,

    where f :

    n

    and h :

    n

    m are continuously

    differentiable.

    The (1st order) multiplier method finds

    xk = arg minxn

    Lck (x, k) f(x) + kh(x) + c

    k

    2h(x)2

    and updates k using

    k+1 = k + ckh(xk)

  • 8/8/2019 NLP Slides

    137/201

    CONVEX EXAMPLE

    Problem: minx1=1(1/2)(x21 + x

    22) with optimal so-lution x = (1, 0) and Lagr. multiplier = 1.

    We have

    xk = arg minx

    n

    Lck (x, k) =

    ck kck + 1

    , 0k+1 = k + ck

    ck kck + 1

    1

    k+1

    =

    k

    ck

    + 1

    We see that: k = 1 and xk x = (1, 0) for ev-

    ery nondecreasing sequence {ck}. It is NOTnecessary to increase ck to .

    The convergence rate becomes faster as ck

    becomes larger; in fact

    |k |

    converges

    superlinearly if ck .

  • 8/8/2019 NLP Slides

    138/201

    NONCONVEX EXAMPLE

    Problem: minx1=1(1/2)(x21 + x

    22) with optimal so-lution x = (1, 0) and Lagr. multiplier = 1.

    We have

    xk = arg minxn

    Lck (x, k) =

    ck kck

    1

    , 0provided ck > 1 (otherwise the min does not exist)

    k+1 = k + ck

    ck kck 1 1

    k+1 = k

    ck 1

    We see that:

    No need to increase ck to

    for convergence;

    doing so results in faster convergence rate. To obtain convergence, ck must eventually

    exceed the threshold 2.

  • 8/8/2019 NLP Slides

    139/201

    THE PRIMAL FUNCTIONAL

    Let (x, ) be a regular local min-Lagr. pair sat-isfying the 2nd order suff. conditions are satisfied. The primal functional

    p(u) = minh(x)=u

    f(x),

    defined for u in an open sphere centered at u = 0,and we havep(0) = f(x), p(0) = ,

    0 u

    (u + 1)212

    p(u) =

    p(0) = f(x*) = 12

    -1

    0 u

    (u + 1)212

    p(u) = -

    p(0) = f(x*) = -12

    -1

    (a) (b)

    p(u) = minx11=u

    12

    (x21 + x22), p(u) = min

    x11=u12

    (x21 + x22)

  • 8/8/2019 NLP Slides

    140/201

    AUGM. LAGRANGIAN MINIMIZATION

    Break down the minimization of Lc(, ):

    minx

    Lc(x, ) = minu

    minh(x)=u

    f(x) + h(x) +

    c

    2h(x)2

    = min

    u p(u) + u +

    c

    2u2 ,

    where the minimization above is understood tobe local in a neighborhood of u = 0.

    Interpretation of this minimization:

    p(0) = f(x*)

    p(u)

    Slope = - *

    min Lc(x,)x

    Slope = -

    0 uu(,c)

    - 'u(,c)

    Primal Function

    ||u||2c2p(u) +

    Penalized Primal Function

    If c is suf. large, p(u) + u + c2u2 is convex in

    a neighborhood of 0. Also, for and large c,the value minx Lc(x, ) p(0) = f(x).

  • 8/8/2019 NLP Slides

    141/201

    INTERPRETATION OF THE METHOD

    Geometric interpretation of the iterationk+1 = k + ckh(xk).

    p(0) = f(x*)

    p(u)

    Slope = - *

    min Lck(x,

    k

    )x

    Slope = - k

    Slope = - k+1 = p(uk)

    Slope = - k+1

    Slope = - k+2

    0 uuk uk+1

    min Lck+1(x,k+1)

    x

    ||u||2c

    2

    p(u) +

    If k is sufficiently close to and/or ck is suf.large, k+1 will be closer to than k.

    ck need not be increased to

    in order to ob-

    tain convergence; it is sufficient that ck eventuallyexceeds some threshold level.

    Ifp(u) is linear, convergence to will be achievedin one iteration.

  • 8/8/2019 NLP Slides

    142/201

    COMPUTATIONAL ASPECTS

    Key issue is how to select {ck

    }. ck should eventually become larger than thethreshold of the given problem.

    c0 should not be so large as to cause ill-conditioning at the 1st minimization.

    ck should not be increased so fast that too

    much ill-conditioning is forced upon the un-constrained minimization too early.

    ck should not be increased so slowly thatthe multiplier iteration has poor convergencerate.

    A good practical scheme is to choose a mod-erate value c0, and use ck+1 = ck, where is ascalar with > 1 (typically [5, 10] if a Newton-like method is used).

    In practice the minimization of Lck (x, k) is typ-ically inexact (usually exact asymptotically). In

    some variants of the method, only one Newtonstep per minimization is used (with safeguards).

  • 8/8/2019 NLP Slides

    143/201

    DUALITY FRAMEWORK

    Consider the problemminimize f(x) +

    c

    2h(x)2

    subject to x x < , h(x) = 0,

    where is small enough for a local analysis tohold based on the implicit function theorem, and cis large enough for the minimum to exist.

    Consider the dual function and its gradient

    qc() = min

    x

    x

  • 8/8/2019 NLP Slides

    144/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 18: DUALITY THEORY

    LECTURE OUTLINE

    Geometrical Framework for Duality

    Geometric Multipliers The Dual Problem Properties of the Dual Function

    Consider the problem

    minimize f(x)

    subject to x X, gj (x) 0, j = 1, . . . , r ,

    We assume that the problem is feasible and the

    cost is bounded from below:

    < f = inf xX

    gj(x)0, j=1,...,r

    f(x) <

  • 8/8/2019 NLP Slides

    145/201

    MIN COMMON POINT/MAX CROSSING POINT

    Let M be a subset of n

    : Min Common Point Problem: Among all points thatare common to both M and the nth axis,find theone whose nth component is minimum.

    Max Crossing Point Problem: Among all hyper-planes that intersect the nth axis and support theset M from below, find the hyperplane for whichpoint of intercept with the nth axis is maximum.

    0

    Min Common Point w*

    ax Crossing Point q*

    M

    0

    M

    Max Crossing Point q*

    Min Common Point ww w

    uu

    Note: We will see that the min common/maxcrossing framework applies to the problem of thepreceding slide, i.e., minxX, g(x)0f(x), with thechoice

    M = (z, f(x)) | g(x) z, x X

  • 8/8/2019 NLP Slides

    146/201

    GEOMETRICAL DEFINITION OF A MULTIPLIER

    A vector = (1, . . . , r ) is said to be a geometricmultiplier for the primal problem if

    j 0, j = 1, . . . , r ,

    and

    f = infxX L(x, ).

    0

    (a)

    (0,f*)

    (*,1)

    w

    z

    H ={(z,w) | f* = w + jjzj}*

    0

    (b)

    (0,f*)(*,1)

    Set of pairs (z,w) corresponding to xthat minimize L(x,*) over X

    S = {(g(x),f(x)) | x X} S = {(g(x),f(x)) | x X}

    z

    w

    Note that the definition differs from the one givenin Chapter 3 ... but if X = n, f and gj are convexand differentiable, and x is an optimal solution,the Lagrange multipliers corresponding to x, asper the definition of Chapter 3, coincide with thegeometric multipliers as per the above definition.

  • 8/8/2019 NLP Slides

    147/201

    EXAMPLES: A G-MULTIPLIER EXISTS

    0(-1,0)

    (*,1) min f(x) = (1/2) (x12 + x2

    2)

    s.t. g(x) = x1 - 1 0

    x X = R2

    (b)

    0

    (0,-1)

    (*,1)

    (0,1) min f(x) = x1 - x2

    s.t. g(x) = x1 + x2 - 1 0

    x X ={(x1,x2) | x1 0, x2 0}

    (a)

    (-1,0)

    0

    (*,1)

    min f(x) = |x1| + x2

    s.t. g(x) = x1 0

    x X ={(x1,x2) | x2 0}

    (c)

    (*,1)

    (*,1)

    S = {(g(x),f(x)) | x X}

    S = {(g(x),f(x)) | x X}

    S = {(g(x),f(x)) | x X}

  • 8/8/2019 NLP Slides

    148/201

    EXAMPLES: A G-MULTIPLIER DOESNT EXIST

    min f(x) = x

    s.t. g(x) = x2 0

    x X = R

    (a)

    (0,f*) = (0,0)

    (-1/2,0)

    S = {(g(x),f(x)) | x X}min f(x) = - x

    s.t. g(x) = x - 1/2 0

    x X = {0,1}

    (b)

    (0,f*) = (0,0)

    (1/2,-1)

    S = {(g(x),f(x)) | x X}

    Proposition: Let be a geometric multiplier.Then x is a global minimum of the primal problemif and only if x is feasible and

    x = arg minxX

    L(x, ), j gj (x) = 0, j = 1, . . . , r

  • 8/8/2019 NLP Slides

    149/201

    THE DUAL FUNCTION AND THE DUAL PROBLEM

    The dual problem ismaximize q()

    subject to 0,

    where q is the dual function

    q() = infxX

    L(x, ), r.

    Question: How does the optimal dual value q =sup0 q() relate to f?

    (,1)

    H ={(z,w) | w + 'z = b}

    OptimalDual Value

    x Xq() = inf L(x,)

    Support pointscorrespond to minimizersof L(x,) over X

    S = {(g(x),f(x)) | x X}

  • 8/8/2019 NLP Slides

    150/201

    WEAK DUALITY

    The domain of q isDq =

    | q() >

    .

    Proposition: The domain Dq is a convex set andq is concave over Dq. Proposition: (Weak Duality Theorem) We have

    q f.

    Proof: For all 0, and x X with g(x) 0, wehave

    q() = infzX

    L(z, ) f(x) +r

    j=1

    j gj (x) f(x),

    soq = sup

    0q() inf

    xX, g(x)0f(x) = f.

  • 8/8/2019 NLP Slides

    151/201

    DUAL OPTIMAL SOLUTIONS AND G-MULTIPLIER

    Proposition: (a) If q = f, the set of geometricmultipliers is equal to the set of optimal dual solu-tions. (b) If q < f, the set of geometric multipliersis empty.

    Proof: By definition, a vector 0 is a geometricmultiplier if and only if f = q() q, which by theweak duality theorem, holds if and only if there isno duality gap and is a dual optimal solution.Q.E.D.

    (a)

    f* = 0

    q()

    1

    (b)

    q()

    f* = 0

    - 1

    min f(x) = x

    s.t. g(x) = x2 0

    x X = R

    min f(x) = - x

    s.t. g(x) = x - 1/2 0x X = {0,1}

    q() = min {x + x2} ={- 1/(4 ) if > 0

    - if 0

    - 1/2

    x R

    q() = min { - x + (x - 1/2)} = min{ - /2, /2 1}x {0,1}

  • 8/8/2019 NLP Slides

    152/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 19: DUALITY THEOREMS

    LECTURE OUTLINE

    Duality and G-multipliers (continued)

    Consider the problem

    minimize f(x)

    subject to x X, gj (x) 0, j = 1, . . . , r ,

    assuming < f < . is a geometric multiplier if 0 and f =infxX L(x, ).

    The dual problem is

    maximize q()subject to 0,

    where q is the dual function q() = infxX L(x, ).

  • 8/8/2019 NLP Slides

    153/201

    DUAL OPTIMALITY

    (,1)

    H ={(z,w) | w + 'z = b}

    OptimalDual Value

    x X

    q() = inf L(x,)Support pointscorrespond to minimizers

    of L(x,) over X

    S = {(g(x),f(x)) | x X}

    Weak Duality Theorem: q

    f

    .

    Geometric Multipliers and Dual Optimal Solu-tions:

    (a) If there is no duality gap, the set of geometricmultipliers is equal to the set of optimal dualsolutions.

    (b) If there is a duality gap, the set of geometricmultipliers is empty.

  • 8/8/2019 NLP Slides

    154/201

    DUALITY PROPERTIES

    Optimality Conditions: (x, ) isan optimal solutiogeometric multiplier pair if and only if

    x X, g(x) 0, (Primal Feasibility), 0, (Dual Feasibility),

    x = arg minxX

    L(x, ), (Lagrangian Optimality),

    j gj (x) = 0, j = 1, . . . , r , (Compl. Slackness).

    Saddle Point Theorem: (x, ) is an optimalsolution-geometric multiplier pair if and only if x X, 0, and (x, ) is a saddle point of the La-grangian, in the sense that

    L(x, ) L(x, ) L(x, ), x X, 0.

  • 8/8/2019 NLP Slides

    155/201

    INFEASIBLE AND UNBOUNDED PROBLEMS

    0

    min f(x) = 1/x

    s.t. g(x) = x 0

    x X ={x | x > 0}

    (a)f* = , q* =

    0

    S = {(x2,x) | x > 0}

    min f(x) = x

    s.t. g(x) = x2 0

    x X = {x | x > 0}

    (b)f* = , q* = 0

    0

    S = {(g(x),f(x)) | x X}= {(z,w) | z > 0}

    min f(x) = x1 + x2

    s.t. g(x) = x1 0

    x X = {(x1,x2) | x1 > 0}

    (c)f* = , q* =

    z

    w

    w

    w

    z

    z

    S = {(g(x),f(x)) | x X}

  • 8/8/2019 NLP Slides

    156/201

    EXTENSIONS AND APPLICATIONS

    Equality constraints hi(x) = 0, i = 1, . . . , m, canbe converted into the two inequality constraints

    hi(x) 0, hi(x) 0.

    Separable problems:

    minimize

    mi=1

    fi(xi)

    subject to

    m

    i=1

    gij (xi) 0, j = 1, . . . , r ,

    xi Xi, i = 1, . . . , m .

    Separable problem with a single constraint:

    minimize

    ni=1

    fi(xi)

    subject to

    ni=1

    xi A, i xi i, i.

  • 8/8/2019 NLP Slides

    157/201

    DUALITY THEOREM I FOR CONVEX PROBLEMS

    Strong Duality Theorem - Linear Constraints:Assume that the problem

    minimize f(x)

    subject to x X, aix bi = 0, i = 1, . . . , m ,

    ej x dj 0, j = 1, . . . , r ,

    has finite optimal value f. Let also f be convexover n and let X be polyhedral. Then there existsat least one geometric multiplier and there is noduality gap.

    Proof Issues Application to Linear Programming

  • 8/8/2019 NLP Slides

    158/201

    COUNTEREXAMPLE

    A Convex Problem with a Duality Gap: Considerthe two-dimensional problem

    minimize f(x)

    subject to x1 0, x X = {x | x 0},

    wheref(x) = e

    x1x2 , x X,

    and f(x) is arbitrarily defined for x / X. f is convex over X (its Hessian is positive definite

    in the interior of X), and f = 1. Also, for all 0 we have

    q() = infx0

    e

    x1x2 + x1

    = 0,

    since the expression in braces is nonnegative forx 0 and can approach zero by taking x1 0 andx1x2 . It follows that q = 0.

  • 8/8/2019 NLP Slides

    159/201

    DUALITY THEOREM II FOR CONVEX PROBLEMS

    Consider the problemminimize f(x)

    subject to x X, gj (x) 0, j = 1, . . . , r .

    Assume that X is convex and the functionsf : n , gj : n are convex over X. Fur-thermore, the optimal value f is finite and thereexists a vector x X such that

    gj (x) < 0,

    j = 1, . . . , r .

    Strong Duality Theorem: There exists at leastone geometric multiplier and there is no dualitygap.

    Extension to linear equality constraints.

  • 8/8/2019 NLP Slides

    160/201

    6.252 NONLINEAR PROGRAMMING

    LECTURE 20: STRONG DUALITY

    LECTURE OUTLINE

    Strong Duality Theorem

    Linear Equality Constraints Fenchel Duality

    ********************************

    Consider the problem

    minimize f(x)

    subject to x X, gj (x) 0, j = 1, . . . , r ,

    assuming

    < f 0.

    Normalize ( = 1), take the infimum over x X,and use the fact 0, to obtain

    f

    inf

    xXf(x) + g(x) = q() sup0 q() = q.Using the weak duality theorem, is a geometricmultiplier and there is no duality gap.

  • 8/8/2019 NLP Slides

    163/201

    LINEAR EQUALITY CONSTRAINTS

    Suppose we have the additional constraintseix di = 0, i = 1, . . . , m

    We need the notion of the affine hullof a convexset X [denoted af f(X)]. This is the intersection ofall hyperplanes containing X.

    The relative interiorof X, denoted ri(X), is the setof all x X s.t. there exists > 0 with

    z | z x < , z af f(X) X,that is, ri(X) is the interior of X relative to af f(X).

    Every nonempty convex set has a nonemptyrelative interior.

  • 8/8/2019 NLP Slides

    164/201

    DUALITY THEOREM FOR EQUALITIES

    Assumptions: The set X is convex and the functions f, gjare convex over X.

    The optimal value f is finite and there existsa vector x ri(X) such that

    gj (x) < 0, j = 1, . . . , r ,

    eix di = 0, i = 1, . . . , m .

    Under the preceding assumptions there existsat least one geometric multiplier and there is no

    duality gap.

  • 8/8/2019 NLP Slides

    165/201

    COUNTEREXAMPLE

    Considerminimize f(x) = x1

    subject to x2 = 0, x X =

    (x1, x2) | x21 x2

    .

    The optimal solution is x = (0, 0) and f = 0. The dual function is given by

    q() = inf x21x2

    {x1 + x2} = 1

    4, if > 0,

    , if 0.

    No dual optimal solution and therefore there isno geometric multiplier. (Even though there is noduality gap.)

    Assumptions are violated (the feasible set andthe relative interior of X have no common point).

  • 8/8/2019 NLP Slides

    166/201

    FENCHEL DUALITY FRAMEWORK

    Consider the problemminimize f1(x) f2(x)subject to x X1 X2,

    where f1 and f2 are real-valued functions on n

    ,and X1 and X2 are subsets of n. Assume that < f < . Convert problem to

    minimize f1(y) f2(z)subject to z = y, y