Introduction to Nonlinear Optimization; and Optimality Conditions...

46
Introduction to Nonlinear Optimization; and Optimality Conditions for Unconstrained Optimization Problems Robert M. Freund March, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1

Transcript of Introduction to Nonlinear Optimization; and Optimality Conditions...

  • Introduction to Nonlinear Optimization; and

    Optimality Conditions for Unconstrained

    Optimization Problems

    Robert M. Freund

    March, 2014

    c⃝2014 Massachusetts Institute of Technology. All rights reserved.

    1

  • 1 Introduction to Nonlinear Optimization

    1.1 What is Nonlinear Optimization?

    A nonlinear optimization problem is a problem of the following general form:

    (P:) minimizex f(x)

    s.t. x ∈ F ,

    x ∈ Rn .

    (1)

    Here x = (x1, . . . , xn)T are the variables, or unknowns, and are usually

    referred to as the decision variables. The function f(·) : Rn → R is calledthe objective function and it is this function that we seek minimize, or moregenerally, to optimize. The expression “x ∈ F” states that the decisionvariables x must lie in some specified set F . If the goal is maximizationinstead of minimization, then this can be formulated by either replacing“minimize” by “maximize” above, or by simply replacing f(·) by −f(·).

    1.2 History of Nonlinear Optimization, in Brief

    Most of the theoretical and computational interest in nonlinear optimizationhas taken place since 1947. However, it is useful to note that nonlinearoptimization was first studied as early as the 1600s. Indeed, both Fermat(1638) and Newton (1670) studied the 1-dimensional nonlinear optimizationproblem:

    minx

    f(x) where x ∈ R ,

    and Newton developed the classical optimality condition: df(x)dx = 0 . Thiswas generalized by Euler (1755) to multivariate nonlinear optimization:

    min f(x1, . . . , xn)

    with the optimality condition: ∇f(x) = 0 .

    In the late 1700s both Euler and Lagrange examined problems in infinitedimensions and developed the calculus of variations. The study of optimiza-tion received very little attention thereafter until 1947, when modern linear

    2

  • optimization was developed by Dantzig. In the late 1940s and early 1950sKuhn and Tucker (preceded by Karush) developed general optimality con-ditions for nonlinear constrained optimization. At that time, applications ofnonlinear optimization included problems in chemical engineering, portfoliooptimization, and a host of decision problems in management and industrialoperations. The 1960s saw the development of larger scale nonlinear opti-mization coincident with the expanding capabilities of computers. In 1984,new possibilities in optimization methods were developed with Karmarkar’sinterior-point method for linear optimization, which was extended to nonlin-ear convex optimization by Nesterov and Nemirovskii in 1994. In the 1990sthe important fields of semidefinite optimization, conic optimization, and“robust optimization” were developed. Since 2000 there has been great in-terest and progress in “huge-scale” nonlinear optimization, which considersproblems in thousands or even millions of decision variables and constraints.

    1.3 Some Applications of Optimization

    Applications of nonlinear optimization arise in just about every human en-deavor. (In fact, one could argue that as individuals and/or societies weconstantly seek to make the choices that will optimize our own utility orhappiness, or to optimize what is best for our society.) Nonlinear opti-mization problems arise throughout our industrial economy. For example,optimization problems arise in urban vehicle traffic management, in airlinetransportation management, and in the logistics of determining how to getgoods and/or people to their destinations at least cost. In airline trans-portation, optimization arises in everything from determining the shape ofan aircraft wing (and its skeleton) to optimize its lift/weight, to determiningwhich crew should be assigned to which flight routes to minimize costs. Op-timization is used to determine prices and inventories of products in storesand in online enterprises, as well as to compute advertising strategies andonline banner advertisements for websites. In the finance industry, nonlinearoptimization is used to compute efficient asset portfolios, to price financialinstruments, and to determine trade strategies of investment funds. In datamining, optimization is used to efficiently decompose data into its princi-ple components, to classify data by properties, and to predict everythingfrom economic trends to the most desirable next purchase of an individual

    3

  • consumer at an online retailer (recommender systems). In engineering, non-linear optimization is used to compute antenna designs, sensor networks,optimal control policies, etc. In metamaterial design optimization is usedto design materials with specific wave management properties. In physicalsystems, optimization is a governing principle behind physical laws – indeedmany such laws arise as the solution of minimum energy principles. The listof applications of nonlinear optimization is virtually endless. Let us nowvisit several important applications that illustrate the structure, usefulness,and scope of nonlinear optimization.

    1.3.1 Pricing Models

    Consider a demand function d(p) for a single good as a function of the pricep of the good, whose unit cost of production is a scalar c. We presume thatd(p) is a decreasing function, and we model it, perhaps locally, as:

    d(p) = d̄− b̄p ,

    where b̄ > 0. If C is our production capacity, we seek to maximize contri-bution to earnings by solving:

    (P) maxp f(p) := (d̄− b̄p)× (p− c)

    s.t. d̄− b̄p ≥ 0

    p ≥ 0

    d̄− b̄p ≤ C .

    Now let us generalize this model to n differentiated substitutable products.Demand for our goods is a function of the vector of prices p = (p1, . . . , pn),and we presume that d(p) := d1(p), . . . , dn(p) are linear functions of the

    4

  • vector p of prices:

    d1(p) = d̄1 − b̄11p1 − b̄12p2 − . . .− b̄1npn...

    dn(p) = d̄n − b̄n1p1 − b̄n2p2 − . . .− b̄nnpn .

    We write this in vector form as:

    d(p) = d̄− B̄p .

    Let the per unit cost of production of good j be cj and let Cj denote theproduction capacity of good j for j = 1, . . . , n We write the vector of costs asc = (c1, . . . , cn) and the vector of production capacities as C = (C1, . . . , Cn).We seek to maximize contribution to earnings by solving:

    (P) maxp f(p) := (d̄− B̄p)T (p− c)

    s.t. d̄− B̄p ≥ 0

    p ≥ 0

    d̄− B̄p ≤ C .

    1.3.2 A Pattern Classification Training Problem

    By way of illustration, suppose we are given a set of pixilated digital imagesof faces. Suppose we have k male face images, expressed in digital form asa1, . . . , ak ∈ Rn (where each of the n coefficients corresponds to a pixel),and m female face images expressed in digital form as b1, . . . , bm ∈ Rn. Wewould like to compute a linear decision rule comprised of a vector v ∈ Rnand a scalar β that will satisfy vTai > β for all i = 1, . . . , k and vT bi < βfor all i = 1, . . . ,m. If we can compute v, β that satisfy these conditions, wecan then use v, β to establish a prediction procedure to declare the gender ofany other digital face image as follows: if we are given a face image c ∈ Rn,we will declare whether c is the image of a male or female face as follows:

    5

  • • if vT c > β, then we declare that c is a male face image, and

    • if vT c < β, then we declare that c is a female face image.

    Much more generally, suppose we are given:

    • points a1, . . . , ak ∈ Rn that have property “P”, and

    • points b1, . . . , bm ∈ Rn that do not have property “P”.

    We would like to use these k + m points to develop a linear rule that canbe used to predict whether or not other points c might or might not haveproperty P. In particular, we seek a vector v and a scalar β for which:

    • vTai > β for all i = 1, . . . , k

    • vT bi < β for all i = 1, . . . ,m.

    We will then use v, β to predict whether or not any other point c has propertyP or not. If we are given another vector c, we will declare whether c hasproperty P or not as follows:

    • if vT c > β, then we declare that c has property P, and

    • if vT c < β, then we declare that c does not have property P.

    We therefore seek v, β that define the hyperplane

    Hv,β := {x : vTx = β}

    for which:

    • vTai > β for all i = 1, . . . , k

    • vT bi < β for all i = 1, . . . ,m.

    6

  • Figure 1: Illustration of the pattern classification problem.

    This is illustrated in Figure 1.

    In addition, we would like the hyperplane Hv,β to be as far away from thepoints a1, . . . , ak, b1, . . . , bk as possible. Elementary analysis shows that thedistance of the hyperplane Hv,β to any point a

    i is equal to

    vTai − β∥v∥

    and similarly the distance of the hyperplane Hv,β to any point bi is equal to

    β − vT bi

    ∥v∥.

    If we normalize the vector v so that ∥v∥ = 1, then the minimum distance ofthe hyperplane Hv,β to the points a

    1, . . . , ak, b1, . . . , bk is then:

    min{vTa1 − β, . . . , vTak − β, β − vT b1, . . . , β − vT bm} .

    Therefore we would also like v and β to satisfy:

    • ∥v∥ = 1, and

    • min{vTa1 − β, . . . , vTak − β, β − vT b1, . . . , β − vT bm} is maximized.

    This is formulated as the following optimization problem:

    7

  • PCP : maximizev,β,δ δ

    s.t. vTai − β ≥ δ, i = 1, . . . , k

    β − vT bi ≥ δ, i = 1, . . . ,m

    ∥v∥ = 1,

    v ∈ Rn, β ∈ R .

    Now consider dividing each of the constraints of PCP by δ. If we performthe following transformation of variables:

    x =v

    δ, α =

    β

    δ

    then the constraints on the data points become:

    xTai − α ≥ 1 , i = 1, . . . , k

    α− xT bi ≥ 1 i = 1, . . . ,m .

    Also maximizing δ is the same as maximizing ∥v∥∥x∥ =1

    ∥x∥ , which is the same

    as minimizing ∥x∥. Therefore we can write the equivalent problem:

    CPCP : minimizex,α ∥x∥

    s.t. xTai − α ≥ 1, i = 1, . . . , k

    α− xT bi ≥ 1, i = 1, . . . ,m

    x ∈ Rn, α ∈ R .

    8

  • Figure 2: Illustration of the minimum norm problem.

    The formulation CPCP has the nice property that all of the constraints arelinear functions of the decision variables, and the objective function is therelatively simple function ∥x∥. We can solve CPCP for x, α, and substitutev = x∥x∥ , β =

    α∥x∥ and δ =

    1∥x∥ to obtain the solution of PCP .

    1.3.3 The Minimum Norm Problem

    Given a vector c, we would like to find the closest point to c that satisfiesthe linear inequalities Ax ≤ b.

    This problem is formulated as:

    MNP : minimizex ∥x− c∥

    s.t. Ax ≤ b

    x ∈ Rn .

    Figure 2 illustrates the formulation of the minimum norm problem.

    9

  • 1.3.4 The Fermat-Weber Problem

    We are given m points c1, . . . , cm ∈ Rn. We would like to determine thelocation of a distribution center at the point x ∈ Rn that minimizes the sumof the distances from x to each of the points c1, . . . , cm ∈ Rn. This problemis illustrated in Figure 3. It has the following formulation:

    FWP : minimizexm∑i=1∥x− ci∥

    s.t. x ∈ Rn .

    1.3.5 A Ball Covering Problem

    We are given m points c1, . . . , cm ∈ Rn. We would like to determine thelocation of a distribution center at the point x ∈ Rn that minimizes themaximum distance from x to any of the points c1, . . . , cm ∈ Rn. This prob-lem is illustrated in Figure 4. It has the following formulation:

    BCP : minimizex,δ δ

    s.t. ∥x− ci∥ ≤ δ, i = 1, . . . ,m,

    x ∈ Rn .

    1.3.6 The Analytic Center Problem

    Given a system of linear inequalities Ax ≤ b, we would like to determine a“nicely” interior point x̂ that satisfies Ax̂ < b. Of course, we would like the

    10

  • Figure 3: Illustration of the Fermat-Weber problem.

    Figure 4: Illustration of the ball covering problem.

    11

  • Figure 5: Illustration of the analytic center problem.

    point to be as interior as possible, in some mathematically meaningful way.Consider the following problem, which is referred to as the “analytic centerproblem” :

    ACP : maximizexm∏i=1

    (b−Ax)i

    s.t. Ax ≤ b,

    x ∈ Rn .

    Let us call the solution x̂ to this problem the “analytic center” of the linear

    inequality system “Ax ≤ b”.

    The analytic center problem is illustrated in Figure 5. The analytic centerx̂ has the following rather remarkable (and non-obvious) property which westate but do not prove:

    Remark 1.1 Suppose that x̂ solves the analytic center problem ACP . Thenfor each i = 1, . . . ,m, it holds that:

    (b−Ax̂)i ≥max{x:Ax≤b}(b−Ax)i

    m.

    If we assume that the strict linear inequality system “Ax < b” has a solution,then the ACP is equivalent to:

    12

  • CACP : minimizexm∑i=1− ln((b−Ax)i)

    s.t. Ax < b,

    x ∈ Rn .

    1.3.7 Least Squares Problems, Model Reconstruction, and Curve-Fitting

    We are given m functions g1(x), . . . , gm(x), where gi(x) : Rn 7→ Rri , i =1, . . . ,m, and we are interested in finding a value of x that minimizes thesum of the squares of norms of these functions. This problem is:

    LS : minimizex12∥g(x)∥

    2 := 12∑m

    i=1 ∥gi(x)∥2

    s.t. x ∈ Rn .

    Examples include linear and nonlinear regression, model reconstruction (in-verse problems), and curve-fitting (splines).

    In model reconstruction, we are given input/output data (yi, zi), i = 1, ...,m,from a physical system, and we postulate the input/output relationship

    z = h(x, y)

    where x is a vector of unknown parameters, and the general form of h(·, ·)is known and given. We thus seek the value of the parameter vector x thatbest matches the data. One way to model this is to solve:

    13

  • MR : minimizex12

    ∑mi=1 ∥zi − h(x, yi)∥2

    s.t. x ∈ Rn .

    As a more specific example, let us consider the problem of fitting points toa quartic polynomial. Here

    h(x, y) = x0 + x1y + x2y2 + x3y

    3 + x4y4 ,

    and x = (x0, x1, x2, x3, x4) is the vector of unknown coefficients of the poly-nomial. Suppose we are given data (yi, zi), i = 1, ...,m. Then we seek tosolve:

    CF : minimizex12

    ∑mi=1 |zi − (x0 + x1yi + x2y2i + x3y3i + x4y4i )|2

    s.t. x ∈ Rn .

    1.3.8 The Circumscribed Ellipsoid Problem

    We are given m points c1, . . . , cm ∈ Rn. We would like to determine anellipsoid of minimum volume that contains each of the points c1, . . . , cm ∈Rn. This problem is illustrated in Figure 6.

    Before we show the formulation of this problem, first recall that a symmetricpositive definite (SPD) matrix R and a given point z can be used to definean ellipsoid in Rn:

    ER,z := {y | (y − z)TR(y − z) ≤ 1} .

    14

  • Figure 6: Illustration of the circumscribed ellipsoid problem.

    Figure 7: Illustration of an ellipsoid.

    Here R is the shape matrix of the ellipsoid and z is the center of the ellipsoid.Figure 7 shows an illustration of an ellipsoid.

    One can prove that the volume of ER,z is proportional to√

    det(R−1).

    Our problem is:

    15

  • MCP1 : minimize [det(R−1)]

    12

    R, zs.t. ci ∈ ER,z, i = 1, . . . , k,

    R is SPD .

    Of course, minimizing det(R−1)12 is the same as minimizing ln(det(R−1)

    12 ),

    since the logarithm function is strictly increasing in its argument. Also,

    ln(det(R−1)12 ) = −1

    2ln(det(R)) ,

    and so our problem is equivalent to:

    MCP2 : minimize − ln(det(R))R, zs.t. (ci − z)TR(ci − z) ≤ 1, i = 1, . . . , k

    R is SPD .

    We now factor R = M2 where M is SPD (that is, M is a square root of R),and then MCP2 becomes:

    MCP3 : minimize − ln(det(M2))M, zs.t. (ci − z)TMTM(ci − z) ≤ 1, i = 1, . . . , k,

    M is SPD .

    which is the same as:

    16

  • MCP4 : minimize −2 ln(det(M))M, zs.t. ∥M(ci − z)∥ ≤ 1, i = 1, . . . , k,

    M is SPD .

    Next substitute y = Mz to obtain:

    MCP5 : minimize −2 ln(det(M))M,ys.t. ∥Mci − y∥ ≤ 1, i = 1, . . . , k,

    M is SPD .

    We can recover R and z after solving MCP5 by substituting R = M2 and

    z = M−1y.

    1.3.9 Portfolio Management Optimization

    Portfolio optimization models are used throughout the financial investmentmanagement sector. These are nonlinear models that are used to determinethe composition of investment portfolios.

    Investors prefer higher annual rates of return on investing to lower annualrates of return. Furthermore, investors prefer lower risk to higher risk. Port-folio optimization seeks to optimally trade off risk and return in investing.

    We consider n assets, whose annual rates of return Ri are random variables,i = 1, . . . , n. The expected annual return of asset i is µi, i = 1, . . . , n, andso if we invest a fraction xi of our investment dollar in asset i, the expectedreturn of the portfolio is:

    n∑i=1

    µixi = µTx

    17

  • where of course the fractions xi must satisfy:

    n∑i=1

    xi = eTx = 1.0

    andx ≥ 0 .

    Here we use the notation that e is the vector of ones, namely e = (1, 1, . . . , 1)T .

    The covariance of the rates of return of assets i and j is given as

    Qij = COV(Ri, Rj) .

    We can think of the Qij values as forming a matrix Q, whereby the varianceof portfolio is then:

    n∑i=1

    n∑j=1

    COV(Ri, Rj)xixj =

    n∑i=1

    n∑j=1

    Qijxixj = xTQx .

    The “risk” in the portfolio is the standard deviation of the portfolio:

    STDEV =√

    xTQx ,

    and the “reward” of the portfolio is the expected annual rate of return ofthe portfolio:

    RETURN = µTx .

    Suppose that we would like to determine the fractional investment valuesx1, . . . , xn in order to maximize the return of the portfolio, subject to meet-ing some pre-specified target risk level. For example, we might want toensure that the standard deviation of the portfolio is at most 13.0%. Wecan formulate the following nonlinear optimization model:

    MAXIMIZE: RETURN = µTxs.t.FSUM: eTx = 1

    18

  • RISK:√

    xTQx ≤ 13.0

    NONNEGATIVITY: x ≥ 0 .

    An alternative version of the basic portfolio optimization model is to deter-mine the fractional investment values x1, . . . , xn in order to minimize therisk of the portfolio, subject to meeting a pre-specified target expected re-turn level. For example, we might want to ensure that the expected returnof the portfolio is at least 16.0%. We can formulate the following nonlinearoptimization model:

    MINIMIZE: RISK =√

    xTQxs.t.FSUM: eTx = 1

    RETURN: µTx ≥ 16.0

    NONNEGATIVITY: x ≥ 0 .

    Variations of this family of optimization models are used pervasively by assetmanagement companies worldwide. Also, these models can be extended toyield the CAPM (Capital Asset Pricing Model). Finally, we point out theNobel Prize in economics was awarded in 1990 to Merton Miller, WilliamSharpe, and Harry Markowitz for their work on portfolio theory and portfoliomodels (and the implications for asset pricing).

    The family of models just considered are conceptually simple but are ob-viously a simplification of the real situation. In reality, modern portfo-lio optimization models incorporate many features including limited short-selling, transaction costs, effects of buys and sells on market prices them-selves, turnover, liquidity, and volatility constraints, regulatory restrictions,multiple-periods, and robustness considerations that explicitly take accountof the fact that the expected return data µ and the covariance data Q are notknown with uncertainty, and hence the models must use imperfect estimatesof the true data values.

    19

  • 1.4 Two Important Formats for the General Nonlinear Op-timization Problem

    There are two important formats for general nonlinear optimization prob-lems. The first is the constraint functions format, which is as follows:

    NLP : minx f(x)

    s.t. g1(x) ≤ 0...

    gm(x) ≤ 0h1(x) = 0

    ...hl(x) = 0x ∈ X ,

    where X is most typically the entire space X = Rn, but occasionally is someother open (or occasionally closed) set. Also,

    f(·) : Rn 7→ R ,

    gi(·) : Rn 7→ R, i = 1, . . . ,m

    hj(·) : Rn 7→ R, j = 1, . . . , l

    where all of these functions are continuous and usually differentiable.

    The feasible region of this problem is the set:

    F = {x : g1(x) ≤ 0, . . . , gm(x) ≤ 0, h1(x) = 0, . . . , hl(x) = 0, x ∈ X} .

    20

  • Let g(x) = (g1(x), . . . , gm(x)) : Rn → Rm, h(x) = (h1(x), . . . , hl(x)) : Rn →Rl. Then (NLP) can be written as

    (NLP) minx f(x)

    s.t. g(x) ≤ 0

    h(x) = 0

    x ∈ X .

    (2)

    The other important format is the subset inclusion format, which is as fol-lows:

    (CNLS) minx f(x)

    s.t. b−Ax ∈ K

    x ∈ C ,

    (3)

    where the affine transformation x 7→ b − Ax maps from Rn to Rm, andK ⊂ Rm and C ⊂ Rn. In most of this course, when we use this format,K and C will be convex cones, which we will discuss in a few lectures fromnow.

    Here we say that the feasible region of this problem is the set:

    F = {x : b−Ax ∈ K, x ∈ C} .

    1.5 Some Special Classes of Nonlinear Optimization Prob-lems

    There are some useful special classes of nonlinear optimization problems.Let X = Rn.

    An unconstrained nonlinear optimization problem is a problem with for-

    21

  • mat:(UP) minx f(x)

    x ∈ X .

    A linearly constrained nonlinear optimization problem is a problem withformat:

    (LC) minx f(x)

    s.t. Ax ≤ b

    Mx = d

    x ∈ X .

    A linear optimization problem is a problem with format:

    (LP) minx cTx

    s.t. Ax ≤ b

    Mx = d

    x ∈ X .

    A quadratic optimization problem is a problem with format:

    (QP) minx cTx+ 12x

    TQx

    s.t. Ax ≤ b

    Mx = d

    x ∈ X .

    22

  • A quadratically constrained optimization problem is a problem withformat:

    (QC) minx f(x)

    s.t. gi(x) :=12x

    TQix+ (ci)Tx− di ≤ 0, i = 1, . . . ,m

    Mx = d

    x ∈ X .

    A separable optimization problem is a problem with format:

    (SP) minx∑n

    j=1 fj(x)

    s.t. gi(x) :=∑n

    j=1 gij(xj) ≤ 0 , i = 1, . . . ,m

    hi(x) :=∑n

    j=1 hij(xj) = 0 , i = 1, . . . , l

    x ∈ X .

    A conic optimization problem is a problem with format:

    (CP) minx cTx

    s.t. b−Ax ∈ K

    x ∈ C ,

    where K and C are convex cones (to be defined later).

    1.6 Characterizations of Optimal Solutions: a Preview

    From both a theoretical and computational point of view, it is important tounderstand ways to characterize (or just partially characterize) an optimalsolution of a nonlinear optimization problem.

    23

  • Consider the following 2-dimensional nonlinear optimization problem in vari-ables (x, y):

    (P(a,b) :) minx,y√

    (x− a)2 + (y − b)2

    s.t.√

    (x− 8)2 + (y − 9)2 ≤ 7

    x ≤ 13

    x ≥ 2

    x+ y ≤ 24 ,

    where we are given the point (a, b) as part of the problem description. Thefeasible region F of P(a,b) is the intersection of four sets: (i) the disc of radius7 centered at the point (8, 9), (ii) the halfspace of points whose x-coordinateis at most 13, (iii) the halfspace of points whose x-coordinate is at least 2,and (iv) the halfspace characterized by x+y ≤ 24. Notice that the objectivefunction of P(a,b) is simply the Euclidean distance between (x, y) and thepoint (a, b).

    Suppose that (a, b) = (16, 14). Then the solution of P(a,b) is the point(x, y)∗ = (13, 11), which is a “corner” of the feasible region F and is wherethe two constraints x ≤ 13 and x+ y ≤ 24 are satisfied at equality.

    Now suppose that (a, b) = (14, 14). Then the solution of P(a,b) is the point(x, y)∗ = (12, 12), which lies on the boundary of the feasible region in Fbut is not at a corner. Indeed, the point (12, 12) lies in the “middle” of theboundary line segment corresponding to where the constraint x+ y ≤ 24 issatisfied at equality.

    Next suppose that (a, b) = (8, 8). Then the solution of P(a,b) is the point(x, y)∗ = (8, 8), which lies in the interior of the feasible region F .

    These examples illustrate that unlike in the case of linear optimization,the optimal solution does not necessarily occur at a “corner” of the feasibleregion or even on the boundary of the feasible region. The optimal solutionwill be where the feasible region just touches the best iso-objective functionline. This concept will be made more precise later on.

    24

  • 2 Unconstrained Optimization: Characterizationof Optimal Solutions

    2.1 Local, Global, and Strict Optima

    Consider the following optimization problem over the set F :

    P : minx or maxx f(x)

    s.t. x ∈ F .

    Here F is the “feasible region” of the problem, and recall that F correspondsto F = {x : g1(x) ≤ 0, . . . , gm(x) ≤ 0, h1(x) = 0, . . . , hl(x) = 0, x ∈ X}for the constraint functions format for NLP in (2), or to F = {x : b−Ax ∈K, x ∈ C} for the subset inclusion format (3). Of course, if there are noconstraints in the problem, then F = Rn and we will call P an unconstrainedoptimization problem.

    The ball centered at x̄ with radius ε is the set:

    B(x̄, ε) := {x|∥x− x̄∥ ≤ ε} .

    We have the following definitions of local/global,strict/non-strict minima/maxima.

    Definition 2.1 x ∈ F is a local minimum of P if there exists ε > 0 suchthat f(x) ≤ f(y) for all y ∈ B(x, ε) ∩ F .

    Definition 2.2 x ∈ F is a strict local minimum of P if there exists ε > 0such that f(x) < f(y) for all y ∈ B(x, ε) ∩ F , y ̸= x.

    Definition 2.3 x ∈ F is a global minimum of P if f(x) ≤ f(y) for ally ∈ F .

    Definition 2.4 x ∈ F is a strict global minimum of P if f(x) < f(y) forall y ∈ F , y ̸= x.

    25

  • Definition 2.5 x ∈ F is a local maximum of P if there exists ε > 0 suchthat f(x) ≥ f(y) for all y ∈ B(x, ε) ∩ F .

    Definition 2.6 x ∈ F is a strict local maximum of P if there exists ε > 0such that f(x) > f(y) for all y ∈ B(x, ε) ∩ F , y ̸= x.

    Definition 2.7 x ∈ F is a global maximum of P if f(x) ≥ f(y) for ally ∈ F .

    Definition 2.8 x ∈ F is a strict global maximum of P if f(x) > f(y) forall y ∈ F , y ̸= x.

    2.2 Gradients and Hessians

    Let f(·) : X → R, where X ⊂ Rn is open. f(·) is differentiable at x̄ ∈ X ifthere exists a vector ∇f(x̄) (called the gradient of f(·) at x̄) such that foreach x ∈ X it holds that:

    f(x) = f(x̄) +∇f(x̄)T (x− x̄) + ∥x− x̄∥α(x̄, x− x̄),

    and limy→0 α(x̄, y) = 0. f(·) is differentiable on X if f(x) is differentiablefor all x̄ ∈ X. In the standard coordinate system for Rn, the gradient vectorcorresponds to the vector of partial derivatives:

    ∇f(x̄) =(∂f(x̄)

    ∂x1, . . . ,

    ∂f(x̄)

    ∂xn

    )T.

    Example 2.1 Let f(x) = 3(x1)2(x2)

    3 + (x2)2(x3)

    3. Then

    ∇f(x) =(6(x1)(x2)

    3, 9(x1)2(x2)

    2 + 2(x2)(x3)3, 3(x2)

    2(x3)2)T

    .

    The directional derivative of f(·) at x̄ in the direction d is:

    limλ→0

    f(x̄+ λd)− f(x̄)λ

    .

    26

  • Proposition 2.1 The directional derivative satisfies:

    limλ→0

    f(x̄+ λd)− f(x̄)λ

    = ∇f(x̄)Td .

    Proof: From the definition of the gradient, we have:

    f(x̄+ λd) = f(x̄) +∇f(x̄)T (λd) + ∥λd∥α(x̄, λd) ,

    where α(x̄, y)→ 0 as y → 0. Rearranging and dividing by λ yields:

    f(x̄+ λd)− f(x̄)λ

    = ∇f(x̄)Td+ ∥d∥α(x̄, λd) .

    Now observe that the rightmost term goes to zero as λ → 0, yielding theresult.

    The function f(·) is twice differentiable at x̄ ∈ X if there exists a vector∇f(x̄) and an n × n symmetric matrix H(x̄) (called the Hessian of f(·) atx̄) such that for each x ∈ X it holds that:

    f(x) = f(x̄)+∇f(x̄)T (x− x̄)+ 12(x− x̄)TH(x̄)(x− x̄)+∥x− x̄∥2α(x̄, x− x̄) ,

    and limy→0 α(x̄, y) = 0. f(·) is twice differentiable on X if f(x) is twicedifferentiable for all x ∈ X. In the standard coordinate system for Rn, theHessian corresponds to the matrix of second partial derivatives:

    H(x̄)ij =∂2f(x̄)

    ∂xi∂xj.

    Example 2.2 Continuing Example 2.1, we have

    H(x) =

    6(x2)3 18(x1)(x2)2 018(x1)(x2)2 18(x1)2(x2) + 2(x3)3 6(x2)(x3)20 6(x2)(x3)

    2 6(x2)2(x3)

    .27

  • 2.3 Positive Semidefinite and Positive Definite Matrices

    An n× n matrix M is called:

    • positive definite if xTMx > 0 for all x ∈ Rn, x ̸= 0

    • positive semidefinite if xTMx ≥ 0 for all x ∈ Rn

    • negative definite if xTMx < 0 for all x ∈ Rn, x ̸= 0

    • negative semidefinite if xTMx ≤ 0 for all x ∈ Rn

    • indefinite if there exists x, y ∈ Rn for which xTMx > 0 and yTMy < 0.

    We say that M is SPD if M is symmetric and positive definite. Similarly,we say that M is SPSD if M is symmetric and positive semidefinite.

    Example 2.3

    M =

    (2 00 3

    )is positive definite.

    Example 2.4

    M =

    (8 −1−1 1

    )is positive definite. To see this, note that for x ̸= 0,

    xTMx = 8x21 − 2x1x2 + x22 = 7x21 + (x1 − x2)2 > 0 .

    2.4 The Weierstrass Theorem, and Existence of Optimal So-lutions

    In order to study optimal solutions of optimization problems, we firstneed to know when an optimal solution exists. We will make use of thefollowing important result from real analysis.

    28

  • Theorem 2.1 [Weierstrass’ Theorem for sequences] Let {xk}, k → ∞ bean infinite sequence of points in the compact (i.e., closed and bounded) set F .Then some infinite subsequence of points xkj converges to a point containedin F .

    Theorem 2.1 allows us to prove:

    Theorem 2.2 [Weierstrass’ Theorem for functions] Let f(·) be a contin-uous real-valued function on the compact nonempty set F ⊂ Rn. Then Fcontains a point that minimizes (maximizes) f(x) on the set F .

    Proof: Since the set F is non-empty and bounded, f(x) is bounded be-low on F . Therefore there exists v = infx∈F f(x). By definition, for anyε > 0, the set Fε = {x ∈ F : v ≤ f(x) ≤ v + ε} is non-empty. Letεk → 0 as k → ∞, and let xk ∈ Fεk . Since F is bounded, there exists asubsequence of {xk} converging to some x̄ ∈ F . By continuity of f(·), wehave f(x̄) = limk→∞ f(xk), and, since v ≤ f(xk) ≤ v + εk, it follows thatf(x̄) = limk→∞ f(xk) = v.

    Notice that the assumptions underlying Theorem 2.2 are (i) boundednessof the region F , (ii) closedness of the region F , and (iii) continuity of thefunction f(·). Let us see why each of these conditions is needed. Considerthe problem:

    (P) minx1 + x

    2x

    s.t. x ≥ 1 .Here there is no optimal solution because the feasible region F is not bounded.

    Next, consider the problem:

    (P) minx1

    x

    s.t. 1 ≤ x < 2 .

    Here there is no optimal solution because the feasible region F is not closed.

    Last of all, consider the problem:

    29

  • (P) minx f(x)

    s.t. 1 ≤ x ≤ 2 ,

    where

    f(x) :=

    {1/x for 1 ≤ x < 21 for x = 2 .

    Here there is no optimal solution because the function f(·) is not continuous.

    It turns out that the assumptions of Theorem 2.2 can be somewhat relaxed.For example, the following theorem can be proved in a manner similar toTheorem 2.2:

    Theorem 2.3 Consider the optimization problem:

    P : minx f(x)

    s.t. x ∈ F .

    Suppose that f(·) is lower semi-continuous, i.e., for any constant c, the set{x ∈ F : f(x) ≤ c} is closed. Also, suppose that there exists some x̂ ∈ F forwhich the set {x ∈ F : f(x) ≤ f(x̂)} is compact. Then F contains a pointthat minimizes f(x) on the set F .

    2.5 Optimality Conditions for Unconstrained Problems

    Consider the unconstrained optimization problem:

    (P) minx f(x)

    s.t. x ∈ X ,

    where x = (x1, . . . , xn) ∈ Rn, f(·) : Rn → R, and X = Rn . Here the feasibleregion is the entire space X = Rn, so all points x are feasible.

    30

  • Definition 2.9 The direction d̄ is called a descent direction of f(·) at x = x̄if

    f(x̄+ εd̄) < f(x̄) for all ε > 0 and sufficiently small .

    Proposition 2.2 Suppose that f(·) is differentiable at x̄. If there is a vectord such that ∇f(x̄)Td < 0, then d is a descent direction of f(·) at x̄. Thatis, f(x̄+ λd) < f(x̄) for all λ > 0 and sufficiently small.

    Proof: From Proposition 2.1 we have:

    limλ→0

    f(x̄+ λd)− f(x̄)λ

    = ∇f(x̄)Td .

    Since ∇f(x̄)Td < 0, it follows that f(x̄ + λd) − f(x̄) < 0 for all λ > 0sufficiently small. Therefore d is a descent direction.

    A necessary condition for local optimality is a statement of the form: “ifx̄ is a local minimum of (P), then x̄ must satisfy . . . ” Such a condition helpsus identify all candidates for local optima. The following corollary providesa necessary condition for a point x̄ to be a local minimum.

    Corollary 2.1 Suppose f(·) is differentiable at x̄. If x̄ is a local minimum,then ∇f(x̄) = 0.

    Proof: If it were true that ∇f(x̄) ̸= 0, then d = −∇f(x̄) would be adescent direction since then ∇f(x̄)Td = −∇f(x̄)T∇f(x̄) = −∥∇f(x̄)∥2 < 0,whereby x̄ would not be a local minimum.

    The above corollary is a first order necessary optimality condition foran unconstrained minimization problem. The following theorem is a secondorder necessary optimality condition.

    Theorem 2.4 Suppose that f(·) is twice continuously differentiable at x̄ ∈X. If x̄ is a local minimum, then ∇f(x̄) = 0 and H(x̄) is positive semidefi-nite.

    31

  • Proof: From the first order necessary condition in Corollary 2.1, it holdsthat ∇f(x̄) = 0. Suppose H(x̄) is not positive semidefinite. Then thereexists d such that dTH(x̄)d < 0. We have:

    f(x̄+ λd) = f(x̄) + λ∇f(x̄)Td+ 12λ2dTH(x̄)d+ λ2∥d∥2α(x̄, λd)

    = f(x̄) + 12λ2dTH(x̄)d+ λ2∥d∥2α(x̄, λd) ,

    where α(x̄, λd)→ 0 as λ→ 0. Rearranging,

    f(x̄+ λd)− f(x̄)λ2

    = 12dTH(x̄)d+ ∥d∥2α(x̄, λd) .

    Since dTH(x̄)d < 0 and α(x̄, λd)→ 0 as λ→ 0, f(x̄+ λd)− f(x̄) < 0 for allλ > 0 sufficiently small, yielding the desired contradiction.

    Example 2.5 Let

    f(x) = 12x21 + x1x2 + 2x

    22 − 4x1 − 4x2 − x32 .

    Then∇f(x) =

    (x1 + x2 − 4, x1 + 4x2 − 4− 3x22

    )T,

    and

    H(x) =

    (1 11 4− 6x2

    ).

    ∇f(x) = 0 has exactly two solutions: x̄ = (4, 0) and x̃ = (3, 1). We have

    H(x̄) =

    (1 11 4

    )and H(x̃) =

    (1 11 −2

    ).

    Notice that H(x̃) is indefinite, whereby from Theorem 2.4 it follows thatx̃ = (3, 1) cannot be a local minimum. Therefore the only possible candidatefor a local minimum is x̄ = (4, 0).

    A sufficient condition for local optimality is a statement of the form: “ifx̄ satisfies . . . , then x̄ is a local minimum of (P).” Such a condition allowsus to automatically declare that x̄ is indeed a local minimum.

    32

  • Theorem 2.5 Suppose that f(·) is twice differentiable at x̄. If ∇f(x̄) = 0and H(x̄) is positive definite, then x̄ is a (strict) local minimum.

    Proof:

    f(x) = f(x̄) + 12(x− x̄)TH(x̄)(x− x̄) + ∥x− x̄∥2α(x̄, x− x̄) .

    Suppose that x̄ is not a strict local minimum. Then there exists a sequencexk → x̄ such that xk ̸= x̄ and f(xk) ≤ f(x̄) for all k. Define dk = xk−x̄∥xk−x̄∥ ,and notice that ∥dk∥ = 1. Then

    f(xk) = f(x̄) + ∥xk − x̄∥2(12d

    TkH(x̄)dk + α(x̄, xk − x̄)

    ),

    and so12d

    TkH(x̄)dk + α(x̄, xk − x̄) =

    f(xk)− f(x̄)∥xk − x̄∥2

    ≤ 0 .

    Since ∥dk∥ = 1 for any k, there exists a subsequence of {dk} converging tosome point d such that ∥d∥ = 1. Assume without loss of generality thatdk → d. Then

    0 ≥ limk→∞

    {12d

    TkH(x̄)dk + α(x̄, xk − x̄)

    }= 12d

    TH(x̄)d ,

    which is a contradiction of the positive definiteness of H(x̄).

    Remark 2.1 If ∇f(x̄) = 0 and H(x̄) is positive semidefinite, we cannot besure if x̄ is a local minimum.

    Remark 2.2 Similar to Theorem 2.5 it is easy to show that if ∇f(x̄) = 0and H(x̄) is negative definite, then x̄ is a (strict) local maximum.

    Example 2.6 Continuing Example 2.5, we compute H(x̄):

    H(x̄) =

    (1 11 4

    ).

    Then H(x̄) is positive definite. To see this, note that for any d = (d1, d2),we have

    dTH(x̄)d = d21 + 2d1d2 + 4d22 = (d1 + d2)

    2 + 3d22 > 0 for all d ̸= 0 .

    Therefore, x̄ satisfies the sufficient conditions to be a strict local minimum,and so x̄ is a strict local minimum.

    33

  • Example 2.7 Letf(x) = x31 + x

    22 .

    Then∇f(x) =

    (3x21, 2x2

    )T,

    and

    H(x) =

    (6x1 00 2

    ).

    At x̄ = (0, 0), we have ∇f(x̄) = 0 and

    H(x̄) =

    (0 00 2

    )is positive semidefinite, but x̄ is not a local minimum, since f(−ε, 0) =−ε3 < 0 = f(0, 0) = f(x̄) for all ε > 0.

    Example 2.8 Letf(x) = x41 + x

    22 .

    Then∇f(x) =

    (4x31, 2x2

    )T,

    and

    H(x) =

    (12x21 00 2

    ).

    At x̄ = (0, 0), we have ∇f(x̄) = 0 and

    H(x̄) =

    (0 00 2

    )is positive semidefinite. Furthermore, x̄ is a local minimum, since for all xwe have f(x) ≥ 0 = f(0, 0) = f(x̄).

    2.6 Convexity and Minimization

    We now present some very important definitions regarding convexity ofsets and functions.

    34

  • Let x, y ∈ Rn. Points of the form λx + (1 − λ)y for λ ∈ [0, 1] are calledconvex combinations of x and y .

    A set S ⊂ Rn is called a convex set if for all x, y ∈ S and for all λ ∈ [0, 1] itholds that λx+ (1− λ)y ∈ S.

    A function f(·) : S → R, where S is a nonempty convex set, is a convexfunction if

    f(λx+ (1− λ)y) ≤ λf(x) + (1− λ)f(y)

    for all x, y ∈ S and for all λ ∈ [0, 1].

    A function f(·) as above is called a strictly convex function if the inequalityabove is strict for all x ̸= y and λ ∈ (0, 1).

    A function f(·) : S → R, where S is a nonempty convex set, is a concavefunction if

    f(λx+ (1− λ)y) ≥ λf(x) + (1− λ)f(y)

    for all x, y ∈ S and for all λ ∈ [0, 1].

    A function f(·) as above is called a strictly concave function if the inequalityabove is strict for all x ̸= y and λ ∈ (0, 1).

    Consider the problem:

    (CP) minx f(x)

    s.t. x ∈ S .

    Theorem 2.6 Suppose S is a convex set, f(·) : S → R is a convex function,and x̄ is a local minimum of (CP). Then x̄ is a global minimum of f(·) overS.

    Proof: Suppose x̄ is not a global minimum. Then there exists y ∈ S forwhich f(y) < f(x̄). Let y(λ) := (1 − λ)x̄ + λy for λ ∈ [0, 1]. Because y(λ)is a convex combination of x̄ and y, it follows that y(λ) ∈ S because S isconvex and both x̄ and y are elements of S. Note that y(λ)→ x̄ as λ→ 0.

    35

  • From the convexity of f(·), it holds that:

    f(y(λ)) = f(λy+(1−λ)x̄) ≤ λf(y)+(1−λ)f(x̄) < λf(x̄)+(1−λ)f(x̄) = f(x̄) ,

    for all λ ∈ (0, 1]. Therefore, f(y(λ)) < f(x̄) for all λ ∈ (0, 1], and so x̄ is nota local minimum, resulting in a contradiction.

    Similar to Theorem 2.6, one can prove:

    Corollary 2.2 The following are true:

    • If f(·) is strictly convex, a local minimum is the unique global mini-mum.

    • If f(·) is concave, a local maximum is a global maximum.

    • If f(·) is strictly concave, a local maximum is the unique global maxi-mum.

    The following result shows that a convex function always lies above itsfirst-order approximation.

    Theorem 2.7 Suppose S is a non-empty open convex set, and f(·) : S → Ris differentiable. Then f(·) is a convex function if and only if f(·) satisfiesthe following gradient inequality:

    f(y) ≥ f(x) +∇f(x)T (y − x) for all x, y ∈ S .

    Proof: Suppose f(·) is convex. Then for any λ ∈ [0, 1], it holds that:

    f(λy + (1− λ)x) ≤ λf(y) + (1− λ)f(x) .

    Now consider λ ∈ (0, 1], and rearrange the above inequality to yield:

    f(x+ λ(y − x))− f(x)λ

    ≤ f(y)− f(x) .

    Letting λ → 0 and invoking Proposition 2.1, we obtain: ∇f(x)T (y − x) ≤f(y)− f(x), establishing the “only if” part.

    36

  • Now suppose that the gradient inequality holds for all pairs of points in S.Let w and z be any two points in S. Let λ ∈ [0, 1], and set x = λw+(1−λ)z.Then

    f(w) ≥ f(x) +∇f(x)T (w − x)

    andf(z) ≥ f(x) +∇f(x)T (z − x) .

    Taking a convex combination of the above inequalities, we obtain

    λf(w) + (1− λ)f(z) ≥ f(x) +∇f(x)T (λ(w − x) + (1− λ)(z − x))

    = f(x) +∇f(x)T 0

    = f(λw + (1− λ)z) ,

    which shows that f(·) is convex.

    Theorem 2.8 Suppose S is a non-empty open convex set, and f(·) : S → Ris twice differentiable. Let H(·) denote the Hessian of f(·). Then f(·) isconvex if and only if H(x) is positive semidefinite for all x ∈ S.

    Proof: Suppose f(·) is convex. Let x̄ ∈ S and d be any direction. Then forλ > 0 sufficiently small, x̄+ λd ∈ S. We have:

    f(x̄+ λd) = f(x̄) +∇f(x̄)T (λd) + 12(λd)TH(x̄)(λd) + ∥λd∥2α(x̄, λd) ,

    where α(x̄, y)→ 0 as y → 0. From the gradient inequality, we also have

    f(x̄+ λd) ≥ f(x̄) +∇f(x̄)T (λd) .

    Combining the above two relations, we obtain:

    λ2(12d

    TH(x̄)d+ ∥d∥2α(x̄, λd))≥ 0 .

    37

  • Dividing by λ2 > 0 and letting λ→ 0, we obtain dTH(x̄)d ≥ 0, proving the“only if” part.

    Conversely, suppose that H(z) is positive semidefinite for all z ∈ S. Letx, y ∈ S be given. Then according to the second-order version of the MeanValue Theorem, there exists a value z which is a convex combination of xand y which satisfies the following equation:

    f(y) = f(x) +∇f(x)T (y − x) + 12(y − x)TH(z)(y − x) .

    Since z is a convex combination of x and y, then z = λx+(1−λ)y for someλ ∈ [0, 1], and hence z ∈ S. Therefore H(z) is positive semidefinite, andthis implies that

    f(y) ≥ f(x) +∇f(x)T (y − x) .

    Therefore the gradient inequality holds, and hence f(·) is convex as a con-sequence of Theorem 2.7.

    Returning to the optimization problem (P), knowing that the functionf(·) is convex allows us to establish a global optimality condition that is bothnecessary and sufficient:

    Theorem 2.9 Suppose f(·) : X → R is convex and differentiable on X.Then x̄ ∈ X is a global minimum if and only if ∇f(x̄) = 0.

    Proof: The necessity of the condition ∇f(x̄) = 0 was established in Corol-lary 4 regardless of the convexity of the function f(·).

    Suppose ∇f(x̄) = 0. Then by the gradient inequality we have f(y) ≥f(x̄) +∇f(x̄)T (y − x̄) = f(x̄) for all y ∈ X, and so x̄ is a global minimum.

    Example 2.9 Continuing Example 2.5, recall that

    H(x) =

    (1 11 4− 6x2

    ).

    Suppose that the domain of of f(·) is X = {(x1, x2) | x2 < 12}. Then f(·) isa convex function on this domain.

    38

  • Example 2.10 Let

    f(x) = − ln(1− x1 − x2)− lnx1 − lnx2 .

    Then

    ∇f(x) =

    11−x1−x2 − 1x11

    1−x1−x2 −1x2

    ,and

    H(x) =

    (

    11−x1−x2

    )2+

    (1x1

    )2 (1

    1−x1−x2

    )2(

    11−x1−x2

    )2 (1

    1−x1−x2

    )2+

    (1x2

    )2 .

    It is actually easy to prove that f(·) is a strictly convex function, and hencethat H(x) is positive definite on its domain X = {(x1, x2) | x1 > 0, x2 >0, x1 + x2 < 1}. At x̄ =

    (13 ,

    13

    )we have ∇f(x̄) = 0, and so x̄ is the unique

    global minimum of f(·).

    3 A Line-search Algorithm for a Convex Functionbased on Bisection

    Let us consider the unconstrained optimization problem:

    (P) minx f(x)

    s.t. x ∈ Rn .

    In a typical algorithm for nonlinear optimization, the current iterate is somevalue x̄ ∈ F , and the algorithm then computes a direction d̄ that satisfies

    ∇f(x̄)T d̄ < 0 , (4)

    whereby we know from Proposition 2.2 that d̄ is a descent direction of f(·)at x = x̄. A typical algorithm then determines the next iterate x̄new bycomputing a step-size α and setting:

    x̄new ← x̄+ αd̄ .

    39

  • It is often desirable to choose the step-size value α so that the objective func-tion value of the new iterate is improved as much as possible, and thereforewe wish to determine the optimal value α∗ of the following 1-dimensionalproblem:

    α∗ ← arg minα≥0

    f(x̄+ αd̄) . (5)

    Problem (5) is called a line-search problem, as we seek to search for the bestvalue of α on the (half-) line L = {x̄+ αd̄ : α ≥ 0}.

    Let us consider the case when f(·) is a continuously differentiable convexfunction. Let

    h(α) := f(x̄+ αd̄) .

    Then h(·) is a convex function in the scalar variable α, and our problemthen is to solve for

    α∗ = argminα≥0

    h(α) .

    Due to the convexity of h(·), we equivalently seek a value α∗ for which

    h′(α∗) = 0 .

    We have:

    Proposition 3.1 The following properties of h(·) hold:

    1. h′(α) = ∇f(x̄+ αd̄)T d̄ ,

    2. h′(0) < 0 , and

    3. h′(α) is a monotone increasing function of α .

    Proof: Item (1.) follows from the definition of the directional derivativeand Proposition 2.1, and item (2.) follows from (4). Item (3.) is a basicproperty of a 1-dimensional convex function, namely that its slope is non-decreasing.

    Because h′(α) is a monotonically increasing function, we can approximately

    compute α∗, the point that satisfies h′(α∗) = 0, by a suitable bisection

    method. Let us see how this can be done. Suppose that we know a value α̂that h

    ′(α̂) > 0. Since h

    ′(0) < 0 and h

    ′(α̂) > 0, by the Mean Value Theorem

    40

  • there exists some value of α ∈ (0, α̂) for which h′(α) = 0 ; in other words wehave α∗ ∈ (0, α̂). We will use the mid-value α̃ = 0+α̂2 as a suitable test-point.Note the following:

    • If h′(α̃) = 0, we are done.

    • If h′(α̃) > 0, then α∗ ∈ (0, α̃).

    • If h′(α̃) < 0, then α∗ ∈ (α̃, α̂).

    Repeating this process, we are led to the following bisection algorithm pre-sented in Algorithm 1 for solving the line-search problem for approximatelycomputing α∗ = argminα≥0 h(α) = argminα≥0 f(x̄+ αd̄) by approximatelysolving the equation h

    ′(α) = 0.

    Algorithm 1 Bisection Algorithm for Line-search Problem

    Initialize at αl := 0 and αu := α̂ . Set k ← 0 .At iteration k:1. Set α̃ = αu+αl2 and compute h

    ′(α̃).

    • If h′(α̃) > 0, re-set αu := α̃. Set k ← k + 1 .

    • If h′(α̃) < 0, re-set αl := α̃. Set k ← k + 1 .

    • If h′(α̃) = 0, stop.

    The following properties of the bisection algorithm for solving the line-search problem are easy to establish:

    Proposition 3.2

    1. After every iteration of the bisection algorithm, the current interval[αl, αu] contains a point α

    ∗ for which h′(α∗) = 0 .

    2. At the kth iteration of the bisection algorithm, the length of the currentinterval [αl, αu] is

    L =

    (1

    2

    )k(α̂) .

    41

  • 3. A value of α such that |α− α∗| ≤ ε can be found in at most⌈log2

    (α̂

    ε

    )⌉steps of the bisection algorithm.

    3.1 Computing α̂ for which h′(α̂) > 0

    Suppose that we do not have available a convenient value α̂ for which h′(α̂) >

    0. One way to proceed is to pick an initial “guess” of α̂ > 0 and computeh

    ′(α̂). If h

    ′(α̂) > 0, then proceed with the bisection algorithm; if h

    ′(α̂) ≤ 0,

    then re-set α̂← 2α̂ and repeat the process.

    3.2 Stopping Criteria for the Bisection Algorithm

    In practice, we need to run the bisection algorithm with a stopping criterion.Some relevant stopping criteria are:

    • Stop after a fixed number of iterations. That is, stop when k = K̄,where K̄ is specified by the user.

    • Stop when the interval becomes small. That is, stop when αu−αl ≤ ε,where ε is specified by the user.

    • Stop when |h′(α̃)| becomes small. That is, stop when |h′(α̃)| ≤ ε,where ε is specified by the user.

    The third stopping criterion typically yields the best results in practice.

    42

  • 3.3 Modification of the Bisection Algorithm when the Do-main of f(·) is Restricted

    The discussion and analysis of the bisection algorithm has presumed thatour optimization problem is

    P : minimizex f(x)

    s.t. x ∈ Rn .

    Given a point x̄ and a descent direction d̄, the line-search problem then is

    LS : minimizeα h(α) := f(x̄+ αd̄)

    s.t. α ≥ 0 .

    Suppose instead that the domain of definition of f(·) is an open set F ⊂ Rn.Then our optimization problem is:

    P : minimizex f(x)

    s.t. x ∈ F ,

    and the line-search problem then is

    LS : minimizeα h(α) := f(x̄+ αd̄)

    s.t. x̄+ αd̄ ∈ F

    α ≥ 0 .

    In this case, we must ensure that all iterate values of α in the bisection

    43

  • algorithm satisfy x̄+αd̄ ∈ F . As an example, consider the following problem:

    P : minimizex f(x) := −m∑i=1

    ln(bi −Aix)

    s.t. b−Ax > 0 .

    Here the domain of f(·) is F = {x ∈ Rn | b−Ax > 0}. Given a point x̄ ∈ Xand a descent direction d̄, the line-search problem is:

    LS : minimizeα h(α) := f(x̄+ αd̄) = −m∑i=1

    ln(bi −Ai(x̄+ αd̄))

    s.t. b−A(x̄+ αd̄) > 0

    α ≥ 0 .

    Standard arithmetic manipulation can be used to establish that

    b−A(x̄+ αd̄) > 0 if 0 ≤ α < α̂

    where

    α̂ := minAid̄>0

    {bi −Aix̄

    Aid̄

    },

    and the line-search problem then is:

    LS : minimizeα h(α) := −m∑i=1

    ln(bi −Ai(x̄+ αd̄))

    s.t. 0 ≤ α < α̂ .

    4 Exercises on Unconstrained Optimization

    1. Find points satisfying necessary conditions for extrema (that is, localminima or local maxima) of the function

    f(x) =x1 + x2

    3 + x21 + x22 + x1x2

    .

    44

  • Try to establish the nature of these points by checking sufficient con-ditions.

    2. Find minima of the function

    f(x) = (x22 − x1)2

    among all the points satisfying necessary conditions for an extremum.

    3. Consider the problem of finding x that minimizes ∥Ax− b∥2, where Ais an m× n matrix and b is an m-dimensional vector.a. Give a geometric interpretation of the problem.

    b. Write a necessary condition for optimality. Is this also a sufficientcondition?

    c. Is the optimal solution unique? Why or why not?

    d. Can you construct a closed-form expression for the optimal solu-tion? Specify any assumptions that you may need.

    e. Solve the problem for A and b given below:

    A =

    1 −1 00 2 10 1 01 0 1

    , b =

    2110

    .4. Let S be a nonempty set in Rn. Show that S is convex if and only if

    for each integer k ≥ 2 the following holds true:

    x1, . . . , xk ∈ S ⇒k∑

    j=1

    λjxj ∈ S

    whenever λ1, . . . , λk satisfy λ1, . . . , λk ≥ 0 andk∑

    j=1λj = 1 .

    5. Prove Corollary 2.2.

    6. Consider the function f(·) = fa(·) parameterized by the scalar a :

    fa(x1, x2) = 4(x1)2 + (x2)

    2 + a(x1)(x2) + 2x1 + x2 .

    For each value of a, describe the set of solutions of ∇fa(x) = 0 . Whichof these solutions are global minima?

    45

  • 7. Bertsekas, Exercise 1.1.2, page 16, parts (a), (b), (c), and (d). (Note:x∗ is called a stationary point of f(·) if ∇f(x∗) = 0.)

    8. Let f(·) : Rn → R be differentiable at x̄, and let d1, . . . , dn be linearlyindependent vectors in Rn. For each j = 1, . . . , n, suppose that theminimum of f(x̄ + λdj) over λ ∈ R occurs at λ = 0. Show that thisimplies that∇f(x̄) = 0. Does this imply that f(·) has a local minimumat x̄?

    46