Learning Hessian matrix.pdf

download Learning Hessian matrix.pdf

of 100

Transcript of Learning Hessian matrix.pdf

  • 7/27/2019 Learning Hessian matrix.pdf

    1/100

    Derivatives, Higher Derivatives, Hessian Matrix

    And its Application in Numerical Optimization

    Dr M Zulfikar Ali

    Professor

    Department of MathematicsUniversity of Rajshahi

  • 7/27/2019 Learning Hessian matrix.pdf

    2/100

    1. First-, Second- and Higher-Order Derivatives

    11. Differentiable Real-Valued Function

    Let f be a real valued function defined on an open set in nR , and let x be in . f is said to be differentiable at x iffor all nRx such that + xx we have that

    xxxxxtxfxxf ),()()()( ++=+

    where )(xt is an n-dimensional bounded vector, and is a

    real-valued function of x such that0),(lim

    0=

    xx

    x

    f is said to be differentiable on if it is differentiable ateach x in .[Obviously, if f is differentiable on the open set , it isalso differentiable on any subset

    1 (open or not) of .

    Hence when we say that f is differentiable on some set

    1 (open or not), we shall mean that f is differentiable onsome open set containing

    1 .]

    Theorem 1 Let )(xf be a real-valued function defined on

    an open set in nR , and let x in .(i) If )(xf is differentiable atx , then )(xf is continuous

    atx , and )(xf exists (but not conversely), andxxxxxxfxxf ),()()()( ++=+ ,

    0),(lim0

    =

    xxx

    for + xx

    (ii) If )(xf has continuous partial derivatives at x with

    respect ton

    xxx ,,,21K , that is )(xf exists and f is

    continuous atx , then f is differentiable at x .

  • 7/27/2019 Learning Hessian matrix.pdf

    3/100

    Theorem 2 Let )(xf be a real-valued function defined on

    an open set in nR , and let x in .(i) If )(xf is differentiable atx , then )(xf is continuous

    atx , and )(xf exists(but not conversely), andxxxxxxfxxf ),()()()( ++=+ ,

    0),(lim0

    =

    xxx

    for + xx

    (ii) If )(xf has continuous partial derivatives at x with

    respect ton

    xxx ,,,21K , that is )(xf exists and f is

    continuous at x , then f is differentiable at x .

    1.2. Twice Differentiable Real-Valued Function

    Let f be a real-valued function defined on an open set

    in nR , and let xbe in . f is said to be twice differentiableat x if for all nRx such that + xx we have that

    2

    2

    ))(,(2

    )()()()( xxx

    xxfxxxfxfxxf +

    ++=+

    where )(2 xf is an nn matrix of bounded elements, and is a real-valued function ofx such that 0),(lim

    0=

    xx

    x

    The nn matrix )(2 xf is called the Hessian (matrix) off at x and its ij-th element is written as

    [ ] njixxxf

    xfji

    ij ,,2,1,,)(

    )(

    2

    2

    L=

    =

    Obviously, if f is twice differentiable atx , it must also be

    differentiable at x .

  • 7/27/2019 Learning Hessian matrix.pdf

    4/100

    1.3. First Partial Derivative

    Suppose ),( yxf is a real-valued function of two

    independent variables x and y. Then the partial derivative

    of ),( yxf with respect to x is defined as

    +

    =

    x

    yxfyxxf

    x

    fx

    y

    ),(),(lim

    0(1)

    Similarly, the partial derivative of ),( yxf with respect to y

    is defined as

    +

    =

    y

    yxfyyxf

    y

    fx

    x

    ),(),(lim

    0. (2)

    Example1 If 22 2),( yxyxf = . Then

    +

    =

    =

    x

    yxyxx

    x

    ff

    x

    y

    x

    )2(]2)[(lim

    2222

    0

    x2= Similarly

    +

    =

    =

    y

    yxyyx

    y

    ff

    x

    x

    y

    )2(])(2[lim

    2222

    0

    y4=

    .1.4. Gradient of Real-Valued Functions

    Let fbe a real-valued function defined on an open set innR , and let x be in . The n-dimensional vector of the

  • 7/27/2019 Learning Hessian matrix.pdf

    5/100

    partial derivatives of f with respect ton

    xxx ,,,21K at x is

    called the gradient of f at x

    and is denoted by )(xf , that is,

    ))(,,)(()(1 n

    xxfxxfxf = K

    Example2 Let .)(),( 22 yyxyxf += Then the gradient off ,

    )42,22())(,)(()(21

    yxyxxxfxxfxf +==

    1.5. Function of a Function

    It is well-known propert of functions of one independent

    variable that if f is a function of a variable u, and u is a

    function of a variable x, then

    dx

    du

    du

    df

    dx

    df= . (3)

    This result may be immediately extended to the case when f

    is a function of two or more independent variables.

    Suppose )(uff = and ),( yxuu = . Then, by the definitionof a partial derivative,

    yy

    x

    x

    u

    du

    df

    x

    ff

    =

    = , (4)

    xx

    y

    yu

    dudf

    yff

    =

    = . (5)

    Example 3 If

  • 7/27/2019 Learning Hessian matrix.pdf

    6/100

    x

    yyxf

    1tan),( =

    Then putting xyu = we have

    22

    1 )(tanyx

    y

    x

    uu

    du

    d

    x

    ff

    yy

    x

    +=

    =

    =

    22

    1 )(tanyx

    x

    y

    uu

    du

    d

    y

    ff

    xx

    y

    +=

    =

    = .

    1.6. Higher Partial Derivatives

    Provided the first partial derivatives of function are

    differentiable we may differentiate them partially to obtain

    the second partial derivatives. The four second partial

    derivatives of ),( yxf are therefore

    y

    xxx

    x

    f

    xf

    xx

    ff

    =

    =

    =

    2

    2

    , (6)

    x

    yyy

    y

    f

    yf

    yy

    ff

    =

    =

    =

    2

    2

    (7)

    x

    yxy

    y

    f

    xf

    xyx

    ff

    =

    =

    =

    2

    , (8)

    and

    y

    xyx x

    f

    yf

    yxy

    ff

    =

    =

    =

    2

    . (9)

    Higher partial derivatives than the second may be obtained

    in a similar way.

  • 7/27/2019 Learning Hessian matrix.pdf

    7/100

    Example 4 Ifx

    yyxf

    1tan),( = . Then

    2222,

    yxx

    xf

    yxy

    xf

    xy +=

    +=

    .

    Hence differentiating these first derivatives partially, we

    obtain

    222222

    2

    )(

    2)(

    yx

    xy

    yx

    y

    xx

    ff

    xx

    +=

    +

    =

    =

    and

    222222

    2

    )(

    2)(

    yx

    xy

    yx

    x

    yy

    ff

    yy

    +=

    +

    =

    =

    also

    222

    22

    22

    2

    )()(

    yx

    xy

    yx

    x

    xyx

    ff

    xy

    +

    =

    +

    =

    =

    and

    222

    22

    22

    2

    )()(

    yx

    xy

    yx

    y

    yxy

    ffyx

    +=

    +

    =

    = .

    We see that

    xy

    f

    yx

    f

    =

    22(10)

    which shows that the operatorsx

    and

    y

    are

    commutative.

    We also see that

    02

    2

    2

    2

    =

    +

    y

    f

    x

    f. (11)

  • 7/27/2019 Learning Hessian matrix.pdf

    8/100

    The above equation is called Laplace equation in two

    variables.

    In general, any function satisfying this equation is called a

    harmonic function.

    1.7. Total Derivatives

    Suppose ),( yxf is a continuous function defined in a

    region R of the xy-plane, and that bothyx

    f

    and

    xy

    f

    are

    continuous in this region. We now consider the change in

    the value of the function brought about by allowing smallchanges in x and y.

    If f is the change in ),( yxf due to changes x and y in

    x and y then

    ),(),( yxfyyxxff ++= (12)

    ),(),(

    ),(),(

    yxfyyxf

    yyxfyyxxf

    ++

    +++=

    (13)

    Now, by definition (1) and (2)

    +++

    =+

    x

    yyxfyyxxfyyxf

    x x

    ),(),(lim),(

    0(14)

    and

    +

    =

    y

    yxfyyxfyxf

    y y

    ),(),(lim),(

    0(15)

    Consequently ,

    xyyxfx

    yyxfyyxxf

    ++

    =+++ ),(),(),(

  • 7/27/2019 Learning Hessian matrix.pdf

    9/100

    (16)

    and

    yyxf

    y

    yxfyyxf

    +

    =+ ),(),(),( (17)

    where and satisfy the conditions

    0lim0

    =

    xand 0lim

    0=

    y. (18)

    Using (16) and (17) in (13) we now find

    yyxfy

    xyyxfx

    f

    +

    +

    ++

    = ),(),( . (19)

    Furthermore, since all first derivatives are continuous by

    assumption, the first term of (19) may be written as

    +

    =+

    x

    yxfyyxf

    x

    ),(),( (20)

    where satisfies the condition

    0lim0

    =y

    . (21)

    Hence, using (20) and (19) now becomes

    yxyy

    yxfx

    x

    yxff +++

    +

    = )(

    ),(),(. (22)

    The expression

    yy

    yxfxx

    yxff

    +

    ),(),( (23)

    obtained by neglecting the small terms x )( + and y in(22) represents, to the first order in x and y . The change

  • 7/27/2019 Learning Hessian matrix.pdf

    10/100

    f in f(x, y) due to changes x and y in x and y

    respectively is called the total differential of f.

    In case of a function of n independent variables

    ),,,( 21 nxxxf L we have

    =

    ++

    +

    =

    n

    rr

    r

    n

    n

    xx

    fx

    x

    fx

    x

    fx

    x

    ff

    12

    2

    1

    1

    L .

    (24)

    Example5. To find the change in

    xy

    xeyxf =),(

    when the values of x and y are slightly changed from 1 and

    0 to 1+ x and y respectively.

    We first use (23) to obtain

    yexxexyef

    xyxyxy

    2

    )(++

    Hence putting x=1, y=0 in the above expression we have

    yxf +

    For example, if 10.0=x and 05.0=y , then 15.0f

    We now return to the exact expression for f given in (22).

    Suppose u=f(x, y) and that both x and y are differentiablefunctions of a variable t so that

    x=x(t), y=y(t) (25)

  • 7/27/2019 Learning Hessian matrix.pdf

    11/100

    and

    u=u(t). (26)

    Hence dividing (22) by t and proceeding to the limit0t (which implies 0x , 0y andconsequently 0,, ) we have

    dt

    dy

    y

    f

    dt

    dx

    x

    f

    dt

    du

    +

    = . (27)

    This expression is called the total derivative of )(tu with

    respect to t.It is easily seen that if

    ),,,(21 n

    xxxfu L= (28)

    wheren

    xxx ,,,21L are all differentiable functions of a

    variable t, then u=u(t) and

    =

    ++

    +

    =

    n

    r

    r

    r

    n

    ndtdx

    xf

    dtdx

    xf

    dtdx

    xf

    dtdx

    xf

    dtdu

    1

    2

    2

    1

    1

    L

    (29)

    Example6. Suppose22),( yxyxfu +==

    And2

    ,sinh tytx == .Then

    yy

    fx

    x

    f2,2 =

    =

  • 7/27/2019 Learning Hessian matrix.pdf

    12/100

    tdt

    dyt

    dt

    dx2,cosh ==

    Hence

    34coshsinh24cosh2 tttyttxdt

    dy

    y

    f

    dt

    dx

    x

    f

    dt

    du+=+=

    +

    =

    .

    1.8. Implicit DifferentiationIn special case of the total derivative (27) aries when y is

    itself a function of x (i.e. t=x).Consequently u is a function of x only and

    dx

    dy

    y

    f

    x

    f

    dx

    du

    +

    = (30)

    Example7. Suppose

    y

    xyxfu 1tan),( ==

    andxy sin=

    Then by (30), we have

    xx

    xxx

    xyx

    x

    yx

    y

    dx

    du

    22

    222

    sin

    cossin

    cos

    +

    =

    +

    +=

    When y is defined as a function of x by the equation

    0),( == yxfu (31)

  • 7/27/2019 Learning Hessian matrix.pdf

    13/100

    Y is called an implicit function of x. Since u is identically

    zero its total derivative must vanish, and consequently from(30)

    xyy

    f

    x

    f

    dx

    dy

    = (32)

    Example8. Suppose

    0222),( 22 =+++++= cfygxbyhxyaxyxf

    (where a, h, b, g, f and c are constants)Then

    fhxby

    ghyax

    dx

    dy

    222

    222

    ++

    ++= .

    1.9. Higher Total Derivatives

    We have already seen that if ),( yxfu = and x and y aredifferentiable functions of t then

    dt

    dy

    y

    f

    dt

    dx

    x

    f

    dt

    du

    +

    = . (33)

    To find2

    2

    dt

    udwe note (33) that the operator

    dt

    dcan be

    written as

    ydtdy

    xdtdx

    dtd

    +

    . (34)

    Hence

    =

    =

    dt

    du

    dt

    d

    dt

    ud2

    2

    +

    +

    dt

    dy

    y

    f

    dt

    dx

    x

    f

    ydt

    dy

    xdt

    dx

  • 7/27/2019 Learning Hessian matrix.pdf

    14/100

    2

    2

    222

    2

    2

    2

    +

    +

    =

    dt

    dy

    y

    f

    dt

    dy

    dt

    dx

    yx

    f

    dt

    dx

    x

    f

    +2

    2

    2

    2

    dt

    yd

    y

    f

    dt

    xd

    x

    f

    +

    (35)

    where we have assumed thatyxxy

    ff = .Higher total

    derivatives may be obtained in a similar way.

    A special case of (35) is when

    ,, kdt

    dyh

    dt

    dx==

    where h and k are constants. We then have

    2

    2

    2

    2

    2

    2

    2

    2

    2

    2y

    fk

    yx

    fhk

    x

    fh

    dt

    ud

    +

    +

    = ,

    which, if we define the differential operator D* by

    y

    k

    x

    hD

    +

    =* ,

    may be written symbolically as

    .2*2

    2

    2

    fDfy

    kx

    hdt

    ud=

    +

    =

    Similarly we find

    3

    3

    3

    2

    3

    2

    2

    3

    2

    3

    3

    3

    3

    3

    33y

    fk

    yx

    fhk

    yx

    fkh

    x

    fh

    dt

    ud

    +

    +

    +

    =

    fDfy

    kx

    h3*

    3

    =

    +

    = ,

    assuming the commutative properties of partial

    differentiation.In general,

  • 7/27/2019 Learning Hessian matrix.pdf

    15/100

    fDfy

    kx

    hdt

    ud nn

    n

    n

    *=

    +

    = ,

    where the operatorn

    yk

    xh

    +

    is to be expanded by

    means of the binomial theorem.

    1.10 . Taylors Theorem for functions of Two Independent

    Variables

    Theorem3 (Taylors Theorem) If ),( yxf is defined in a

    region R of xy-plane all its partial derivatives of orders upto and including the (n+1)th are continuous in R , then for

    any point (a, b) in this region

    ,),(!

    1

    ),(!2

    1),(),(),(

    *

    2

    *

    *

    nn EbafD

    n

    bafDbaDfbafkbhaf

    ++

    +++=++ L

    where D* is the differential operator defined by

    yk

    xhD

    +

    =*

    and

    ),(* bafD r means ),( yxfy

    kx

    h

    r

    +

    evaluated at the point ),( ba . The Lagrange error termn

    E is

    given b

  • 7/27/2019 Learning Hessian matrix.pdf

    16/100

    ),()!1(

    11

    *

    kbhafDn

    E nn

    +++

    = +

    where 10

  • 7/27/2019 Learning Hessian matrix.pdf

    17/100

    +

    +

    +

    2

    3

    32

    1

    6

    3

    3)1(2

    2

    yyx

    + terms of degree 3 and higher.

    2. Derivatives by Linearization Technique

    2.1. Linear Part of a Function.

    Consider the function, where32

    3),( yxyxyxf = ;and try to approximate ),( yxf near the point

    )1,2(),( =yx by simpler function. To do this, set+= 2x and += 1y . Then

    )1,2(),( ++= fyxf

    )22(3)44(2

    ++++=

    )3331( 32 ++

    ).33()97(11 322 ++++= Here )1,2(11 = f ; 97 is a linear function of thevariables and; and 322 33 ++=v is small,

    compared to the linear function 97 , ifandare bothsmall enough. This means that, ifandare both small

    enough, then)97(11),( +yxf

  • 7/27/2019 Learning Hessian matrix.pdf

    18/100

    is a good approximation, the terms omitted being small in

    comparision to the terms retained.

    To make the idea of small enough precise, denote by

    dthe distance from )1,2(

    to )1,2( ++

    ; then2

    1

    22 ][ +=d .Then d and d , so that, as 0d ,

    022 = dddd ; 0333 2 = dddd ;

    and similarly for the remaining terms in v . This motivates

    the following definition.

    Definition 1. The function f of two variables yx, is

    differentiable at the point ),( ba if

    . ),()()(),(),( yxbyqaxpbafyxf +++= ,where p and q are constants, and

    0d as 0])()[( 21

    22 += byaxd .

    The linear function )()( byqaxp + will be called thelinear part of f at ),( ba . (some books call it the

    differential.)

    The numbers pandqcan be directly calculated. Ify is

    fixed at the value b , then axd = , and

    0),(

    0),(),(

    +

    ++=

    p

    ax

    bxp

    ax

    bafbxf

    as 0= axd , thus as ax . Thus p equals the partialderivative of f with respect to x at a , holding y fixed at

    b . This will be written

  • 7/27/2019 Learning Hessian matrix.pdf

    19/100

    ),(),(),( bafDbafbax

    fp

    xx==

    = , or as

    x

    f

    or

    xf or fD

    x

    for short. Similarly q equals ),(),( bafbayf

    y=

    , the partial

    derivative of f with respect to y at b , holding x fixed at a .

    Thus, in the above example, yxfx

    32 = , and so, at

    )1,2(),( =yx , pfx

    === 7)1(3)2(2 . Similarly,

    ))1,2((933 2 === atqyxfy

    .

    Example 1. Calculate the linear part of f at )0,4

    (

    where

    )32cos(),( yxyxf += .Solution. )32cos(),( yxyxf +=

    )0,4

    (3)32sin(3),(

    )0,4

    (2)32sin(2),(

    atyxyxf

    atyxyxf

    y

    x

    =+=

    =+=

    Hence the linear part of f at )0,4

    (

    is

    =

    04)32(

    )0(3)4

    (2

    yx

    yx

  • 7/27/2019 Learning Hessian matrix.pdf

    20/100

    Example 2. Calculate the linear part of g at )2

    ,0(

    , where

    ).sin(),( 22 yxyxg +=

    Solution. )4

    cos()2

    ,0(,0)2

    ,0(2

    ==yx

    gg

    Hence the linear part of g at )2

    ,0(

    is

    )2

    )(4

    cos(

    )2

    )(4

    cos()0.(0

    2

    2

    =

    +

    y

    yx

    2.2. Vector Viewpoint

    It will be convenient to regard f as a function of a vector

    variable w whose components are x andy . Denote

    =y

    xw , as a column vector, in matrix language; denote

    also

    =

    b

    ac . Define also row vector

    ( ) ),(),,(,)( bafbafqpcfyx

    == .

    ( The notation )(cf suggests a derivative; we have here asort of vector derivative.). Then, since f is differentiable,

    ),())(()()( wcwcfcfwf ++=

  • 7/27/2019 Learning Hessian matrix.pdf

    21/100

    where the product

    )()(),())(( byqaxp

    by

    axqpcwcf +=

    =

    is the usual matrix product. Now let cw denote the

    length of the vector cw ; then the previous cwd = ,

    and so 0)( cww

    Example 3. For ,1,2,3),( 32 === bayxyxyxf set

    = 1

    2c ; then

    )9,7()( = cf .

    2.3. Directional Derivative.

    Suppose that the point (x, y) moves along a straight line

    through the point (a, b); thus myx == ,l , where is

    the distance from (a, b) to (x, y) andl andm are constantsspecifying the direction of the line (with 122 =+ ml ). In

    vector language, let

    =

    y

    xw ,

    =

    b

    ac ,

    =

    mt

    l; then

    cw = ; and the line has equation tcw += .

    The rate of increase of )(wf , measured at cw = , as wmoves along the line tcw += , is called the directional

    derivative of f at c in the direction t.This may be

    calculated, assuming f differentiable, as follows:

    +=

    =

    +tcf

    tcfcftcf)(

    ))(()()(.

  • 7/27/2019 Learning Hessian matrix.pdf

    22/100

    The required directional derivative is the limit of this ratio

    as 0 , namely ( ) qmpm

    qptcf +=

    = l

    l)( (since

    0 as 0 ). Note that t is here a unit vector.

    Example 4(a) Let .3),( 32 yxyxyxf = The directional

    derivative of at

    1

    2in the direction

    sin

    cosis

    ( )

    sin9cos7

    sin

    cos97

    sin

    cos)1,2( =

    =

    f .

    (b) Find the directional derivative of222),,( yzxyxzyxf ++= at (1,-1, 2) in the direction of the

    vector A=(1,-2,2).

    Solution. )2,,4(),,( 2 yzzxyxzyxf ++= ,

    ).4,5,3()2,1,1( =f Unit vector in the direction of A= (1,-2, 2) is ).

    3

    2,

    3

    2,

    3

    1(

    Therefore the directional derivative at (1,-1, 2) in the

    direction of A is

    ( ) 5

    32

    32

    31

    453 =

    .

  • 7/27/2019 Learning Hessian matrix.pdf

    23/100

    2.4. Vector Functions

    Let and each be differentiable real functions of the

    two real variables x andy . The pair of equations

    ),(),(

    yxvyxu

    ==

    defines a mapping from the point ),( yx to the point ),( vu .

    If, instead of considering points, we consider a vectorw,

    with components yx, , and a vectors , with components u, v,

    then the two equations define a mapping from the vector w

    to the vectors . This mapping is then specified by the vector

    function

    =f .

    Definition. The vector function f is differentiable at c if

    there is a matrix )(cf such that

    )())(()()( wcwcfcfwf ++= (1)

    holds, with

    0)(

    cw

    was 0 cw .

    The term ))(( cwcf is called the linear part of f at c .

    Example 5. Let 22),( yxyx += and xyyx 2),( = .thesefunctions are differentiable at )2,1( , and calculation shows

    that

  • 7/27/2019 Learning Hessian matrix.pdf

    24/100

    )).2)(1(2()2(2)1(44

    );)2(4)1(()2(4)1(25 22

    +++=

    ++++=

    yxyxv

    yxyxu

    This pair of equations combines into a single matrixequation:

    ++

    +

    =

    )2)(1(2

    )2(4)1(

    2

    1

    24

    42

    4

    5 22

    yx

    yx

    y

    x

    v

    u.

    In vector notation, this may be written as

    )())(()()( wcwcfcfwf ++= ,

    where now )(cf is a 22 matrix. Since the components

    21, of satisfy 0)(

    1 cww and

    0)(2

    cww as 0 cw , it follows that

    0)( cww as 0 cw .

    Definition 2. The vector function f is differentiable at c if

    there is a matrix )(cf

    such that Equation (1) holds, with

    0)(

    cw

    was 0 cw .

    (2)

    The term ))(( cwcf is called the linear part of f at c.

    Example 6. For the function

    +=

    =

    xy

    yxf

    2

    22

    ,

  • 7/27/2019 Learning Hessian matrix.pdf

    25/100

    the linear part at

    2

    1is

    2

    1

    24

    42

    y

    x

    .

    Example7. Let .2

    22

    =

    xy

    yx

    y

    xf Calculate

    2

    1f .

    Here ;2

    22

    2

    =

    xyy

    yx

    y

    xf .

    44

    42

    2

    1

    =

    f

    2.5. Functions of Functions

    Let the differentiable function f map the vector w to the

    vector s; let the differentiable function g map the vector s to

    the vector t. Diagrammatically,

    tsw

    gf

    Then the composition fgh o= of the functions g and fmaps w to t.

    Since f and g are differentiable,

    ;)()()( += cwAcfwf ;))()(())(())(( += cfwfBcfgwfg

    where A and B are suitable matrices, and and can be

    neglected, then approximately

  • 7/27/2019 Learning Hessian matrix.pdf

    26/100

    )())(())(( cwBAcfgwfg .

    The linear part of h, is in fact, )( cwBA .

    From the chain rule we have)())(()( cfcfgch = .

    Example8. Let

    +=

    =

    =

    xy

    yx

    yx

    yx

    v

    uf

    2),(

    ),( 22

    ,

    23),( vuvug = ,

    =

    2

    1c .

    =

    +=

    =

    4

    5

    212

    21

    2

    1)(

    22

    fcf

    =

    =

    24

    42)(,

    22

    22),( cf

    xy

    yxyxf ,

    )2,3()2,3(),( vvvug == , )8,3()( = cg

    The chain rules then gives

    ).426(24

    42)83(

    2

    1=

    =

    h

    Example9 . tytxyxz sin,cos,2 22 === . Here

    222,sin

    cos)( yx

    y

    xg

    t

    ttf =

    = .

    Taking partial derivatives,

    ( )yxy

    xg

    t

    ttf 22;

    cos

    sin)( =

    =

    Hence

  • 7/27/2019 Learning Hessian matrix.pdf

    27/100

    ( )

    =

    t

    tyx

    dt

    dz

    cos

    sin22

    ( ) 0cos

    sin

    sin2cos2 =

    = t

    t

    tt

    as it should.

    3. Gateaux and Frechet Derivatives

    Nonlinear Operators can be investigated by establishing a

    connection between them and linear operators-more

    precisely, by the technique of local approximation to thenonlinear operator by a linear one. The differential calculus

    for nonlinear operators is needed for this

    purpose.Differentiation is a technique that helps us to

    approximate a nonlinear operator locally. In Banach spaces

    there are different types of derivatives. Among them

    Gateaux and Frechet Derivatives are very important for

    applications.One of them is more general than the other butin some special circustances they are equivalent.

    3.1. Gateaux Derivative

    The Gateaux derivative is the generalization of directional

    derivative.

    Definition.Let X and Y are Banach spaces and let P be an

    operator such that YXP : . Then P is said to be Gateauxdifferentiable at Xx

    0if there exists a continuous linear

    operator YXU : (in general depends on0

    x ) such that

  • 7/27/2019 Learning Hessian matrix.pdf

    28/100

    )()()(

    lim 000

    xUt

    xPtxxPt

    =+

    for every Xx .

    The above is clearly equivalent to

    0)]()()([1

    lim000

    =+

    xtUxPtxxPtt

    (1)

    for every Xx .

    In the above situation, U is called the Gareaux derivative

    of P at0

    x , written )(0

    xPU = , and its value at Xx is

    denoted by ))((0

    xxP or simply xxP )(0

    .

    Theorem1. The Gateaux derivative, if it exists, is unique.

    Proof. Let UandVbe two Gateaux derivatives of Pat0

    x .

    From the relation, for 0t

    )],()()([1

    )]()()([1

    )()(

    00

    00

    xtVxPtxxPt

    xtUxPtxxPt

    xUxV

    +

    +=

    we obtain for 0>t

    )()()(1

    )()()(1

    )()(

    00

    00

    xtVxPtxxPt

    xtUxPtxxPt

    xUxV

    ++

    +

  • 7/27/2019 Learning Hessian matrix.pdf

    29/100

    and as 0t , because of (1), both the expressions in theright hand side tend to zero.

    Since this is true for each Xx , we see that VU = and rhe

    theorem is proved.

    We now suppose that X and Y are finite dimensional, saynRX = and mRY = . Let us analyse the representation of the

    Gareaux derivative of an operator from X into Y. We know

    that if U is a linear operator from X into Y then U is given

    by the matrix

    mnmm

    n

    n

    aaa

    aaa

    aaa

    L

    LLLL

    L

    L

    21

    22221

    11211

    whereYyXxxUy

    mn=== ),,,(,),,,(),(

    2121 LL

    and

    k

    n

    kiki

    a ==1

    , mi ,,2,1 L= . (1)

    Let P be an operator mapping an open subset G of X into

    Y and Gxn

    = ),,,(21

    L

    Yym

    = ),,,(21

    L and )(xPy = . Then we see that there

    exist numerical functionsm

    ,,,21L such that

    =ii ( n ,,, 21 L ), mi ,,2,1 L= ; (2)

    Suppose that the Gateaux derivative of Pexists at

    ),,,( )0()0(2

    )0(

    10 nx L= and UxP = )(

    0. Let Ube given by

    the above matrix, If the equation

  • 7/27/2019 Learning Hessian matrix.pdf

    30/100

    )()()(

    lim 000

    xUt

    xPtxxPt

    =+

    is written in full then with the help of the relation (1) and

    (2) we obtain m relations

    t

    tttninni

    t

    ),,,(),,,(lim

    )0()0(

    2

    )0(

    1

    )0(

    2

    )0(

    21

    )0(

    1

    0

    LL +++

    =k

    n

    kik

    a =1

    , mi ,,2,1 L= (3)

    The relation (3) holds for all Xxn

    = ),,,(21

    L and

    therefore taking in turn an x whose all coordinates are zero

    except one which is equal to unity, we see that the

    functionsm

    ,,,21L have partial derivatives with respect

    ton

    ,,,21L and

    ik

    k

    ni a=

    ),,,( )0()0(2

    )0(

    1L

    where mi ,,2,1 L= and nk ,,2,1 L= .

    The derivative UxP = )( 0 is therefore given by the matrix of

    partial derivatives of the functions m ,,, 21 L

    =

    nmmm

    n

    n

    xP

    L

    LLLL

    L

    L

    21

    22212

    12111

    0 )(

    which is known as the Jacobian matrix and is denoted by

    )(0

    xJ .

  • 7/27/2019 Learning Hessian matrix.pdf

    31/100

    Example1. In this example, we show that the existence of

    the partial derivatives ofm

    ,,,21L need not guarantee

    the existence of )( 0xP .Let m=1 and n=2 and

    )0,0(,0)0,0(,)(

    ),(0122

    2

    2

    1

    21

    211==

    += x

    .

    h

    h

    h

    hhh

    )0,0()0,(lim

    )0,0()0,0(lim

    )0,0(11

    0

    11

    0

    1

    1

    =

    +=

    0)(

    0.lim

    220==

    hh

    hh

    and similarly, 0)0,0(

    2

    1 =

    .

    Therefore, if the derivative )(0

    xP were to exist then it must

    be the zero operator and then (3) should give0

    ),(lim 211

    0=

    t

    ttt

    .

  • 7/27/2019 Learning Hessian matrix.pdf

    32/100

    But

    +=

    +=

    =

    +=

    22

    2

    2

    1

    5

    21

    2

    0

    22

    2

    2

    1

    21

    0

    1

    2

    1

    1

    0

    11

    00

    })(){(lim

    })(){(

    .1lim

    )0,0()(lim

    )0,0()0(lim)(

    hht

    hht

    thth

    thth

    t

    t

    th

    th

    t

    thxP

    t

    t

    t

    t

    =

    + 222

    2

    1

    3

    21

    0 })(){(lim

    hht

    hht

    which does not

    exist.

    (b) Let m=n=2 and ),(),( 22

    3

    121xxxxP = . Let ),(

    21zzz = be

    any point then we see that

    =

    2

    21

    20

    03)(z

    zzP .

    3.2. Frechet Derivative

    The derivative of a real function of a real variable is

    defined by

    h

    xfhxf

    xf h)()(

    lim)( 0+

    = (1)

    This definition cannot be used in the case of mapping

    defined in Banach space because h

  • 7/27/2019 Learning Hessian matrix.pdf

    33/100

    is then a vector and division by a vector is meaningless. On

    the other hand, the division by a vector can be easily

    avoided by rewriting (1) in the form

    )()()()( hhxfxfhxf ++=+ (2)

    where is a function ofh such that 0)( h as. 0h Equivalently, we can now say that )(xf is the derivativeof f at x if

    )()()()( hhxfxfhxf +

    =+ (3)

    where 0)( hh as 0h .

    The definition based on (3) can be generalized to include

    mappings from a Banach space into a Banach space. This

    leads to the concept of the Frechet differentiability and

    Frechet derivative.

    Definition. Let f map a ball }:{0

  • 7/27/2019 Learning Hessian matrix.pdf

    34/100

    Here )(afDi

    denotes )(ax

    f

    i

    . If ),(2 RUCf then, for each

    n

    Rv with Uva + , the linear part of

    i

    n

    iii

    uafDvafDuafvaf +=+=1

    )]()([)]()([

    is

    ji

    n

    jiij

    vuafDvuaf ==1,

    )(),)((

    where )(afDij denotes )(ax

    f

    xij

    .

    This process may be continued. If ),( RUCfk , denote

    )()(11

    21

    afxxx

    afDiii

    iii

    kk

    k

    =

    LLLLL

    .

    Define then, for nk

    Rwww ,,,21

    LL ,

    kk

    k iki

    n

    iiiiiiik

    k

    wwwafDwwwaf,,2

    1,,,,121

    )(

    221

    121

    )(),,,)(( LLLL

    LLL=

    =

    where1,1 i

    w denotes the 1i component of 1w . If all wwi = , we

    abbreviate

    ),,,)((21

    )(

    k

    k

    wwwaf LL to .))(()(nk

    waf

    The derivative )(af is representd by a n1 matrix, whosecomponents are )(afD

    i. Also )(af is represented by an

    nn matrix, M say, whose ji, element is )(. afD ji ; ifu

    andv are regarded as colums, then

    MuvvuafT= ),)(( .

  • 7/27/2019 Learning Hessian matrix.pdf

    35/100

    It is shown below that M is a symmetric matrix.

    Example Define RRf n : by the polynomial

    yxyxyxyxyxf222

    623473),( ++++++= .Then

    ]643,1232[)0()0( 21212121

    vvvvvvvfvf ++++=+

    Taking the linear part of this expression, and applying it to

    =

    2

    1

    u

    uu ,

    ( )

    =

    ++=

    2

    1

    21

    2

    1

    2121

    43

    32

    )43,32(),)(0(

    u

    uvv

    u

    uvvvvvuf

    where the 22 matrix consists of second partialderivatives.

    In more abstract terms, let ),( RRL n denote the vector

    space of continuous linear maps from nR into R . Then, in

    terms of continuous linear maps, ),()( RRLaf n ,)),,(()( RRRLLaf n and so on. Also )(af is a bilinear

    map from nn RR into R ; this means that ),)(( vuaf islinear in u for each fixed v, and also linear in v for each

    fixed u.

    Theorem 2 (Taylors)

    Let );,( RUCf k let Ua and let Uxa + . Then

  • 7/27/2019 Learning Hessian matrix.pdf

    36/100

    L+++=+ 2))((!2

    1)(

    !1

    1)()( xafxafafxaf

    kkkk

    xafk

    xafk

    ))((!

    1))((

    )!1(

    1)(1)1( +

    + ,

    where xac += for some in .10

  • 7/27/2019 Learning Hessian matrix.pdf

    37/100

    xy

    fxfyfyxfyx

    )0,0()0,(),0(),(),(

    +=

    (i)

    and, for fixed y, let )0,(),()( xfyxfx = . Then themean-value theorem shows that, for fixed y,

    y

    xfyxf

    xy

    xx

    xy

    xyx xx

    )0,(),()()0()(),(

    =

    =

    =

    For some in 10

  • 7/27/2019 Learning Hessian matrix.pdf

    38/100

    3.4. Example of Bilinear Operator

    Assume that ( ) mi

    Rcc = is a real m-vector, thatij

    aA =

    is a real ),( mm matrix and that B=ijk

    b is a bilinear operator

    from mm RR tomR . Then the mapping mm RRf :

    defined by

    m

    RzBzAzczf ++= ,)( 2 ( where BzzBz =2 ) (b1)

    is called a quadratic operator. The equation 0)( =zf , thatis

    02 =++ cAzBz (b2)

    is called a quadratic equation in mR

    As a simple but very important example to the quadratic

    equation (b2), we consider the algebraic-eigenvalue

    problem

    xTx = (b3)where

    ijtT = is a real (n,n) matrix. We assume that the

    eigenvector )( ixx = has Eulidean length one

    ===

    n

    ii

    xx1

    22

    21 (b4)

    If we set ( ),,,,21 n

    T

    xxxz L= , then (b3) and (b4) can be

    written as a system of nonlinear equations, namely

    0)1(

    2

    1

    )(

    )( 22

    =

    =

    x

    xIT

    zf

    (b5)

  • 7/27/2019 Learning Hessian matrix.pdf

    39/100

    It is well known that (b5) is a quadratic equation of the

    form (b1) where m=n+1 and

    =

    2

    100 L

    T

    c , mRc (b6)

    =

    000

    0

    0

    L

    MTA (b7)

    =

    0000100

    0

    0

    1

    1

    1

    0

    0

    0010

    0

    1

    2

    1

    LL

    MOM

    L

    L

    MO

    O

    OOB . (q8)

    For the mapping (5), we get by using (q6, q7) and (q8) that

    BzfBzAzfBzAzczf 2)(,2)(,)( 2 =+++= .

    Therefore )(zf has the matrix representation

    =

    0)(

    x

    x

    Tzf

    T

    and )(zf is the bilinear operator defined in (b8),multiplied by the factor two.

    For, 2=n ,

    =

    =

    =

    2

    10

    0

    ,,2

    1

    2

    1

    cx

    x

    zx

    xx

    .

  • 7/27/2019 Learning Hessian matrix.pdf

    40/100

    +

    +

    =

    =

    )1(21

    )(

    )(

    )1(

    2

    1

    )(

    )(

    2

    2

    2

    1

    222121

    212111

    2

    2

    xx

    xtxt

    xtxt

    x

    xIT

    zf

    =

    0

    )(

    21

    22221

    11211

    xx

    xtt

    xtt

    zf

    =

    0

    T

    x

    xIT

    =

    000010001

    0

    0

    10

    01

    1

    0

    00

    00

    0

    1

    00

    00

    )(zf

    and for 3=n ,

    =

    0000010000100001 0

    0

    0

    100

    010

    001

    1

    0

    0

    000

    000

    000

    0

    1

    0

    000

    000

    000

    0

    0

    1

    000

    000

    000

    )(zf

    4. Hessian Matrix and Unconstraint Optimization

    In mathematics, the Hessian matrix(or simply the Hessian)

    is the square matrix of second-order partial derivatives of afunction, that is , it describes the local curvature of a

    function of many variables.The Hessian matrix was

    developed in 19th century by the German mathematician

  • 7/27/2019 Learning Hessian matrix.pdf

    41/100

    Ludwig Otto Hesse and later named after him. Hesse

    himself had used the term functional determinants.

    Given the real-valued function),,,(21 n

    xxxf L

    If all second partial derivatives of f exist, then the Hessian

    matrix of f is the matrix)()()( xfDDxfH

    jiij=

    where ),,,(21 n

    xxxx L= andi

    D is the differentiation

    operator with respect to the ith argument and the Hessian

    becomes

    =

    nnn

    n

    n

    xfxxfxxf

    xxfxfxxf

    xxfxxfxf

    fH

    22

    2

    2

    1

    2

    2

    2

    2

    22

    12

    2

    1

    2

    21

    2

    1

    22

    )()()(

    )()()(

    )()()(

    )(

    L

    MOMM

    L

    L

    ))(( xfH is frequently shortened to simply )(xH .

    Some mathematicians define the Hessian as the

    determinant of the above matrix.

    Hessian matrices are used in large-scale optimization

    problems within Newton-type methods because they are the

    coefficient of the quadratic term of a local Taylor

    expansion of a function.That is,

    xxHxxxJxfxxfy T +++= )(2

    1)()()(

    Where J is the Jacobian matrix, which is a vector (the

    gradient for a scalar valued function). The full Hessian

    matrix can be difficult to compute in practice, in such

    situations, quasi-Newton algorithms have been developed

    that use approximations to the Hessian. The best known

    quasi-Newton algorithm is the BFGS algorithm.

  • 7/27/2019 Learning Hessian matrix.pdf

    42/100

    4.1. Mixed derivatives and Symmetry of the Hessian

    The mixed derivatives of f are the entries off the main

    diagonal in the Hessian. Assuming that they are continuous,

    the order of differentiation does not matter (Clairauttheorem).

    For example,

    =

    x

    f

    yy

    f

    x.

    This can also be written asxyyx

    ff = .

    In a formal statement: if the second derivatives of f are all

    continuous in a neiborhood D, then the Hessian of f is asymmetric matrix throughout D.

    Example1

    Consider the real-valued function35),( xyyxf = .

    Then )15,5(),( 23 xyyyxf = and the Hessian matrix,

    ==

    2

    22

    2

    2

    2

    2 ),(

    x

    f

    xy

    f

    yxf

    xf

    yxfH =

    xyy

    y

    3015

    1502

    2

    .

    Example2

    Let 32 3),( yxyxyxf = .

    Then ( )2

    3332),( yxyxy

    f

    x

    f

    yxf =

    =

  • 7/27/2019 Learning Hessian matrix.pdf

    43/100

    The Hessian matrix,

    =

    ==y

    xf

    xyf

    yx

    f

    x

    f

    yxfH

    63

    32),(

    2

    22

    2

    2

    2

    2 .

    Example3 Let a function, 22: RRf is given by

    +=

    xy

    yxyxf

    2),(

    22

    .

    Then

    = xy

    yxyxf 22

    22),( and

    =

    0220

    2002),(2 yxf

    which is not a Hessian matrix but a bilinear operator.

    4.2. Critical points and discriminant

    If the gradient of f is zero at some point x, then f has acritical point (or a stationary point) at x. The determinant of

    the Hessian at x is then called the discriminant. If this

    determinant is zero then x is called a degenerate critical

    point of f, this is also called a non-Morse critical point of f.

    Otherwise, it is non-degenerate, this is called a Morse

    critical point of f.

    For a real-valued function ),( yxf of two variables x and y

    and let1and

    2 be the eigenvalues of the corresponding

    Hessian matrix of f, then

    =+21

    trace (H) and =21

    det(H)

  • 7/27/2019 Learning Hessian matrix.pdf

    44/100

    Example4

    For the previous example, the critical point of f is given by

    ( ) 03332),(2 ==

    yxyxyxf , whence wehave

    )0,0(),( =yx

    Thus,

    =

    03

    32H . The eigenvalues of H are

    ( ) ( )101,101,21

    +=

    Hence 221

    =+ =trace(H) and == 921

    det(H)

    4.3. Functions of one Variable.

    Definitions. Suppose )(xf is a real-valued function defined

    on some interval (The interval I may be finite or infinite,

    open or closed, or half-open.). A point *x in I is:(a) a global minimizer for )(xf on I if

    )()( * xfxf for all x in I;(b) a stritct global minimizer for )(xf on I if

    )()( * xfxf < for all x in I such that *xx ;(c) a local minimizer for )(xf if there is a positive

    number such that )()( * xfxf for all x in I for

    which +

  • 7/27/2019 Learning Hessian matrix.pdf

    45/100

    (e) a critical point of )(xf if )( *xf exists and isequal to zero.

    The Taylors formula (single variable)

    Theorem1. Suppose that )(,)(,)( xfxfxf exist on the

    closed interval [ ] { }bxaRxba = :, .If xx ,* are anytwo different points of ]ba, , then there exists a point zstrictly between *x andx such that

    2**** )(2

    )())(()()( xxzfxxxfxfxf ++= .

    If *x is the critical point of )(xf then the above formula

    reduces to

    2** )(

    2

    )(0)()( xx

    zfxfxf

    ++= ,

    or

    2** )(2

    )()()( xx

    zfxfxf

    =

    for all *xx .

    Theorem 2. Suppose that )(,)(,)( xfxfxf are all

    continuous on an interval Iand that Ix

    *

    is a criticalpoint of )(xf .

    (a) If 0)( xf for all Ix , then *x is a globalminimizer of )(xf on I.

  • 7/27/2019 Learning Hessian matrix.pdf

    46/100

    (b) If 0)( > xf for all Ix , such that *xx , then*

    x is a strict global minimizer of )(xf on I.

    (c) If 0)( > xf , then *x is a strict local minimizerof )(xf .

    Example5. Consider 143)( 34 += xxxf .Since ),1(121212)( 223 == xxxxxf the only critical

    points of )(xf are 0=x and 1=x . Also, since)23(122436)( 2 == xxxxxf we see that

    0)0( =f and 12)1( =f .Therefore, 1=x is a strict local minimizer of )(xf Definition. Suppose that )(xf be a numerical function

    defined on a subset D of nR . A point x in D is

    (i) a global minimizer for )(xf on D if )()( xfxf forall Dx ;

    (ii) a strictly global minimizer for )(xf on D if)()( xfxf < for all Dx such that xx ;

    (iii) a local minimizer for )(xf if there is a positivenumbersuch that )()( xfxf for all Dx forwhich ),( xBx ;

    (iv) a strictly local minimizer for )(xf if there is apositive numbersuch that )()( xfxf < for all

    Dx for which ),( xBx and xx ;(v) a critical point for )(xf if the first partial derivatives

    of )(xf exist at x and

  • 7/27/2019 Learning Hessian matrix.pdf

    47/100

    0)( =i

    xxf for ni ,,2,1 K= ,

    that is0)( = xf

    Example 6. Consider 23 )5(3)4(40)( ++= yxxxyf .))5(6,124( 23 = yxxf .

    0=f gives two critical points )}5,0(),5,3{(),( =yx

    4.4. Taylors Formula (several variables).

    Theorem 5. Suppose that x ,x are points in nR and that

    )(xf is a function of n variables with continuous first andsecond partial derivatives on some open set containing the

    line segment

    [ ] { }10;)(:, +== txxtxwRwxx n joining x andx . Then there exists a [ ]xxz , such that

    ))(()(

    2

    1)()()()( xxzHfxxxxxfxfxf ++=

    Here H denotes the Hessian Matrix

    This formula is used to develop tests for maximizers and

    minimizers among the critial points of a function.

    Theorem 6. Suppose that *x is a critical point of a function

    )(xf with continuous first and second partial derivatives

    on

    n

    R . Then:(a) *x is a global minimizer for )(xf if

    0))(()( xxzHfxx for all nRx and all ],[ * xxz ;

  • 7/27/2019 Learning Hessian matrix.pdf

    48/100

    (b) *x is a strict global minimizer for )(xf if

    0))(()( > xxzHfxx for all nRx such that *xx andall ],[ * xxz ;

    (c)*

    x is a global maximizer for )(xf if0))(()( xxzHfxx for all nRx and all ],[ * xxz ;

    (d) *x is a strict global maximizer for )(xf if

    0))(()(

  • 7/27/2019 Learning Hessian matrix.pdf

    49/100

    In general, )(yQA

    is a sum of terms of the formjiij

    yyc where

    nji ,,1, K= andij

    c is the coefficient which may be zero,

    that is, every term in )(yQA is of second degree in thevariables

    nyyy ,,,

    21K . On the other hand, any function

    ),,(1 n

    yyq K that is the sum of second-degree terms in

    nyyy ,,,

    21K can be expressed as the quadratic form

    associated with an nn -symmetric matrix A by splittingthe coefficient of

    jiyy between the ),( ji

    and ),( ij entries of A.

    Example 7(a). The function

    3221

    2

    3

    2

    2

    2

    1321424),,( yyyyyyyyyyq ++=

    is the sum of second degree terms in .,,321

    yyy

    Splitting the coefficients ofji

    yy , we get

    323231

    312121

    2

    3

    2

    2

    2

    1321

    220

    04),,(

    yyyyyy

    yyyyyyyyyyyyq

    +++

    ++=

    =233213311221

    2

    3

    2

    2

    2

    122004 yyyyyyyyyyyyyyy +++++

    =

    3

    2

    1

    321

    420

    211

    011

    ),,(

    y

    y

    y

    yyy

    where

    =

    =

    420

    211

    011

    333231

    232221

    131211

    ddd

    ddd

    ddd

    A

    with

  • 7/27/2019 Learning Hessian matrix.pdf

    50/100

    =ii

    d coefficient of 2i

    y

    ijd or

    jid =

    2

    1(coefficient of

    jiyy or

    jiy ) , ji .

    (b) 23

    2

    2

    2

    13213),,( yyyyyyq ++=

    323231312121

    2

    3

    2

    2

    2

    1.0.0.0.0.0.03 yyyyyyyyyyyyyyy +++++++=

    ( )

    3

    2

    1

    321

    100

    030

    001

    y

    y

    y

    yyy

    (c) 23

    2

    21321)2(),,( yyyyyyq +=

    21

    2

    3

    2

    2

    2

    144 yyyyy ++=

    3232

    31312121

    2

    3

    2

    2

    2

    1

    .0.0

    .0.0224

    yyyy

    yyyyyyyyyyy

    ++

    ++++=

    ( )

    =

    3

    2

    1

    321

    100

    042

    021

    y

    y

    y

    yyy

    The Hessian )(zHfH = of )(xf evaluated at a point z is annn -symmetric matrix.

    For nRxx *, , the quadratic form )(yQ

    Hassociated with H

    evaluated at *xx is))(()()( *** xxzHfxxxxQ

    H=

  • 7/27/2019 Learning Hessian matrix.pdf

    51/100

    4.6. Positive and Negative Semidefinite and Definiteness

    Definitions. Suppose that A is an nn -symmetric matrix

    and that AyyyQA=

    )( is the quadratic form associated with A. Then A and

    AQ are

    called:

    (a) positive semidefinite if 0)( = AyyyQA

    for alln

    Ry ;(b) positive definite if 0)( >= AyyyQ

    Afor all

    n

    Ry , 0y ;

    (c) negative semidefinite if 0)( = AyyyQA for allnRy , 0y ;

    (d) negative definite if 0)( = AyyyQ

    Afor some nRy and

    0)(

  • 7/27/2019 Learning Hessian matrix.pdf

    52/100

    (d) *x is a strict global maximizer for )(xf if )(xHf is

    negative definite on nR .

    Here are some examples.

    Example 8.

    (a) A symmetric matrix whose entries are all positiveneed not be positive definite. For example, the matrix

    =

    14

    41A

    is not positive definite. For if )1,1( =x , then

    .063

    3)1,1(

    1

    1

    14

    41)1,1()(

  • 7/27/2019 Learning Hessian matrix.pdf

    53/100

    =

    200

    030

    001

    A

    is positive definite because the associated quadratic form

    )(xQA is

    2

    3

    2

    2

    2

    123)( xxxAxxxQ

    A++==

    and so 0)( >xQA

    unless 0321

    === xxx .

    (d) A 33 -diagonal matrix

    =

    3

    2

    1

    00

    00

    00

    d

    d

    d

    A

    is

    (1) positive definite if 0>id for 3,2,1=i ;(2) positive semidefinite if 0i

    d for 3,2,1=i ;

    (3) negative definite if 0> ddd then

    0)( 222

    2

    11+= xdxdxQ

    A

    for all 0x since 0,021

    >> dd , but if )1,0,0(=x , then

    0)( =xQA

    even though 0x .

  • 7/27/2019 Learning Hessian matrix.pdf

    54/100

    (e) If a 22 -symmetric matrix

    =

    cb

    baA

    is positive definite , then 0>a and 0>c . For if )0,1(=x ,

    then 0x and so .0.0.0.21.)(0 22 acbaxQA

    =++=<

    Similarly, if )1,0(=x , then .)(0 cxQA

    =< However,

    (a)shows that there are 22 -symmetric matrices with0,0 >> ca that are not positive definite. We can see that

    the size of b relative to the size of the product ac is the

    determining factor for positive definiteness.

    The examples show that for general symmetric matrices

    there is little relationship between the signs of the matrix

    entries and the positive or negative definite features of the

    matrix. They also show that for diagonal matrices, these

    features are completely transparent.

    Here are some examples of positive definite, positive semi-

    definite, negative definite and negative semi-definite

    matrices in real field:

    Example 9.

    (a)

    =

    210

    121

    012

    A , ( ) 0321

    >=T

    xxxX

    0)()( 23

    2

    1

    2

    32

    2

    21>+++++= xxxxxxAXXT .

  • 7/27/2019 Learning Hessian matrix.pdf

    55/100

    The matrix A is PD

    (b)

    =

    1105

    0181551525

    A , ( ) 0321

    >=T

    xxxX

    0111825 23

    2

    2

    2

    1>++= xxxAXXT .

    The matrix A is PD

    (c)

    =

    151451

    141451

    5551

    1111

    A , ( ) 04321

    >=T

    xxxxX

    0)(9

    )(4)(2

    4

    2

    43

    2

    432

    2

    4321

    >+++

    ++++++=

    xxx

    xxxxxxxAXXT

    .

    The matrix A is PD

    (d)

    =

    1100

    1210

    0121

    0012

    A , ( ) 04321

    >=T

    xxxxX

    0)()()( 243

    2

    32

    2

    21

    2

    1>+++= xxxxxxxAXXT .

    The matrix A is PD

  • 7/27/2019 Learning Hessian matrix.pdf

    56/100

    (e) 322

    2

    1,)( Rxxxxq += , for any nonzero Rx

    3and

    021

    == xx , ( ) 33

    000 Rx

    gives 0)( 22

    2

    1=+= xxxq

    The matrix

    =

    000

    010

    001

    A is PSD

    (f) )3()( 23

    2

    2

    2

    1xxxxq ++=

    ( )

    =

    3

    2

    1

    321

    100

    030

    001

    x

    x

    x

    xxx .

    The matrix

    =

    100

    030

    001

    A is ND.

    (g) ))2(()( 23

    2

    21xxxxq +=

    =( )

    3

    2

    1

    321

    100

    042

    021

    x

    x

    x

    xxx .

    The matrix

    =

    100

    042

    021

    A is NSD

    We will develop two basic tests for positive and negative

    definiteness-one in terms of determinant, and another in

    terms of eigenvalues.

  • 7/27/2019 Learning Hessian matrix.pdf

    57/100

    4.7. Determinant Approach

    We begin by looking at functions of two variables. If A is a

    22

    -symmetric matrix

    =

    2212

    1211

    aa

    aaA

    then the associated quadratic form is.2.)( 2

    2222112

    2

    111xaxxaxaAxxxQ

    A++==

    For any 0x in 2R , either )0,(1

    xx = with

    01

    x ),(21

    xxx = with 02

    x .

    Let us analyze the sign of )(xQA in terms of entries of A in

    each of these two cases

    Case 1. )0,(1

    xx = with 01

    x .

    In this case, 2111

    )( xaxQA

    = so 0)( >xQA

    if and only if

    011

    >a , while 0)( t for all Rt .

    Note that1211

    22)( atat += ,11

    2)( at = , so that

    1112

    * / aat = is a critical point of )(t and this critical point is

    a strict minimizer if 011 >a and strict maximizer if 011 a and if Rt , then

  • 7/27/2019 Learning Hessian matrix.pdf

    58/100

    2212

    1211

    11

    2

    122211

    11

    22

    11

    2

    12

    11

    12*

    1

    )(

    1

    )()()(

    aa

    aa

    aaaaa

    aa

    a

    a

    att

    ==

    +==

    . Thus, if 011

    >a and

    det 02212

    1211 >

    aa

    aa, then 0)( >t for all Rt and so

    0)( >xQA

    for all ),(21

    xxx = with 02

    x .On the other hand,if 0)( >xQ

    Afor all such x, then 0)( >t for all Rt and

    so 011

    >a and the discriminant of )(t

    4442211

    2

    12= aaa det

    2212

    1211

    aa

    aais negative ,

    that is 011

    >a and det 02212

    1211 >

    aa

    aa. An entirely similar

    analysis shows that 0)( a , det 0

    2212

    1211

    >

    aaaa ;

  • 7/27/2019 Learning Hessian matrix.pdf

    59/100

    (b) negative definite if and only if 011

    aa

    aa. The 22 case and a little imagination

    suggest the correct formulation of the general case.

    Suppose A is an nn -symmetric matrix. Definek

    to be

    the determinant of the upper left-hand corner kk -sub-matrix of A for nk 1 .The determinant

    k is called the

    kth principal minor of A

    =

    nnnnn

    n

    n

    n

    aaaa

    aaaa

    aaaa

    aaaa

    A

    L

    MMMMM

    L

    L

    L

    321

    3332313

    2232212

    1131211

    ,111

    a= Aaa

    aan

    det,,det2212

    1211

    2=

    = L .

    The general theorem can be formulated as follows:

    Theorem.9. If A is an nn -symmetric matrix and ifk

    is

    the kth principal minor of A for nk

    1 , then:

    (a) A is positive definite if and only if 0>k

    for k=1, 2 ,

    ... , n;

  • 7/27/2019 Learning Hessian matrix.pdf

    60/100

    (b) A is negative definite if and only if 0)1( >k

    k for

    k=1, 2 , ... , n(that is, the principal minors alternate in

    sign with 01

  • 7/27/2019 Learning Hessian matrix.pdf

    61/100

    (a) 0. >Axx for all 03

    x such that 03

    =x if and only if

    ;0,021

    >>

    (b) 0. Axx for all

    0x such that 03 x if and only if

    =),( ts 022223131233

    2

    22

    2

    11>+++++ tasastaatasa

    for real numbers ts , . In addition, 0.

  • 7/27/2019 Learning Hessian matrix.pdf

    62/100

    0det2212

    1211

    2

    =

    aa

    aa,

    and this unique solution is given by Cramers Rule as

    =

    2223

    1213

    2

    * det1

    aa

    aas , =*t

    2312

    1311

    2

    det1

    aa

    aa. (1)

    if we multiply the equation0

    13

    *

    12

    *

    11=++ atasa

    by *s , and multiply the equation0

    23

    *

    22

    *

    12=++ atasa

    by

    *

    t and add the results, we obtain 02)()( *23

    *

    13

    **

    12

    2*

    22

    2*

    11=++++ tasatsatasa .

    Cosequently,

    33

    *

    23

    *

    13

    ** ),( atasats ++= ,

    and so (1) implies that if

    ,02

    then

    2

    3

    2332313

    232212

    131211

    2

    **det

    det1

    ),(

    =

    =

    =

    A

    aaa

    aaa

    aaa

    ts .

    (2)

    Since

    2

    2212

    1211

    422

    22det),( =

    =

    aa

    aatsH ,

    it follows from Theorem 8 and Theorem 7 that ),( ** ts is a

    strict global minimizer for ),( ts if and only if0,021

    >> . Similarly, ),( ** ts is a strict global maximizer

    for ),( ts if and only if 0,021

    >>> , then the conclusion (a) of Case 1

    shows that if 0x and 03

    =x , then 0. >Axx ; on the other

  • 7/27/2019 Learning Hessian matrix.pdf

    63/100

    hand , the considerations in Case 2 show that if

    ,,,0,031323

    sxxtxxxx == then

    .0),(),(.2

    32

    3

    **2

    3

    2

    3>

    == xtsxtsxAxx

    Therefore 0. >AXx for all 0x if 0,0,0321

    >>> .

    On the other hand , if 0. >Axx for all 0x , then theconclusion (a) of Case 1 shows that 0,0

    21>> . Also, if

    )1,,( *** tsx = , then (2) yields

    ,0),( ****

    1

    3 >==

    Axxts so 0

    3> . This proves part (a)

    of the theorem for 3=n

    Example.9

    (a) Minimize the function313221

    2

    3

    2

    2

    2

    1321),,( xxxxxxxxxxxxf +++= .

    The critical point of ),,(321

    xxxf are the solutions of the

    system

    .02

    ,02

    ,02

    321

    321

    321

    =++

    =++

    =

    xxx

    xxx

    xxx

    or

    =

    0

    0

    0

    211

    121

    112

    3

    2

    1

    x

    x

    x

    .

  • 7/27/2019 Learning Hessian matrix.pdf

    64/100

    Sincedet

    211

    121

    112

    0 , 0,0,0321

    === xxx is the

    one and only solution.

    The Hessian of ),,(321

    xxxf is the constant matrix

    =),,(321

    xxxHf

    211

    121

    112

    Note that ,4,3,2 321 === so ),,( 321 xxxHf is positivedefinite everywhere on 3R .

    It follows from Theorem 7 that the critical point (0, 0, 0) is

    a strict global minimizer for ),,(321

    xxxf and this is the only

    one critical point of ),,(321

    xxxf .

    (b) Find the global minimizer of22),,( zeeezyxf xxyyx +++= .

    To this end, compute

    +

    =

    z

    ee

    xeee

    zyxfxyyx

    xxyyx

    2

    2

    ),,(

    2

    ,

    and

  • 7/27/2019 Learning Hessian matrix.pdf

    65/100

    +

    +++

    =

    200

    0

    024

    ),,(

    222

    xyyxxyyx

    xyyxxxxyyx

    eeee

    eeeexee

    zyxHf

    Clearly, 01

    > for all x, y, z because all the terms of it are

    positive. Also

    =2

    222 )()24)(()(

    22xyyxxxxyyxxyyx eeeexeeee

    +++++

    = )24)((222 xxxyyx

    eexee ++

    >0

    because both the factors are always positive. Finally,

    0223

    >= . Hence ),,( zyxHf is positive definite at all

    points. Therefore by Theorem 7, ),,( zyxf is strictly

    globally minimized at any critical point ).,,( *** zyx To find

    ),,( *** zyx , solve

    0

    2

    2

    ),,(*

    *

    *** ****

    2*****

    =

    +

    =

    z

    ee

    exee

    zyxfxyyx

    xxyyx

    This leads to ,,0***** xyyx

    eez == hence 02

    2* )(* =xex .

    Accordingly, ;**** xyyx = that is, ** yx = and 0* =x .Therefore )0,0,0(),,( *** =zyx is the global minimizer

    of ),,( zyxf .

    (c) Find the global minimizers of

  • 7/27/2019 Learning Hessian matrix.pdf

    66/100

    .),( xyyx eeyxf +=

    To this end, compute

    +=

    xyyx

    xyyx

    eeeeyxf ),(

    and

    +

    +=

    xyyxxyyx

    xyyxxyyx

    eeee

    eeeeyxHf ),(

    Since 0>+ xyyx ee for all x, y and det 0),( =yxHf , the

    Hessian ),( yxHf is positive semidefinite for all x, y.

    Therefore by Theorem 7, ),( yxf is minimized at any

    critical point ),( ** yx of ),( yxf .

    To find ),( ** yx , solve

    == ),(0** yxf

    +

    ****

    8***

    xyyx

    xyyx

    ee

    ee.

    This gives

    8***xyyx ee

    = or

    **** xyyx = ;

    that is ** 22 yx = .This shows that all points of the line xy = are the globalminimizers of ),( yxf .

  • 7/27/2019 Learning Hessian matrix.pdf

    67/100

    (d) Find the global minimizes of.),( yxyx eeyxf + +=

    In this case,

    +

    +=

    +

    +

    yxyx

    yxyx

    ee

    eeyxf ),(

    ++

    ++=

    ++

    ++

    yxyxyxyx

    yxyxyxyx

    eeee

    eeeeyxHf ),( .

    Since 0>+ + yxyx ee for all x, y and det 0),( >yxHf , then),( yxHf is positive definite for all x, y. Therefore by

    Theorem 7, ),( yxf is minimized at any critical

    point ),( ** yx . To find ),( ** yx ,

    ),(0 ** yxf= =

    +

    ++

    +

    *8**

    *8**

    yxyx

    yxyx

    ee

    ee.

    Thus*8**

    yxyx

    ee+

    + =0and0

    *8**

    =+ + yxyx ee .But 0

    **

    >yxe and 0**

    >+yxe for all ** , yx . Therefore the

    equality 0*8**

    =+ + yxyx ee is impossible. Thus ),( yxf has nocritical points and hence ),( yxf has no global minimizers.

    There is no disputing that global minimization is far moreimportant than mere local minimization. Still there are

    certain situations in which scientists want knowledge of

    local minimizers of a function. Since we are in an excellent

  • 7/27/2019 Learning Hessian matrix.pdf

    68/100

    position to understand local minimization, let us get on

    with it. The basic fact to understand is the next theorem.

    Theorem.10. Suppose that )(xf is a function withcontinuous first and second partial derivatives on some set

    D in nR . Suppose *x is an interior point of D and that *x is a

    critical point of )(xf . Then *x is:

    (a) a strict local minimizer of )(xf if )( *xHf is positivedefinite;

    (b)

    a strict local maximizer of )(xf if )(

    *

    xHf is negativedefinite.

    Now we briefly investigate the meaning of an indefinite

    Hessian at a critical point of a function. Suppose that

    )(xf has continuous second partial derivatives on a set D

    in nR , that *x is an interior point of D which is a critical

    point of )(xf , and that )( *xHf is indefinite. This means that

    there are nonzero vectors wy, in nR such that

    .0)(.,0)(. ** wxHfwyxHfy

    Since )(xf has continuous second partial derivatives on D,

    there is an 0> such thatfor all t with

  • 7/27/2019 Learning Hessian matrix.pdf

    69/100

    Therefore, t=0 is a strict local minimizer for Y(t) and a

    strict local maximizer for W(t).

    Thus, if we move from

    *

    x in the direction of y or y, thevalues of )(xf increase, but if we move from *x in the

    direction of w or w, the values of )(xf decrease. For this

    reason, we call the critical point *x a saddle point.

    The following result summerizes this little discussion:

    Theorem.11. If )(xf is a function with continuous second

    partial derivatives on a set D in nR , if *x is an interior point

    of D that is a critical point of )(xf , and if the Hessian

    )( *xHf is indefinite, then *x is a saddle point for )(xf .

    Example.10. Let us look for the global and minimizers

    and maximizers (if any) of the function.812),( 3

    221

    3

    121xxxxxxf +=

    In this case, the critical points are the solutions of the

    system

    .24120

    ,1230

    2

    21

    2

    2

    2

    1

    1

    xxx

    f

    xxx

    f

    +=

    =

    =

    =

    This system can be readily solved to identify the critical

    points (2,1) and (0,0).

    The Hessian of ),(21

    xxf is

  • 7/27/2019 Learning Hessian matrix.pdf

    70/100

    =

    2

    1

    21

    4812

    126),(

    x

    xxxHf .

    Since

    =4812

    1212)1,2(Hf

    and since 121

    = and 4322

    = , it follows that the critical

    point (2,1) is a strict local minimizer.

    Now let us see whether (2,1) is a global minimizer.

    Observe that ),(21

    xxHf is not positive definite for all

    ),( 21 xx ; for example,

    =

    4812

    120)1,0(Hf

    is indefinite. In view of Theorem 7, this leads us to suspect

    that (2,1) may not be a global minimizer. The fact that=

    )0,(lim

    11

    xfx

    shows conclusively that ),( 21 xxf has no global minimizer.

    Moreover, since+=

    +)0,(lim

    11

    xfx

    ,

    we see that there are no global maximizers or global

    minimizers.

    How about the critical point (0, 0)? Well, since

    = 012

    120

    )0,0(Hf ,

    this matrix miserably fails the tests for positive definiteness

    and this leads us to expect trouble at (0, 0).The Theorem 11

    tells us that there is a saddle point at (0, 0).

  • 7/27/2019 Learning Hessian matrix.pdf

    71/100

    4.8. Eigenvalues and Positive Definite Matrices

    If A is an nn -matrix and if x is a nonzero vector in nR such that xAx = for some real or complex number, then

    is called an eigenvalue of A. Ifis an eigenvalue of A,then any nonzero vector x that satisfies the equation

    xAx = is an eigenvector of A corresponding to. Since is an eigenvalue of an nn -matrix A if and only if thehomogeneous system 0)( = xIA of n equations in nunknowns has a nonzero solution x, it follows that the

    eigenvalues of A are just the roots of the characteristic

    equation

    0)det( = IA .Since )det( IA is a polynomial of degree n in , thecharacteristic equation has n real or complex roots if we

    count the multiple roots according to heir multiplicities, so

    an nn -matrix A has n real or complex eigenvaluescounting their multiplicities.

    Symmetric matrices have the following special propertieswith respect to eigenvalues and eigenvectors:

    (1) All of the eigenvalues of a symmetric matrix are realnumbers.

    (2) Eigenvectors corresponding to distinct eigenvalues ofa symmetric matrix are orthogonal.

    (3) Ifis an eigenvalue of multiplicity k for a symmetricmatrix A, there are k linearly independent eigenvectorscorresponding to . By applying the Gram-Schmidt

    Orthogonalization Process, we can always replace these

    k linearly independent eigenvectors with a set of k

    mutually orthogonal eitgenvectors of unit length.

  • 7/27/2019 Learning Hessian matrix.pdf

    72/100

    Combining (2) and (3), we see that if A is an nn -symmetric matrix, there are n mutually orthogonal unit

    eigenvectors

    )()1(

    ,,

    n

    uuL

    corresponding to the n eigenvaluesn

    ,,1L . If P is the nn -matrix whose i-th column is the

    unit eigenvector )( iu corresponding toi

    , and if D is the

    diagonal matrix with the eigenvaluesn

    ,,1L down the

    main diagonal, then the following matrix equation holds:

    PDAP =

    because )()( ii

    i uAu = for ni ,,1L= .Since the matrix P is

    orthogonal(that is, its columns are mutually orthogonal unit

    vectors), P is invertible and the inverse 1P ofP

    is just the transpose TP of P . It follows that

    DAPPT = ,that is , the orthogonal matrix P diagonalizes A. If

    AxxxQA .)(=

    is the quadratic form associated with thesymmetric matrix A and if Pyx = , then

    22

    22

    2

    11

    )()(.)(

    nn

    TTTT

    A

    yyy

    DyyAPyPyPyAPyAxxxQ

    +++=

    ====

    L

    Moreover, since P is invertible, 0x if and only if.0y Also, if )( iy is the vector in nR

    with the ith component equal to 1 and all other componentsequal to zero, and if )()( ii Pyx = , then

    i

    i

    AxQ =)( )(

  • 7/27/2019 Learning Hessian matrix.pdf

    73/100

    for all ni ,,1L= . These considerations yield the following

    eigenvalue test for definite, semidefinite, and indefinite

    matrices.

    Theorem 12. If A is a symmetric matrix, then:

    (a) the matrix A is positive definite (resp. negativedefinite) if and only if all the eigenvalues of A are

    positive (resp. negative);

    (b) the matrix A is positive semidefinite (resp.negative semidefinite) if and only if all the

    eigenvalues of A are nonnegative (resp. nonpositive);

    (c) the matrix A is indefinite if and only if A has atleast one positive eigenvalue and at least one negative

    eigenvalue.

    Example11. Let us locate all maximizers, minimizers, and

    saddle points of

    21

    2

    3

    2

    2

    2

    13214),,( xxxxxxxxf ++=

    The critical points of ),,(321

    xxxf are solutions of the systm

    of equations

  • 7/27/2019 Learning Hessian matrix.pdf

    74/100

    .20

    ,240

    ,420

    3

    3

    21

    2

    21

    1

    xx

    f

    xxx

    f

    xxx

    f

    =

    =

    +=

    =

    =

    =

    (0, 0, 0) is the one and only one solution of this system.

    The Hessian of ),,(321

    xxxf is the constant matrix

    =

    200

    024042

    ),,(321

    xxxHf .

    The eigenvalues of the Hessian matrix are 2,6,2 = , so

    the Hessian is indefinite at the critical point (0, 0, 0) and

    hence it is a saddle point for ),,(321

    xxxf .

    4.9. Problem posed as minimization problem.

    (a) Consider the system of equations

    bAx = with A of order (m, n) and m>n. The problem becomes

    that of finding x which minimizes2

    2bAx ; this is a

    quadratic function in x and hence the vector2

    2)( bAxxfg == containing the first order partial

    derivatives is linear in x. The solution is found to be that of

    bAAxATT =

  • 7/27/2019 Learning Hessian matrix.pdf

    75/100

    Note that the Hessian of f is AAT which is seen to be

    positive definite matrix when rank of A is n, meaning that

    the problem is posed as a minimization problem.

    (b) Another example is the least squares method for

    finding the straight line baxy += to the set of points ),(,),,(),,(

    2211 nnyxyxyx L .This

    problem becomes that of finding the constants a andb

    which minimizes the function

    =2)(),( baxybaf

    ii

    They are given by ==

    =

    n

    iiii

    xbaxya

    f

    1

    ))((20

    and

    ==

    =

    n

    iii

    baxyb

    f

    1

    )1)((20

    The above equations can be organized in the convenient

    form

    =

    =

    =

    ==

    =n

    iii

    n

    ii

    n

    ii

    n

    ii

    n

    ii

    xy

    y

    a

    b

    xx

    xn

    1

    1

    1

    2

    1

    1

    We can easily justify that the matrix on the left hand side is

    nonsingular because2

    11

    2

    >==

    n

    ii

    n

    ii

    xxn

    [ Holders Inequality:

    ( ) ( ) 111,1

    1

    1

    11

    =+=== qp

    babaqn

    i

    q

    i

    pn

    i

    p

    ii

    n

    ii

    .

    Letting ,1, ==iii

    bxa and 2== qp , we get

  • 7/27/2019 Learning Hessian matrix.pdf

    76/100

    ( ) ( )2121

    1

    2

    1

    nxxn

    ii

    n

    ii

    ==

    i.e. , ) ]211

    2

    ==

    n

    ii

    n

    ii

    xxn

    The matrix on the left hand side is also a Hessian matrix of

    the function ),( baf ; this being positive definite means thatwe are again dealing with a minimization problem.

    Many other nonlinear functions which seem difficult to

    deal with, except by nonlinear techniques can be dealt with

    using same technique explained above, if they can be

    formulated as a quadratic objective function. The

    procedure is to take the logarithm of both sides to

    formulate the problem as follows:

    ( ) =

    n

    iiiba

    bxay1

    2

    ,lnlnmin

    If the function is not quadratic, the above method fails, for

    0),(),( == yxfyxg generate a set of nnonlinear

    simultaneous equations. The procedure will then be tochoose a guess point 0x and improve on it until the solution

    is reached. The procedure is as follows.

    Expand )(xf around 0x by Taylors series:

    ))((

    )()(2

    1)()()(

    30

    0000

    00

    xxo

    xxHxxxxgxfxfxx

    T

    +

    ++=

    Now x)

    is defined as the point at which

    0=

    xx

    f

    )

    .

  • 7/27/2019 Learning Hessian matrix.pdf

    77/100

    Differentiating the Taylors expansion w. r. t. x or

    )( 0xx gives)(0 0

    00xxHg

    xx+)

    , where the third term is neglected if0

    x is rightly chosen nearx)

    . Hence00

    10

    xxgHxx

    )

    where the vector gHy 1= is obtained by solving the linear

    equationsgHy =

    Because the above correction forx)

    is only approximately,

    the solution x)

    can be improved by iteration.

    The convergence of the above method is guaranteed ifH

    after every iteration is found positive definite for a

    minimization problem or negative definite for a

    maximization problem.

    For Example, for a minimization problem,

    )()(2

    1)()()( 0000

    00xxHxxxxgxfxf

    xx

    T ++=))))

    = { } )}(){(2

    1)(

    000000

    110

    x

    T

    xxxxx

    T ggHgHgxf ++

    =000000

    110

    2

    1)(

    xxx

    T

    xxx

    T

    gHggHgxf +

    = 00010

    2

    1)( xxx

    T

    gHgxf

    and since0x

    H is positive definite,0

    1

    xH

    is also positive

    definite .

    Hence 0000

    1 >xxx

    T

    gHg

  • 7/27/2019 Learning Hessian matrix.pdf

    78/100

    From this, we obtain )()( 0xfxf ==

    < 1

    1

    1

    11

    22

    1

    21

    1

    22

    ,,

    ,,,dHd

    gg

    gggggg

    = >< 1

    1

    1

    11

    22

    1

    1

    1

    1

    11

    22

    ,,

    ,0,

    ,

    ,dHd

    gg

    ggdHd

    gg

    gg

    =0 ( 0, 21 >=< ggQ )

    And minimizing in the direction of 3x , we obtain

  • 7/27/2019 Learning Hessian matrix.pdf

    83/100

    >=< gg .

    The above property holds because 2g is a linear

    combination of 2d and 1d . But 2d is orthogonal on 3g , and

    so on is 1d , for

    0,

    ,,,1

    2

    2

    2

    1212

    22

    213

    >=>==

  • 7/27/2019 Learning Hessian matrix.pdf

    84/100

    22 244),( yxyxyxf +=

    Solution The function can be rewritten as

    ( )

    =

    y

    xyxyxf

    22

    24),( .

    )44,48(),( yxyxyxf +=

    ==

    44

    48),(2 yxfH .

    Let us start with

    = 3

    21

    x .

    Then )4,4()128,1216(11 =+== gd

    ( )

    ( )2

    1

    64

    32

    0

    1644

    32

    1616

    1632),4,4(

    32

    4

    4

    44

    48

    ),4,4(

    4

    444

    ,

    ,1

    1

    1

    11

    1

    ==

    =

    >

    +