Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND...

22
Chapter 2 Optimization and Solving Nonlinear Equations This chapter deals with an important problem in mathematics and statistics: finding values of x to satisfy f (x) = 0. Such values are called the roots of the equation and also known as the zeros of f (x). 2.1 The bisection method The goal is to find the solution of an equation f (x) = 0. A question that should be raised is the following: Is there a (real) root of f (x) = 0? One answer is provided by the intermediate value theorem. Intermediate value theorem. If f (x) is continuous on an interval [a, b], and f (a) and f (b) have opposite signs, i.e., f (a)f (b) < 0, then there exists a point ξ (a, b) such that f (ξ ) = 0. The intermediate value theorem guarantees that a root exists under those conditions. However, it does not tell us the precise value of the root ξ . The bisection method works by assuming that we know two values a and b such that f (a)f (b) < 0, and works by repeatedly narrowing the gap between a and b until it closes in on the correct answer. It narrows the gap by taking the average a+b 2 of a and b. If f ( a+b 2 ) = 0, then we find a root at a+b 2 . Otherwise, look at two subsections: (a, a+b 2 ) and ( a+b 2 ,b). By the intermediate value theorem again , there must be a root in the interval (a, a+b 2 ) when f (a)f ( a+b 2 ) < 0, or in the interval ( a+b 2 ,b) when f ( a+b 2 )f (b) < 0. We continue this procedure until a desired accuracy has been achieved. 19

Transcript of Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND...

Page 1: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

Chapter 2

Optimization and Solving Nonlinear

Equations

This chapter deals with an important problem in mathematics and statistics: finding values of x to satisfyf(x) = 0. Such values are called the roots of the equation and also known as the zeros of f(x).

2.1 The bisection method

The goal is to find the solution of an equation f(x) = 0.

A question that should be raised is the following: Is there a (real) root of f(x) = 0? One answer isprovided by the intermediate value theorem.

Intermediate value theorem. If f(x) is continuous on an interval [a, b], and f(a) and f(b) have opposite signs,i.e., f(a)f(b) < 0, then there exists a point ξ ∈ (a, b) such that f(ξ) = 0.

The intermediate value theorem guarantees that a root exists under those conditions. However, it doesnot tell us the precise value of the root ξ.

The bisection method works by assuming that we know two values a and b such that f(a)f(b) < 0, andworks by repeatedly narrowing the gap between a and b until it closes in on the correct answer.

It narrows the gap by taking the average a+b2 of a and b. If f(a+b

2 ) = 0, then we find a root at a+b2 .

Otherwise, look at two subsections: (a, a+b2 ) and (a+b

2 , b). By the intermediate value theorem again , there

must be a root in the interval (a, a+b2 ) when f(a)f(a+b

2 ) < 0, or in the interval ( a+b2 , b) when f(a+b

2 )f(b) < 0.We continue this procedure until a desired accuracy has been achieved.

19

Page 2: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

20 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

Example 1. Find the zeros of f(x) = 5x5 − 4x4 + 3x3 − 2x2 + x − 1.

There is at least one real zero of f(x) (why?)

It would be better to start by drawing a graph of f(x).

> f=function(x){5*x^5-4*x^4+3*x^3-2*x^2+x-1}

> x=seq(-50, 50, length=500)

> plot(x, f(x))

> x=seq(-5, 5, length =500)

> plot(x, f(x))

> x=seq(0, 1, length =500)

> plot(x, f(x))

Next we use the bisection method to find the zero between 0 and 1.

> f(0)

[1] -1

> f(1)

[1] 2

> f(.5) # f value at midpoint of (0, 1)

[1] -0.71875 # This suggests next step go to (0.5, 1)

> f(0.75) # f value at midpoint of (0.5, 1)

[1] -0.1884766 # Go to (0.75, 1)

> f(0.875) # f value at midpoint of (0.75, 1)

[1] 0.5733337 # Go to (0.75, 0.875)

> f(0.8125) # f value at midpoint of (0.75, 0.875)

[1] 0.1285563 # Go to (0.75, 0.8125)

> f(0.78125) # at midpoint of (0.75, 0.8125)

[1] -0.04386625 # Go to (0.78125, 0.8125)

> f(0.796875) # at midpoint of (0.78125, 0.8125)

[1] 0.03862511 # Go to (0.78125, 0.796875)

> (0.78125+ 0.796875)/2

[1] 0.7890625

> f(0.7890625) # at midpoint of (0.78125, 0.796875)

[1] -0.003519249 # Go to (0.7890625, 0.796875)

Page 3: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

2.1. THE BISECTION METHOD 21

> (0.7890625+ 0.796875)/2

[1] 0.7929688

> f(0.7929688) # at midpoint of (0.7890625, 0.796875)

[1] 0.01732467 # Go to (0.7890625, 0.7929688)

> (0.7890625+ 0.7929688)/2

[1] 0.7910157

> f(0.7910157)

[1] 0.006846331 # Go to (0.7890625, 0.7910157)

> (0.7890625+ 0.7910157)/2

[1] 0.7900391

> f((0.7890625+ 0.7910157)/2) # Go to (0.7890625, 0.7900391)

[1] 0.001649439

> (0.7890625+ 0.7900391)/2

[1] 0.7895508

> f((0.7890625+ 0.7900391)/2) # What do you think?

[1] -0.0009384231 # Is f(0.7895508) close enough to 0?

Below is a simple way.

f=function(x){5*x^5-4*x^4+3*x^3-2*x^2+x-1}

bisection=function(a,b,n){

xa=a

xb=b

for(i in 1:n){ if(f(xa)*f((xa+xb)/2)<0) xb=(xa+xb)/2

else xa=(xa+xb)/2}

list(left=xa,right=xb, midpoint=(xa+xb)/2)

}

> bisection(0,1,15)

$left

[1] 0.7897034

$right

[1] 0.7897339

$midpoint

[1] 0.7897186

Page 4: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

22 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

Example 2. Find the maximum of

g(x) =log x

1 + x, x > 0.

Since g(x) is differentiable, we look at its derivative

g′(x) =1

x(1 + x)− log x

(1 + x)2=

1

(1 + x)2

(

x + 1

x− log x

)

, x > 0,

and find a critical point of g(x) by solving g ′(x) = 0, or by solving

x + 1

x− log x = 0,

whose root is denoted by c. Clearly, c > 1. It can be shown that g ′(x) > 0 for all x ∈ (0, c), and g′(x) < 0for all x ∈ (c,∞). Thus, g(c) is the maximum value of g(x).

> gd=function(x){(1+x)/x-log(x)}

> x=seq(1, 2, length=50)

> plot(x, gd(x)) # It seems that c is between 1.2 and 1.4

> gd(3)

[1] 0.2347210

> gd(6)

[1] -0.6250928

> bisection=function(a,b,n){

xa=a

xb=b

for(i in 1:n){ if(gd(xa)*gd((xa+xb)/2)<0) xb=(xa+xb)/2

else xa=(xa+xb)/2}

list(left=xa,right=xb, midpoint=(xa+xb)/2)

}

> bisection(3,6,30)

$left

[1] 3.591121

$right

[1] 3.591121

$midpoint

[1] 3.591121

Page 5: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

2.1. THE BISECTION METHOD 23

Example 3. A Cauchy density function takes the form

f(x) =1

{1 + (x − θ)2}π , x ∈ R,

where θ is a parameter.

(1) Generate 50 random numbers from a Cauchy distribution with θ = 1.

data = rcauchy(50, 1)

−40 0 20 40

−45

0−

350

−25

0−

150

Log−likelihood function

θ

l(dat

a, θ

)

−40 0 20 40

−10

−5

05

10

Derivative

θ

ld(d

ata,

θ)

(2) Treat the data you get from step (1) as sample observations from a Cauchy distribution with anunknown θ. Plot the log-likelihood function of θ,

l(θ) = −n lnπ −n∑

i=1

ln{1 + (xi − θ)2}, θ ∈ R.

l=function(x,t){

s=0

n=length(x)

for(j in 1:n) s=s + log(1+(x[j]-t)^2)

l=-n*log(pi)-s

l

}

theta=seq(-50, 50,length=500)

plot(theta, l(data,theta), type="l",main="Log-likelihood function",

xlab=expression(theta), ylab=expression(l(data, theta)))

Page 6: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

(3) The maximum value seems to be shown in the above plot of the log-likelihood function of θ. Use thebisection method to find the maximum likelihood estimator of θ.

To do so, we calculate the derivative of l(θ), with a constant dropped,

l′(θ) =

n∑

i=1

θ − xi

{1 + (xi − θ)2} , θ ∈ R,

and draw a plot of l′(θ).

ld=function(x,t){

s=0

n=length(x)

for(j in 1:n) s=s + (t-x[j])/(1+(x[j]-t)^2)

l=s

l

}

theta=seq(-10, 10,length=500)

plot(theta, ld(data,theta), type="l",main="Derivative",

xlab=expression(theta), ylab=expression(ld(data, theta)))

The bisection method is applicable to l′(θ), since it is continuous everywhere.

f=function(t){ld(data, t)}

bisection(-10,10,30)

$left

[1] 0.9758892

$right

[1] 0.9758892

$midpoint

[1] 0.9758892

Hence, θ̂ = 0.9758892

Page 7: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

2.2. SECANT METHOD 25

2.2 Secant method

The secant method begins by finding two points on the curve of f(x), (x0, f(x0)) and (x1, f(x1)), hopefullynear to a root r we seek. A straight line that passes these two points is

y − f(x0)

f(x1) − f(x0)=

x − x0

x1 − x0.

If x2 is the root of f(x) = 0, and the point (x2, f(x2)) is on the line, then

0 − f(x0)

f(x1) − f(x0)=

x2 − x0

x1 − x0.

From this we solve for x2,

x2 = x1 − f(x1)x0 − x1

f(x0) − f(x1).

Because f(x) is not exactly linear, x2 is not equal to r, but it should be closer than either of the two pointswe begin with.

If we repeat this, we have

xn+1 = xn − f(xn)xn−1 − xn

f(xn−1) − f(xn), n = 1, 2, . . .

Under the assumptions that the sequence {xn, n = 1, 2, . . .} converges to r, f(x) is differentiable near r,and f ′(r) 6= 0, we obtain

limn→∞

xn+1 = limn→∞

xn − f( limn→∞

xn)lim

n→∞

xn−1 − limn→∞

xn

f( limn→∞

xn−1) − f( limn→∞

xn),

orr = r − f(r)/f ′(r),

which gives f(r) = 0.

Page 8: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

26 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

Example. Find the zeros of f(x) = x3 − 2x2 + x − 1 using the secant method.

f=function(x){x^3-2*x^2+x-1}

secant=function(x,y,n){ # Katherine Earles’s code

if(abs(f(x))<abs(f(y))){xa=x} & {xb=y}

else {xa=y}&{xb=x}

xc=0

for(i in 1:n){xc=xb-f(xb)/(f(xa)-f(xb))*(xa-xb)

xa=xb

xb=xc}

list("x(n)"=xa, "x(n+1)"=xb)

}

> secant(0,5,12)

$"x(n)"

[1] 1.754878

$"x(n+1)"

[1] 1.754878

> secant(5,0, 12)

$"x(n)"

[1] 1.754878

$"x(n+1)"

[1] 1.754878

> secant(5,0, 15)

$"x(n)"

[1] 1.754878

$"x(n+1)"

[1] NaN

The above code does break down for high enough values of n (returns NaN). The following is an improvementon function h that fixes the problem. The “if statement” will break out of the loop if the values of xa andxb are equal.

Page 9: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

2.2. SECANT METHOD 27

g=function(x,y){y-(f(y)/(f(x)-f(y)))*(x-y)}

h=function(x,y,n){ # Katherine Earles’s code

xa=x

xb=y

xc=0

for(i in (1:n)){if (identical(all.equal(xa, xb), TRUE)) break

else # or {xc=g(xa,xb)}&{ xa=xb}&{xb=xc}

xc=g(xa,xb)

xa=xb

xb=xc

}

list("x(n)"=xa,"x(n+1)"=xb)}

> h(-10,50,500)

$"x(n)"

[1] 1.754878

$"x(n+1)"

[1] 1.754878

Page 10: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

28 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

2.3 Newton’s method

Newton’s method or the Newton-Raphson method is a procedure or algorithm for approximating the zerosof a function f (or, equivalently, the roots of an equation f(x) = 0). It consists of the following three steps:

Step 1. Make a reasonable initial guess as to the location of a solution, which is denoted by x0.

Step 2. Calculate

x1 = x0 −f(x0)

f ′(x0).

Step 3. If x1 is sufficiently close to a solution, stop; otherwise, continue this procedure by

x2 = x1 − f(x1)f ′(x1)

,

x3 = x2 − f(x2)f ′(x2)

,

· · ·xn = xn−1 − f(xn−1)

f ′(xn−1) .

Under the assumptions that the sequence x0, x1, . . . , xn, . . . converges to r, and that f(x) is differentiablenear r with f ′(r) 6= 0, by taking the limit on both sides of

xn = xn−1 −f(xn−1)

f ′(xn−1),

we obtain

r = r − f(r)

f ′(r),

which results in f(r) = 0.

This method requires that the first approximation is sufficiently close to the root r.

A comparison between the secant method and Newton’s method. The secant method is obtained fromNewton’s method by approximating the derivative of f(x) at two points xn and xn−1 by

f ′(x) =f(xn) − f(xn−1)

xn − xn−1.

Geometrically, Newton’s method uses the tangent line and the secant method approximates the tangent lineby a secant line.

Page 11: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

2.3. NEWTON’S METHOD 29

Example 1. Find a zero of f(x) = x2006 + 2006x + 1.

> newton=function(x0, n){

f=function(x){x^(2006)+2006*x+1}

fd=function(x){2006*x^(2005)+2006}

x=x0

for (i in 1:n){x=x-f(x)/fd(x)}

list(x)

}

> newton(-.5, 20)

[1] -0.0004985045

> newton(3.5, 10) * It is sensitive to an initial guess x0

[1] NaN

> nr=function(x0, numstp,eps){

numfin = numstp

small = 1.0*10^(-8)

istop = 0

while(istop == 0){

f=function(x){x^(2006)+2006*x+1}

fd=function(x){2006*x^(2005)+2006}

x1=x0-f(x0)/fd(x0)

check = abs(x0-x1)/abs(x0 + small)

if(check < eps){istop=1}

x0=x1

}

list(x1=x1,check=check)

}

> nr(20,0,20,0.3)

$x1

[1] -0.0004985045

$check

[1] 2.174953e-16

Page 12: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

30 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

Example 2. The Weibull distribution function is of the form

F (x) =

{

1 − exp{−(βx)λ}, if x ≥ 0,0, elsewhere,

where λ and β are positive parameters.

(1) Generate 50 random numbers from a Weibull distribution with β = 1 and λ = 1.8.

> weib = rweibull(50, shape=1.8, scale = 1)

(2) Add three more numbers to the above group. Treat these 53 observations as your data from a Weibulldistribution with an unknown λ, but keep β = 1 fixed. Plot the log-likelihood function of λ.

> mydata = c(weib, 0.9, 1, 1.1) # add 3 numbers 0.9, 1, 1.1

The likelihood and log-likelihood functions of λ are

L(λ) = λn

(

n∏

k=1

xk

)λ−1

exp

(

−n∑

k=1

xλk

)

,

and

l(λ) = n lnλ + (λ − 1)

n∑

k=1

lnxk −n∑

k=1

xλk ,

respectively.

> loglike=function(t){

x=mydata

s=0

for(i in 1:length(x)) s=s-x[i]^t+(t-1)*log(x[i])

loglike=53*log(t)+s

loglike

}

> l=seq(0.5, 3,len=200)

> plot(l, loglike(l), type=’l’, xlab=expression(lambda),

ylab=expression(l(lambda)),

main=‘loglikelihood function for Weibull Data’)

It can be seen from the plot of loglikelihood function that l(λ) is concave.

Page 13: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

2.3. NEWTON’S METHOD 31

(3) Use Newton’s method to find the maximum likelihood estimator of λ.

To do so, we need solve the equation l′(λ) = 0 for stationary points. The first and second derivatives ofl(λ) are

l′(λ) =n

λ+

n∑

k=1

lnxk −n∑

k=1

xλk lnxk,

and

l′′(λ) = − n

λ2−

n∑

k=1

xλk ln2 xk.

Now l′′(λ) < 0 indicates that there is the unique maximum point of l(λ).

ld=function(t){ # define l’(lambda)

x=mydata

s=0

for(i in 1:length(x)) s=s-(log(x[i]))*x[i]^t+log(x[i])

ld=s+53/t

ld

}

ldd=function(t){ # define l’’(lambda)

x=mydata

s=0

for(i in 1:length(x)) s=s-(log(x[i]))^2*x[i]^t

ldd=s-53/t^2

ldd

}

newton=function(t,n){ # Newton’s iteration

for(i in 1:n) {t=t-ld(t)/ldd(t)}

t

}

> newton(0.1,20) # x_0=0.1

[1] 1.704811

Page 14: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

32 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

2.4 Fixed-point iteration: x = g(x) method

Suppose that we can bring an equation f(x) = 0 in the form x = g(x), which usually can be done in severalways. Whenever r = g(r), r is said to be a fixed point for the function g(x).

We can solve this equation, on certain conditions, using iteration.

Start with an approximation x0 of the root.

Calculate

x1 = f(x0),

x2 = f(x1),

· · ·xn+1 = g(xn), n = 0, 1, 2, . . .

Example. Consider a simple equation

x2 − 3x + 2 = 0.

It can be rewritten as x = g(x) in many ways. For instance,

x =x2 + 2

3,

x =√

3x − 2,

x = −√

3x − 2,

x = x2 − 2x + 2,

x =1

2

√3x − 2 +

x

2.

The for loop can be easily set down.

> fixed=function(x, n){

for(i in 1:n){ x = g(x) }

x

}

Let us take a look of x = x2+23 .

> g=function(x){(x^2+2)/3}

> fixed(0.1, 20)

[1] 0.9999037 # It’s close to 1, one of the roots.

> fixed(3, 20)

[1] Inf # A problem of the initial point?

> fixed(-4, 20)

[1] Inf

Page 15: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

2.4. FIXED-POINT ITERATION: X = G(X) METHOD 33

A solution is guaranteed under the assumptions of the following theorem.

Theorem. If |g′(x)| ≤ k < 1 in an interval (a, b), and the sequence {x0, x1, ..., xn, ...} belongs to (a, b), thenthe sequence has a limit r, and r is the only root of x = g(x) in the interval (a, b).

Proof. Appealing on Lagrange’s theorem we can write

x2 − x1 = g(x1) − g(x0) = (x1 − x0)g′(c1), c1 is between x0 and x1,

x3 − x2 = g(x2) − g(x1) = (x2 − x1)g′(c2), c2 is between x1 and x2,

· · ·

xn+1 − xn = g(xn) − g(xn−1) = (xn − xn−1)g′(cn), cn is between xn−1 and xn.

Since |g′(x)| < k < 1, we obtain

|x2 − x1| < k|x1 − x0|,|x3 − x2| < k|x2 − x1| < k2|x1 − x0|,· · ·|xn+1 − xn| < k|xn − xn−1| < · · · < kn|x1 − x0|,

and for m > n,

|xm − xn| ≤ |xm − xm−1| + |xm−1 − xm−2| + . . . + |xn+1 − xn|< km−1|x1 − x0| + km−2|x1 − x0| + . . . + kn|x1 − x0|= (km−1 + . . . + kn)|x1 − x0|

=kn − km

1 − k|x1 − x0|.

Thus, by Cauchy’s criterion, the sequence {xn, n = 0, 1, 2, . . .} converges. Say the limit is r. By taking limitof both sides of the equation

xn+1 = g(xn),

we obtain limn→∞ xn+1 = limn→∞ g(xn), or

r = g(r),

which means that r is a root of the equation x = g(x).

If r1 is a second root of x = g(x) in the interval (a, b), then

r1 − r = g(r1) − g(r) = (r1 − r)g′(c), with c ∈ (a, b).

Then

g′(c) = 1,

and this gives a contradiction. 2

Page 16: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

34 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

Notice that Newton’s method is a special case of the fixed-point iteration, with

g(x) = x − f(x)

f ′(x),

and

g′(x) = 1 − {f ′(x)}2 − f(x)f ′′(x)

{f ′(x)}2=

f(x)f ′′(x)

{f ′(x)}2.

Applying the above theorem to this particular case, we obtain

Corollary. Assume that the function f(x) is continuous in the interval [a, b] and is twice differentiable in(a, b), with

f(x)f ′′(x)

{f ′(x)}2

≤ k < 1, x ∈ (a, b).

If the sequence {x0, x1, x2, . . .} is formulated by Newton’s method with

xn+1 = xn − f(xn)

f ′(xn), n = 0, 1, 2, . . . ,

and xn ∈ (a, b), n = 0, 1, 2, . . . , then the sequence has a limit r, and r is the only root of f(x) = 0 in theinterval [a, b].

This corollary indicates that the initial point x0 is very important for Newton’s method. A good tryshould start with a x0 that satisfies

f(x0)f′′(x0)

{f ′(x0)}2

≤ k < 1.

2.5 Convergence rate

Consider a fixed-point iteration for solving the equation x = g(x) with the procedure

xn+1 = g(xn), n = 0, 1, 2, . . .

Let r be the root of the equation. Define the nth step error by

en = r − xn, n = 1, 2, . . .

Since r = g(r), we obtain

en+1 = r − xn+1

= g(r) − g(xn)

= g′(cn)(r − xn) by the mean value theorem

= g′(cn)en.

This means the error at the (n + 1)th step is linearly related to the error at the nth step.

For Newton’s method, it can be shown that the error at the (n + 1)th step is quadratically related tothe error at the nth step.

Page 17: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

2.6. NEWTON’S METHOD FOR A SYSTEM OF NONLINEAR EQUATIONS 35

2.6 Newton’s method for a system of nonlinear equations

Newton’s method can be applied for solving a system of nonlinear equations. This is particularly usefulwhen we try to find maximum likelihood estimators of several parameters.

Let F(x) be a vector-valued function of a vector argument x, assuming that both vectors contain mcomponents. To apply Newton’s method to the problem of approximating a solution of

F(x) = 0,

we would like to start from an initial point x0 and then write

xn+1 = xn − F(xn)/F′(xn), n = 0, 1, 2, . . . .

Two questions arise in the above procedure immediately. First, what is meant by F ′(xn)? and second, whatis meant by the division F(xn)/F′(xn)?

Here, F′(x) is a matrix defined by

F′(x) =

∂f1(x)∂x1

∂f1(x)∂x2

. . . ∂f1(x)∂xm

∂f2(x)∂x1

∂f2(x)∂x2

. . . ∂f2(x)∂xm

. . . . . .∂fm(x)

∂x1

∂fm(x)∂x2

. . . ∂fm(x)∂xm

.

This matrix is known as the Jacobian matrix for the system and is typically denoted by J(x).

For the division of two matrices, we use multiplication of an inverse. Thus, Newton’s method takes theform

xn+1 = xn − (J(xn))−1F(xn), n = 0, 1, 2, . . . .

When implementing this scheme, rather than actually computing the inverse of the Jacobian matrix, wedefine

vn = −(J(xn))−1F(xn),

and then solve the linear system of equations

J(xn)vn = −F(xn),

for vn. Once vn is known, the next iterate is computed according to the rule

xn+1 = xn + vn, n = 0, 1, 2, . . . .

Page 18: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

36 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

Example 1. Find the solution of the system of two nonlinear equations

{

x31 − 2x2 + 1 = 0,

x1 + 2x32 − 3 = 0.

First of all, we set up

x =

(

x1

x2

)

, and F(x) =

(

x31 − 2x2 + 1

x1 + 2x32 − 3

)

.

Then, find the Jacobian matrix for the system,

J(x) =

(

3x21 −2

1 6x22

)

.

The next codes were written by Katherine Earles.

F=function(x){ # define the (column) vector of equations

F=matrix(0,nrow=2) # nrow depends on the length of F

F[1]= x[1]^3-2*x[2]+1 # The first component of F

F[2]= x[1]+2*x[2]^3-3 # The second component of F

F # output F, a column vector of values

}

J=function(x){ # define the Jacobian of F

j=matrix(0,ncol=2,nrow=2) # ncol & nrow depend on the length of F

j[1,1]= 3*x[1]^2

j[1,2]= -2

j[2,1]= 1

j[2,2]= 6*x[2]^2

j # output j, a matrix of values

}

NNL=function(initial,n){ # Newton’s method for a system of non-linear equations

x=initial

v=matrix(0,ncol=length(x))

for (i in 1:n){

v=solve(J(x),-F(x))

x=x+v}

cat(" x1=",x[1],"\n","x2=",x[2],"\n")

}

Sometimes we may need check whether the Jacobian matrix is invertible. For this purpose, the above codesare improved.

Page 19: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

2.6. NEWTON’S METHOD FOR A SYSTEM OF NONLINEAR EQUATIONS 37

NNL=function(initial,n){ # Newton’s method for a system of non-linear equations

x=initial

v=matrix(0,ncol=length(x))

for (i in 1:n){

d=det(J(x)) # check that J(x) is invertible

if (identical(all.equal(d,0),TRUE))

{cat("Jacobian has no inverse. Try a different initial point.","\n")

break}

else

v=solve(J(x),-F(x))

x=x+v

}

cat(" x1=",x[1],"\n","x2=",x[2],"\n")

}

> NNL(c(0.1, 0.2), 1)

x1= 2.901794

x2= 0.5425269

> NNL(c(0.1, 0.2), 2)

x1= 1.969765

x2= 0.9450524

> NNL(c(0.1, 0.2), 3)

x1= 1.387231

x2= 0.9309951

> NNL(c(0.1, 0.2), 4)

x1= 1.093614

x2= 0.9872401

> NNL(c(0.1, 0.2), 5)

x1= 1.007192

x2= 0.9989359

> NNL(c(0.1, 0.2), 6)

x1= 1.000047

x2= 0.9999933

> NNL(c(0.1, 0.2), 7)

x1= 1

x2= 1

Page 20: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

38 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

Example 2 (Logistic regression model). Let Y denote a binary response variable. The regression model

E(Y ) = π(x) =exp(β0 + β1x)

1 + exp(β0 + β1x)

is called the logistic regression model, where β0 and β1 are parameters.

Suppose that Y1, Y2, . . . , Yn are independent Bernoulli random variables with

E(Yi) , πi =exp(β0 + β1xi)

1 + exp(β0 + β1xi), i = 1, . . . , n,

where x observations are assumed to be known constants.

The likelihood function of parameters β0 and β1 is

L(β0, β1) =n∏

i=1

πyi

i (1 − πi)1−yi

=n∏

i=1

( πi

1 − πi

)yi ·n∏

i=1

(1 − πi)

= exp{

n∑

i=1

(β0 + β1xi)yi

}

·n∏

i=1

{1 + exp(β0 + β1xi)}−1 .

From this we obtain the log-likelihood function

`(β0, β1) =n∑

i=1

(β0 + β1xi)yi −n∑

i=1

ln {1 + exp(β0 + β1xi)} .

However, no closed-form solution exists for the values of β0 and β1) that maximize the log-likelihood function`(β0, β1). So we need maximize `(β0, β1) numerically.

A data set from Kutner et al. (2005), Applied Statistical Models, page 566, (x=months of experience,y=task success):

x=c(14,29,6,25,18,4,18,12,22,6,30,11,30,5,20,13,9,32,24,13,19,4,28,22,8) # months

y=c(0,0,0,1,1,0,0,0,1,0,1,0,1,0,1,0,0,1,0,1,0,0,1,1,1) # success

We start by defining the partial derivatives of `(β0, β1),∂

∂β0`(β0, β1) and ∂

∂β1`(β0, β1), which are our target

functions.

F1=function(b){

F1=0

for(i in 1:length(x)) F1=F1+y[i]-exp(b[1]+b[2]*x[i])/(1+exp(b[1]+b[2]*x[i]))

F1

}

Page 21: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

2.6. NEWTON’S METHOD FOR A SYSTEM OF NONLINEAR EQUATIONS 39

F2=function(b){

F2=0

for(i in 1:length(x)) F2=F2+x[i]*y[i]-x[i]*exp(b[1]+b[2]*x[i])/(1+exp(b[1]+b[2]*x[i]))

F2

}

F=function(b){

F=matrix(0,nrow=2)

F[1]=F1(b)

F[2]=F2(b)

F

}

Alternatively, the vector function F(β0, β1) can be set as follows

F=function(b){

F=matrix(0,nrow=2)

s1=0

s2=0

for(i in 1:length(x)){

s1 = s1 +y[i]-((exp(b[1]+b[2]*x[i]))*(1+exp(b[1]+b[2]*x[i]))^(-1))

s2 = s2 +x[i]*y[i]-(x[i]*(exp(b[1]+b[2]*x[i]))*(1+exp(b[1]+b[2]*x[i]))^(-1))}

F[1]=s1

F[2]=s2

F}

The next step is to set down the Jacobian matrix, a 2 × 2 matrix.

J=function(b){

j=matrix(0,ncol=2,nrow=2) # The format of J is 2 by 2

s11=0

s12=0

s22=0

for(i in 1:length(x)){

s11 = s11-exp(b[1]+b[2]*x[i])*(1+exp(b[1]+b[2]*x[i]))^(-2)

s12 = s12 -x[i]*exp(b[1]+b[2]*x[i])*(1+exp(b[1]+b[2]*x[i]))^(-2)

s22 = s22 -(x[i]^(2))*exp(b[1]+b[2]*x[i])*(1+exp(b[1]+b[2]*x[i]))^(-2)

}

j[1,1]=s11

j[1,2]=s12

j[2,1]=s12

j[2,2]=s22

j

}

Page 22: Optimization and Solving Nonlinear Equationscma/stat774/ch2.pdf24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS (3) The maximum value seems to be shown in the above plot

40 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS

The R codes for Newton’s method are

NNL = function(initial,n){

b=initial

v=matrix(0,ncol=length(b))

for (i in 1:n){

d=det(J(b)) # check that J(b0,b1) is invertible

if(identical(all.equal(d,0),TRUE))

{cat(’Jacobian has no inverse.Try a different initial point.’,’\n’)

break}

else

v=solve(J(b),-F(b))

b=b+v}

cat(’ b0=’,b[1],’\n’,’b1=’,b[2],’\n’)

}

Finally let us try several particular cases.

> NNL(c(1,1),10) # A good initial point is important!

Error in qr(x, tol = tol) : NA/NaN/Inf in foreign function call (arg 1)

> NNL(c(1,0),5) # A small n

b0= -3.059696

b1= 0.1614859

> NNL(c(1,0),200) # A large n

b0= -3.059696

b1= 0.1614859

> F(c(-3.059696, 0.1614859)) # check the value of F

[1,] 2.066355e-06

[2,] 4.156266e-05

Thus, the maximum likelihood estimators of β0 and β1 are -3.059696 and 0.1614859, respectively.