This research was supported by the National … research was supported by the National Institutes of...

..

•

This research was supported by the National Institutes of Health,Institute of General Medical Sciences, Grant No. GM-l2868-04 .

ON THE KIEFER-WOLFOWITZ PROCESS AND SOME OF ITSMODIFICATIONS

Mark Allyn JohnsonUniversity of North Carolina

Institute of Statistics Mimeo Series No. 644

September 1969

•

•

ABSTRACT

JOHNSON t MARK ALLYN. On the Kiefer-Wolfowitz Process and Some of Its

Modifications. (Under the direction of JAMES E. GRIZZLE).

The Kiefer-Wolfowitz (K-W) process is a stochastic approximation

procedure used to estimate the value of a controlled variable that

maximizes the expected response (M(x» of a corresponding dependent

random variable.

An extension of a theorem on the asymptotic distribution of a

certain class of stochastic approximation procedures t which includes

the K-W process, is given. This extension admits an increasing number

of observations to be taken on each step, constrains the observed

random variable to lie between predetermined bounds (not necessarily

finite), and extends the class of admissable norming sequences. Each

of these extensions is examined in the context of their usefulness to

the application of the K-W process.

Approximations of the small sample mean and variance of the K-W

process are given and studied through an examination of various special

cases of the response function. Traditionally the estimation of the

variance of the process involves the estimation of M"(8). It is shown

that an efficient estimator of M"(8) based on the observations generated

by the process provides a consistent estimator of M"(8) .

A modification of the K-W process is developed for which the

dependent random variables influence the direction t but not the magni-.

tude of the correction on each step. In addition to obtaining the

•

asymptotic distribution of this process, an approximation for the small

sample variance is given. An estimator of this variance is given which

does not entail the estimation of unknown parameters. The estimator is

shown to be consistent.

Finally, examples illustrating the use of these procedures are

given .

-eBIOGRAPHY

The author was born September 30, 1943, in Weld County, Colorado.

He was reared on an irrigated farm near Eaton, Colorado. He graduated

from Eaton High School in 1956 and went directly to college.

He received the Bachelor of Science degree with a major in zoology

at Colorado State University in 1960. On receiving a grant from the

National Institute of Health, he studied experimental statistics in the

graduate school of the same university for two years. In September,

1966, he entered the Graduate School of North Carolina State University

at Raleigh to continue studying experimental statistics. He has ac

cepted a position as a statistical consultant for the Upjohn Company

in Kalamazoo, Michigan.

The author married Miss Martha McLendon Willey from Nashville,

North Carolina, in 1968. As yet they have no children.

ACKNOWLEDGMENTS

The author wishes to thank his advisor, Dr. James Grizzle, who

suggested the topic of this dissertation, for his freely given counsel

throughout his doctoral program. He also wishes to thank Professors

N. L. Johnson, Pranab K. Sen, Harry Smith, Jr., and H. R. van der Vaart

for serving on his advisory committee and offering assistance during

this study.

The author wishes to thank Mrs. Gay Goss for typing the manuscript

and to thank Mrs. Mary Donelan for assisting in the typing of th~ rough

draft of this manuscript.

Finally, the author wishes to thank his parents, Mr. and Mrs.

Walter F. Johnson for their inspiration throughout his academic career,

and to thank his dearest wife, Martha, for her patience and her

assistance in the proofreading of each manuscript and in the typing of

the rough draft.

TABLE OF CONTENTS

Page

LIST OF FIGURES . . .

1. INTRODUCTION AND REVIEW OF LITERATURE. .

·v

1

2.

1.1 Introduction.•....1.2 Summary of Literature .....1.3 Summary of Results .••

STOCHASTIC APPROXIMATION PROCEDURES OF TYPE AT .o

2.1 Constrained Stochastic ApproximationProcedures.•.•.....•.

2.2 Constrained Random Variables .••.2.3 Asymptotic Distribution of a Type

AT Process .•..•...•.•.o

157

9

911

14

3. THE CONSTRAINED KIEFER-WOLFOWITZ PROCESS .

3.1 Definition of the Constrained Kiefer-Wolfowitz Process . • . . . . . . .

3.2 Asymptotic Properties of the CKW Process.3.3 Effect of Constraining the K-W Process ..3.4 Expected Small Sample Behavior of the

K-W Process . • . . . . • . . . . • .3.5 Approximations of the Small Sample Variance •3.6 Estimation of E(X - 8)2 •.•

n

4. THE SIGNED KIEFER-WOLFOWITZ PROCESS.

27

272834

384454

57

4.1 Definition of the Signed Kiefer-WolfowitzProcess . . • . . . • • . . . . . . • . . . 57

4.2 Asymptotic Theory of the SKW Process.. .... 624.3 Sample Properties of the SKW Process. 704.4 Estimation of E(X - 8)2. 73

n

6. SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH...

5. TWO EXAMPLES . • . . 79

88

6.16.2

Summary 8 • •• • • • • • • •

Suggestions for Further Research ..8890

LIST OF REFERENCES... 91

..

LIST' OF:'FIGURES

Figure

1. Graphical Definition of ~(c ) .n

2. Optimal Determination of c (=C)n

: 3. Graph of the Sequence {b }. . •n

4. Graph of Reproduction versus Temperature.

Page

3

44

46

80

.. e

..

..

CHAPTER 1~ INTRODUCTION AND REVIEW OF LITERATURE

1.1 Introduction

The problem of finding the maximum of a response curve is often

encountered in applied statistics. Two methods for locating this maxi-

mum have been developed extensively. The method of steepest ascent,

which relies on multiple regression principles, is the most widely

accepted. The second method is a sequential procedure. Heretofore,

this sequential procedure has been developed mainly along asymptotic

lines. We shall develop two modifications of this latter method with

special emphasis given to the practical application of this method .

With this in mind, let X be a variable that can be controlled by

the experimenter, and let Y(X) be an associated random variable such

that E[Y(X)] = M(X). In otller words, let M(x) be a function of x,

and assume Y(x) is an observable random variable with a distribution

function F(olx) such that

00

M(x) = I ydF(y!x).

-00

It will be assumed througllout this paper that M(x) has a unique maximum

at x = e and that M(x) is increasing or decreasing as x is less than or

greater than e.

The foundations of the method of estimation, which Kiefer and

Wolfowitz [8] adapted for the particular problem at hand, are contained

in a paper by Robbins and Monro [9]. The spirit of this estimation

.. e

...

2

procedure is to generate a sequence of estimates {X } that converge ton

e. The procedure is such that then'th estimate (X ) equals X 1 plusn n-

a correction term that is a function of an observable random variable

(Y(X 1) depending on X. for i ~ 1, n-l.n- 1

These procedures have certain advantages and disadvantages when

they are compared to the usual methods of response surface methodology.

They are simple to execute and easy to evaluate. They are generally

applicable in practice, and they admit a complete mathematical formu-

lation. On the negative side, they have the disadvantage of most

sequential procedures in that they require evaluation after each step.

Moreover, their variance often depends upon parameters for which a

good estimator cannot be calculated from the observations.

The stochastic approximation procedure proposed by Kiefer and

Wolfowitz [8] was later called the Kiefer-Wolfowitz process. In this

paper it will be referred to as the K-W process. Its definition

follows.

Definition: Let {a } and {c } be sequences of positive numbers,n n

and suppose Y(x) is a random variable with distribution function F(ylx)

00

where M(x) ~ !YdF(Y/X). Let Xl be arbitrary, and let the random-00

variable sequence {X } be defined byn

Xn+l ~ X - a c-l (Y(X - c ) - Y(X + c»,n nn· n n n n (1.1)

where Y(X - c ) and Y(X + c ) are random variables which, conditionaln n n n

.. e3

distributed independently as F(y!x - c ) and F(yjx + c ).n n n n

Before summarizing the literature on this process, some motiva~

tion for the assumptions on the sequences {a } and {c } will be given.n n

It follows from the definition that

If x - c > 8, then M(x - c ) - M(x + c ) is positive, and ifn n- n n n n

x + c < e, then M(x - c ) - M(x + c ) is negative. Obviously, undern n- n n n n

suitable regularity assumptions on M(x), there exists a point ~(c ) suchn

that M(x - c ) - M(x + c ) is greater than, equal to, or less thann n n n

zero as x is greater than,equal to, or less than ~(c). See Figure 1n n

for an illustration.

III

1- TI I

II I

I III I

~ (c )-c e ~(c )n n n ~(c )+cn n

Figure 1. Graphical Definition of ~(c )n

.e4

It is equally clear that if j.l(c ) exists, then /j.l(c ) - el < c , andn n - n

that if M(x) is symmetric about e, then j.l(c ) = e. If C ~ c, intui-n n

tion tells us that X should tend to j.l(c) rather than e. Thus, if M(x)n

is not required to be symmetric about e, it is usually assumed that

C + O.n

Moreover,

> E(x2-8-2a20/Xl =xl )

2> (x

l-8)-20 La.

m=l m

By induction

00

If La < 00, it is possible to choose (xl-e) > 20 L a implying an asymp-m 1 m

totic bias; thus it is necessary to assume that La = 00.n

5

This assumptionTh 1 . ~s that ~a2c-2 < 00.east common assumpt~on ~ ~ nn2 -2However, it is easy to see why a c mustnn

tend to zero. In fact, it follows from (1.1) and the inequalities

is somewhat less intuitive...

.e

= E(-a c-l[(y(x -c) - M(x -c » - (y(x +c ) - M(x +c »])2~nn n n n n n n n n

These three assumptions have almost always been made in theorems

implying X + e in any usual mode of convergence. The exception isn

that c does not have to converge to zero if M(x) is symmetric about e.n

The validity of these assumptions will always be assumed in the fol-

lowing review.

1.2 Summary of Literature

In their introduction of K-W process, Kiefer and Wolfowitz [8]

proved that X + e in probability if M(x) is bounded by two quadraticn

2 -1functions, E(Y(x») < 00, and La c < 00. In later papers (J. R. Blumn n

[1], D. L. Burkholder [2], and A. Dvoretzky [4]) the assumption that

-1La c < 00 was dropped, and the conclusion was strengthened to that

n n

of convergence with probability one. Burkholder's theorem relaxes the

assumption of quadratic bounds on M(x) with the assumption which im-

plies for practical purposes that 1M' (x)l > 0 if x ~ e. Burkholder

also noted that this assumption admits response functions taking

2values on bounded value sets (~'K' exp(-x ».

.e6

The first significant results on the asymptotic distribution of

X were derived by Burkholder [2] who used methods suggestedn

K. L. Chung [3]. Burkholder established that the moments of

by

n~(X -8)n

converged to those of a central normal random variable if the sequences

{a } and {c } are of the form a = O(n- l ) and c = O(n-w), andn n n n-1

~ = l/2-w. Burkholder also assumed that W > (4+2n) ,where

1~(E)-81 = O(El +n). He proved that, if the third derivative of M(x)

is continuous, then n ~ 1. Using a method developed by Hodges and

Lehman [7], he was able to drop the assumption of the quadratic bounds,

but he had to weaken the conclusion to that of convergence in distribu- .

tiori.

Sachs [10] used a different approach to show that

1/3 -1 .n (In n) (X -8) was asymptot1cally normal under similar restrictionsn

on the sequence {a } and the assumption of quadratic bounds on M(x).n

He did this by considering a slightly different class of sequences {c }.n

He also proved that, in general, nl/3 (X -8) is not asymptotically nor-

n

mal under the assumptions on the sequences {a } and {c } discussed inn n

the introduction.

More recently, Fabian [6] proved a very general theorem about

which he stated, " ... [the theorem] implies in a simple way all known

results on asymptotic normality in various cases of stochastic approxi-

mation." He used his result to obtain the asymptotic normality of his

generalization of the K-W process (Fabian [5]) which can be constructed

to obtain an asymptotic variance of order (n-(l-E)) for any E > O.

Again, this is mainly an

.e7

Fabian achieves this smaller variance by taking observations at

more than two points On each step. His results have interesting pos-

sibilities, especially if a very precise estimate of 8 is needed.

However, the K-W process seems to have the better small sample pro-

perties unless one has substantial ~ priori knowledge as to the loca-

don of 8.

J. H. Venter [12] has proposed that observations be taken at

x - eX, and X + c on each iteration and thereby estimate Mli (8).n n' n n n

He noted that this estimate could be used to minimize the variance

of n~(x -8) if a is of the form An-I.n n

asymptotic result and will not be considered, here.

The only small sample results relevant to the K-W process seem

to be those of B. Springer's [11] simulation study on a process pro-

posed by Wilde [13]. This process is a modification of the K-W

process in which the sign of Y(X -c ) - Y(X +c ) is used rather thann n n n

the difference. In this study he used a geometric sequence for {a }n

which obviously does not satisfy the usual assumption that Lan

diverge. Then he unjustifiably criticized the process as a whole for

the bias that he observed.

1.3 Summary of Results

The K-W process has been largely of theoretical interest, and the

behavior of this process has been studied mainly through its asymptotic

properties. With the exception of some remarks by a few writers,

little knowledge of use to the consulting statistician can be gleaned

from a review of the literature. It is the intent of this paper to

fill this gllP.

~ Chapter 2 will be devoted to a generalization of Burkholder's~ ,.,

8

-e

[2] theorem on the asymptotic distribution of a certain class of sto-

chastic approximation procedures. These extensions will admit

sequences {a } of the form O(n-a ) where 1/2 < a ~ 1, admit the use ofn

constrained random variables (to be defined later), and allow the

experimenter to take an increasing number of observations at each

step. The usefulness of these extensions will be discussed in the

following chapters as they arise.

In the third chapter, a generalization of the K-W process will

be proposed and its asymptotic properties will be obtained from the

results of Chapter 2. Approximate bounds will be given for the small

sample variance of the process when M(x) is a quadratic function, and

this result will be used to obtain approximate bounds in cases of

more general functions. The effects of the response function, the

value of a, the variance of Y(x), and the starting value (Xl) on the

mean square error will be studied in some detail.

In the fourth chapter we shall develop the modification of the

K-W process proposed by Wilde [13]. Various asymptotic and small

sample properties will be established. In addition, a method of

estimating the variance of the estimator will be developed.

In the final chapter two examples will be given to illustrate

some of the results in the third and fourth chapters.

..

..

CHAPTER 2. STOCHASTIC APPROXIMATION PROCEDURES 'OF TYPE ATo

2.1 Constrained Stochastic Approximation Procedures

In this chapter we shall extend Burkholder's [2] Theorem 3 on

the asymptotic normality of a class of stochastic approximation

procedures. This extension will improve the practicality and the

flexibility of these procedures. The main discussion of t~e useful-

ness of this extension will be made in Chapter 2 and in Chapter 3 in

which more concrete examples will be examined.

In the development that follows, F[r](o) will denote the dis-

tribution function of the mean of [r] ([r] denotes the greatest

integer less than or equal to r) independent random variables with

distribution function F(o). We shall denote by yT the random vari-

able defined by

T, Y > T

yT = y, -T < Y < T

-T, Y <-T

where Y is any random variable. The random variable (yT) will be

called a constrained random variable, and if R(Y) is any function or

functional of Y, then RT(y) will denote the corresponding function

of functional of yT•T T(!'K" if M(x) = E(Y(x», then M (x) = E(Y (x).)

When it is possible to do so without confusion, the random variable

sequence defined by

•

10

T , Y > TT n n n

y n = yn' -T < Y < T

n n- n- n

-Tn' Y < -Tn n

will be denoted by yT. We shall say that a function sequence' {D (x)}n n

is continuously convergent to A at x = e if for every E > 0, there

exists a o(e) > 0 and an N(E) > 0 such that if lx-e) < 0 and

n > N(E), then ID (x)-A/ < E. The usual symbols 0(0) and 0(0) willn

be used to indicate the order of convergence.

Definition (2.1): Let {a'} and {X } be sequences of positiven n

numbers such that if T =f: co for every n greater thansome.n , thenn 0

inf T > O. Let Z (x) denote a random variable with distributionn n

function Gn(ojx). Let Xl be fixed, and define the random variable

sequence {X } byn

X = X - a'ZT(X),n+l n n n (2.1)

where ZT(x) is a random variable with conditional distribution GT(olx)n n

Tgiven Z1' $ f; " ,

TZ l' Xl' ... , X l' and X = x.n- n- n The stochastic

Tapproximation sequence X will be said to be of Type A .n 0

This definition is, in fact, Burkholder's [2] definition of a

Type A process. However, the superscript T is added to emphasize theo

fact that, although the sequence {X } depends on constrained randomn

variables, it is usually more reasonable to make assumptions on the

unconstrained random variables.

11

In order to get an appreciation for this class of processes and

for some functions that will be defined later, we shall derive a few

well-known results for the K-W process.

That the K-W is of T can be by lettingprocess Type A seen0

a' -1 T :: 00, and ZT(X) Y(X -c ) - Y(X +c ). Let= a c , =n n n n n n n n n

R (x) =E(Z (x)). Then R (x) = M(x-c ) - M(x+c ), and, as was notedn n n n n

in the introduction, under suitable regularity assumptions there

exists a sequence {~ } such that R (~ ) = M(~ -c ) - M(~ +c ) = 0,n n n n nn n .

and (x-~ ) R (x) > 0 if x ~ ~. By considering the Taylor seriesn n n

expansion of R (x) about ~ , we can show that R (x) = -2c (x-~ )M"(~ )n n n n n n

for x = ~ , we find that the function sequence {D (x)} is continuouslyn n

convergent to -2M"(8) at x = 8 since I~ -el + °as c + 0.n n

..

·e

+ o(c (~ -8)).n n Letting D (x) = R (x)/c (x-~ ) for x ~ ~ , = -2M"(8)n n n n n

2.2 Constrained Random Variables

The following lemmas establish certain relationships between

constrained and unconstrained random variables that are used in

Theorem (2.1).

Lemma (2.1): Let Y be a random variable with finite variance

(a2). Then Var(yT) < a2 .

Proof: Without loss of generality assume that E(Y) = O. Define

..

the random variable (Y') by

Y' ={

T'

Y,

Y > T

Y < T.

12

Then

•

Var(Y' )

Therefore

Lemma (2.2): Let' {T } and {r } be sequences satisfyingn n

..

..

Definition (2.1), and let {Z (x)} denote a sequence of randomn

variables with distribution functions {F[ru](olx)} such that00

R(x) = I zdF(zlx). Assume that there exists a finite number (B)-00

and an even integer (K ~ 2) such that

00

supx J (Z-R(x)l dF'(Z'lx) <B.

-00

Then for 0 < C < 1 there exist positive constants Dl

, D2 , and D3

such that

a. if /R(x)! > CTn' then

TR n(x)sgn(R(x» > D T - D r-K/ 2 T-(K-l)

1 n 2 n n

b. and if /R(x)! < CTn' then

T

!R(x) - R n (x) I < D

3r -K/2 T-:(K-l)

- n n

Proof: Case (a): Without loss of generality, assume R(x) > O.

Choose C' such that 0 < C' < C. We have then

.eE(ZT(x»

n

T [r ]= -T P(Z (x) < -T ) + fn zdF n (zlx)n n n_T

n

+ T P(Z (x) > T )n n n

> -T P(Z (x) < (C-C')T )- n n n

+ (C - C')T P(Z (x) < (C-C')T )n n - n

13

..

= (C-C')T - (l+C-C')T P(Z (x) < (C-C')T ).n n n n

But since R (x) > CT we haven n

P(Z (x) < (C-C')T ) < p(IZ (x) - R (x) I > CIT )n n - n n n

< (C'T )-K E(Z, (~) - R(x»K- n n

< '(CIT )-K r -K/2 ,E(Z(x) -R(x»K- n 'n '

We have then

E(ZT(X» > (C-C')T - (l+C-C,)-KBT-(K-l)n - n n

-K/2r

n

The desired result is obtained by letting Dl = (C-C') and

D2

= (l+C-C') (C,)-K.

Case b: Again we can assume R(x) > O.

14

We have that

TIR(x) - R n(x)/ =

< (/R(X)! + T ) P([Z (x) < -T ] U[Z (x) > T ])n n n n n

-T00 n

+ f !z-R(x)ldF(rn] (z!x) + f . IZ-R(x~dF[rn](z/x)T _00

n

< (/R(x)! + T ) P(!Z(x) - R(x)] > T - R(x»n n

-T+ IT + R(x)/-(K-l) J n IZ-R(x)I K dF[rn] (zlx)

n-00

since )R(x)] < CT < T. Again using Chebyshev bounds, we see thatn n

TjR(x) - R nl < (l+C)T (B(l-C)T )-K r-K/ 2 + I(l-C)T ,-(K-l\-K/2 B.

- n n n n n

-KThe desired result is obtained by letting D3 = 2B(1 + C)(l - C) .

2.3 TAsxmptotic Distribution of a Type Ao Process

We shall now give an extension ~f Burkholder's [2] Theorem 3.

It is helpful to recall the discussion on the K-W process preceding

Lemma (2.1) and to bear in mind that the assumptions are based on the

unconstrained random variables even though the theorem concerns a

TType A process.o

Before giving Theorem (2.1) we shall give a theorem by Burk

holder [2] and a theorem by Fabian [6] which play integral parts in

15

the proof of Theorem (2.1).

Theorem (Burkholder [2J): *Suppose {X } is a stochastic approxi-n

< 00',

mation process of Type A and e is a real number such thato

(1) there is a function Q from the positive real numbers into

the integers such that if E > 0, jx-el > E, and n > Q(E),

*then (x-e) R (x) > 0;n

*(2) sup (IR (x)!/(l+!xl»n,x n

*(3) sup Var(Z (x» < 00;n,x n

* *(4) if °< 01' < 02 < 00, then La ( inf ]R (x)/) = 00;n sl~lx-e 1~02 n

* 2(5) E(a) < 00;n

* 'Ie(6) if n is a positive integer, then R (x) and Var(Z (x» aren n

Borel measurable.

Then P(lim X = e) = 1.n

It will be useful to note that condition (1) ,can be replaced by

the condition that

(1') there is a function Q from the positive real numbers into

the positive integers such that if E > 0, Ix-el > E,

and n > Q(x), then

*E (x) > 0, and Lan n

*(x-e)R (x) > E (x)n n

sup 10 (x) I < 00.x n

+ is (x) wheren

The following theorem is a one-dimensional version of Fabian's

[6] theorem on asymptotic normality. In the theorem let R denote the

real numbers. Let P denote the set of all measurable transformations

from the measurable space (~,S) to R. Let X[A] denote the indicator

function of the set A, and, unless indicated otherwise, all conver-

gence notions will mean convergence almost everywhere.

decreasing sequence of a-fields, S c: S; suppose U ,V ,T , r , ¢ sP,n n n n n n

Theorem (Fabian): Suppose K is a positive integer, S a nonn

16

,t< ~, r,¢ER, and r > 0.

0:, (3, sR, and

Suppose r ,¢ l' V 1 are S - measurable, C,n n- n- n

r + r, ¢ + ¢, T + T, or EIT -TI + 0,n n n n

E(V Is ) = 0, C > jE(V2/S ) ~ ~I + 0,n n n n

(2.2)

(2.3)

and, with 2E(X [V2 ~ rj ]V2), let eithera. =J,r

lim a2 = ° for every r > 0, (2.4)j~

j,r

-1 n 2or 0: = 1, lim n ~ a. = ° for every r > 0. Suppose that11~ j=l J,r

(3+ = (3 if 0: = 1, (3 = ° if 0: :I 1,+

and

°< 0: < 1, °~ (3, (3+ < 2f,

U +1 = (I - n-ar ) U + n-(a+(3)!2 ¢ V + n-a-(3!2 T •n n n n n n

(2.5)

(2.6)

Then the asymptotic distribution of n(3!2 U is normal with meann

-1 2(r - (3+/2) T and variance ~ ¢ !(Zf - (3+).

Theorem (2.1): Let {X } be a stochastic approximation process ofn

Type AT with EZ (x) = R (x) and G (olx) = G[nP

] Colx). Suppose thato n n n

{~ } is a real number sequence, {c } is a positive number sequence,n n

e is a real number, P, W, and ~ are non-negative numbers, and each of

2n , A, C, a , v, a, and A is a positive number such thato

17

i. {R (x)} is a function sequence such that if n > n , thenn 0

R (~ ) = 0, and if x ~ ~ , then (x - ~ ) R (x) > 0;n n n n n

ii. sup nW!R (x)// (1 + Ix!) < 00;n,x n

iii. if 0 < 81 < 82 < 00, then La l( inf IR (x)!) = 00,

n 8i </x-el2.82 n

and if T ~ 00 for all n > n , thenn 0

lim ( inf IR (x) /) exis ts ;n 8

1<lx-el<8

2n

iv. {D (x)} is a function sequence defined by D (x) =n n

( ) ~ __ A*R x)/c (x-~ if x T ~, J~ if x = ~ , and is continuouslyn n n n n

convergent to A at x = e;

v. sup Var(Z(x» < 00, Var(Z(x» is continuous and converges to

2a at x = e, there exists a finite open interval (I), con-

taining e such that for n > 0

lim E{X[(Z(x) - R(x»2 > nja] (Z(x) - R(x»2} + O.j-+oo

uniformly for xE1, and if T of 00 for every n > n , thenn 0

sup E(jZ(x) - R(x)jr) < 00 for r an even integer ~ 2;x

vi. G(zlo) is Borel measurable for all z;

- I I -v a-w Ivii. (a) 1/2 < a ~ 1, ~n - e = O(n ), n an + A/C,

*The definition of D (~ ) is arbitrary except for the restrictionthat D (ll) -+ A. n n

n n

18

(rp/2 + (r-l)~) > w + (1 - a), and

S/2 = a/2 + p/2 - w < V;

(b) if p/2 < w, then p > 1 - 2(a - w);

(c) if T # 00 for every n > n , then 6/2 < (rp/2 + (r-l)~-w);n 0

(d) and 6+ = 0 if a < 1, = 6 < 2AA if a = 1.

6/2Then X + 8 a.e., and n (X - 8) is asymptotically normaln n

Proof: First we shall use Burkholder's theorem to show X 4 8n

a.e. by considering two cases.

Case I: p/2 < w.

!e

Burkholder has shown that condition (i) implies that, for every

E > 0, there exists a Q(E) such that if Ix - 81 > E, then

(x - 8)R (x) > 0 for every n > Q(E). Thus if T =00,n n

(x - 8)R*(x) = (x - 8)nP/ 2R (x) > O.n n

If Tn # 00 for every n > no' then by Lemma (2.2) (with ern] = [nP])

we have that

-rp/2 - (r-l~ { }O(n ) max D2 ,D3 )

20

lime inf JR (x)I),n+oo 01<!x-eI202 n

by L. This limit exists and is finite by (ii) and (iii). If L > 0,

Lemma (2.2) and (vii. (a» give

for °< D4

< min{L,Dl

}. Since a~ = O(n-(a-w», w·~ 0, and a ~ 1,

again the series diverges. If L = 0, then

for some °< D < 1 and sufficiently large n since lim inf n-sT > 0.n

Thus by (b) of Lemma (2.2) and the fact that (rp/2 + (r-l) s) >

w + (I-a) ~ 0, the divergence of the series follows from the diver-

genee for the case T =00 Condition (4) therefore holds in alln

cases.

Condition (5) follows from

by (vii. (b».

Finally, Burkholder (2) has shown that if G(zje) is Borel

measurable, then G[nP](zle) is Borel measurable and if the

sup Var(Z(x» < 00, then the measurability of R(x) and Z(x) is impliedx

by that of G(zle). Since the measurability of G[nP](zle) clearly im-

plies the same for the distribution function of nP/ 2ZT(x) , then

measurability of R*(x) and VarZ*(x) is implied by the fact thatn n

21

sup Var Z*(x) < 00. Thus condition (6) is satisfied.x,n n

Therefore, X -+ e a.e.n

Case II: pj2 ~ w. Let a* = a'c and Z*(x) = c-1ZT(x).n n n' n n n

Conditions (1), (2), (3), (4), and (6) are established as in

Case I. Condition (5) follows from

by (vii. (a».

Thus, in either case, X -+ e a.e.n

We shall now establish the asymptotic normality of nS/2 (X - e)n

using Fabian's theorem.

From the definition of {X } we haven

Xn

+1

- e = X - e - a'RT(X) - a' (ZT(X) - RT(X»n nn n n n

= X - e - aIR (X)-a'«RT(X)-R (X»+(ZT(X)-RT(X»).n nn n n n n n

Bur by conditions (i) and (iv) ,

R (x) = c (x - ~ ) D (x)n n n n

= c (8 - ~ )D (x) + c (X -8)D (x),n n n n n n

and so

..

= (1 - a'c D (X » (X - 8) - a'(ZT(X) - RT(X»nnn n n n n n

- a'«RT(X) - R (X» + c (8 - ~ )D (X».n n n n n n

(2.7)

.e22

Make the correspondence with Fabian'~ theorem as follows:

8, aU '" X - f n '" n a'c D (X),n n n n n

-1 1

cPn '"n (Ci.+S)/2[nPJ"2 a' V '" _[nP]2 (ZT(X) _ RT(X»,

n' n n n

and

T '" T (X) '" _na+S/ 2a'«RT(x) - R (X» + c D (X) (8 - ~ ».n n n n n nn n

(Note the possible confusion of T (X) with the values at which ZT(X)n n

is constrained.)

Let S be the a-field generated by Xl'T ZT

n Zl' ... ,n~l'

Clearly this is the a-field generated by T Tsame Zl' ... , Z l'n-

Xl' .•. , Xn' Moreover, S is increasing, and cP 1 and V 1 aren n- n-

measurable with respect to S. Since R~(X) '" E(ZT(X)/X ), R (X) isn n n n n

measurable with respect to S implying the same for D (X) and there-n n

fore r .n

To establish (2.2) we recall that X a+e. e: Since D (X) isn n

continuously convergent to A at x '" 8, it follows that D (X) + A a.e.,n

and therefore R (X) a+e. O. Thusn

by conditions (iv) and (vii. (a». Again by (vii. (a»

'" A/C

since S/2 '" a/2 + p/2 - w. Finally, since a~ '" o(n-(a-W»,Tn(X) + 8

23

a.e. That

nB/ 2 (e - ~ ) D (X) + Oa.e.n n

follows from the fact that je - ~ J= O(n-V) and B/2 < v. Thusn

T + 0 a.e.n

We have

E(V js )=E(E(V IX )IS ) = E(Ols ) = O.n n n n n n

If T =00, then there exists a C > 0 such that (with E = 02)

n

C > jE(V2 ls ) - rj + 0 a.e.n n

by (v). If T ~ 00 for ever.y n > n , we have from (vii. (a» thatn 0

rp/2 + (r-l)~ > w + 1 - a ~ 0, and thus

Hence IE(V2 IS ) - rl + 0 a.e. in either case. This establishes (2.3).n n

To show that lim 0: t = 0 for every t > 0, let I be defined asj+oo J'

2 2in (v). Since E(V ) = 0, o. t < sup E(V.) < sup Var Z(x) < 00n J, - x J - x

by Lemma (2.1) and condition (v). Letting D denote this last upper

bound, we obtain

..

2O. tJ,

= EX{[V: > tja] Ur}v:J - J

+ EX{[V: > tja] UI C} V:

J - J

< EX{ [V: > tja] UI}V: + DP (X. 8rc) .- J - J J

.e24

The first term goes to zero by (v), and the last term goes to zero

since P(X ~I) + 1. Thus condition (2.4) is satisfied with t = r.n

Finally, by noting that r = AA and S < 2 AA if a = 1, it follows

that nS/ 2 (Xn - e) is asymptotically normal (0, a2A

2 /c2(2AA - S+».

Although Theorem (2.1) does not extend the conclusions of

Fabian's theorem to a larger class of stochastic approximation pro-

cedures, the assumptions of Theorem (2.1) are more related to the

conditions of an experiment in which the use of these procedures may

be attractive. Theorem (2.1) also has the added attraction of re-

moving the implied assumption that X + e a.e. and including it inn

the conclusion of the theorem.

Theorem (2.1) extends Burkholder's [2] Theorem 3 in three

directions. Firstly, it increases the possible values for a from that

of 1 to that of the interval (1/2, 1). Since the constant a (=An-a )n

can be thought of as the weight assigned to the n'th observation, this

wider range of values for a gives the experimenter more flexibility

in weighting the observations. This greater flexibility often enables

the experimenter to decrease the mean squared error during the early

stages of the search for e. Secondly, it allows the use of con-

strained random variables. An example in Chapter 3 points out when

such constraining can improve the behavior of the K-W process. And,

thirdly, it allows an increasing number of observations to be taken

on each step. Again an example in Chapter 3 describes situations in

which, by taking more observations on each step, we can obtain the

same variance as by taking the same number of observations using more

steps. The practical importance of decreasing the number of times the

.e experimenter must obtain observations is quite clear.

Due to the generality of Theorem (2.1), it is difficult to com-

ment on the practical consequences of each assumption with the excep-

tion of assumptions (v) and (vi). It is sufficient to guarantee

assumption (v) that our observations be bounded. Since this can

almost always be assumed in practice, assumption (5) is not practi-

cally restrictive for any finite r. Thus if either, or p is greater

than zero, the assumption that

rp!2 + (r - 1) ~ > w + (1 - a)

and

B/2 < (rp/2 + (r - 1) , - w)

are almost always satisfied. Assumption (vi) seems never to be

questioned in practice.

Theorem (2.1) expresses the order of the variance in terms

of the number of steps rather than the number of observations. Let

N(n) be the total number of observations necessary to obtain X .n

Then

n-lN(n) = ~ mP = O(np+1).

m=l

If Y is such that (N(n»)Y B!2= n then

(p + l)y = a!2 + p/2 - w,

or

Y = (a/2 + p!2 - w)!(p + 1).

26

We shall discuss the value of y for the K-W process in the next

chapter, and shall only note here that W = 0 for the Robbins-Monro [9]

process giving y = (a + p)/2 (p + 1).

·eCHAPTER 3. THE CONSTRAINED KIEFER-WOLFOWITZ PROCESS

3.1 Definition of the Constrained Kiefer-Wolfowitz Process

The Kiefer-Wolfowitz process is not generally accepted as a

practical procedure for seeking a maximum in experimental work. The

basic reasons for this are: its sequential nature, insufficient

knowledge of its small sample properties, lack of a good estimate of

its variance, and an unfamiliarity of experimental statisticians with

its attributes.

We shall attempt to relieve these deficiencies somewhat. Fol-

lowing the definition of the Modified Kiefer-Wolfowitz process, we

shall obtain, under certain mild restrictions, its asymptotic distri-

bution as a corollary to Theorem (2.1). Various examples will be

given to study the effect of the parameters of the process and to

illustrate its small sample behavior. Finally, bounds will be ob-

tained for expressions related to the small sample variance, and a

method of estimating this variance will be discussed.

Definition (3.1): Let {a }, {c }, {r }, and {T } denotenn n n

sequences of positive numbers such that [r ] > 1 and if T ~ 00 for alln - n

n greater than some n , then inf T > O. Let Y(x) denote a randomo n

variable with mean M(x) and distribution function F(O!x). Let Xl be

fixed, and let the random variable sequence {X } be defined byn

•= X - a c-l(y(X - c ) - y(X + c »T,

n n n n n n n (3.1)

.e28

where Y(X - c ) and Y(X + c ) are random variables which, conditionaln n n n

on Y(XI - cl ), "', Y(Xn- l + en-I)' Xl'

[r ]tributed independently as F n (ylx

n

.•. , X l' X = x , are dis-n- n n[r ]

c ) and F n (ylx + c ).n n n

This will be called the Constrained Kiefer-Wolfowitz process,

and it will be referred to as the CKW process.

The CKW process is a generalization of the K-W process. The

CKW process bounds the difference (Y(X - c ) - Y(X + c » by + T ,n n n n - n

and it allows the experimenter to vary the number of observations on

each step.

T _ 00.

n

It clearly reduces to the K-W process when r - r > 1 andn

3.2 Asymptotic Properties of the CKW Process

We shall now examine certain asymptotic properties of the CKW

process through the following corollary to Theorem (2.1). In re-

viewing the corollary, it will be helpful to bear in mind that the

assumptions are based on the unmodified random variables.

Corollary (3.1): Let {X } be a CKW process with rn n

2aw= n

Let 8 be a real number, let a and b be non-negative numbers, and let

each of n , A, C, 02 , a, w, and A be positive numbers such that

o

a. M'(x) is continuous, (x

sup (I MI (x) II (l + Ix I»x

8) M'(x) > 0 if x ~ 8, and

< 00',

b. the third derivative of M(x) exists and is continuous for x

in an open interval (I) containing 8, and M" (8) = -A/2;

c. Var Y(x) = 02(x) is bounded and continuous, there exists an

open interval (I) containing 8 such that for any t > 0

"

29

lim E{X[(Y(x) - M(x))2 > tja] (Y - M(x))2} + °j-+oo

uniformly for x E I, and if T ~ 00 for every n > n , thenn 0

sup E(/Y(x) - M(x)/r) < 00 for r an even integer ~ 2;x

d. F (ylo) is Borel measurable for all y;

e. (1) 1/2 < a ~ 1, naa + A, nWc + C, lim inf n-b°Wr > 0,n n n

(ra + (r - l)b) > 1 + (1 - a)/w, and

B/2 = al2 + (a - l)w <2W;

(2) if a < 1, then a > 1 - (2a - l)/~w

(3) if T ~ 00 for every n > n , thenn 0

B/2 < (ra + (r - I) b -l)w;

(4) and B = ° if a < 1, = B < 2 AA if a = 1.+

Then X + 8 almost everywhere, and nB/2 (X - 8) is asymptoticallyn n

222normal (0, 2 A cr IC (2 AA - B+)).

The basic ideas of Corollary (3.1) have already been noted by

Burkholder [2]. However, since Corollary (3.1) is an extension of

his work and since it is instructive to do so, we shall give a proof

of this corollary.

Proof: Make the correspondence with Theorem (2.1) as follows.

a' = a c-l,z (x) = (Y(X -c ) - Y(X +c )), and R (x) = M(x-c )-M(x+c ).n n n n n n n n n n n

Condition (a) implies that M(x) is continuous, and that M(x) is

strictly increasing or decreasing as x is less than or greater than

8, respectively. Thus, there exists a sequence {lJ J such thatn

.e30

(x - ~ ) R (x) > 0 for x # ~ .n n n

From conditions (a) and (e.(l» we obtain that the

sup nWIR (x)I/(l + Ixl) = sup nW(c jO(M'x)I/(l + Ix!»~ < 00.n,x n n,x n

This is condition (ii).

To verify condition (iii) we note that by (a) and (e. (1»

ra' ( inf IR (x)/ ) =n n(\«x-e)~oZ

a ( inf 10 (M' (x» I)n 0l~lx-e.J<PZ

= OO::a ) = 00.n

Moreover, the lim inf IR (x)1 = lim c 0(1) =0 by then 0l<jx-ej<oZ n n n

above argument and the fact that c + O.n

The function sequence {D (x)} in condition (iv) is defined byn

D (x) = c-l(x - ~ )-1 (M(x - c ) - M(x + c » if x # ~ , = A at x = en n n n n n

using condition (b). For x £ I we have, since R (~ ) = 0,n n

M(x - c ) - M(x + c ) = (x - ~)(M'(~ - c ) - M'(~ + c »n n n n n n n

+ O(x -~ )Z(M"(§l - c) - M"(§Z + c»n n n n

= -Z c (x - ~ ) M"(~ ) + O(c (x - ~ »n n n n n

where Iso - xl < I~ - xl, i = 1,Z. Since the third derivative exists1 n

for x £ I and I~ - el < c + 0, it follows that for every £ > 0 theren - n

exists a 0 > 0 and an N such that for Ix - el < 0 and n > N,IDn(x)-AI<£.

Thus the condition (iv) holds.

Condition (v) is a direct consequence of (c) with Var Z(x) Z= 20 .

..

31

Finally, Burkholder has noted that condition (d) implies (vi).

He has also shown that the existence of a continuous third derivative

in an open interval containing e implies that I~ - eJ = 0(c2) =n n

°(n-2U) . Letting v = 2w, p = 2aw, and l; ::: bW, condition (vii) can be

verified using simple arithmetic.

Thus the corollary is proven.

Except in a few pathological cases, an experimenter can assume

that the derivatives of M(x) exist, that the r'th central moment of

Y(x) exists and is continuous, and that F(ylx) is a continuous func-

tion of x for every y. Moreover, it is a rare case when he cannot

specify a finite interval which should contain e. By restricting Xn

to this interval, all the assumptions of Corollary (3.1) hold for r

arbitrarily large.

If the above assumptions are satisfied, condition (c) is equiva-

lent to e I.ex. w

1/2 < ex. ~ 1, n a + A, n c + C, a + b > 0,n n

S/2 ::: ex./2 + (a - 1) w < 2w, and

S+ ::: 0 if ex. < 1, = S < 2 AA if ex. = 1.

Thus, except for assumption (e ' ), the assumptions of Corollary

(3.1) should not be looked upon as restrictions, but rather as factors

influencing the behavior of the process.

Since in experimental work it can usually be assumed that the

supremum over x of any central moment of Y(~) is finite (at least over

an interval known to contain e), the assumptions that

(ra + (r - l)b) > 1 + (1 - ex.)-w

and

32

a/2 + (a - 1) w> (ra + (r - l)b - l)!W

~re generally not restrictive. Practically speaking then, the constant

a must satisfy

a/2 + (a - 1) W < 2'w

and

1 - (2a - l)/lw < a,

or, equivalently,

1 - (2a - 1) /2w < a < 3 - a/2 w.

(3.2)

(3.3)

As in Chapter 2, let N(n) denote the number of observations

required to obtain X , and let y be defined by (N(n))Y = nS/ 2.n

By (2.8)

Y = (a/2 + (a - 1)w)/(2aw + 1) (3.4)

Thus, if a = 0, (i.~., the same number of observations are taken on

each step), Y = a/2 - w. Since (3.2) implies

w> a/2(3 - a),

we obtain, for a = 0,

Y = S/2 < a/;2 - a/6 = a/3.

Suppose a > O. We have

2~(2aw + 1) dw = (2aw + l)(a - 1) - (a/2 + (a - 1)w)2a

= a(l - a) - 1.

(3.5)

(3.6)

33

It follows that d y/dw is greater than, equal to, or less than zero

-1as a is greater than, equal to, or less than (1 - a) .

.e.. Case I: a <

-1(1 - a) • In this case y is maximized by minj-

mizing w. Thus by (3.4) and (3.5)

y < (a/2 + a(a - 1)/2(3 - a»/(a a/(3 - B) + 1)

= a/(3 - a(l - a»

< a/3. (3.7)

The equality sign holds if either a = 0 or a = 1.

Case II: a ~ (1 _a)-I. By (3.3) (1 - a)-l ~ 3 - a/2w. Since

1/2 < a ~ 1, this equality is not satisfied if either w ~ 1/4 or

a > 2/3.

Assuming these conditions are satisfied, substitution intb (3.4)

gives

y = (a/2 + aw/(l - a»/(2w/(1 - a) + 1)

If a > (1

= a/2 < 1/3.

) -1 i .. db' . . .a ,y s maX1m1ze y m1n1m1z1ng w. Taking the

..

limit of (3.4) as w + 00 we obtain

y = (a - 1)/2a.

By (3.3) a < 3, and so y < 1/3.

Thus in all cases y < 1/3. We might add that a and w can be ad-

justed so that y is arbitrarily close to the bound obtained in each

case.

.e•

34

We shall see that the bias is minimized Qy increasing the number

of steps and usually not affected by increasing the number of observa-

tions taken on each step. The restriction that ex > 1/2 implies for

Case II that a > 2 and so this case is usually not of practical signif-

icance.

It follows from (3.7) that, for Casel, the order of the vari-

ance increases as either a decreases or ex increases. Moreover, the

effect of a change in a becomes less pronounced as ex increases, and it

vanishes at ex = 1. Thus, if the number of steps is sufficient to dis-

count a serious bias, the experimenter loses little efficiency in a

statistical sense by increasing moderately the number of observations

on each step and decreasing the number of steps when ex is close to 1.

However, by taking observations at fewer levels of the controlled

variable (X), the experimenter decreases the time required for the

experiment and possibly the cost.

3.3 Effect of Constraining the K-W Process

It is difficult to study the advantages and disadvantages of

constraining the random variables via a strict mathematical treatment.

Rather, using an intuitive approach, we shall study an example which

illustrates when we may improve the K-W process by constraining the

random variables.

Let Z ~ N(~,a2). It is clear by Lemma 2.1 and the symmetry of

the normal distribution that

..T

1. sgn (~ ) = sgn(~);

T· c . .

2. ~ ~ ~ if I~I < < T; (3.9)

35

3./~TI < I~I if I~/ >T;

4. Var yT(x) ~ 02 if /~I < < T and 0 < < T;

5. and Var yT(x) < 02 if 0 > T.

Partition the x-axis into the intervals II = (-00, e - k),

- M(x + C)/ < < T for x £ II'

12 = [8 - k, 8 + k], and 13 = (8 + k, 00).

M'(a) < < /M'(b)/ for a £ II and b £ 13 ,

and 02 are such that T < 0 and !M(x - C)

Assume that

Assume that T - T, c - C,n n

(These assumptions on M(x) are often realized in various threshold

phenomena.)

TLet X and X denote the K-W process and the CKW processn n

respectively, and assume that Xl = xi = x.

To compare the processes we can use the observations in (3.9)

with Z = Y(x - C) - Y(x + C). Using (5) and the fact that T < 0, we

obtain

Var(y(x - C) - Y(x + C»T < Var(Y{x - C) - Y(X + C».

It follows that, at least for a few steps, the K-W process is more

variable about its expected value. In addition to this, as long as

both processes remain in the same interval, we can visualize the

following cases.

Case I: x £ II' Since !M(x - C) - M(x + C) I < < T for x £ II'

(2) of (3.9) implies the processes approach e at the same expected

rate. The fact that M'(a) < < M'(b) for a £ II and b £ 13 implies

that this expected rate is much less than the corresponding rate for

either process approaching e from 13 , That is, the probability that

either process moves from II to 12 in a few steps is relatively small.

36

It is possibly a little larger for the K-W process due to its more

erratic nature.

Case II: x £ 12

, While in this region, the behavior of both pro

cesses is characterized by random movement since

M(x-C) - M(x+C) ~ 0 < < T. This movement is more erratic for the K-W

process if T < 20.> Moreover, if by chance either process leaves this

region, the movement away from e should most likely be greater for the

K-W process due to its sensitivity to observations from the tails of

the distribution. This difference increases with an increase in 0 or

a decrease in T.

Case III: x £ 13 , It follows from (3) of (3.9) that the ex

pected rate of approach toward e is faster for the K-W process. This

difference increases with a decrease in T. However, the expected rate

of approach toward e is fast for either process, and so the probability

that either process moves from 13 to 12 in a few steps is relatively

large.

Since both processes tend to move out of the region 13 more

quickly than out of region II' we can expect both processes to be

biased after a few steps given that Xl is in a neighborhood of e.

The CKW is less biased for two reasons: Firstly, it tends to move from

13 into 12 more slowly, and it tends to move from II into 12 at about

the same rate. Secondly, but probably more importantly, on leaving

12 the K-W process tends to be thrown farther into regions II and 13

,

Due to the usually quick recovery from 13

and the usually slow re-

covery from II' this can greatly increase the difference in the bias

..

37

of the estimates.

If Xl s II' neither process seems to be the better, and although

the K-W process seems better if Xl S 13 , the difference quickly dis

appears since both processes approach 8 at a rapid rate. Thus the CKW

process seems to be preferable for this example.

The essential feature of this example is the existence of two

regions, both excluding 8, that are differentiated by the magnitude of

M'(x). Other examples with this feature can be constructed, but we2

note only that regression functions of the form e-x have this feature.

The advantage of the CKW process is that of lessening the probability

of a deep penetration into the region for which IM'(x)/ is small.

2The optimal choice of T is largely determined by 0 and the

regions 12 and 13

, Clearly, if 4 0 < T, where 02 = Var Y(x), con-

straining is of little consequence in 12, and if it is of consequence

in 13 , it is a negative factor. Thus, we must choose T so as to de

crease the variability in 12 without sacrificing too much of the

information that 13

contains with respect to the direction of 8. As

2either more observations are taken at each step, or 0 is decreased,

this becomes more difficult to do.

Before leaving this subject it should be observed that as

a + 0, the influence of each iteration goes to zero. This impliesn

that both processes become characterized by their expected rate of

approach toward 8. In any small neighborhood of 8, toward which

both processes tend, this is very similar.

.e 38

304 Expected Small Sample Behavior of the K-W Process

In this section we shall discuss the effects that the parameters

a and w, the starting value (Xl)' and the shape of the response curve

have on the behavior of· the K-W process. Although the CKW process will

not be discussed, it will be seen that most of the results in this sec-

tion apply to it as well.

Let I denote an interval containing 8, and assume that M'(x) = B

for x € rCo Assume that the distribution function of Y(x) - M(x) is

independent of x.

22 a .

Let Z(X ) = Y(X - c ) - y(X + c ), and Var Z(X ) =n n n n n n

We have by (3.1) that

Xn+l - 8 = X

n- 8 - a c-l(M(X -c ) - M(X +c » - a c-lZ(X )

nn n n n n nn n (3.10)

If ~ > > 8, the process will remain in I C for a few iterations with a

high probability. Thus if n - k is not too large, (3.10) becomes

By induction

= (X -8) - 2 B a - a c-1Z(X )n n n n n

n= (~-8) - 2 B ~ a

m=k m

n -1~ a c Z(X),

m=k m m m

where, conditional on Xk > > 8, Z(Xm), m=k, ..• , n, are independent

random variables. It follows that

n= (X

k-8)- 2 B E a ,

m=k m(3.11)

and

2 nVar(Xn+l - 81~ > > 8) = 2 a L

m=k

2 -2a cmm

39

(3.12)

Let a = An-a and cn n

-w= en . It is a well known result that

n -pL m ~

m=k

n+lJ x-P dxk

(3.13)

of p > O. Substituting the expressions for a and c into (3.11) andn n

-e

(3.12) and applying (3.13), we obtain

E(Xn+l - 81~ > > 8) ~

~ - 8 - 2BA(ln(n+l) - In k), a = 1,

and

I 2 2 -2 -1Var(Xn+1 - 8 ~ > > 8) ~~ A C (a-w-l/2)

(Kl - 2 (a-w) _ (n+l)1-2(a-w)).

The above expressions indicate that if we are approaching 8

along a constant slope, our expected rate of approach is proportional

to the slope and increases very rapidly as a decreases. Furthermore,

the variance of our rate of approach decreases as either a increases

or w decreases.

Thus if we expect to approach 8 in a region in which M'(x) is

fairly constant, we should generally decrease both a and w. It may

be wise to let c and a be constant until the sequence {X } non n n

longer shows a trend in a particular direction. This lack of a trend

indicates we may be in a neighborhood of 8.

40

In this example we are trying to decrease the bias for a given

number of observations. It is clear from (3.11) that the bias is not

decreased by increasing the number of observations on each step. There-

fore, unless it is practically restrictive, it is best to take a mini-

mum number of observations on each step until we are in a neighborhood

of 8.

The previous example reveals the behavior of the K-W process for

a few iterations while in a region in which M'(x) is fairly constant.

However, the assumptions of this example are certainly violated in a

small neighborhood of 8, or in a region in which M'(x) cannot be ap-

proximated by a constant.

A more reasonable approximation, at least in a neighborhood of 8,

is to assume that M(x) is a quadratic function of x. That is, assume

2M(x) = -(B/2) (x-8) (3.14)

for B > O. Let a = An-a and cn n

-w;::: Cn . Let Z(X ) = y(X -c ) -n n n

y(X +c ) - (M(X -c ) - M(X +c », and assume that the distributionn n n n n n

function of Z(X Ix = x) is independent of a. Let Var Z(X ) = 2 02

n n n

Using (3.10) we have

;::: X - 8 - An-a c (B/2) (-(X -c _8)2 - (X +c _8)2)n n n n n n

- AC-l n-(a-w) Z(X )n

(3.15)

41

Since, conditional on X = x, Z(X ) is independent of x, we obtain byn n

induction that

nXn+I - 8= IT (1 - 2ABj-a)(~ - 8)

j=kn n

- AC-l E E (1 - 2ABj-a)n-(a-w)Z(Xm),

m=k j=m+ln

where IT (1 - 2ABj-a) = 1. It follows thatj=n+l

and

n= IT (1 - 2ABj-a)(~ - 8),

j=k

nwhere D(m,n) = n

j=m

n~(k,n)(~-8)2 + 2A2C-2a2 E D(m+I,n)m-2 (a-w), (3.16)

m=k

(1 - 2ABj-a)2. The last two expressions of (3.16)

f)

are, conditional on ~, the square of the bias and the variance of Xn ,

respectively.

It is another well known result that there exists 8k + 0 such

that for every k, n

Using results similar to (3.13) it has been shown that there exists

8k + 0 such that for every k, n

{ -I I-a. I-a }(1-8k)exp -4AB(I-a) (n -k .) ~ D(k,n)

(3.17)

{ -I I-a I-a}< (I+8k)exp -4AB(I-a) (n -k ), a < 1,

and

••1.

42

(3.15)

We can use the results of (3.17), (3.18) and Corollary (3.1) to

obtain asymptotic expressions for the bias and variance of X in thisn

example. In fact, it follows from (3.16), (3.17), and (3.18) that

2 -4AB«~-e) + o(l»n ,ct = 1.

(2 -1 l-ct) «~-e) + 0(1)exp{-4AB(1-ct) n }, ct < 1

l(3.19 )

••

Although Corollary (3.1) does not guarantee that Var(X ) exists, itn

does imply that n~/2(x - e), with ~/2 = (ct - 2w)/2, tends to ben

distributed as a normal random variable with variance equal to

(3.20)

•

2222A a /c (4AB - .(ct - 2w», ct = 1

for A > (a - 2w)/4B if a = 1. Many writers have noted that the variance

expression for a = 1 is maximized with respect to A if A = (ct - 2w)/2B.

It is reasonably evident from the nature of D(k,m) that the bias

of the K-W procedure is diminished by decreasing ct. Asymptotically this

is clear from (3.19). However, it follows from (3.6) that ~/2 =

a - 2w < a/3. Thus the asymptotic variance increases as ct decreases.

Again, until X is in a neighborhood of e, it is probably best to letn

a and w be small. Once X is in a neighborhood of e, .ex should be inn

creased .

43

It is important to note that in this example the bias is not a

function of c , and, as is to be expected, the variance decreases asn

either C increases or w decreases. It follows that we can set w = o.

By doing so y (see (3.4)) becomes a/2. Hence with a = 1 we obtain an

asymptotic variance of O(n-1). As was noted in Chapter 1, it is the

symmetry of M(x) about 8 that allows us to assume w = O. It is clear

that if M(x) is reasonably symmetric about 8, and if only a rough

estimate of 8 is needed, then it is best to let w ~ o.

Suppose we can assume that M(x) is symmetric about 8. This

implies that we can set c = C. Asymptotically, the optimal choicen

of C maximizes the difference !M(x-C) - M(x+C)I for small deviations

of x about 8. It is quite clear then how C should be chosen. We only

note here that, for a quadratic response surface, C should be made

arbitrarily large, and, for surfaces of the form e-x2 , C is determined

as in Figure 2.

-~2C i~

----+j

IIII

IIIrIII

x

Figure 2. Optimal Choice of c (=C)n

44

2In this example we have assumed that Var Z(X ) "" 2 cr. If

n

Var Z(X ) = 2[n2aW]-1 cr2 , (3.10) becomesn

1

E(Xn+l - el~) = D2 (k,n)(~-e)

and

n~ D(m+l,n)m-2 (a+(a-l)W)

m""k

3.5 Approximations of the Small Sample Variance

The importance of evaluating the last expression in equation

(3.21) is quite clear. This expression is, conditional on Xk ' the

small sample variance of X if M(x) is a quadratic function. Ifn

each X , m=l, •.. , n, lies in a neighborhood of e in which M(x) ism

reasonably approximated by a quadratic function, it is a close ap-

proximation to this variance for more general response functions.

At the present, the only simple, but adequate, approximations to

this expression rely heavily on asymptotic theory. We now obtain

bounds for this expression using an appeal to asymptotic theory that

is satisfied for values of k and n usually realized in practice.

Let

b ;:;:(m,n)

nIT

j;:;:m+l(3.22)

t)

where D, ¢ > O. For simplicity, let b "" b( ) keeping in mind them m,n

dependence on n.

••45

Using (3.22) we obtain

b - b = b (l_b- l b 1)m m-I m m m-

Thus, for k sufficiently large, {b } is,an increasing sequencem '

m > k if either a < 1, or a = 1 and 2 D > $.

(3.23)

by

-aSince m-a

(m-l) = O(n-(l+a», the second difference is given

••(b -b ) - (b -b )

m m-l m-l m-2=(2Dm-a_ '-l)"(b'b ')

ym 'm-' m'-l

-2a+ Oem )(b -b 1)'m m- (3.24)

Thus, for k sufficiently large, {b -b I} is an increasing sequencem m-

for m > k if either a < 1, or a = 1 and 2 D > Yn

The last expression of (3.21) is of the form 1: b It ism=k m

clear from equations (3.23) and (3.24) that, for sufficiently large k,

upper and lower bounds can be obtained for this sum using a geometrical

argument. See Figure 3 for a renresentation of the elements b thatm

illustrates this approach. In fact,

Area 6 ADE - Area 6 ABC, k > A

• and

n1: b >mm=k

Area 6 ADE , k < A

(3.25 )

46

bk -------

Figure 3. Graph of the Sequence {b }n

(f)

n-l n

nL:

m=kb < (n - k) b .m - n

This geometrical approach will be used in only one case in the

following lemmas. However, every bound will be obtained by finding

sequences of functions {f (x)} such that eithern

f (m) > b, m = k, .•• , nn - m

or

f (m) < b, m = k, ... , n.n - m

The respective bound will be obtained by integrating f (x) over an

region such as (k,n+l) or (k-l,n).

Lemma (3.1): Let the sequence {b } be defined by (3.22) withm

a = 1. Let AI' A2, ¢, and D be real numbers such that

¢ < Al < 2 D < 1.2 . Then there exists an N(A1 ,A2) such that, for every

k, n > N (1.1

,1.2),

47

A-(ep-l) n(k-l) 2 ) < r b

m=k m

-A A -(ep-1) A -(ep-l)< (A - (ep- 1»-1n I «n+l) I _ k I ).- I

Proof: Consider the function sequence {f (X,A)} <;lefined byn

-A A-epf (x A) = n x·n ' ,

for x > O. Clearly, by (3.22),

f (n , A) = n -ep = b •n n

We have using (3.26) that

f (m,A) > b «b )n - m - m

if and only if

(3.26)

(3.27)

(3.28)

-A ep-An > m b

m«mC/l-A b ).- m

(3.29)

Let d (A) = mep-A b •m m

We have

-1 -2= I + (A - 2D) m + Oem )

by (3.23). Since Al < 2 D < A2 , there exists an N = N(AI ,A2) such

that the sequences {dm(AI)}m>N and {dm(A2)}m>N are increasing and

decreasing respectively.

•

48

But (3.28), and hence (3.29), holds for m = n. Thus, by the

monotinicity of the sequences {dm(A1

)} and {dm(A2)}, (3.29), and hence

(3.28), holds for all m > N(AJ.,A 2).

Since ¢ < A2 < A1 , {fn (x,A1)} and {fn (x,A2)} are sequences of in-

creasing functions. It follows that, for k > N(A1 ,A2),

n -A A -¢ n 11+1 -A A -¢f n

2 2 dx < 2: b < f 1 1 (3.30)x n x dx.k-l m=k m k

Lemma (3.1) is obtained by integrating the above expression.

Lemma (3.2): Let the sequence {b } be defined as in Lemma (3.1).m

Let A1 , A2, ¢, and D be such that 0 < ¢ - 1 < A1 < 2D < A2 < ¢.

Then there exists an N(A1 ,A2) such that, for every k, n > N(A1 ,A2),

-A A -(¢-1) A -(¢-1)(A - (¢-l»-ln 2«n+l) 2 _ k 2 )

2·e-A A -(¢-)

< (A - (¢_1»-1 n 1 (n 1 -- 1

n< 2: b- m

m=kA -(¢-1)

(k-l) 1 ).

Proof: The proof is a repetition of the proof of Lemma (3.1)

except, in this case, {f (X,A)} is a sequence of decreasing functions.n

Thus the ranges of integration in (3.30) must be reversed. The re-

striction that A1 > ¢ - 1 is to prevent fn(x,A) from being of the

form x-1 .

Lemma (3.3): Let the sequence {b } be defined by (3.22) with. m

a < 1. Let 0 < A < 2D. Then there exists an N(A) such that

n2 -1 -ab (b -b 1) (1 - exp{An (k-n)})/2 < 2: bn n n- - mm=k

-1 -¢+a -a {-a }~ A n (exp{An} - exp An (k-n».

.e Proof:

49

Consider the function sequence {f (x,A.)} defined byn

f (x,A.) = b exp{A.n-a (x-n)},n n

for x > O. We have

f (m,A.) > bn - m

if and only if

Let d = b exp{- mn-a}. Thenm m

(3.31)

(3.32)

(3.33)

by (3.23). Since A. < 2D, there exists an N(A.) such that

(3.34)

{dm(A.)}m>N(A.) is an increasing sequence.

But (3.32), and hence (3.33), ho~ds for m = n. Thus, by the

monotonicity of the sequence {dm(A,)}m>N(A)' (3.33), and hence (3.32)~

holds for all m > N(A.).

Since fn(x,A.) is an increasing function for all n,

n n+lE b < b exp{-A. nl - a } J exp{An-ax}dx.

m=k m - n k

The right hand inequality of Lemma (3.3) follows by integration and by

noting that b = n-¢.n

.50

To obtain the left hand inequality of Lemma (3.3),·· we note that

the slope of the line AE (see Figure 3) is given by b - b 1.n n-

Therefore D - A = b (b - b 1)-1 and (B ~ A) = (C - B)(b - b 1)-1.n n n- n n-

But, for k >N(A), (3.31) and (3.32) imply that

< bn

-aexp{A n (k-n)}.

It follows from (3.25) that, for k > N(A),

nz:;

m=kb > Area ~ ADE - Area ~ ABCm-

2 ~1 a>b (b -b . 1·)(1 - exp{A n- " (k....n)} )12.- n n n-

Thus Lemma (3.3). is proven.

We now use Lemmas (3.1), (3.2),' and (3.3)''taca1culate asymptotic

bounds for the variance.

Theorem (3.2): Let the sequence {b } be 'defined' by (3.22).m

Then, for fixed k, if a < 1..,

and if a;= 1 and 2D > ep - 1 > 0"

nnep-az:; b

m= (2D - (¢_1»-1+ 0(1).

m=k

Proof: We see from (3.'16) and (3.22) ,-t;hat

.e 51

if D = 2 AB. Using (3.16), (3.19), and the fact that if a < 1, then

2D > <P - 1, we have n<P-O'.bk

-+ 0, for any fixed k. Thus we need con-

sider only the sum from m > N where N is chosen to satisfy the previous

lemmas.

The case in which a =1 and 2D4<P follows directly from lemmas

(3.1) and (3.2) si~ce Al and A2 of these

ily close to 2D.

lemmas can be chosen arbitrarn

It is true for <P = 2D since E b is a monotonicm=k m

...

11

function of D for every k and n.

Again the fact that A is arbitrary in Lemma (3.3) implies the

right hand inequality for Theorem (3.1) when a < 1. The left hand

inequality holds from Lemma (3.3) since

= (4D)-1 + 6(1)

by (3.22) and (3.23).

We can apply Theorem (3.2) t9 the example in which M(x) is a

quadratic function by letti~g D = 2 A~ and <p.= 2(0'. - W). Thus, con-

0'.-2w editional on~, the asymptotic variance of n (Xn- ) is equal to

p 2 A2 (i}C2(2 AB - r\) where p = 1 if a = 1, 1/2 ~ p ~ 1 if a < 1

This is in agreement with (3.20). It also points out the important

fact that if M(x) is quadratic, the variance of X converges to then

variance of its distribution given by Corollary (3.1). The same

result for a more general class of regression functions has been

obtained by many writers using different approaches.

52

We now consider the extent to which Lemmas (3.1), (3.2), and

(3.3) rely on asymptotic theory. We first note that it ,is not neces-

serry for the sequence {d } to be monotonic in order that ei·ther (3.29)m

or (3.33) hold.

Consider the upper bounds established in lemmas (3.1) and (3.2).

The appeal to asymptotic results lies in the value of N such that, for

all m > N,

(3.35)

where .\2 < 2 D.

If D = 1, (3.35) holds for N = 1.

Using the Taylor expansion, we find that (3.35) is equivalent to

00

1 - 2Dm-l + D2m-2 < L:j=O

j-l( 11i=O (3.36)

"

By (3.35), the right hand side of (3.36) decreases as .\Zincreases.

Thus N is less than or equal to the value of N obtained for .\2 ~ 2D.

-3Assuming .\2 .::: 2D and considering terms up to orders (m ), we find

N is approximately the solution of

This solution is given by x = 2(2D - 1)/3. Actually, if 2D > 3,

the coefficient of (m-4) is positive implying the value of Nis less

-4than 2(2D-l)/3 up to orders (m ).

It is clear that the upper bounds in lemmas (3.1) and (3.2) are

very practical bounds for moderate D. Moreover, we have already noted

that, for a = 1, (3.20) is maximized at A = (1-2w)/2B. The

53

corresponding value of D is 2(1-2w), giving N ~ 2.

The appeal to asymptotic theory for the upper bound of Lemma

(3.3) lies in the minimum value of N such that

-a -a 2exp{An }(l-Dm ) (3.37)

for allm,n > N (see (3.23)' and (3.34» .

For the K-W process, cj> = 2(0. + (a-l)w). Since a is usually less

than one, we shall consider the case cj> < 2. It follows from (3.23)

that if 2D is not significantly less than cj>, then (1_Dm-a )2 (1-m-1)-cj>

is increasing in m for quite small m. Let N be such that this is theo

"

case for m~> N . Then, if (3.37) holds for m = N , it holds for allo 0

m > N. Let N1be the smallest value for which (3.37) holds with m = n,

and for which Nl ~ No' We have, for m, n ~ Nl ,

Thus, if either 2D ~ cj> or 2D > cj>, N1 is a reasonable approximation for

the minimum solution of (3.37).

If we let m = n and neglect terms of orders (n-3a)~·in: 3~37, we'ob-

tain an approximation of Nl given by the minimum solution of

1 + (A-2D)x-a + (A2/2 + D2 - 2DA)X-2a < 1 - cj>x-l + cj>(cj>-1)x-2 • Since

cj> > 1 for the K-W process, the coefficient of x-2 is positive. We

therefore drop this term knowing that the resulting minimum solution

is increased.

If we let A = CD; for 0 < C < 2, the above inequality is surely

satisfied if x is greater than the minimum solution of

54

or, equivalently,

For 2 - 12 < C < 2 the quadratic in C is negative. Thus if C is so

restrained, our approximation is less than the solution of

or

1

Y = (¢!(2D_A»1-a

3."6' Estimation of E(X - 8)2.n

A comparison of (3.11) and (3.12) with (3.16) quickly reveals

•

that (3.16) is generally not the correct expression for

E«Xn+l - )21~). However, it often serves as a reasonable approxi

mation when the variables {X + c }n~k lie in a region of 8 in whichm - m m=

oM(x) is reasonably approximated by a quadratic function. We assume

in this section that the process has been in such a region for

m = k , ..• , n.o

a-2w 2A natural estimator of n E(X - 8) is the right hand ex-n

pression of (3.16) after the parameters have been replaced by their

estimators and the expression has been suitably normalized. Using

this estimator, the problem becomes one of choosing a good value for

k and estimating the parameters.

Assume, for the moment, that estimates of the parameters are

available. The squared bias expression in (3.16) is minimized as

either k decreases or Xk approaches 8. This suggests that a good

..

·e

55

choice for ~ would be close to Xn+l (the estimate of 8) and such that

k - k. Using this method, the squared bias expression can be ignored- 0

if n - k is of any consequence.o

A natural estimate of the variance of X , conditional on k,n

would be to substitute the estimate for -4 AM " (8) (= 2D) into the

upper bounds given in lemmas (3.1), (3.2), and (3.3). If n-k is

a-2wmoderately large, this estimate (after normalizing by n ) closely

approximates the corresponding estimate of the asymptotic variance

given in Corollary (3.1).

The parameters, for which we have assumed estimates exist, are

cr2 and M" (8). Carrying the assumption that M(x) is approximately

quadratic a little further, we can obtain estimates of these param-

eters using regression analysis.

It is interesting to consider Fisher I s information for M' '(8)

when Y(x) _ N(-B(x-8)2, cr2). We have

d2 2 2 2= - --- (C - (y+B(x-8) ) /2cr )dB 2

d 222= -aB «y + B(x-8) )(x-8) /2 cr )

= (x-8)4/2 cr2 .

Thus, conditional on X., i = k , ... , n, the information is givenJ. 0

by (assuming 2[m2aw ] observations are taken on each step)

(l2cr) -2nL:

m=ko

•n

~ O( L:m=k

o

2 2aw)c n .m

.e56

-wIf c = nand w is arbitrarily close to its lower boundn

(a/2(3-a)), the information is bounded below by a quantity of

I - a(2-a)/(3-a) - EO(n )

for arbitrary small E. This expression increases as either a decreases

or as a increases until a = 2. Moreover, since (2-a)/(3-a) < 2/3 for

small a, the usual regression estimators of 02 and M"(8) will yield

consistent estimates if M(x) is quadratic.

.e..

CHAPTER 4. .IRE SIGNED KIEF.ER-WOLFOWITZ PROCESS

4.1 Definition of the Signed Kiefer-Wolfowitz Process

Chapter 3 is a development of a generalization of the K-W processT

in which the corrective variable «Y(X -c ) - Y(X +c » n) is con-n n - n n

strained to lie between + T. This corrective variable can be char- n

acterized as influencing both the direction and magnitude of the adjust-

ment at the n'th step. In this chapter we shall develop a modification

of the K-W process in which the corrective variable influences only the

direction of this adjustment.

positive numbers where [r J > 1. Let Y(x) denote a random variable w~thn -

mean M(x) and the distribution function F("lx). Let

-eDefinition (4.1). Let {a }, {c }, and {r } be sequences ofn n n

where

o(x ,c )n n

[r Jn

= [r ]-1 L:n i=l

o. (x, c ),]. n

or

1 if Yi(x-cn) > Y. (x+c ),]. n

°i(x,cn) = 0 if Y. (x-c) = Y. (x+c ),]. n ]. n

-1 if Y. (x-c) < Y. (x+c ),]. n ]. n

i = 1, •.• , ern]. Let Xl be fixed, and let the random variable

sequence {X } be defined byn

58

(4.1)

where Y.(X -c ) and Y.(X +c ), i = 1, ... , [r ], are random variables~ n n ~ n n n

which, conditional on (Xl,cl ), •.. , o(Xn_l,cn_l ), Xl' •.. , Xn- l , and

X = x , are distributed independently as F(-Ix -c ) and F(-Ix +c ).n n n n n n

This process will be called the Signed Kiefer-Wolfowitz process,

and will be referred to as the SKW process. When [r ] =r (a positiven

-e

integer) the SKW becomes the process first proposed by Wilde [13].

Wilde did not attempt to develop the process except to note that it

seemed less sensitive to the variance of Y(x) than the K-W process.

The only other attempt to study this process seems to be Springer's

[11] work which was discussed in Chapter 1.

The SKW process is intrinsically different from the K-W process.

The K-W process seeks the value (8) such that E[Y(8)] ~ E[Y(x)] for

all x. The SKW process seeks the value (8') such that

P[Y(8') > Y(x)!Y(8') # Y(x)] ~ 1/2

for all x. Notice that if Y(x) is such that E[Y(x)] > E[Y(x)] f6r

x # x' implies that p[Y(x) > Y(x')ly(x) # Y(x')] > 1/2, then 8 = 8'.

In particular, if Y(x) - N(M(x), 02(x», then 8 = 8'.

However, the behavior of the two processes can be markedly dif-

ferent because of their different emphases on the magnitude of the cor-

rective variable.

When [r ] = 1, the SKW process can be considered as a limitingn

-alet a = An for the CKW process with T =T.n n

,case of the CKW process. s -aLet a = A n for the SKW process, and

n

the corrective term for the CKW process

..

59

(AT-l n-a c-l (Y(X -c ) - Y(X +c »T)n n n n n

converges in probability to the correction term of the SKW process

-a -1(An c o(X ,c » for aqy fixed n as T + O.n n n

Before deriving any basic results it is helpful to discuss the

random variable (6(x,c ». We have, n

E o(x,c ) = P[Y(x-c ) > Y(x+c )]n n n

- P[Y(x-c ) < Y(x+c )].n n

(4.2)

If Y(x) is a continuous random variable for all x, (4.2) can be ex-

pressed as

00 Y

E o(x,cn) = f (JdF(z.jx+cn»dF(y]x-cn)-00 00

00 y

J ,.f dF(z Ix-cn)dF{y Ix+cn)-00 -00

(4.3)

We now consider the important special case in which Y(x) is a

normal random variable with density function

-1222f(y,x) = (2 TI a (x» exp{-(1/2)«y - M(x»/a(x» }.

Substitution into (4.3) gives

00 Y

E 0 (x,E) = r(Jf(z,x+€)dz) f(y,x-E)dy-00 -00

00 y

- J ( J f(z,x-E)dz) f(y,x+E)dy.-00 -00

(4.4)

Since we shall be interested in the behavior of o(x,c ) forn

c small and x in a neighborhood of 8, we shall need estimates ofn

60

a aax E O(X,E) and aEaX E O(X,E).

Assume that

1. M'(x) is continuous and there exists an open interval I

containing e such that for x E I M"(x) is continuous,

2. 0 < 0'2 (x) < co, a'(x) is continuous, and a"(x) is

continuous for x E I.

Then the Taylor expansion of f(y,X+E) gives

f(y,X+E) = f(y,x) + E( a~ f(y,x+ ) ) i = ~

for 0 < ~ < T.

aaT f(y,T) =

Since

assumptions (1) and (2) guarantee that the last expression in (4.5)

can be dominated by an integrable function for x in a bounded interval.

We can, therefore, express (4.4) as

co Y

E O(X,E) = J[ J (f(Z,X+E) - f(Z,X-E»dz] (f(y,x) + Eg(y,x,E)dy

-co -co

where g(y,x,E) can be dominated by an integrable function.

(4.7)

Since the term in brackets is bounded by ±. 1 and goes to zero

for every z as E -7 0, we have from (4.7) and assumptions (1) and (2),

that

~~ ~E ES(X,E)- 2 {( ~x1f(z,x) dz ) f(y,x)dy

..

61

00

= 2 cr(x) f (~x [(z - M(x)/a(x)] ) f2

(y,x)dy_00

= Mt (x) / ITI a (x)

In order to evaluate ~E t- Eo(x,e) at x = 8 and E = 0, we

note that, for x in a closed subinterval of I, the continuity of

M' I (x) and a I I (x) is sufficient to guarantee the existence and

(4.8)

.. f a a ~( ) d a a E ~( )cont~nUJ.ty 0 3E a E u X,E an ax 3E u X,E • Moreover,

these second partials are equal, and we have by (4.6), (4.7), and

(4.8) that

limc +0n

since M' (8) = o.

a 2- EQ(X,E)I8E ax x=8, E=c

n

= - M' I (8)/cr(8)17f (4.9)

Thus if Y(x) is a normal random variable, we have

E o(x,c ) = -c M'(x)/a(x)lIT + O(c ),n n n

and, for Ix - 8 I· :sufficiently small,

E o(x,c ) = -c (x-8)M " (8)/cr(8)1TI + O(c (x-8».n n n

If, more generally, Y(x) is a continuous random variable whose

density function f(y(x»)is continuous in x for all y,then it follows

2from Definition (4.1) that E 0i(x,cn) :: 1 and E <S.(x,c) + 0 as c·+ o.~ n n

.e62

Since, conditional on X = x, o.(x,c ), i = 1, ... , [r ], are independ-n ~ n n

ent random variables, we have

limc~

n

Var o(8.c ) = 1.n

(4.10)

4.2 Asymptotic Theory of the SKW Process

Corollary (4.1): Let {X } be a SKW process with rn n

2~= n

Let {V } be a real number sequence, 8 be a real number, a and W ben

non-negative numbers, and no' AS, C, v, and a be positive numbers such

that

a. if n > n , then E o(V ,c ) = 0, and if x ~ ~ , theno n n n

(x-v) E o(x,c ) > 0;n n

b. for every x, E 6(x,0) =

and if B = sup c , thenn

d°and ~E E O(X,E) is continuous,

c. for every x in an open interval (I) containing e and for every

lEI < B, (o/ox) E o(x,E)l x=8, y=O equals zero, (o2/oE6x)

E o(x,y) is continuous, and the sequence

2{(d IdEdX) E O(X,E)] 8 } converges to A > 0;x= , E=cn

d. Var o(x,c ) is continuously convergent to 02 at x = 8;

n

e. F(y/o) is Borel measurable for all y;

f. (1) 1/2 < a < 1, na a + AS, nW c + C, Iv, - 81 = o(n-V) ,

- n n n

(2) if a < 1, then a > 1 - (2a - 1)/2w;

(3) and S+ = 0 if a < 1, = S < 2 AS A if a = 1.

·e..

-e

..

63

Proof: Make the correspondence with Theorem (Z.l) as follows:

-1a' = a c Z (x) = o(x ,c ), and R (x) = E o(x ,c ).n n n n n n n n n

To verify that conditions (i) through (vii) of Theorem (2.1) are

satisfied, we first note that condition (i) is a direct consequence

of condition (a).

It follows from condition (b) that

sup nW/E o(x ,c )j/(l+jxl)n,x n n

= sup nWc (J ~ E o(x ,E)/ c ( »/(I+lx/)n,X n aE n Y=sn x

where I~ (x) I < c. The last expression is finite by condition (b),n n

and therefore condition (ii) is verified.

To verify condition (iii) we note that condition (b) implies

the existence of a function sequence {~ (x)} such thatn

I 6 ( ) I c l~ E °( )I IE 'xn,cn = n dE Xn,E E=~ (x)n

where 'I~ (x)/ < c. By condition (f. (1», c converges to a limit,. n n n

and hence the function sequence {~ (x)} converges to a function, sayn

~(x). Moreover, condition (b) implies ~(x) is continuous, and there-

fore the function / (d/dE) E o(x,E)IE=~(x)/ is continuous and attains

its infimum (k(ol'oZ» over the set [01 ~ Ix-el.~ 02]'

But ~n + e by (f. (1». Therefore if 0 < 01 ~ 02< 00, there

exists an N such that for every n > N we have that

[01~lx-el':::"02]n [Ix-e/ ~ Illn-ej] = cp. This implies by (a) that

k(ol'oZ) > O. Thus condition (iii) is verified by noting that the

.e64

divergence of

L:a'n

inf IR (x) I01 ~ ]x-e] < 0z n

is implied by the divergence of

We have by (a) and (c) that

E o(x ,c )/c (x-~ )n n n n

c-l(~ E o(x,c »n aX n x=n

where In - ~ I < Ix - ~ I and I~I < c. The last expression is con-n n n

tinuously convergent to A at x = e by condition (c). Thus condition

(iv) is satisfied.

Condition (v) follows from condition (d) and the fact that

I0(x ,c ) I < 1.n n -

Since the P[Y(x-c ) > Y(x+c )] can be written as the limit ofn n

Lebesgue-Stieltjes sums, which by condition (e) are Borel measurable,

it is clear that the distribution function of o(x,c ) is Borel measurn

able for all (y). Thus condition (vi) is satisfied.

Condition (vii) can be shown to hold by direct substitution for

the case T _ 00 and p = Zaw.n

Thus Corollary (4.1) is proven.

The conditions in this corollary are somewhat unintuitive. This

arises from the fact that the SKW process explores a response curve

65

(E O(x,c » which, although often closely related, is not the responsen

curve of the observed random variables. However, as in Corollary (3.1),

the assumptions of Corollary (4.1) are satisfied in practice almost

without exception. This fact will become clearer in the following

examples.

Example 1. 2Suppose Y(x) _ N(M(x) , a (x». Let {X } be a SKWn

process which satisfies assumptions (1) and (2) preceding (4.5) and

assumption (f.) of Corollary (4.1) with w > O. We have yet to

determine V.

In addition, assume that

3. (x -8) M'(x) < a for x ~ 8, = a for x = 8,

4. M"'(x) is continuous for x E I;

5. and sup (IM'(x)! + ja'(x) I)/a(x)(l + Ixl) < 00,x

We shall show that nS/ 2(X

n-8) ~ N(O,(As )2/c2(2AsA - S+», where

A = - M"(8)/a(8) /-IT, using Corollary (4.1).

We first note that (3) implies condition (a) since

P[Y(x) > Y(x')] > 1/2 if and only if E Y(x) > E Y(x') for independent

non-degenerate normal random variables. It then follows by condition

(4) and Corollary (3.1) that V = 2w.

Condition (e) is an immediate consequence of the continuity of

M(x) and a(x).

We now summarize the relevant results obtained earlier for this

special case:

d1. aE E a(X,E) is continuous;

2. {Var a(x,c )} is continuously convergent to 1 at x = 8;n

66

every jE] < B,

= 0, the sequenceE=O

3. for every x E I and

( ~x E <'i(x,£) ) xeS,

d2dEdX EO(X,E)/x=8 £=c converges to

, n

d2

A ( = -M"(8)!a(8) lIT), and dE3x E O(X,E) is continuous.

Thus, the assertion of asymptotic normality is established if we can

show that the

suP /E/ < B,xl~E E o(x,E)//(l + Ix]) < 00.

Taking the partial with respect to E under the integral signs

in (4.3) (The continuity of M(x) and a(x) assures us that this can be

-edone.) we obtain integrals of the form

co y

J (J ~E f(z,£) dz) f(y,£) dy-00 _00

and

fOO ( JY d

f(Z,E) dz) aE f(y,E) dy.

-00 -00

An examination of (4.5) reveals that assumption (5) is sufficient to

guarantee that the above supremum is finite.

Thus the desired asymptotic normality of nS/2 (X -.f) is estabn

lished.

'..The variances of the respective normal distributions to which

the SKW process and the K-W process converge are respectively

.e67

and

when a = 1, and

and (4.11)

when 1/2 < a < 1. It is evident that the K-W process is asymptotically

more sensitive to the variance of the observations.

Before leaving this example we note that if AS = A o(8)/IIT,

the expressions in (4.11) are identical.

Example 2: Suppose Y(x) has an exponential density function

given by

f(y,x) ~ (~(x) e-~(x)y, x ~ 0

l 0 , x < o.

Let {X } be a SKW process satisfying assumption (f) of Corollaryn

(4.1) with w > O. Let M(x) = ~-l(x) and assume that

1. M'(x) is continuous and (x- 8)M'(x) < 0 by x # 8;

2. sup /M'{x)I/M(x) (1 + Ixl) < 00;x

3. and there exists an open interval I containing 8 such that

M"'(x) is continuous.

..

68

We have

P[Y(X-E) < Y(x+E)] = 1 - A(X+E)!(A(x+E) + A(X-E»

= 1 - M(X-E)! (M(X+E) + M(X-E»

by the last expression in (4.3).

It follows that

E O(X,E) = (M(X-E) - M(X+E»!(M(x-E) + M(X+E». (4.12)

..

It is clear from (4.12) that the value of V is less than 2w as

in Corollary (3.1).

With the possible exception of condition (b), the conditions

of Corollary (4.1) can be easily verified using assumptions (1) and

(2) •

We have

= SUPjE] < B,xIM'(x)/!M(x)(l + Ixl) < ~

by assumption (2).

It remains to evaluate the Var 6(8,0) and A~ The

limVar o(x,c ) + 1 by (4.9) for all x and, in particular, for x = 8.c +0 n .n

By assumption (3) the value of A is given by

d .L E IdX dE O(X,E) E=O x=8'

or

69

A = - M"(8)/M(8)

= - M"(8)/cr(8)

since M(8) = A- l (8) = cr(8).

A comparison of (4.13) with (4.9) suggests that A may be gener-

ally of the form - k M"(8)/cr(8) where k is a constant characterizing

a family of distribution functions. The next example gives an impor-

tant case when this is not true.

Example 3: Suppose Y(x) is a Bernoulli random variable with

the P[Y(x) = 1] = p(x). Suppose that assumptions (1) and (3) of

Example 2 hold with M(x) replaced by p(x). In addition, assume that

(2) sup Ip'(x)1 (l+x) < 00.x

It is easily seen that these assumptions are sufficient to verify

the conditions of both Corollary (4.1) and Corollary (3.1). Moreover,

in this case the SKW process and the K-W process are identical. Thus,

using the results of Corollary (3.1), we find that

nS/ 2(Xn-8) ~ N(O, (As )2cr2/c2(2 ASA - S+»

where A = - 2 p"(8), and cr2 = 2 pee) (1-p(8».

This example shows that the corrective variable in the SKW process

is a transformation of the observed random variable into a discrete,

bounded random variable with a regression function that is less than,

equal to, and greater than zero as x is less than, equal to, or greater

than ~. That is, the SKW process is simply a K-W process after thisn

transformation on the observed random variables has been made.

70

These examples indicate that, at least asymptotically, the SKW

process is more dependent than the K-W process on the family of distri-

butions ~.K. normal, exponential) from which the observed random vari-

abIes come. However, for the special cases of the normal and exponen-

tial families, the SKW process is less sensitive to the variance of the

parent distribution.

4.3 Sample Properties of the SKW Process

Unlike the K-W process, whose small sample properties are largely

determined by M(x) and a(x), the small sample properties of the SKW

process often depend on the distribution function ina more intrinsic

manner. This makes a general discussion of the small sample properties

of the SKW process very difficult.

However, a few general remarks can be made. Firstly, the effect

of using the sign of (Y(X -c ) - Y(X +c )) is somewhat similar to thatn n nn

of constraining an observed random variable; both operations lessen

the effect of an observation from the tails of the distribution.

Secondly, if the SKW process is approaching e along a constant, rela.:.-o

tively small slope, the expected behavior is fairly well approximated

2by (3.11) and (3.12) in which the constants B and a are characteristic

of the family of observed random variables. And, thirdly, if the SKW

process is approaching e along a steep slope, the approach is very

uniform, and the rate is determined by the sequences {a } and {c }.n n

It is clear from the nature of the SKW process that is it im-

possible to consider an example corresponding to M(x) being quadratic

for the K-W process. That is, since 0 (x., c ) is bounded by + 1, itn

71

is impossible for E o(x,c ) = k c (x-e) to hold for all x. This impliesn n

that the moments of nS/ 2 (X -8) generally do not ·converge to those of itsn

asymptotic normal distribution unless possibly when X is restricted ton

lie in a bounded interval containing 8. To study the expected rate of

approach to 8 therefore requires a slight modification of the method

used for the K-W process.

Let {X} be a SKW process. Let I be an open interval of 8 suchn

that

E o(x,c ) = c T (x) (x-8)n n n (4.14)

for two reasons.

k -+ 00, and since

.. where 0 < 1. < Tn (x) 2. T < 00... Let 1\ be the event that XmE I for

m > k, and let X* denote the random variable X conditional on A..- n n-1<:

That is, assume that each member of the random variable sequence

. {X }oo k lies in I.m m=

It is not unreasonable to study the random variable sequence {X*}n

Since Xn -+ e a.e. by Corollary (4.1), P(1\) -+ 1 as

P(A ) + 1 nS/2 (x*-8) -+ nS/2 (X -8) in probability andk ' n n

therefore has the same asymptotic distribution.

We have, for a = ASn-a and c = C n-Wn n

= X* - 8 - a c- l o(x*,c )n n n n n

*Xn+l - 8(4.15)

= (1 - AST (X*)n-a )(X*-8) _ ASC-ln-(a-w)Z(X*)n n n n

.e 72

* * * *where, conditional on X , E Z(X ) = 0 and Var Z(X ) = Var o(X ,c }.n n n nn

In much the same manner that we obtained (3.16), we obtain

nE«X*+1-8)2 ~) ~ D(k,n) (X=_6)2 + A2C-202 i: D(m+l,n)m-2 (a-w) (4.16)

n . m=k

n 2-2where D(k,n) = TI (1 - A~T j-a) and a = sup Var o(x,c). Then,x nj=m

-2 . -. 2reverse inequality holds if ! and a were replaced by T and Q. Quite* a.e. * a.e.

clearly, since T (X) + A and Var o(X ,c) + Vai: 0(6,0), it couldn n n n

be shown that

Q * 2nlJ E(X -8)n

where S = a - 2w.

(4.17)

..

Since it will be useful in the next section, we shall show that

Where

If we raise (4.15) to the'4'th power and take expectations, we

obtain terms of the following forms:

.. * 3 * I *since E[:(X ·-6) Z (X) X ] = O. Using (4.16) and Holder's inequality,:.n n n n

the last two expressions are of smaller order than·thesecond.' Thus

we can write

.e73

Letting

+ 6 ~ (As )2C-2(G2 + 0(1) ~ D2(m+1,n)m~3~+4w. m=k

(4.18)

b =m

nIT

j=m+ln > 0

it is seen that Lemmas (3.1),'(3.2) and (3.3) as well as Theorem (3.2)

hold with 2 D replaced by nD. Thus the last expression in (4.18) is

less than

or, equivalently,

~.e n-2 (a-2w) 3 ~2 (1 + 0(1»).

It should be noted that, when a = 1, the restriction that

sa - 2 w < 2 A A implies ~ - 1 = 2(a-2w) < 4 A A = n D in the above

expressions.

Using (3.17) and (3.18) with 2 B = AS A, weobtairi

as we might have expected.

4.4 Estimation of E(X - 8)211

(4.19)

The fact that the form of A is generally unknown for the SKW

process adds a new complexity to the estimation of the variance of

X. If we can assume a particular distribution for Y(x), we cann

.e 74

often evaluate A as was illustrated in the preceding examples. An

estimate of the variance of X could then be calculated using then

method developed in Chapter 3. HOVlever,the distribution of Y(x) is

not always known. We now develop a 'methO'd'O'f estimating the variance

of X which does not require this knowledge.n

*Denote the Var X by V. Let b be an increasing subsequencen n m

Define V byn

Vn

-1= NN2:

melbs ~~ * 2(X -X ) •

m m n(4.20)

We shall study the properties of V through a study of then

properties of

V' = N-ln

(4.21)

The asterisk, which indicates that X , m=k, ... , n, all lie inm

a neighborhood of 8 in which (4.14) is a reasonable approximation for

E 'o(x,c ), will be omitted in the rest of this discussion., n

It follows from (4.17) that Vi is asymptotically unbiased.n

To obtain the variance of V' we note thatn

Var[V ' ]n

N= N-2 2: (b b I)

I m mm,m =1Cov[ (~

m) 2 -. 2

- 8 . (~ - 8) ]m'

2 N 2= N- 2: (b b ,) (E[X

b- 8) (x. _8)2]

I m m -lJ Im,m =1 m m

- E(~ ~ 6)2 ~(~ _ 8)2)m m'

..

75

+ Var(Xj,m'

- 81Xb )] - E(Xb - 8)2 E(Xbm m m'

-2- N

b ,m

. 11 .j=b

m

..Using (4.19) we obtain

Var(V') < (6 ¢2 + o(1))N-2n -

NL:

m'>mD(b ,b ,).

m m (4.22)

Consider the case a = 1. We have from (3.22) and (3.28) that for

every A < 2 A T there exists a N(A) such that, for every m, n > N(A),

AD(b ,b ,) < (b /b ,) .m m - m m (4.23)

Let m' = m + 1, and suppose that the sequence {b } is chosen such thatm

(bm/bm+l ) is independent of m. Then, for some constant B,

m-1L: (In (b i +l ) - In b)

i=l

or, equivalently,

(m-l)B

Thus b is of the form (sm).m

Let b be so defined with s > 1.m

obtain

76

Then from (4.22) and (4.23) we

Var(V') < (6 ¢2 + 0(1) N- 2 ~ (N-m)s-mAn m=O

Since N is approximately the solution of sN = n, we have

In(n) Var(V') ~ 6 ¢2 In(s)/(l-s-T) + 0(1).n

Consider the case a < 1. We have from (3.22) and (3.32) that for

every A < 2 AST there exists and N(A) such that for every m, n > N(A)

D(b ,b ,) < (b /b ,)¢ exp[A b-~ (b -b ,)]mm mm m mm

where ¢ > O.

(4.24)

Let m' = m + 1 and suppose the sequence {b Vis(:'chosen,'such':\:hatm

the term in brackets is asymptotically independent of m. Then, if

-1bm+l = bm(l + oem », we have

-1 bl-am .. -+ C.m

-1It . 'I h b (T m)l-a)1S eas1Y seen t at = L

mhas the required property.

Let b be so defined. Then, for fixed k and m' = m + k, them

term in brackets becomes [- A T k + 0(1)]. Thus by (4.22) and (4.24)

we obtain

Since N is approximately the solution of

-1(T N) (l-a)

we have

Since 8 is unknown, it is natural to use V instead of Vi.n n

We have

77

..E V

nN-1 ~ b 2= E(X.·m-.Xn·.) .

m=l m.

NL bm(E(~ _8)2 - 2 E(~ -8)(Xn-8) + E(Xn-~}2)

m=l m m

NL:

m=l(¢ + 0(1».

Thus V is asymptotically unbiased. Moreover, for a < 1,n

O(b-S exp{-n-a (n-b )}m m '.

.. Thus for a < 1, V is usually a conservative estimate of V •n n

This method of estimation is obviously very inefficient for a

close to one. It does have the advantages of being relatively easy

-e

78

to calculate and of being generally applicable. It should be noted

that the order of the variance of V depends on the number of stepsn

and not on the number of observations. This is undesirable when we

wish to economize by taking more observations on each step and taking

fewer steps.

CHAPTER 5. TWO EXAMPLES

We now illustrate the application of some of the results of

Chapter 3 and of Chapter 4 by means of two hypothetical examples. Both

examples entail designing an experiment to emphasize the factors that

influence the choice of the sequences {a }, {c }, and {T }.n n n

Example 1: A research worker is studying the relationship between

temperature and the rate of reproduction of a certain strain of bacteria.

One of his objectives is to estimate the temperature (8) that maximizes

the average rate of reproduction. He emphasizes that he wishes to mini

mize the E (M(O) - M(8»2 rather than E(e - 8)2.

-e From past experience he is reasonably confident that 8 lies be-

tween 80 and 110 degrees Fahrenheit and that the average rate of repro-

duction is reasonably cOnstant in the interval (8-5,8+5). For temper-

atures below 8 - 5 degrees the average rate of reproduction changes ap-

proximately one unit for each 10 degree change in temperature, and for

temperatures above 8 + 5 degrees the average rate of reproduction

changes two to three units for each 10 degree change in temperature.

The standard deviation is expected to be approximately three units at

temperatures below 8 and to moderately increase as temperatures rise

in a twenty degree range above 8. He also mentions that for practical

reasons he can perform the experiment at only 20 different temperatures.

The above discussion indicates that the relation between the..average rate of reproduction and temperature may be represented by

Figure 4.

" 22I:l0 20''M.w()

::J18"0

0H

~ 16~ I

in 80

IET-/-

HI IG

I Ifl(c) I I

I I8-5

90 100Temperature

8+5

110 120

80

Figure 4. Graph of Reproduction versus Temperature

We shall illustrate the design of this experiment using two

CKW processes. In order that the first process (X') approaches 8n

along a moderate slope and that the second process (X") approachesn

8 along a steep slope, we set Xi = 80 and Xi' = 110. A reason in

favor.of using two such processes in practice rather than one is

that two processes would usually generate observations more uni-

formly and over a larger range of temperatures. This allows the

experimenter to study the whole relationship more reliably.

Since so many of the results in this paper depend on (3.15)

being a reasonable approximation of the process, we must define an

interval I of 8 for which this is true. Then we must design the

process such that X is in the region at least for the last few steps.n

Since we are practically restricted to 20 steps, we shall choose the

sequence {an} such that with a high probability Xn has reached this

region after 10 steps.

Before doing this we must choose the sequence {c}. Thisn

sequence should be chosen with two thoughts in mind. Firstly, cn

81

should be large as long as Ix - eln

searching for a neighborhood of e.

> C ; that is, X is stilln n

Secondly, when Ix - el < c ,n n

..

c should be small enough so that the expected correction term alwaysn

directs the process to a region in which M(x) ~ M(e).

The maximum value of c for which the last statement is truen

can be approximated by considering Figure 4. Let D - F denote the

length of the line DF, and let c = (D - F)/2. We see from the

picture that if c < c, then I~(c ) - el < 5. The experimenter hasn n

noted that there is not a biologically significant difference between

M(~(c» and M(e) if I~(c) - el <:5. Thus we seek the maximum value of

c for which I~(c) - el < 5.

Let p be such that (G - F) = 2pc. Then if sl and s2 represent

the slopes of lines FE and ED, we have

E - G = 2cpsl = 2c(1-p)s2'

or

Since

5 > I~(c) - e]

for s2 > sl' Thus, for our example,

c - 5(.1 + .2)/(.2 - .1)

= 15.

.e..

82

Consider the choice of c for X'. Initially c should be large enoughn n n

so that if x + c < 8, then X' +l > X' with a high probability •n n n n

Since p[ly(x) - M(x) I < 0] ~ .6, we may choose c1 (~C) as the solution

of

20 = M(xl + C) - M(X1 - C)

~ .2C,

or

C ~ 30.

If, in addition, we require that

then W ~ .3. With these values for C and W we find that cZO ~ 11.5.

Similarly for the choice of c for X' we obtainn n

20 = M(xl

+ C) - M(xl

- C)

~ .4C.

or

C ~ 15.

In this case, c1

is sufficiently small to discount a serious bias due

Thus, we can set w = 0, or c S 15.n

It should be noted that we could start the process at n = k > 1

and let n = k, •.• , 20 + k. By doing so, we can adjust the change in

chosen constants.

.e 83

An examination of Figure 4. reveals that if c < 15 andn

IXn - 81< 5, then (3.15) is a reasonable approximation for Xn+1 . Equa-

tions (3.11) and (3.12) are useful for choosing the sequence {a } suchn

that X reaches the interval I by the n'th step.n

In choosing a (=~a) for X' we note first that this process willn n

be somewhat exploratory due to the small slope. Therefore, we do not

want a to decrease too rapidly as that would make the process toon

sensitive to the first few steps. Thus we might set a = 3/4. Since

8£(80, 110), IXI - (8 - 5)1 < 25. Thus a conservative value for A,

which reasonably assures that X £1 for some n ~ 10, is the solution ofn

102(1.1)A L

m=l

-.75m ~ .2(3.8)A

= 25, (4.25)

by (3.11). We therefore set A = 30.

-aIn choosing a (=An ) for X" we note first that this processn n'

should be in a neighborhood of e quite early due to the steep slope.

Thus, we want the sequence {a } to optimize the behavior of X" undern n

this assumption. Asymptotically this is done by setting a = 1 with

the restriction that A > B/2 Mil (8) I, where B = 1 - 2w = 1. To obtain

a conservative estimate of IM"(8) I, we note from (3.14) and Figure 4.

that

-B(8-10~8) ~ M'(8-10)

~ .1,

or

84

B ~M"(e)-

Thus A > 2.5. Furthermore, by the note following (3.20), asymptotically

the optimal choice of A is 5. However, if we require A to be greater

than the solution of the equation corresponding to (4.25), we must have

A > 25.

There are numerous bases on which the sequence {T } of constraintsn

may be chosen. We illustrate one method for X".n In the region of

steepest slope the expected correction term is

AC-l n- l (M(x-c ) - M(x+c )) ~ 6AC-l n-ln n-

If we set T = DCA-In then the maximum correction would be th It isn

clear that for D > 6 this constraint does not seriously affect the

approach of X' in the region of steep slope. By letting D < 10, then

possible bias arising from a large movement to the left of e in the

early stages of the experiment is diminished. Obviously, the effect

of T diminishes very quickly as n gets large.n

We can get an approximation for the varianc~ of X' and X" usingn n

(3.16) and Lemmas (3.1) and (3.3). We have for X' thatn

nD(m,n) < IT (1 - 2(.1)30j-a)2

j=m+l

giving

2(9)20-. 75+. 6 75Var(X21 - elxk ) ~ ';;;;"'>':1:-:2~ (exp{6(20)-'}

- 75)- exp{-(20 ... k)6(20)' }

•

...

by Lemma (3.3). For k = 10, we have

If X10 - 81 < 5, then

20a·/' ) <IT· (1 - 6.-. 75 )2 25x10 _.. J

j=ll

~ (11/20)-1.5+.6 exp{-12(10)(20)-·75}25

~ O.

by (4.24) with b = 10 and b , = 20.m m

We have for X" thatn

n .-1 2D(m,n) ~ '..II (1 - 2(.1)25 J ) •

j=m+l

Using Lemma (3.1), we obtain

2Var(X

2"1 - alx

10).:. (9) (~5) (20-1 - (.5)5 10-1)

(15) (10-1)

~ .14.

If Ix" - al < 5, thenn

20E2 (X" - alx ) < 'll (1 - 5j-l)2 25

20 10 - j=l1

:::.. (11/20) 10

:::.. .03 .

85

(4.26)

(4.27)

86

Before comparing (4.28) and (4.27), we must adjust for the fact

that X" approached e along the steepest slope. The corresponding adn

justment in (4.25) changes the restriction on AS to AS > 15. With

this new value for AS we obtain

We conclude this example by noting that much the same procedure

would be followed if the SKW process were considered.

Example 2. Suppose it has been noted that the addition of a

moderate amount of a certain medication to the daily diet of some

chickens has improved their health, and it is desired to estimate the

optimal amount.

In this type of problem, it is often difficult to quantify the

observed variable (health). Moreover, the shape of the response func-

tion depends on the manner in which we choose to quantify this vari-

able. Thus it seems natural to use a method of estimation that

depends on relative magnitudes rather than absolute magnitudes. We

therefore illustrate this example using the SKW process.

The corrective variable (o(X ,c » is defined to be -1 if then n

chickens which are given the amount of medication (X - c ) aren n

healthier than the chickens which are given the amount of medication

(X + c). As we noted in the discussion of Example 3t the randomn n

variable o(x,c ) generates a response curve that equals zero at ~.n n

We shall suppose that the experimenter is reluctant to assume that

•• flln - 8.

r)view the"

that the

87

If for no other reason than'"it is somewhat meaningless to

expected health as a function of diet, we shall also assume

experimenter wishes to minimize E(8 - 8)2.

Since it is not too unreasonable to think of ).In - 8 as the bias

of X , we might let c equal some constant (greater than 1) times then n

bias we are willing to accept. The values C and w should be chosen

such that c is large during the period of time we expect the processm

to be searching for the region of 8, and approximately equal to cn

when in this region.

We have such little knowledge of A in this case that is is

probably wise to set a < I and thereby obviate the risk of not satis

fying the inequality 2As~ > S for the case a = 1. Moreover, due to

the unknown nature of A, we are restricted to estimating the variance

of X using the method proposed in the previous section. This lattern

fact may cause us to diminish a still further and to compensate for the

corresponding increase in the variance by decreasing AS.

Using the principles employed in Example 4, we could evaluate the

sequences {a } and {c } more specifically.n n

..CHAPTER 6. SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH

6.1 Sununary

A large class of stochastic approximation procedures can be

represented by

(6.1)

where, conditioned on X , Z(X ) is an observable random variablen n

independent of X. and Z(X.),i = 1, .•. , n-l. In particular the Kiefer-1 1

.. Wolfowitz (K-W) process admits this representation with a' = an n

-1cn

and Z(X ) = Y(X -c ) - Y(X +c ) where, conditional on X· = x, Y(X -c )n n n n n n n n

and Y(X +c ) are independent random variables with means M(x-c ) andn n n

M(x+c ) respectively.n

An extension of Burkholder's [2] theorem on the asymptotic dis-

tribution of the stochastic approximation procedures, which can be

represented by (6.1), is given in Chapter 2. This extension allows

the number of observations taken on each step to be of O(n-p)~· p ~ 0,

allows the random variable Iz(x )/ to be constrained to lie betweenn

bounds of O(n-S), S ~ 0, and admits a more general sequence of con-

stants a'. The practical consequences of this extension for the K-Wn

process are discussed in Chapter 3.

The small sample mean and variance of Xn ' conditional on Xk '

are studied through a consideration of some specific examples. The

case in which M(x) is a quadratic function is considered since M(x)

89

is often closely approximated by a quadratic function in a neighborhood

of e. It is shown by means of some approximations that, after suitable

noxma~ization, the variance of the normal random variable to which the

K-W process converges in distribution is a conservative estimate of the

Var(Xnl~). This asymptotic variance depends on the unknown parameters

02 and M"(8). It is shown that an efficient estimator of M"(e) is also

.. 2a consistent estimator of M" (8) when Y(x). ~ N(M(x),.o ).

A modification of the K-W process first proposed by Wilde [13] is

defined in Chapter 4. It satisfies (6.1) with a' = an n

-1c -and withn

Z(Xn ) being the mean of [rn ] independent random variables defined by

1 if Y. (X -c ) > Y. (X +c ),1 n n 1 n n

01 (x,cn) = 0 if Y. (X -c ) = Y. (X +c ),1 n n 1 n n

-1 if Y. (X -c ) < Y. (X +c ),1 n n 1 n n

where, conditional on X = x, Y.(X -c ) and Y.(X +c ), i = 1, ... , [r J,n 1 n n 1 n n n

are independent random variables with mean M(x-c ) and M(x+c ),n n

respectively. This process is termed the Signed Kiefer-Wolfowitz (SKW)

proc..ess. The asymptotic distribution of the SKW process is obtained

under;quite general assumptions on Y(x) if the sequences a , c , andrn n n

are of the form n-P, p > O. It is seen that the order of the variances

of the K-W process and the SKW process are the same, but that the

variance of theSKW process shows a greater dependency on the nature of

the distribution of Y(x).

The variance of Xn conditional on Xk is again studied by consider

ing specific examples. It is seen that this variance depends on an

,

.,.,

...

90

unknown parameter~ the nature of which depends on the form of the dis-

tribution of Y(x). The estimator of the variance,

-1 N S 2Var(X ) = N l: bm (Xb - X )

n m=l m n

where {bm} is an increasing subsequence of the integers, 1, .. ", n, and

nS(X -8) is asymptotically normal, is proposed since this estimatorn

does not aepend on the unknown parameter. This estimator is shown to

be consistent for somewhat restricted class of sequences {b }.m·

In Chapter 5, two hypothetical examples are given to illustrate

the designing of an experiment in which these procedures are used.

6.2 Suggestions for Further Research

Although the results in the paper provide some added insight into

the behavior of the K-W and SKW processes and methods of estimating the

variance of the estimate of e are presented, both of these areas need

further study. One very promising possibility would be a numerical

study comparing these stochastic approximation procedures with the

Method of Steepest Ascent .

,

91

LIST OF REFERENCES

1. Blum, J. R. 1954.probabili ty one.

Approximation methods which converge withAnn. Math. Statist. 25:382-386.

2. Burkholder, D. L. 1956. On a class of stochastic approximationprocesses. Ann. Math. Statist. 27:1044-1059.

3. Chung, K. L. 1954. On a stochastic approximation method. Ann.Math. Statist. 25:463-483.

4. Dvoretsky, A. 1956.Berkley Symposium.

On stochastic approximation. Proc. Third1:39-56.

5. Fabian, V. 1967. Stochastic approximation with improv~d asymptoticspeed. Ann. Math. Statist. 38:191-200.

6. Fabian, V.mation.

1968. On asymptotic normality in stochastic approxiAnn. Math. Statist. 39:1327-1332.

7. Hodges, J. L., Jr., and E. L. Lehman. 1956. Two approximationsto the Robbins-Monro process. Proc.Third Berkeley Symposium.1:95-104.

8. Kiefer, J. and J. Wolfowitz. 1952'.' Sfochastic:'apptoximati'on ofthe maximum of a regression function. Ann. Math. Statist.23:462-466.

9. Robbins, H~ E. and S. Monro. 1951. A stochastic approximationmethoa. Ann. Math. Statist. 22:400-407.

10. Sachs, J. 1958. Asymptotic distribution of stochastic approximation procedures. Ann. Math. Statist. 29:373-405.

11. Springer, B. G. F. 1969.of random variability.56:65-74.

Numerical optimization in the-presenceThe single factor case. Biometrika.

12. Venter, J. H. 1967.Ann. Math. Statist.

An extension of the Robbins-Monro procedure.38:181-190.

•13. Wilde, D. J. Optimum seeking methods. Prentice~Ha11, Inc. 1964.

Englewood Cliffs, N. J •

This research was supported by the National … research was supported by the National Institutes of...

Documents

Transcript of This research was supported by the National … research was supported by the National Institutes of...