This research was supported by the National … research was supported by the National Institutes of...
Transcript of This research was supported by the National … research was supported by the National Institutes of...
..
•
This research was supported by the National Institutes of Health,Institute of General Medical Sciences, Grant No. GM-l2868-04 .
ON THE KIEFER-WOLFOWITZ PROCESS AND SOME OF ITSMODIFICATIONS
Mark Allyn JohnsonUniversity of North Carolina
Institute of Statistics Mimeo Series No. 644
September 1969
•
•
ABSTRACT
JOHNSON t MARK ALLYN. On the Kiefer-Wolfowitz Process and Some of Its
Modifications. (Under the direction of JAMES E. GRIZZLE).
The Kiefer-Wolfowitz (K-W) process is a stochastic approximation
procedure used to estimate the value of a controlled variable that
maximizes the expected response (M(x» of a corresponding dependent
random variable.
An extension of a theorem on the asymptotic distribution of a
certain class of stochastic approximation procedures t which includes
the K-W process, is given. This extension admits an increasing number
of observations to be taken on each step, constrains the observed
random variable to lie between predetermined bounds (not necessarily
finite), and extends the class of admissable norming sequences. Each
of these extensions is examined in the context of their usefulness to
the application of the K-W process.
Approximations of the small sample mean and variance of the K-W
process are given and studied through an examination of various special
cases of the response function. Traditionally the estimation of the
variance of the process involves the estimation of M"(8). It is shown
that an efficient estimator of M"(8) based on the observations generated
by the process provides a consistent estimator of M"(8) .
A modification of the K-W process is developed for which the
dependent random variables influence the direction t but not the magni-.
tude of the correction on each step. In addition to obtaining the
•
asymptotic distribution of this process, an approximation for the small
sample variance is given. An estimator of this variance is given which
does not entail the estimation of unknown parameters. The estimator is
shown to be consistent.
Finally, examples illustrating the use of these procedures are
given .
-eBIOGRAPHY
The author was born September 30, 1943, in Weld County, Colorado.
He was reared on an irrigated farm near Eaton, Colorado. He graduated
from Eaton High School in 1956 and went directly to college.
He received the Bachelor of Science degree with a major in zoology
at Colorado State University in 1960. On receiving a grant from the
National Institute of Health, he studied experimental statistics in the
graduate school of the same university for two years. In September,
1966, he entered the Graduate School of North Carolina State University
at Raleigh to continue studying experimental statistics. He has ac
cepted a position as a statistical consultant for the Upjohn Company
in Kalamazoo, Michigan.
The author married Miss Martha McLendon Willey from Nashville,
North Carolina, in 1968. As yet they have no children.
ACKNOWLEDGMENTS
The author wishes to thank his advisor, Dr. James Grizzle, who
suggested the topic of this dissertation, for his freely given counsel
throughout his doctoral program. He also wishes to thank Professors
N. L. Johnson, Pranab K. Sen, Harry Smith, Jr., and H. R. van der Vaart
for serving on his advisory committee and offering assistance during
this study.
The author wishes to thank Mrs. Gay Goss for typing the manuscript
and to thank Mrs. Mary Donelan for assisting in the typing of th~ rough
draft of this manuscript.
Finally, the author wishes to thank his parents, Mr. and Mrs.
Walter F. Johnson for their inspiration throughout his academic career,
and to thank his dearest wife, Martha, for her patience and her
assistance in the proofreading of each manuscript and in the typing of
the rough draft.
TABLE OF CONTENTS
Page
LIST OF FIGURES . . .
1. INTRODUCTION AND REVIEW OF LITERATURE. .
·v
1
2.
1.1 Introduction.•....1.2 Summary of Literature .....1.3 Summary of Results .••
STOCHASTIC APPROXIMATION PROCEDURES OF TYPE AT .o
2.1 Constrained Stochastic ApproximationProcedures.•.•.....•.
2.2 Constrained Random Variables .••.2.3 Asymptotic Distribution of a Type
AT Process .•..•...•.•.o
157
9
911
14
3. THE CONSTRAINED KIEFER-WOLFOWITZ PROCESS .
3.1 Definition of the Constrained Kiefer-Wolfowitz Process . • . . . . . . .
3.2 Asymptotic Properties of the CKW Process.3.3 Effect of Constraining the K-W Process ..3.4 Expected Small Sample Behavior of the
K-W Process . • . . . . • . . . . • .3.5 Approximations of the Small Sample Variance •3.6 Estimation of E(X - 8)2 •.•
n
4. THE SIGNED KIEFER-WOLFOWITZ PROCESS.
27
272834
384454
57
4.1 Definition of the Signed Kiefer-WolfowitzProcess . . • . . . • • . . . . . . • . . . 57
4.2 Asymptotic Theory of the SKW Process.. .... 624.3 Sample Properties of the SKW Process. 704.4 Estimation of E(X - 8)2. 73
n
6. SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH...
5. TWO EXAMPLES . • . . 79
88
6.16.2
Summary 8 • •• • • • • • • •
Suggestions for Further Research ..8890
LIST OF REFERENCES... 91
..
LIST' OF:'FIGURES
Figure
1. Graphical Definition of ~(c ) .n
2. Optimal Determination of c (=C)n
: 3. Graph of the Sequence {b }. . •n
4. Graph of Reproduction versus Temperature.
Page
3
44
46
80
.. e
..
..
CHAPTER 1~ INTRODUCTION AND REVIEW OF LITERATURE
1.1 Introduction
The problem of finding the maximum of a response curve is often
encountered in applied statistics. Two methods for locating this maxi-
mum have been developed extensively. The method of steepest ascent,
which relies on multiple regression principles, is the most widely
accepted. The second method is a sequential procedure. Heretofore,
this sequential procedure has been developed mainly along asymptotic
lines. We shall develop two modifications of this latter method with
special emphasis given to the practical application of this method .
With this in mind, let X be a variable that can be controlled by
the experimenter, and let Y(X) be an associated random variable such
that E[Y(X)] = M(X). In otller words, let M(x) be a function of x,
and assume Y(x) is an observable random variable with a distribution
function F(olx) such that
00
M(x) = I ydF(y!x).
-00
It will be assumed througllout this paper that M(x) has a unique maximum
at x = e and that M(x) is increasing or decreasing as x is less than or
greater than e.
The foundations of the method of estimation, which Kiefer and
Wolfowitz [8] adapted for the particular problem at hand, are contained
in a paper by Robbins and Monro [9]. The spirit of this estimation
.. e
...
2
procedure is to generate a sequence of estimates {X } that converge ton
e. The procedure is such that then'th estimate (X ) equals X 1 plusn n-
a correction term that is a function of an observable random variable
(Y(X 1) depending on X. for i ~ 1, n-l.n- 1
These procedures have certain advantages and disadvantages when
they are compared to the usual methods of response surface methodology.
They are simple to execute and easy to evaluate. They are generally
applicable in practice, and they admit a complete mathematical formu-
lation. On the negative side, they have the disadvantage of most
sequential procedures in that they require evaluation after each step.
Moreover, their variance often depends upon parameters for which a
good estimator cannot be calculated from the observations.
The stochastic approximation procedure proposed by Kiefer and
Wolfowitz [8] was later called the Kiefer-Wolfowitz process. In this
paper it will be referred to as the K-W process. Its definition
follows.
Definition: Let {a } and {c } be sequences of positive numbers,n n
and suppose Y(x) is a random variable with distribution function F(ylx)
00
where M(x) ~ !YdF(Y/X). Let Xl be arbitrary, and let the random-00
variable sequence {X } be defined byn
Xn+l ~ X - a c-l (Y(X - c ) - Y(X + c»,n nn· n n n n (1.1)
where Y(X - c ) and Y(X + c ) are random variables which, conditionaln n n n
.. e3
distributed independently as F(y!x - c ) and F(yjx + c ).n n n n
Before summarizing the literature on this process, some motiva~
tion for the assumptions on the sequences {a } and {c } will be given.n n
It follows from the definition that
If x - c > 8, then M(x - c ) - M(x + c ) is positive, and ifn n- n n n n
x + c < e, then M(x - c ) - M(x + c ) is negative. Obviously, undern n- n n n n
suitable regularity assumptions on M(x), there exists a point ~(c ) suchn
that M(x - c ) - M(x + c ) is greater than, equal to, or less thann n n n
zero as x is greater than,equal to, or less than ~(c). See Figure 1n n
for an illustration.
III
1- TI I
II I
I III I
~ (c )-c e ~(c )n n n ~(c )+cn n
Figure 1. Graphical Definition of ~(c )n
.e4
It is equally clear that if j.l(c ) exists, then /j.l(c ) - el < c , andn n - n
that if M(x) is symmetric about e, then j.l(c ) = e. If C ~ c, intui-n n
tion tells us that X should tend to j.l(c) rather than e. Thus, if M(x)n
is not required to be symmetric about e, it is usually assumed that
C + O.n
Moreover,
> E(x2-8-2a20/Xl =xl )
2> (x
l-8)-20 La.
m=l m
By induction
00
If La < 00, it is possible to choose (xl-e) > 20 L a implying an asymp-m 1 m
totic bias; thus it is necessary to assume that La = 00.n
5
This assumptionTh 1 . ~s that ~a2c-2 < 00.east common assumpt~on ~ ~ nn2 -2However, it is easy to see why a c mustnn
tend to zero. In fact, it follows from (1.1) and the inequalities
is somewhat less intuitive...
.e
= E(-a c-l[(y(x -c) - M(x -c » - (y(x +c ) - M(x +c »])2~nn n n n n n n n n
These three assumptions have almost always been made in theorems
implying X + e in any usual mode of convergence. The exception isn
that c does not have to converge to zero if M(x) is symmetric about e.n
The validity of these assumptions will always be assumed in the fol-
lowing review.
1.2 Summary of Literature
In their introduction of K-W process, Kiefer and Wolfowitz [8]
proved that X + e in probability if M(x) is bounded by two quadraticn
2 -1functions, E(Y(x») < 00, and La c < 00. In later papers (J. R. Blumn n
[1], D. L. Burkholder [2], and A. Dvoretzky [4]) the assumption that
-1La c < 00 was dropped, and the conclusion was strengthened to that
n n
of convergence with probability one. Burkholder's theorem relaxes the
assumption of quadratic bounds on M(x) with the assumption which im-
plies for practical purposes that 1M' (x)l > 0 if x ~ e. Burkholder
also noted that this assumption admits response functions taking
2values on bounded value sets (~'K' exp(-x ».
.e6
The first significant results on the asymptotic distribution of
X were derived by Burkholder [2] who used methods suggestedn
K. L. Chung [3]. Burkholder established that the moments of
by
n~(X -8)n
converged to those of a central normal random variable if the sequences
{a } and {c } are of the form a = O(n- l ) and c = O(n-w), andn n n n-1
~ = l/2-w. Burkholder also assumed that W > (4+2n) ,where
1~(E)-81 = O(El +n). He proved that, if the third derivative of M(x)
is continuous, then n ~ 1. Using a method developed by Hodges and
Lehman [7], he was able to drop the assumption of the quadratic bounds,
but he had to weaken the conclusion to that of convergence in distribu- .
tiori.
Sachs [10] used a different approach to show that
1/3 -1 .n (In n) (X -8) was asymptot1cally normal under similar restrictionsn
on the sequence {a } and the assumption of quadratic bounds on M(x).n
He did this by considering a slightly different class of sequences {c }.n
He also proved that, in general, nl/3 (X -8) is not asymptotically nor-
n
mal under the assumptions on the sequences {a } and {c } discussed inn n
the introduction.
More recently, Fabian [6] proved a very general theorem about
which he stated, " ... [the theorem] implies in a simple way all known
results on asymptotic normality in various cases of stochastic approxi-
mation." He used his result to obtain the asymptotic normality of his
generalization of the K-W process (Fabian [5]) which can be constructed
to obtain an asymptotic variance of order (n-(l-E)) for any E > O.
Again, this is mainly an
.e7
Fabian achieves this smaller variance by taking observations at
more than two points On each step. His results have interesting pos-
sibilities, especially if a very precise estimate of 8 is needed.
However, the K-W process seems to have the better small sample pro-
perties unless one has substantial ~ priori knowledge as to the loca-
don of 8.
J. H. Venter [12] has proposed that observations be taken at
x - eX, and X + c on each iteration and thereby estimate Mli (8).n n' n n n
He noted that this estimate could be used to minimize the variance
of n~(x -8) if a is of the form An-I.n n
asymptotic result and will not be considered, here.
The only small sample results relevant to the K-W process seem
to be those of B. Springer's [11] simulation study on a process pro-
posed by Wilde [13]. This process is a modification of the K-W
process in which the sign of Y(X -c ) - Y(X +c ) is used rather thann n n n
the difference. In this study he used a geometric sequence for {a }n
which obviously does not satisfy the usual assumption that Lan
diverge. Then he unjustifiably criticized the process as a whole for
the bias that he observed.
1.3 Summary of Results
The K-W process has been largely of theoretical interest, and the
behavior of this process has been studied mainly through its asymptotic
properties. With the exception of some remarks by a few writers,
little knowledge of use to the consulting statistician can be gleaned
from a review of the literature. It is the intent of this paper to
fill this gllP.
~ Chapter 2 will be devoted to a generalization of Burkholder's~ ,.,
8
-e
[2] theorem on the asymptotic distribution of a certain class of sto-
chastic approximation procedures. These extensions will admit
sequences {a } of the form O(n-a ) where 1/2 < a ~ 1, admit the use ofn
constrained random variables (to be defined later), and allow the
experimenter to take an increasing number of observations at each
step. The usefulness of these extensions will be discussed in the
following chapters as they arise.
In the third chapter, a generalization of the K-W process will
be proposed and its asymptotic properties will be obtained from the
results of Chapter 2. Approximate bounds will be given for the small
sample variance of the process when M(x) is a quadratic function, and
this result will be used to obtain approximate bounds in cases of
more general functions. The effects of the response function, the
value of a, the variance of Y(x), and the starting value (Xl) on the
mean square error will be studied in some detail.
In the fourth chapter we shall develop the modification of the
K-W process proposed by Wilde [13]. Various asymptotic and small
sample properties will be established. In addition, a method of
estimating the variance of the estimator will be developed.
In the final chapter two examples will be given to illustrate
some of the results in the third and fourth chapters.
..
..
CHAPTER 2. STOCHASTIC APPROXIMATION PROCEDURES 'OF TYPE ATo
2.1 Constrained Stochastic Approximation Procedures
In this chapter we shall extend Burkholder's [2] Theorem 3 on
the asymptotic normality of a class of stochastic approximation
procedures. This extension will improve the practicality and the
flexibility of these procedures. The main discussion of t~e useful-
ness of this extension will be made in Chapter 2 and in Chapter 3 in
which more concrete examples will be examined.
In the development that follows, F[r](o) will denote the dis-
tribution function of the mean of [r] ([r] denotes the greatest
integer less than or equal to r) independent random variables with
distribution function F(o). We shall denote by yT the random vari-
able defined by
T, Y > T
yT = y, -T < Y < T
-T, Y <-T
where Y is any random variable. The random variable (yT) will be
called a constrained random variable, and if R(Y) is any function or
functional of Y, then RT(y) will denote the corresponding function
of functional of yT•T T(!'K" if M(x) = E(Y(x», then M (x) = E(Y (x).)
When it is possible to do so without confusion, the random variable
sequence defined by
•
10
T , Y > TT n n n
y n = yn' -T < Y < T
n n- n- n
-Tn' Y < -Tn n
will be denoted by yT. We shall say that a function sequence' {D (x)}n n
is continuously convergent to A at x = e if for every E > 0, there
exists a o(e) > 0 and an N(E) > 0 such that if lx-e) < 0 and
n > N(E), then ID (x)-A/ < E. The usual symbols 0(0) and 0(0) willn
be used to indicate the order of convergence.
Definition (2.1): Let {a'} and {X } be sequences of positiven n
numbers such that if T =f: co for every n greater thansome.n , thenn 0
inf T > O. Let Z (x) denote a random variable with distributionn n
function Gn(ojx). Let Xl be fixed, and define the random variable
sequence {X } byn
X = X - a'ZT(X),n+l n n n (2.1)
where ZT(x) is a random variable with conditional distribution GT(olx)n n
Tgiven Z1' $ f; " ,
TZ l' Xl' ... , X l' and X = x.n- n- n The stochastic
Tapproximation sequence X will be said to be of Type A .n 0
This definition is, in fact, Burkholder's [2] definition of a
Type A process. However, the superscript T is added to emphasize theo
fact that, although the sequence {X } depends on constrained randomn
variables, it is usually more reasonable to make assumptions on the
unconstrained random variables.
11
In order to get an appreciation for this class of processes and
for some functions that will be defined later, we shall derive a few
well-known results for the K-W process.
That the K-W is of T can be by lettingprocess Type A seen0
a' -1 T :: 00, and ZT(X) Y(X -c ) - Y(X +c ). Let= a c , =n n n n n n n n n
R (x) =E(Z (x)). Then R (x) = M(x-c ) - M(x+c ), and, as was notedn n n n n
in the introduction, under suitable regularity assumptions there
exists a sequence {~ } such that R (~ ) = M(~ -c ) - M(~ +c ) = 0,n n n n nn n .
and (x-~ ) R (x) > 0 if x ~ ~. By considering the Taylor seriesn n n
expansion of R (x) about ~ , we can show that R (x) = -2c (x-~ )M"(~ )n n n n n n
for x = ~ , we find that the function sequence {D (x)} is continuouslyn n
convergent to -2M"(8) at x = 8 since I~ -el + °as c + 0.n n
..
·e
+ o(c (~ -8)).n n Letting D (x) = R (x)/c (x-~ ) for x ~ ~ , = -2M"(8)n n n n n
2.2 Constrained Random Variables
The following lemmas establish certain relationships between
constrained and unconstrained random variables that are used in
Theorem (2.1).
Lemma (2.1): Let Y be a random variable with finite variance
(a2). Then Var(yT) < a2 .
Proof: Without loss of generality assume that E(Y) = O. Define
..
the random variable (Y') by
Y' ={
T'
Y,
Y > T
Y < T.
12
Then
•
Var(Y' )
Therefore
Lemma (2.2): Let' {T } and {r } be sequences satisfyingn n
..
..
Definition (2.1), and let {Z (x)} denote a sequence of randomn
variables with distribution functions {F[ru](olx)} such that00
R(x) = I zdF(zlx). Assume that there exists a finite number (B)-00
and an even integer (K ~ 2) such that
00
supx J (Z-R(x)l dF'(Z'lx) <B.
-00
Then for 0 < C < 1 there exist positive constants Dl
, D2 , and D3
such that
a. if /R(x)! > CTn' then
TR n(x)sgn(R(x» > D T - D r-K/ 2 T-(K-l)
1 n 2 n n
b. and if /R(x)! < CTn' then
T
!R(x) - R n (x) I < D
3r -K/2 T-:(K-l)
- n n
Proof: Case (a): Without loss of generality, assume R(x) > O.
Choose C' such that 0 < C' < C. We have then
.eE(ZT(x»
n
T [r ]= -T P(Z (x) < -T ) + fn zdF n (zlx)n n n_T
n
+ T P(Z (x) > T )n n n
> -T P(Z (x) < (C-C')T )- n n n
+ (C - C')T P(Z (x) < (C-C')T )n n - n
13
..
= (C-C')T - (l+C-C')T P(Z (x) < (C-C')T ).n n n n
But since R (x) > CT we haven n
P(Z (x) < (C-C')T ) < p(IZ (x) - R (x) I > CIT )n n - n n n
< (C'T )-K E(Z, (~) - R(x»K- n n
< '(CIT )-K r -K/2 ,E(Z(x) -R(x»K- n 'n '
We have then
E(ZT(X» > (C-C')T - (l+C-C,)-KBT-(K-l)n - n n
-K/2r
n
The desired result is obtained by letting Dl = (C-C') and
D2
= (l+C-C') (C,)-K.
Case b: Again we can assume R(x) > O.
14
We have that
TIR(x) - R n(x)/ =
< (/R(X)! + T ) P([Z (x) < -T ] U[Z (x) > T ])n n n n n
-T00 n
+ f !z-R(x)ldF(rn] (z!x) + f . IZ-R(x~dF[rn](z/x)T _00
n
< (/R(x)! + T ) P(!Z(x) - R(x)] > T - R(x»n n
-T+ IT + R(x)/-(K-l) J n IZ-R(x)I K dF[rn] (zlx)
n-00
since )R(x)] < CT < T. Again using Chebyshev bounds, we see thatn n
TjR(x) - R nl < (l+C)T (B(l-C)T )-K r-K/ 2 + I(l-C)T ,-(K-l\-K/2 B.
- n n n n n
-KThe desired result is obtained by letting D3 = 2B(1 + C)(l - C) .
2.3 TAsxmptotic Distribution of a Type Ao Process
We shall now give an extension ~f Burkholder's [2] Theorem 3.
It is helpful to recall the discussion on the K-W process preceding
Lemma (2.1) and to bear in mind that the assumptions are based on the
unconstrained random variables even though the theorem concerns a
TType A process.o
Before giving Theorem (2.1) we shall give a theorem by Burk
holder [2] and a theorem by Fabian [6] which play integral parts in
15
the proof of Theorem (2.1).
Theorem (Burkholder [2J): *Suppose {X } is a stochastic approxi-n
< 00',
mation process of Type A and e is a real number such thato
(1) there is a function Q from the positive real numbers into
the integers such that if E > 0, jx-el > E, and n > Q(E),
*then (x-e) R (x) > 0;n
*(2) sup (IR (x)!/(l+!xl»n,x n
*(3) sup Var(Z (x» < 00;n,x n
* *(4) if °< 01' < 02 < 00, then La ( inf ]R (x)/) = 00;n sl~lx-e 1~02 n
* 2(5) E(a) < 00;n
* 'Ie(6) if n is a positive integer, then R (x) and Var(Z (x» aren n
Borel measurable.
Then P(lim X = e) = 1.n
It will be useful to note that condition (1) ,can be replaced by
the condition that
(1') there is a function Q from the positive real numbers into
the positive integers such that if E > 0, Ix-el > E,
and n > Q(x), then
*E (x) > 0, and Lan n
*(x-e)R (x) > E (x)n n
sup 10 (x) I < 00.x n
+ is (x) wheren
The following theorem is a one-dimensional version of Fabian's
[6] theorem on asymptotic normality. In the theorem let R denote the
real numbers. Let P denote the set of all measurable transformations
from the measurable space (~,S) to R. Let X[A] denote the indicator
function of the set A, and, unless indicated otherwise, all conver-
gence notions will mean convergence almost everywhere.
decreasing sequence of a-fields, S c: S; suppose U ,V ,T , r , ¢ sP,n n n n n n
Theorem (Fabian): Suppose K is a positive integer, S a nonn
16
,t< ~, r,¢ER, and r > 0.
0:, (3, sR, and
Suppose r ,¢ l' V 1 are S - measurable, C,n n- n- n
r + r, ¢ + ¢, T + T, or EIT -TI + 0,n n n n
E(V Is ) = 0, C > jE(V2/S ) ~ ~I + 0,n n n n
(2.2)
(2.3)
and, with 2E(X [V2 ~ rj ]V2), let eithera. =J,r
lim a2 = ° for every r > 0, (2.4)j~
j,r
-1 n 2or 0: = 1, lim n ~ a. = ° for every r > 0. Suppose that11~ j=l J,r
(3+ = (3 if 0: = 1, (3 = ° if 0: :I 1,+
and
°< 0: < 1, °~ (3, (3+ < 2f,
U +1 = (I - n-ar ) U + n-(a+(3)!2 ¢ V + n-a-(3!2 T •n n n n n n
(2.5)
(2.6)
Then the asymptotic distribution of n(3!2 U is normal with meann
-1 2(r - (3+/2) T and variance ~ ¢ !(Zf - (3+).
Theorem (2.1): Let {X } be a stochastic approximation process ofn
Type AT with EZ (x) = R (x) and G (olx) = G[nP
] Colx). Suppose thato n n n
{~ } is a real number sequence, {c } is a positive number sequence,n n
e is a real number, P, W, and ~ are non-negative numbers, and each of
2n , A, C, a , v, a, and A is a positive number such thato
17
i. {R (x)} is a function sequence such that if n > n , thenn 0
R (~ ) = 0, and if x ~ ~ , then (x - ~ ) R (x) > 0;n n n n n
ii. sup nW!R (x)// (1 + Ix!) < 00;n,x n
iii. if 0 < 81 < 82 < 00, then La l( inf IR (x)!) = 00,
n 8i </x-el2.82 n
and if T ~ 00 for all n > n , thenn 0
lim ( inf IR (x) /) exis ts ;n 8
1<lx-el<8
2n
iv. {D (x)} is a function sequence defined by D (x) =n n
( ) ~ __ A*R x)/c (x-~ if x T ~, J~ if x = ~ , and is continuouslyn n n n n
convergent to A at x = e;
v. sup Var(Z(x» < 00, Var(Z(x» is continuous and converges to
2a at x = e, there exists a finite open interval (I), con-
taining e such that for n > 0
lim E{X[(Z(x) - R(x»2 > nja] (Z(x) - R(x»2} + O.j-+oo
uniformly for xE1, and if T of 00 for every n > n , thenn 0
sup E(jZ(x) - R(x)jr) < 00 for r an even integer ~ 2;x
vi. G(zlo) is Borel measurable for all z;
- I I -v a-w Ivii. (a) 1/2 < a ~ 1, ~n - e = O(n ), n an + A/C,
*The definition of D (~ ) is arbitrary except for the restrictionthat D (ll) -+ A. n n
n n
18
(rp/2 + (r-l)~) > w + (1 - a), and
S/2 = a/2 + p/2 - w < V;
(b) if p/2 < w, then p > 1 - 2(a - w);
(c) if T # 00 for every n > n , then 6/2 < (rp/2 + (r-l)~-w);n 0
(d) and 6+ = 0 if a < 1, = 6 < 2AA if a = 1.
6/2Then X + 8 a.e., and n (X - 8) is asymptotically normaln n
Proof: First we shall use Burkholder's theorem to show X 4 8n
a.e. by considering two cases.
Case I: p/2 < w.
!e
Burkholder has shown that condition (i) implies that, for every
E > 0, there exists a Q(E) such that if Ix - 81 > E, then
(x - 8)R (x) > 0 for every n > Q(E). Thus if T =00,n n
(x - 8)R*(x) = (x - 8)nP/ 2R (x) > O.n n
If Tn # 00 for every n > no' then by Lemma (2.2) (with ern] = [nP])
we have that
-rp/2 - (r-l~ { }O(n ) max D2 ,D3 )
20
lime inf JR (x)I),n+oo 01<!x-eI202 n
by L. This limit exists and is finite by (ii) and (iii). If L > 0,
Lemma (2.2) and (vii. (a» give
for °< D4
< min{L,Dl
}. Since a~ = O(n-(a-w», w·~ 0, and a ~ 1,
again the series diverges. If L = 0, then
for some °< D < 1 and sufficiently large n since lim inf n-sT > 0.n
Thus by (b) of Lemma (2.2) and the fact that (rp/2 + (r-l) s) >
w + (I-a) ~ 0, the divergence of the series follows from the diver-
genee for the case T =00 Condition (4) therefore holds in alln
cases.
Condition (5) follows from
by (vii. (b».
Finally, Burkholder (2) has shown that if G(zje) is Borel
measurable, then G[nP](zle) is Borel measurable and if the
sup Var(Z(x» < 00, then the measurability of R(x) and Z(x) is impliedx
by that of G(zle). Since the measurability of G[nP](zle) clearly im-
plies the same for the distribution function of nP/ 2ZT(x) , then
measurability of R*(x) and VarZ*(x) is implied by the fact thatn n
21
sup Var Z*(x) < 00. Thus condition (6) is satisfied.x,n n
Therefore, X -+ e a.e.n
Case II: pj2 ~ w. Let a* = a'c and Z*(x) = c-1ZT(x).n n n' n n n
Conditions (1), (2), (3), (4), and (6) are established as in
Case I. Condition (5) follows from
by (vii. (a».
Thus, in either case, X -+ e a.e.n
We shall now establish the asymptotic normality of nS/2 (X - e)n
using Fabian's theorem.
From the definition of {X } we haven
Xn
+1
- e = X - e - a'RT(X) - a' (ZT(X) - RT(X»n nn n n n
= X - e - aIR (X)-a'«RT(X)-R (X»+(ZT(X)-RT(X»).n nn n n n n n
Bur by conditions (i) and (iv) ,
R (x) = c (x - ~ ) D (x)n n n n
= c (8 - ~ )D (x) + c (X -8)D (x),n n n n n n
and so
..
= (1 - a'c D (X » (X - 8) - a'(ZT(X) - RT(X»nnn n n n n n
- a'«RT(X) - R (X» + c (8 - ~ )D (X».n n n n n n
(2.7)
.e22
Make the correspondence with Fabian'~ theorem as follows:
8, aU '" X - f n '" n a'c D (X),n n n n n
-1 1
cPn '"n (Ci.+S)/2[nPJ"2 a' V '" _[nP]2 (ZT(X) _ RT(X»,
n' n n n
and
T '" T (X) '" _na+S/ 2a'«RT(x) - R (X» + c D (X) (8 - ~ ».n n n n n nn n
(Note the possible confusion of T (X) with the values at which ZT(X)n n
is constrained.)
Let S be the a-field generated by Xl'T ZT
n Zl' ... ,n~l'
Clearly this is the a-field generated by T Tsame Zl' ... , Z l'n-
Xl' .•. , Xn' Moreover, S is increasing, and cP 1 and V 1 aren n- n-
measurable with respect to S. Since R~(X) '" E(ZT(X)/X ), R (X) isn n n n n
measurable with respect to S implying the same for D (X) and there-n n
fore r .n
To establish (2.2) we recall that X a+e. e: Since D (X) isn n
continuously convergent to A at x '" 8, it follows that D (X) + A a.e.,n
and therefore R (X) a+e. O. Thusn
by conditions (iv) and (vii. (a». Again by (vii. (a»
'" A/C
since S/2 '" a/2 + p/2 - w. Finally, since a~ '" o(n-(a-W»,Tn(X) + 8
23
a.e. That
nB/ 2 (e - ~ ) D (X) + Oa.e.n n
follows from the fact that je - ~ J= O(n-V) and B/2 < v. Thusn
T + 0 a.e.n
We have
E(V js )=E(E(V IX )IS ) = E(Ols ) = O.n n n n n n
If T =00, then there exists a C > 0 such that (with E = 02)
n
C > jE(V2 ls ) - rj + 0 a.e.n n
by (v). If T ~ 00 for ever.y n > n , we have from (vii. (a» thatn 0
rp/2 + (r-l)~ > w + 1 - a ~ 0, and thus
Hence IE(V2 IS ) - rl + 0 a.e. in either case. This establishes (2.3).n n
To show that lim 0: t = 0 for every t > 0, let I be defined asj+oo J'
2 2in (v). Since E(V ) = 0, o. t < sup E(V.) < sup Var Z(x) < 00n J, - x J - x
by Lemma (2.1) and condition (v). Letting D denote this last upper
bound, we obtain
..
2O. tJ,
= EX{[V: > tja] Ur}v:J - J
+ EX{[V: > tja] UI C} V:
J - J
< EX{ [V: > tja] UI}V: + DP (X. 8rc) .- J - J J
.e24
The first term goes to zero by (v), and the last term goes to zero
since P(X ~I) + 1. Thus condition (2.4) is satisfied with t = r.n
Finally, by noting that r = AA and S < 2 AA if a = 1, it follows
that nS/ 2 (Xn - e) is asymptotically normal (0, a2A
2 /c2(2AA - S+».
Although Theorem (2.1) does not extend the conclusions of
Fabian's theorem to a larger class of stochastic approximation pro-
cedures, the assumptions of Theorem (2.1) are more related to the
conditions of an experiment in which the use of these procedures may
be attractive. Theorem (2.1) also has the added attraction of re-
moving the implied assumption that X + e a.e. and including it inn
the conclusion of the theorem.
Theorem (2.1) extends Burkholder's [2] Theorem 3 in three
directions. Firstly, it increases the possible values for a from that
of 1 to that of the interval (1/2, 1). Since the constant a (=An-a )n
can be thought of as the weight assigned to the n'th observation, this
wider range of values for a gives the experimenter more flexibility
in weighting the observations. This greater flexibility often enables
the experimenter to decrease the mean squared error during the early
stages of the search for e. Secondly, it allows the use of con-
strained random variables. An example in Chapter 3 points out when
such constraining can improve the behavior of the K-W process. And,
thirdly, it allows an increasing number of observations to be taken
on each step. Again an example in Chapter 3 describes situations in
which, by taking more observations on each step, we can obtain the
same variance as by taking the same number of observations using more
steps. The practical importance of decreasing the number of times the
.e experimenter must obtain observations is quite clear.
Due to the generality of Theorem (2.1), it is difficult to com-
ment on the practical consequences of each assumption with the excep-
tion of assumptions (v) and (vi). It is sufficient to guarantee
assumption (v) that our observations be bounded. Since this can
almost always be assumed in practice, assumption (5) is not practi-
cally restrictive for any finite r. Thus if either, or p is greater
than zero, the assumption that
rp!2 + (r - 1) ~ > w + (1 - a)
and
B/2 < (rp/2 + (r - 1) , - w)
are almost always satisfied. Assumption (vi) seems never to be
questioned in practice.
Theorem (2.1) expresses the order of the variance in terms
of the number of steps rather than the number of observations. Let
N(n) be the total number of observations necessary to obtain X .n
Then
n-lN(n) = ~ mP = O(np+1).
m=l
If Y is such that (N(n»)Y B!2= n then
(p + l)y = a!2 + p/2 - w,
or
Y = (a/2 + p!2 - w)!(p + 1).
26
We shall discuss the value of y for the K-W process in the next
chapter, and shall only note here that W = 0 for the Robbins-Monro [9]
process giving y = (a + p)/2 (p + 1).
·eCHAPTER 3. THE CONSTRAINED KIEFER-WOLFOWITZ PROCESS
3.1 Definition of the Constrained Kiefer-Wolfowitz Process
The Kiefer-Wolfowitz process is not generally accepted as a
practical procedure for seeking a maximum in experimental work. The
basic reasons for this are: its sequential nature, insufficient
knowledge of its small sample properties, lack of a good estimate of
its variance, and an unfamiliarity of experimental statisticians with
its attributes.
We shall attempt to relieve these deficiencies somewhat. Fol-
lowing the definition of the Modified Kiefer-Wolfowitz process, we
shall obtain, under certain mild restrictions, its asymptotic distri-
bution as a corollary to Theorem (2.1). Various examples will be
given to study the effect of the parameters of the process and to
illustrate its small sample behavior. Finally, bounds will be ob-
tained for expressions related to the small sample variance, and a
method of estimating this variance will be discussed.
Definition (3.1): Let {a }, {c }, {r }, and {T } denotenn n n
sequences of positive numbers such that [r ] > 1 and if T ~ 00 for alln - n
n greater than some n , then inf T > O. Let Y(x) denote a randomo n
variable with mean M(x) and distribution function F(O!x). Let Xl be
fixed, and let the random variable sequence {X } be defined byn
•= X - a c-l(y(X - c ) - y(X + c »T,
n n n n n n n (3.1)
.e28
where Y(X - c ) and Y(X + c ) are random variables which, conditionaln n n n
on Y(XI - cl ), "', Y(Xn- l + en-I)' Xl'
[r ]tributed independently as F n (ylx
n
.•. , X l' X = x , are dis-n- n n[r ]
c ) and F n (ylx + c ).n n n
This will be called the Constrained Kiefer-Wolfowitz process,
and it will be referred to as the CKW process.
The CKW process is a generalization of the K-W process. The
CKW process bounds the difference (Y(X - c ) - Y(X + c » by + T ,n n n n - n
and it allows the experimenter to vary the number of observations on
each step.
T _ 00.
n
It clearly reduces to the K-W process when r - r > 1 andn
3.2 Asymptotic Properties of the CKW Process
We shall now examine certain asymptotic properties of the CKW
process through the following corollary to Theorem (2.1). In re-
viewing the corollary, it will be helpful to bear in mind that the
assumptions are based on the unmodified random variables.
Corollary (3.1): Let {X } be a CKW process with rn n
2aw= n
Let 8 be a real number, let a and b be non-negative numbers, and let
each of n , A, C, 02 , a, w, and A be positive numbers such that
o
a. M'(x) is continuous, (x
sup (I MI (x) II (l + Ix I»x
8) M'(x) > 0 if x ~ 8, and
< 00',
b. the third derivative of M(x) exists and is continuous for x
in an open interval (I) containing 8, and M" (8) = -A/2;
c. Var Y(x) = 02(x) is bounded and continuous, there exists an
open interval (I) containing 8 such that for any t > 0
"
29
lim E{X[(Y(x) - M(x))2 > tja] (Y - M(x))2} + °j-+oo
uniformly for x E I, and if T ~ 00 for every n > n , thenn 0
sup E(/Y(x) - M(x)/r) < 00 for r an even integer ~ 2;x
d. F (ylo) is Borel measurable for all y;
e. (1) 1/2 < a ~ 1, naa + A, nWc + C, lim inf n-b°Wr > 0,n n n
(ra + (r - l)b) > 1 + (1 - a)/w, and
B/2 = al2 + (a - l)w <2W;
(2) if a < 1, then a > 1 - (2a - l)/~w
(3) if T ~ 00 for every n > n , thenn 0
B/2 < (ra + (r - I) b -l)w;
(4) and B = ° if a < 1, = B < 2 AA if a = 1.+
Then X + 8 almost everywhere, and nB/2 (X - 8) is asymptoticallyn n
222normal (0, 2 A cr IC (2 AA - B+)).
The basic ideas of Corollary (3.1) have already been noted by
Burkholder [2]. However, since Corollary (3.1) is an extension of
his work and since it is instructive to do so, we shall give a proof
of this corollary.
Proof: Make the correspondence with Theorem (2.1) as follows.
a' = a c-l,z (x) = (Y(X -c ) - Y(X +c )), and R (x) = M(x-c )-M(x+c ).n n n n n n n n n n n
Condition (a) implies that M(x) is continuous, and that M(x) is
strictly increasing or decreasing as x is less than or greater than
8, respectively. Thus, there exists a sequence {lJ J such thatn
.e30
(x - ~ ) R (x) > 0 for x # ~ .n n n
From conditions (a) and (e.(l» we obtain that the
sup nWIR (x)I/(l + Ixl) = sup nW(c jO(M'x)I/(l + Ix!»~ < 00.n,x n n,x n
This is condition (ii).
To verify condition (iii) we note that by (a) and (e. (1»
ra' ( inf IR (x)/ ) =n n(\«x-e)~oZ
a ( inf 10 (M' (x» I)n 0l~lx-e.J<PZ
= OO::a ) = 00.n
Moreover, the lim inf IR (x)1 = lim c 0(1) =0 by then 0l<jx-ej<oZ n n n
above argument and the fact that c + O.n
The function sequence {D (x)} in condition (iv) is defined byn
D (x) = c-l(x - ~ )-1 (M(x - c ) - M(x + c » if x # ~ , = A at x = en n n n n n
using condition (b). For x £ I we have, since R (~ ) = 0,n n
M(x - c ) - M(x + c ) = (x - ~)(M'(~ - c ) - M'(~ + c »n n n n n n n
+ O(x -~ )Z(M"(§l - c) - M"(§Z + c»n n n n
= -Z c (x - ~ ) M"(~ ) + O(c (x - ~ »n n n n n
where Iso - xl < I~ - xl, i = 1,Z. Since the third derivative exists1 n
for x £ I and I~ - el < c + 0, it follows that for every £ > 0 theren - n
exists a 0 > 0 and an N such that for Ix - el < 0 and n > N,IDn(x)-AI<£.
Thus the condition (iv) holds.
Condition (v) is a direct consequence of (c) with Var Z(x) Z= 20 .
..
31
Finally, Burkholder has noted that condition (d) implies (vi).
He has also shown that the existence of a continuous third derivative
in an open interval containing e implies that I~ - eJ = 0(c2) =n n
°(n-2U) . Letting v = 2w, p = 2aw, and l; ::: bW, condition (vii) can be
verified using simple arithmetic.
Thus the corollary is proven.
Except in a few pathological cases, an experimenter can assume
that the derivatives of M(x) exist, that the r'th central moment of
Y(x) exists and is continuous, and that F(ylx) is a continuous func-
tion of x for every y. Moreover, it is a rare case when he cannot
specify a finite interval which should contain e. By restricting Xn
to this interval, all the assumptions of Corollary (3.1) hold for r
arbitrarily large.
If the above assumptions are satisfied, condition (c) is equiva-
lent to e I.ex. w
1/2 < ex. ~ 1, n a + A, n c + C, a + b > 0,n n
S/2 ::: ex./2 + (a - 1) w < 2w, and
S+ ::: 0 if ex. < 1, = S < 2 AA if ex. = 1.
Thus, except for assumption (e ' ), the assumptions of Corollary
(3.1) should not be looked upon as restrictions, but rather as factors
influencing the behavior of the process.
Since in experimental work it can usually be assumed that the
supremum over x of any central moment of Y(~) is finite (at least over
an interval known to contain e), the assumptions that
(ra + (r - l)b) > 1 + (1 - ex.)-w
and
32
a/2 + (a - 1) w> (ra + (r - l)b - l)!W
~re generally not restrictive. Practically speaking then, the constant
a must satisfy
a/2 + (a - 1) W < 2'w
and
1 - (2a - l)/lw < a,
or, equivalently,
1 - (2a - 1) /2w < a < 3 - a/2 w.
(3.2)
(3.3)
As in Chapter 2, let N(n) denote the number of observations
required to obtain X , and let y be defined by (N(n))Y = nS/ 2.n
By (2.8)
Y = (a/2 + (a - 1)w)/(2aw + 1) (3.4)
Thus, if a = 0, (i.~., the same number of observations are taken on
each step), Y = a/2 - w. Since (3.2) implies
w> a/2(3 - a),
we obtain, for a = 0,
Y = S/2 < a/;2 - a/6 = a/3.
Suppose a > O. We have
2~(2aw + 1) dw = (2aw + l)(a - 1) - (a/2 + (a - 1)w)2a
= a(l - a) - 1.
(3.5)
(3.6)
33
It follows that d y/dw is greater than, equal to, or less than zero
-1as a is greater than, equal to, or less than (1 - a) .
.e.. Case I: a <
-1(1 - a) • In this case y is maximized by minj-
mizing w. Thus by (3.4) and (3.5)
y < (a/2 + a(a - 1)/2(3 - a»/(a a/(3 - B) + 1)
= a/(3 - a(l - a»
< a/3. (3.7)
The equality sign holds if either a = 0 or a = 1.
Case II: a ~ (1 _a)-I. By (3.3) (1 - a)-l ~ 3 - a/2w. Since
1/2 < a ~ 1, this equality is not satisfied if either w ~ 1/4 or
a > 2/3.
Assuming these conditions are satisfied, substitution intb (3.4)
gives
y = (a/2 + aw/(l - a»/(2w/(1 - a) + 1)
If a > (1
= a/2 < 1/3.
) -1 i .. db' . . .a ,y s maX1m1ze y m1n1m1z1ng w. Taking the
..
limit of (3.4) as w + 00 we obtain
y = (a - 1)/2a.
By (3.3) a < 3, and so y < 1/3.
Thus in all cases y < 1/3. We might add that a and w can be ad-
justed so that y is arbitrarily close to the bound obtained in each
case.
.e•
34
We shall see that the bias is minimized Qy increasing the number
of steps and usually not affected by increasing the number of observa-
tions taken on each step. The restriction that ex > 1/2 implies for
Case II that a > 2 and so this case is usually not of practical signif-
icance.
It follows from (3.7) that, for Casel, the order of the vari-
ance increases as either a decreases or ex increases. Moreover, the
effect of a change in a becomes less pronounced as ex increases, and it
vanishes at ex = 1. Thus, if the number of steps is sufficient to dis-
count a serious bias, the experimenter loses little efficiency in a
statistical sense by increasing moderately the number of observations
on each step and decreasing the number of steps when ex is close to 1.
However, by taking observations at fewer levels of the controlled
variable (X), the experimenter decreases the time required for the
experiment and possibly the cost.
3.3 Effect of Constraining the K-W Process
It is difficult to study the advantages and disadvantages of
constraining the random variables via a strict mathematical treatment.
Rather, using an intuitive approach, we shall study an example which
illustrates when we may improve the K-W process by constraining the
random variables.
Let Z ~ N(~,a2). It is clear by Lemma 2.1 and the symmetry of
the normal distribution that
..T
1. sgn (~ ) = sgn(~);
T· c . .
2. ~ ~ ~ if I~I < < T; (3.9)
35
3./~TI < I~I if I~/ >T;
4. Var yT(x) ~ 02 if /~I < < T and 0 < < T;
5. and Var yT(x) < 02 if 0 > T.
Partition the x-axis into the intervals II = (-00, e - k),
- M(x + C)/ < < T for x £ II'
12 = [8 - k, 8 + k], and 13 = (8 + k, 00).
M'(a) < < /M'(b)/ for a £ II and b £ 13 ,
and 02 are such that T < 0 and !M(x - C)
Assume that
Assume that T - T, c - C,n n
(These assumptions on M(x) are often realized in various threshold
phenomena.)
TLet X and X denote the K-W process and the CKW processn n
respectively, and assume that Xl = xi = x.
To compare the processes we can use the observations in (3.9)
with Z = Y(x - C) - Y(x + C). Using (5) and the fact that T < 0, we
obtain
Var(y(x - C) - Y(x + C»T < Var(Y{x - C) - Y(X + C».
It follows that, at least for a few steps, the K-W process is more
variable about its expected value. In addition to this, as long as
both processes remain in the same interval, we can visualize the
following cases.
Case I: x £ II' Since !M(x - C) - M(x + C) I < < T for x £ II'
(2) of (3.9) implies the processes approach e at the same expected
rate. The fact that M'(a) < < M'(b) for a £ II and b £ 13 implies
that this expected rate is much less than the corresponding rate for
either process approaching e from 13 , That is, the probability that
either process moves from II to 12 in a few steps is relatively small.
36
It is possibly a little larger for the K-W process due to its more
erratic nature.
Case II: x £ 12
, While in this region, the behavior of both pro
cesses is characterized by random movement since
M(x-C) - M(x+C) ~ 0 < < T. This movement is more erratic for the K-W
process if T < 20.> Moreover, if by chance either process leaves this
region, the movement away from e should most likely be greater for the
K-W process due to its sensitivity to observations from the tails of
the distribution. This difference increases with an increase in 0 or
a decrease in T.
Case III: x £ 13 , It follows from (3) of (3.9) that the ex
pected rate of approach toward e is faster for the K-W process. This
difference increases with a decrease in T. However, the expected rate
of approach toward e is fast for either process, and so the probability
that either process moves from 13 to 12 in a few steps is relatively
large.
Since both processes tend to move out of the region 13 more
quickly than out of region II' we can expect both processes to be
biased after a few steps given that Xl is in a neighborhood of e.
The CKW is less biased for two reasons: Firstly, it tends to move from
13 into 12 more slowly, and it tends to move from II into 12 at about
the same rate. Secondly, but probably more importantly, on leaving
12 the K-W process tends to be thrown farther into regions II and 13
,
Due to the usually quick recovery from 13
and the usually slow re-
covery from II' this can greatly increase the difference in the bias
..
37
of the estimates.
If Xl s II' neither process seems to be the better, and although
the K-W process seems better if Xl S 13 , the difference quickly dis
appears since both processes approach 8 at a rapid rate. Thus the CKW
process seems to be preferable for this example.
The essential feature of this example is the existence of two
regions, both excluding 8, that are differentiated by the magnitude of
M'(x). Other examples with this feature can be constructed, but we2
note only that regression functions of the form e-x have this feature.
The advantage of the CKW process is that of lessening the probability
of a deep penetration into the region for which IM'(x)/ is small.
2The optimal choice of T is largely determined by 0 and the
regions 12 and 13
, Clearly, if 4 0 < T, where 02 = Var Y(x), con-
straining is of little consequence in 12, and if it is of consequence
in 13 , it is a negative factor. Thus, we must choose T so as to de
crease the variability in 12 without sacrificing too much of the
information that 13
contains with respect to the direction of 8. As
2either more observations are taken at each step, or 0 is decreased,
this becomes more difficult to do.
Before leaving this subject it should be observed that as
a + 0, the influence of each iteration goes to zero. This impliesn
that both processes become characterized by their expected rate of
approach toward 8. In any small neighborhood of 8, toward which
both processes tend, this is very similar.
.e 38
304 Expected Small Sample Behavior of the K-W Process
In this section we shall discuss the effects that the parameters
a and w, the starting value (Xl)' and the shape of the response curve
have on the behavior of· the K-W process. Although the CKW process will
not be discussed, it will be seen that most of the results in this sec-
tion apply to it as well.
Let I denote an interval containing 8, and assume that M'(x) = B
for x € rCo Assume that the distribution function of Y(x) - M(x) is
independent of x.
22 a .
Let Z(X ) = Y(X - c ) - y(X + c ), and Var Z(X ) =n n n n n n
We have by (3.1) that
Xn+l - 8 = X
n- 8 - a c-l(M(X -c ) - M(X +c » - a c-lZ(X )
nn n n n n nn n (3.10)
If ~ > > 8, the process will remain in I C for a few iterations with a
high probability. Thus if n - k is not too large, (3.10) becomes
By induction
= (X -8) - 2 B a - a c-1Z(X )n n n n n
n= (~-8) - 2 B ~ a
m=k m
n -1~ a c Z(X),
m=k m m m
where, conditional on Xk > > 8, Z(Xm), m=k, ..• , n, are independent
random variables. It follows that
n= (X
k-8)- 2 B E a ,
m=k m(3.11)
and
2 nVar(Xn+l - 81~ > > 8) = 2 a L
m=k
2 -2a cmm
39
(3.12)
Let a = An-a and cn n
-w= en . It is a well known result that
n -pL m ~
m=k
n+lJ x-P dxk
(3.13)
of p > O. Substituting the expressions for a and c into (3.11) andn n
-e
(3.12) and applying (3.13), we obtain
E(Xn+l - 81~ > > 8) ~
~ - 8 - 2BA(ln(n+l) - In k), a = 1,
and
I 2 2 -2 -1Var(Xn+1 - 8 ~ > > 8) ~~ A C (a-w-l/2)
(Kl - 2 (a-w) _ (n+l)1-2(a-w)).
The above expressions indicate that if we are approaching 8
along a constant slope, our expected rate of approach is proportional
to the slope and increases very rapidly as a decreases. Furthermore,
the variance of our rate of approach decreases as either a increases
or w decreases.
Thus if we expect to approach 8 in a region in which M'(x) is
fairly constant, we should generally decrease both a and w. It may
be wise to let c and a be constant until the sequence {X } non n n
longer shows a trend in a particular direction. This lack of a trend
indicates we may be in a neighborhood of 8.
40
In this example we are trying to decrease the bias for a given
number of observations. It is clear from (3.11) that the bias is not
decreased by increasing the number of observations on each step. There-
fore, unless it is practically restrictive, it is best to take a mini-
mum number of observations on each step until we are in a neighborhood
of 8.
The previous example reveals the behavior of the K-W process for
a few iterations while in a region in which M'(x) is fairly constant.
However, the assumptions of this example are certainly violated in a
small neighborhood of 8, or in a region in which M'(x) cannot be ap-
proximated by a constant.
A more reasonable approximation, at least in a neighborhood of 8,
is to assume that M(x) is a quadratic function of x. That is, assume
2M(x) = -(B/2) (x-8) (3.14)
for B > O. Let a = An-a and cn n
-w;::: Cn . Let Z(X ) = y(X -c ) -n n n
y(X +c ) - (M(X -c ) - M(X +c », and assume that the distributionn n n n n n
function of Z(X Ix = x) is independent of a. Let Var Z(X ) = 2 02
n n n
Using (3.10) we have
;::: X - 8 - An-a c (B/2) (-(X -c _8)2 - (X +c _8)2)n n n n n n
- AC-l n-(a-w) Z(X )n
(3.15)
41
Since, conditional on X = x, Z(X ) is independent of x, we obtain byn n
induction that
nXn+I - 8= IT (1 - 2ABj-a)(~ - 8)
j=kn n
- AC-l E E (1 - 2ABj-a)n-(a-w)Z(Xm),
m=k j=m+ln
where IT (1 - 2ABj-a) = 1. It follows thatj=n+l
and
n= IT (1 - 2ABj-a)(~ - 8),
j=k
nwhere D(m,n) = n
j=m
n~(k,n)(~-8)2 + 2A2C-2a2 E D(m+I,n)m-2 (a-w), (3.16)
m=k
(1 - 2ABj-a)2. The last two expressions of (3.16)
f)
are, conditional on ~, the square of the bias and the variance of Xn ,
respectively.
It is another well known result that there exists 8k + 0 such
that for every k, n
Using results similar to (3.13) it has been shown that there exists
8k + 0 such that for every k, n
{ -I I-a. I-a }(1-8k)exp -4AB(I-a) (n -k .) ~ D(k,n)
(3.17)
{ -I I-a I-a}< (I+8k)exp -4AB(I-a) (n -k ), a < 1,
and
••1.
42
(3.15)
We can use the results of (3.17), (3.18) and Corollary (3.1) to
obtain asymptotic expressions for the bias and variance of X in thisn
example. In fact, it follows from (3.16), (3.17), and (3.18) that
2 -4AB«~-e) + o(l»n ,ct = 1.
(2 -1 l-ct) «~-e) + 0(1)exp{-4AB(1-ct) n }, ct < 1
l(3.19 )
••
Although Corollary (3.1) does not guarantee that Var(X ) exists, itn
does imply that n~/2(x - e), with ~/2 = (ct - 2w)/2, tends to ben
distributed as a normal random variable with variance equal to
(3.20)
•
2222A a /c (4AB - .(ct - 2w», ct = 1
for A > (a - 2w)/4B if a = 1. Many writers have noted that the variance
expression for a = 1 is maximized with respect to A if A = (ct - 2w)/2B.
It is reasonably evident from the nature of D(k,m) that the bias
of the K-W procedure is diminished by decreasing ct. Asymptotically this
is clear from (3.19). However, it follows from (3.6) that ~/2 =
a - 2w < a/3. Thus the asymptotic variance increases as ct decreases.
Again, until X is in a neighborhood of e, it is probably best to letn
a and w be small. Once X is in a neighborhood of e, .ex should be inn
creased .
43
It is important to note that in this example the bias is not a
function of c , and, as is to be expected, the variance decreases asn
either C increases or w decreases. It follows that we can set w = o.
By doing so y (see (3.4)) becomes a/2. Hence with a = 1 we obtain an
asymptotic variance of O(n-1). As was noted in Chapter 1, it is the
symmetry of M(x) about 8 that allows us to assume w = O. It is clear
that if M(x) is reasonably symmetric about 8, and if only a rough
estimate of 8 is needed, then it is best to let w ~ o.
Suppose we can assume that M(x) is symmetric about 8. This
implies that we can set c = C. Asymptotically, the optimal choicen
of C maximizes the difference !M(x-C) - M(x+C)I for small deviations
of x about 8. It is quite clear then how C should be chosen. We only
note here that, for a quadratic response surface, C should be made
arbitrarily large, and, for surfaces of the form e-x2 , C is determined
as in Figure 2.
-~2C i~
----+j
IIII
IIIrIII
x
Figure 2. Optimal Choice of c (=C)n
44
2In this example we have assumed that Var Z(X ) "" 2 cr. If
n
Var Z(X ) = 2[n2aW]-1 cr2 , (3.10) becomesn
1
E(Xn+l - el~) = D2 (k,n)(~-e)
and
n~ D(m+l,n)m-2 (a+(a-l)W)
m""k
3.5 Approximations of the Small Sample Variance
The importance of evaluating the last expression in equation
(3.21) is quite clear. This expression is, conditional on Xk ' the
small sample variance of X if M(x) is a quadratic function. Ifn
each X , m=l, •.. , n, lies in a neighborhood of e in which M(x) ism
reasonably approximated by a quadratic function, it is a close ap-
proximation to this variance for more general response functions.
At the present, the only simple, but adequate, approximations to
this expression rely heavily on asymptotic theory. We now obtain
bounds for this expression using an appeal to asymptotic theory that
is satisfied for values of k and n usually realized in practice.
Let
b ;:;:(m,n)
nIT
j;:;:m+l(3.22)
t)
where D, ¢ > O. For simplicity, let b "" b( ) keeping in mind them m,n
dependence on n.
••45
Using (3.22) we obtain
b - b = b (l_b- l b 1)m m-I m m m-
Thus, for k sufficiently large, {b } is,an increasing sequencem '
m > k if either a < 1, or a = 1 and 2 D > $.
(3.23)
by
-aSince m-a
(m-l) = O(n-(l+a», the second difference is given
••(b -b ) - (b -b )
m m-l m-l m-2=(2Dm-a_ '-l)"(b'b ')
ym 'm-' m'-l
-2a+ Oem )(b -b 1)'m m- (3.24)
Thus, for k sufficiently large, {b -b I} is an increasing sequencem m-
for m > k if either a < 1, or a = 1 and 2 D > Yn
The last expression of (3.21) is of the form 1: b It ism=k m
clear from equations (3.23) and (3.24) that, for sufficiently large k,
upper and lower bounds can be obtained for this sum using a geometrical
argument. See Figure 3 for a renresentation of the elements b thatm
illustrates this approach. In fact,
Area 6 ADE - Area 6 ABC, k > A
• and
n1: b >mm=k
Area 6 ADE , k < A
(3.25 )
46
bk -------
Figure 3. Graph of the Sequence {b }n
(f)
n-l n
nL:
m=kb < (n - k) b .m - n
This geometrical approach will be used in only one case in the
following lemmas. However, every bound will be obtained by finding
sequences of functions {f (x)} such that eithern
f (m) > b, m = k, .•• , nn - m
or
f (m) < b, m = k, ... , n.n - m
The respective bound will be obtained by integrating f (x) over an
region such as (k,n+l) or (k-l,n).
Lemma (3.1): Let the sequence {b } be defined by (3.22) withm
a = 1. Let AI' A2, ¢, and D be real numbers such that
¢ < Al < 2 D < 1.2 . Then there exists an N(A1 ,A2) such that, for every
k, n > N (1.1
,1.2),
47
A-(ep-l) n(k-l) 2 ) < r b
m=k m
-A A -(ep-1) A -(ep-l)< (A - (ep- 1»-1n I «n+l) I _ k I ).- I
Proof: Consider the function sequence {f (X,A)} <;lefined byn
-A A-epf (x A) = n x·n ' ,
for x > O. Clearly, by (3.22),
f (n , A) = n -ep = b •n n
We have using (3.26) that
f (m,A) > b «b )n - m - m
if and only if
(3.26)
(3.27)
(3.28)
-A ep-An > m b
m«mC/l-A b ).- m
(3.29)
Let d (A) = mep-A b •m m
We have
-1 -2= I + (A - 2D) m + Oem )
by (3.23). Since Al < 2 D < A2 , there exists an N = N(AI ,A2) such
that the sequences {dm(AI)}m>N and {dm(A2)}m>N are increasing and
decreasing respectively.
•
48
But (3.28), and hence (3.29), holds for m = n. Thus, by the
monotinicity of the sequences {dm(A1
)} and {dm(A2)}, (3.29), and hence
(3.28), holds for all m > N(AJ.,A 2).
Since ¢ < A2 < A1 , {fn (x,A1)} and {fn (x,A2)} are sequences of in-
creasing functions. It follows that, for k > N(A1 ,A2),
n -A A -¢ n 11+1 -A A -¢f n
2 2 dx < 2: b < f 1 1 (3.30)x n x dx.k-l m=k m k
Lemma (3.1) is obtained by integrating the above expression.
Lemma (3.2): Let the sequence {b } be defined as in Lemma (3.1).m
Let A1 , A2, ¢, and D be such that 0 < ¢ - 1 < A1 < 2D < A2 < ¢.
Then there exists an N(A1 ,A2) such that, for every k, n > N(A1 ,A2),
-A A -(¢-1) A -(¢-1)(A - (¢-l»-ln 2«n+l) 2 _ k 2 )
2·e-A A -(¢-)
< (A - (¢_1»-1 n 1 (n 1 -- 1
n< 2: b- m
m=kA -(¢-1)
(k-l) 1 ).
Proof: The proof is a repetition of the proof of Lemma (3.1)
except, in this case, {f (X,A)} is a sequence of decreasing functions.n
Thus the ranges of integration in (3.30) must be reversed. The re-
striction that A1 > ¢ - 1 is to prevent fn(x,A) from being of the
form x-1 .
Lemma (3.3): Let the sequence {b } be defined by (3.22) with. m
a < 1. Let 0 < A < 2D. Then there exists an N(A) such that
n2 -1 -ab (b -b 1) (1 - exp{An (k-n)})/2 < 2: bn n n- - mm=k
-1 -¢+a -a {-a }~ A n (exp{An} - exp An (k-n».
.e Proof:
49
Consider the function sequence {f (x,A.)} defined byn
f (x,A.) = b exp{A.n-a (x-n)},n n
for x > O. We have
f (m,A.) > bn - m
if and only if
Let d = b exp{- mn-a}. Thenm m
(3.31)
(3.32)
(3.33)
by (3.23). Since A. < 2D, there exists an N(A.) such that
(3.34)
{dm(A.)}m>N(A.) is an increasing sequence.
But (3.32), and hence (3.33), ho~ds for m = n. Thus, by the
monotonicity of the sequence {dm(A,)}m>N(A)' (3.33), and hence (3.32)~
holds for all m > N(A.).
Since fn(x,A.) is an increasing function for all n,
n n+lE b < b exp{-A. nl - a } J exp{An-ax}dx.
m=k m - n k
The right hand inequality of Lemma (3.3) follows by integration and by
noting that b = n-¢.n
.50
To obtain the left hand inequality of Lemma (3.3),·· we note that
the slope of the line AE (see Figure 3) is given by b - b 1.n n-
Therefore D - A = b (b - b 1)-1 and (B ~ A) = (C - B)(b - b 1)-1.n n n- n n-
But, for k >N(A), (3.31) and (3.32) imply that
< bn
-aexp{A n (k-n)}.
It follows from (3.25) that, for k > N(A),
nz:;
m=kb > Area ~ ADE - Area ~ ABCm-
2 ~1 a>b (b -b . 1·)(1 - exp{A n- " (k....n)} )12.- n n n-
Thus Lemma (3.3). is proven.
We now use Lemmas (3.1), (3.2),' and (3.3)''taca1culate asymptotic
bounds for the variance.
Theorem (3.2): Let the sequence {b } be 'defined' by (3.22).m
Then, for fixed k, if a < 1..,
and if a;= 1 and 2D > ep - 1 > 0"
nnep-az:; b
m= (2D - (¢_1»-1+ 0(1).
m=k
Proof: We see from (3.'16) and (3.22) ,-t;hat
.e 51
if D = 2 AB. Using (3.16), (3.19), and the fact that if a < 1, then
2D > <P - 1, we have n<P-O'.bk
-+ 0, for any fixed k. Thus we need con-
sider only the sum from m > N where N is chosen to satisfy the previous
lemmas.
The case in which a =1 and 2D4<P follows directly from lemmas
(3.1) and (3.2) si~ce Al and A2 of these
ily close to 2D.
lemmas can be chosen arbitrarn
It is true for <P = 2D since E b is a monotonicm=k m
...
11
function of D for every k and n.
Again the fact that A is arbitrary in Lemma (3.3) implies the
right hand inequality for Theorem (3.1) when a < 1. The left hand
inequality holds from Lemma (3.3) since
= (4D)-1 + 6(1)
by (3.22) and (3.23).
We can apply Theorem (3.2) t9 the example in which M(x) is a
quadratic function by letti~g D = 2 A~ and <p.= 2(0'. - W). Thus, con-
0'.-2w editional on~, the asymptotic variance of n (Xn- ) is equal to
p 2 A2 (i}C2(2 AB - r\) where p = 1 if a = 1, 1/2 ~ p ~ 1 if a < 1
This is in agreement with (3.20). It also points out the important
fact that if M(x) is quadratic, the variance of X converges to then
variance of its distribution given by Corollary (3.1). The same
result for a more general class of regression functions has been
obtained by many writers using different approaches.
52
We now consider the extent to which Lemmas (3.1), (3.2), and
(3.3) rely on asymptotic theory. We first note that it ,is not neces-
serry for the sequence {d } to be monotonic in order that ei·ther (3.29)m
or (3.33) hold.
Consider the upper bounds established in lemmas (3.1) and (3.2).
The appeal to asymptotic results lies in the value of N such that, for
all m > N,
(3.35)
where .\2 < 2 D.
If D = 1, (3.35) holds for N = 1.
Using the Taylor expansion, we find that (3.35) is equivalent to
00
1 - 2Dm-l + D2m-2 < L:j=O
j-l( 11i=O (3.36)
"
By (3.35), the right hand side of (3.36) decreases as .\Zincreases.
Thus N is less than or equal to the value of N obtained for .\2 ~ 2D.
-3Assuming .\2 .::: 2D and considering terms up to orders (m ), we find
N is approximately the solution of
This solution is given by x = 2(2D - 1)/3. Actually, if 2D > 3,
the coefficient of (m-4) is positive implying the value of Nis less
-4than 2(2D-l)/3 up to orders (m ).
It is clear that the upper bounds in lemmas (3.1) and (3.2) are
very practical bounds for moderate D. Moreover, we have already noted
that, for a = 1, (3.20) is maximized at A = (1-2w)/2B. The
53
corresponding value of D is 2(1-2w), giving N ~ 2.
The appeal to asymptotic theory for the upper bound of Lemma
(3.3) lies in the minimum value of N such that
-a -a 2exp{An }(l-Dm ) (3.37)
for allm,n > N (see (3.23)' and (3.34» .
For the K-W process, cj> = 2(0. + (a-l)w). Since a is usually less
than one, we shall consider the case cj> < 2. It follows from (3.23)
that if 2D is not significantly less than cj>, then (1_Dm-a )2 (1-m-1)-cj>
is increasing in m for quite small m. Let N be such that this is theo
"
case for m~> N . Then, if (3.37) holds for m = N , it holds for allo 0
m > N. Let N1be the smallest value for which (3.37) holds with m = n,
and for which Nl ~ No' We have, for m, n ~ Nl ,
Thus, if either 2D ~ cj> or 2D > cj>, N1 is a reasonable approximation for
the minimum solution of (3.37).
If we let m = n and neglect terms of orders (n-3a)~·in: 3~37, we'ob-
tain an approximation of Nl given by the minimum solution of
1 + (A-2D)x-a + (A2/2 + D2 - 2DA)X-2a < 1 - cj>x-l + cj>(cj>-1)x-2 • Since
cj> > 1 for the K-W process, the coefficient of x-2 is positive. We
therefore drop this term knowing that the resulting minimum solution
is increased.
If we let A = CD; for 0 < C < 2, the above inequality is surely
satisfied if x is greater than the minimum solution of
54
or, equivalently,
For 2 - 12 < C < 2 the quadratic in C is negative. Thus if C is so
restrained, our approximation is less than the solution of
or
1
Y = (¢!(2D_A»1-a
3."6' Estimation of E(X - 8)2.n
A comparison of (3.11) and (3.12) with (3.16) quickly reveals
•
that (3.16) is generally not the correct expression for
E«Xn+l - )21~). However, it often serves as a reasonable approxi
mation when the variables {X + c }n~k lie in a region of 8 in whichm - m m=
oM(x) is reasonably approximated by a quadratic function. We assume
in this section that the process has been in such a region for
m = k , ..• , n.o
a-2w 2A natural estimator of n E(X - 8) is the right hand ex-n
pression of (3.16) after the parameters have been replaced by their
estimators and the expression has been suitably normalized. Using
this estimator, the problem becomes one of choosing a good value for
k and estimating the parameters.
Assume, for the moment, that estimates of the parameters are
available. The squared bias expression in (3.16) is minimized as
either k decreases or Xk approaches 8. This suggests that a good
..
·e
55
choice for ~ would be close to Xn+l (the estimate of 8) and such that
k - k. Using this method, the squared bias expression can be ignored- 0
if n - k is of any consequence.o
A natural estimate of the variance of X , conditional on k,n
would be to substitute the estimate for -4 AM " (8) (= 2D) into the
upper bounds given in lemmas (3.1), (3.2), and (3.3). If n-k is
a-2wmoderately large, this estimate (after normalizing by n ) closely
approximates the corresponding estimate of the asymptotic variance
given in Corollary (3.1).
The parameters, for which we have assumed estimates exist, are
cr2 and M" (8). Carrying the assumption that M(x) is approximately
quadratic a little further, we can obtain estimates of these param-
eters using regression analysis.
It is interesting to consider Fisher I s information for M' '(8)
when Y(x) _ N(-B(x-8)2, cr2). We have
d2 2 2 2= - --- (C - (y+B(x-8) ) /2cr )dB 2
d 222= -aB «y + B(x-8) )(x-8) /2 cr )
= (x-8)4/2 cr2 .
Thus, conditional on X., i = k , ... , n, the information is givenJ. 0
by (assuming 2[m2aw ] observations are taken on each step)
(l2cr) -2nL:
m=ko
•n
~ O( L:m=k
o
2 2aw)c n .m
.e56
-wIf c = nand w is arbitrarily close to its lower boundn
(a/2(3-a)), the information is bounded below by a quantity of
I - a(2-a)/(3-a) - EO(n )
for arbitrary small E. This expression increases as either a decreases
or as a increases until a = 2. Moreover, since (2-a)/(3-a) < 2/3 for
small a, the usual regression estimators of 02 and M"(8) will yield
consistent estimates if M(x) is quadratic.
.e..
CHAPTER 4. .IRE SIGNED KIEF.ER-WOLFOWITZ PROCESS
4.1 Definition of the Signed Kiefer-Wolfowitz Process
Chapter 3 is a development of a generalization of the K-W processT
in which the corrective variable «Y(X -c ) - Y(X +c » n) is con-n n - n n
strained to lie between + T. This corrective variable can be char- n
acterized as influencing both the direction and magnitude of the adjust-
ment at the n'th step. In this chapter we shall develop a modification
of the K-W process in which the corrective variable influences only the
direction of this adjustment.
positive numbers where [r J > 1. Let Y(x) denote a random variable w~thn -
mean M(x) and the distribution function F("lx). Let
-eDefinition (4.1). Let {a }, {c }, and {r } be sequences ofn n n
where
o(x ,c )n n
[r Jn
= [r ]-1 L:n i=l
o. (x, c ),]. n
or
1 if Yi(x-cn) > Y. (x+c ),]. n
°i(x,cn) = 0 if Y. (x-c) = Y. (x+c ),]. n ]. n
-1 if Y. (x-c) < Y. (x+c ),]. n ]. n
i = 1, •.• , ern]. Let Xl be fixed, and let the random variable
sequence {X } be defined byn
58
(4.1)
where Y.(X -c ) and Y.(X +c ), i = 1, ... , [r ], are random variables~ n n ~ n n n
which, conditional on (Xl,cl ), •.. , o(Xn_l,cn_l ), Xl' •.. , Xn- l , and
X = x , are distributed independently as F(-Ix -c ) and F(-Ix +c ).n n n n n n
This process will be called the Signed Kiefer-Wolfowitz process,
and will be referred to as the SKW process. When [r ] =r (a positiven
-e
integer) the SKW becomes the process first proposed by Wilde [13].
Wilde did not attempt to develop the process except to note that it
seemed less sensitive to the variance of Y(x) than the K-W process.
The only other attempt to study this process seems to be Springer's
[11] work which was discussed in Chapter 1.
The SKW process is intrinsically different from the K-W process.
The K-W process seeks the value (8) such that E[Y(8)] ~ E[Y(x)] for
all x. The SKW process seeks the value (8') such that
P[Y(8') > Y(x)!Y(8') # Y(x)] ~ 1/2
for all x. Notice that if Y(x) is such that E[Y(x)] > E[Y(x)] f6r
x # x' implies that p[Y(x) > Y(x')ly(x) # Y(x')] > 1/2, then 8 = 8'.
In particular, if Y(x) - N(M(x), 02(x», then 8 = 8'.
However, the behavior of the two processes can be markedly dif-
ferent because of their different emphases on the magnitude of the cor-
rective variable.
When [r ] = 1, the SKW process can be considered as a limitingn
-alet a = An for the CKW process with T =T.n n
,case of the CKW process. s -aLet a = A n for the SKW process, and
n
the corrective term for the CKW process
..
59
(AT-l n-a c-l (Y(X -c ) - Y(X +c »T)n n n n n
converges in probability to the correction term of the SKW process
-a -1(An c o(X ,c » for aqy fixed n as T + O.n n n
Before deriving any basic results it is helpful to discuss the
random variable (6(x,c ». We have, n
E o(x,c ) = P[Y(x-c ) > Y(x+c )]n n n
- P[Y(x-c ) < Y(x+c )].n n
(4.2)
If Y(x) is a continuous random variable for all x, (4.2) can be ex-
pressed as
00 Y
E o(x,cn) = f (JdF(z.jx+cn»dF(y]x-cn)-00 00
00 y
J ,.f dF(z Ix-cn)dF{y Ix+cn)-00 -00
(4.3)
We now consider the important special case in which Y(x) is a
normal random variable with density function
-1222f(y,x) = (2 TI a (x» exp{-(1/2)«y - M(x»/a(x» }.
Substitution into (4.3) gives
00 Y
E 0 (x,E) = r(Jf(z,x+€)dz) f(y,x-E)dy-00 -00
00 y
- J ( J f(z,x-E)dz) f(y,x+E)dy.-00 -00
(4.4)
Since we shall be interested in the behavior of o(x,c ) forn
c small and x in a neighborhood of 8, we shall need estimates ofn
60
a aax E O(X,E) and aEaX E O(X,E).
Assume that
1. M'(x) is continuous and there exists an open interval I
containing e such that for x E I M"(x) is continuous,
2. 0 < 0'2 (x) < co, a'(x) is continuous, and a"(x) is
continuous for x E I.
Then the Taylor expansion of f(y,X+E) gives
f(y,X+E) = f(y,x) + E( a~ f(y,x+ ) ) i = ~
for 0 < ~ < T.
aaT f(y,T) =
Since
assumptions (1) and (2) guarantee that the last expression in (4.5)
can be dominated by an integrable function for x in a bounded interval.
We can, therefore, express (4.4) as
co Y
E O(X,E) = J[ J (f(Z,X+E) - f(Z,X-E»dz] (f(y,x) + Eg(y,x,E)dy
-co -co
where g(y,x,E) can be dominated by an integrable function.
(4.7)
Since the term in brackets is bounded by ±. 1 and goes to zero
for every z as E -7 0, we have from (4.7) and assumptions (1) and (2),
that
~~ ~E ES(X,E)- 2 {( ~x1f(z,x) dz ) f(y,x)dy
..
61
00
= 2 cr(x) f (~x [(z - M(x)/a(x)] ) f2
(y,x)dy_00
= Mt (x) / ITI a (x)
In order to evaluate ~E t- Eo(x,e) at x = 8 and E = 0, we
note that, for x in a closed subinterval of I, the continuity of
M' I (x) and a I I (x) is sufficient to guarantee the existence and
(4.8)
.. f a a ~( ) d a a E ~( )cont~nUJ.ty 0 3E a E u X,E an ax 3E u X,E • Moreover,
these second partials are equal, and we have by (4.6), (4.7), and
(4.8) that
limc +0n
since M' (8) = o.
a 2- EQ(X,E)I8E ax x=8, E=c
n
= - M' I (8)/cr(8)17f (4.9)
Thus if Y(x) is a normal random variable, we have
E o(x,c ) = -c M'(x)/a(x)lIT + O(c ),n n n
and, for Ix - 8 I· :sufficiently small,
E o(x,c ) = -c (x-8)M " (8)/cr(8)1TI + O(c (x-8».n n n
If, more generally, Y(x) is a continuous random variable whose
density function f(y(x»)is continuous in x for all y,then it follows
2from Definition (4.1) that E 0i(x,cn) :: 1 and E <S.(x,c) + 0 as c·+ o.~ n n
.e62
Since, conditional on X = x, o.(x,c ), i = 1, ... , [r ], are independ-n ~ n n
ent random variables, we have
limc~
n
Var o(8.c ) = 1.n
(4.10)
4.2 Asymptotic Theory of the SKW Process
Corollary (4.1): Let {X } be a SKW process with rn n
2~= n
Let {V } be a real number sequence, 8 be a real number, a and W ben
non-negative numbers, and no' AS, C, v, and a be positive numbers such
that
a. if n > n , then E o(V ,c ) = 0, and if x ~ ~ , theno n n n
(x-v) E o(x,c ) > 0;n n
b. for every x, E 6(x,0) =
and if B = sup c , thenn
d°and ~E E O(X,E) is continuous,
c. for every x in an open interval (I) containing e and for every
lEI < B, (o/ox) E o(x,E)l x=8, y=O equals zero, (o2/oE6x)
E o(x,y) is continuous, and the sequence
2{(d IdEdX) E O(X,E)] 8 } converges to A > 0;x= , E=cn
d. Var o(x,c ) is continuously convergent to 02 at x = 8;
n
e. F(y/o) is Borel measurable for all y;
f. (1) 1/2 < a < 1, na a + AS, nW c + C, Iv, - 81 = o(n-V) ,
- n n n
(2) if a < 1, then a > 1 - (2a - 1)/2w;
(3) and S+ = 0 if a < 1, = S < 2 AS A if a = 1.
·e..
-e
..
63
Proof: Make the correspondence with Theorem (Z.l) as follows:
-1a' = a c Z (x) = o(x ,c ), and R (x) = E o(x ,c ).n n n n n n n n n
To verify that conditions (i) through (vii) of Theorem (2.1) are
satisfied, we first note that condition (i) is a direct consequence
of condition (a).
It follows from condition (b) that
sup nW/E o(x ,c )j/(l+jxl)n,x n n
= sup nWc (J ~ E o(x ,E)/ c ( »/(I+lx/)n,X n aE n Y=sn x
where I~ (x) I < c. The last expression is finite by condition (b),n n
and therefore condition (ii) is verified.
To verify condition (iii) we note that condition (b) implies
the existence of a function sequence {~ (x)} such thatn
I 6 ( ) I c l~ E °( )I IE 'xn,cn = n dE Xn,E E=~ (x)n
where 'I~ (x)/ < c. By condition (f. (1», c converges to a limit,. n n n
and hence the function sequence {~ (x)} converges to a function, sayn
~(x). Moreover, condition (b) implies ~(x) is continuous, and there-
fore the function / (d/dE) E o(x,E)IE=~(x)/ is continuous and attains
its infimum (k(ol'oZ» over the set [01 ~ Ix-el.~ 02]'
But ~n + e by (f. (1». Therefore if 0 < 01 ~ 02< 00, there
exists an N such that for every n > N we have that
[01~lx-el':::"02]n [Ix-e/ ~ Illn-ej] = cp. This implies by (a) that
k(ol'oZ) > O. Thus condition (iii) is verified by noting that the
.e64
divergence of
L:a'n
inf IR (x) I01 ~ ]x-e] < 0z n
is implied by the divergence of
We have by (a) and (c) that
E o(x ,c )/c (x-~ )n n n n
c-l(~ E o(x,c »n aX n x=n
where In - ~ I < Ix - ~ I and I~I < c. The last expression is con-n n n
tinuously convergent to A at x = e by condition (c). Thus condition
(iv) is satisfied.
Condition (v) follows from condition (d) and the fact that
I0(x ,c ) I < 1.n n -
Since the P[Y(x-c ) > Y(x+c )] can be written as the limit ofn n
Lebesgue-Stieltjes sums, which by condition (e) are Borel measurable,
it is clear that the distribution function of o(x,c ) is Borel measurn
able for all (y). Thus condition (vi) is satisfied.
Condition (vii) can be shown to hold by direct substitution for
the case T _ 00 and p = Zaw.n
Thus Corollary (4.1) is proven.
The conditions in this corollary are somewhat unintuitive. This
arises from the fact that the SKW process explores a response curve
65
(E O(x,c » which, although often closely related, is not the responsen
curve of the observed random variables. However, as in Corollary (3.1),
the assumptions of Corollary (4.1) are satisfied in practice almost
without exception. This fact will become clearer in the following
examples.
Example 1. 2Suppose Y(x) _ N(M(x) , a (x». Let {X } be a SKWn
process which satisfies assumptions (1) and (2) preceding (4.5) and
assumption (f.) of Corollary (4.1) with w > O. We have yet to
determine V.
In addition, assume that
3. (x -8) M'(x) < a for x ~ 8, = a for x = 8,
4. M"'(x) is continuous for x E I;
5. and sup (IM'(x)! + ja'(x) I)/a(x)(l + Ixl) < 00,x
We shall show that nS/ 2(X
n-8) ~ N(O,(As )2/c2(2AsA - S+», where
A = - M"(8)/a(8) /-IT, using Corollary (4.1).
We first note that (3) implies condition (a) since
P[Y(x) > Y(x')] > 1/2 if and only if E Y(x) > E Y(x') for independent
non-degenerate normal random variables. It then follows by condition
(4) and Corollary (3.1) that V = 2w.
Condition (e) is an immediate consequence of the continuity of
M(x) and a(x).
We now summarize the relevant results obtained earlier for this
special case:
d1. aE E a(X,E) is continuous;
2. {Var a(x,c )} is continuously convergent to 1 at x = 8;n
66
every jE] < B,
= 0, the sequenceE=O
3. for every x E I and
( ~x E <'i(x,£) ) xeS,
d2dEdX EO(X,E)/x=8 £=c converges to
, n
d2
A ( = -M"(8)!a(8) lIT), and dE3x E O(X,E) is continuous.
Thus, the assertion of asymptotic normality is established if we can
show that the
suP /E/ < B,xl~E E o(x,E)//(l + Ix]) < 00.
Taking the partial with respect to E under the integral signs
in (4.3) (The continuity of M(x) and a(x) assures us that this can be
-edone.) we obtain integrals of the form
co y
J (J ~E f(z,£) dz) f(y,£) dy-00 _00
and
fOO ( JY d
f(Z,E) dz) aE f(y,E) dy.
-00 -00
An examination of (4.5) reveals that assumption (5) is sufficient to
guarantee that the above supremum is finite.
Thus the desired asymptotic normality of nS/2 (X -.f) is estabn
lished.
'..The variances of the respective normal distributions to which
the SKW process and the K-W process converge are respectively
.e67
and
when a = 1, and
and (4.11)
when 1/2 < a < 1. It is evident that the K-W process is asymptotically
more sensitive to the variance of the observations.
Before leaving this example we note that if AS = A o(8)/IIT,
the expressions in (4.11) are identical.
Example 2: Suppose Y(x) has an exponential density function
given by
f(y,x) ~ (~(x) e-~(x)y, x ~ 0
l 0 , x < o.
Let {X } be a SKW process satisfying assumption (f) of Corollaryn
(4.1) with w > O. Let M(x) = ~-l(x) and assume that
1. M'(x) is continuous and (x- 8)M'(x) < 0 by x # 8;
2. sup /M'{x)I/M(x) (1 + Ixl) < 00;x
3. and there exists an open interval I containing 8 such that
M"'(x) is continuous.
..
68
We have
P[Y(X-E) < Y(x+E)] = 1 - A(X+E)!(A(x+E) + A(X-E»
= 1 - M(X-E)! (M(X+E) + M(X-E»
by the last expression in (4.3).
It follows that
E O(X,E) = (M(X-E) - M(X+E»!(M(x-E) + M(X+E». (4.12)
..
It is clear from (4.12) that the value of V is less than 2w as
in Corollary (3.1).
With the possible exception of condition (b), the conditions
of Corollary (4.1) can be easily verified using assumptions (1) and
(2) •
We have
= SUPjE] < B,xIM'(x)/!M(x)(l + Ixl) < ~
by assumption (2).
It remains to evaluate the Var 6(8,0) and A~ The
limVar o(x,c ) + 1 by (4.9) for all x and, in particular, for x = 8.c +0 n .n
By assumption (3) the value of A is given by
d .L E IdX dE O(X,E) E=O x=8'
or
69
A = - M"(8)/M(8)
= - M"(8)/cr(8)
since M(8) = A- l (8) = cr(8).
A comparison of (4.13) with (4.9) suggests that A may be gener-
ally of the form - k M"(8)/cr(8) where k is a constant characterizing
a family of distribution functions. The next example gives an impor-
tant case when this is not true.
Example 3: Suppose Y(x) is a Bernoulli random variable with
the P[Y(x) = 1] = p(x). Suppose that assumptions (1) and (3) of
Example 2 hold with M(x) replaced by p(x). In addition, assume that
(2) sup Ip'(x)1 (l+x) < 00.x
It is easily seen that these assumptions are sufficient to verify
the conditions of both Corollary (4.1) and Corollary (3.1). Moreover,
in this case the SKW process and the K-W process are identical. Thus,
using the results of Corollary (3.1), we find that
nS/ 2(Xn-8) ~ N(O, (As )2cr2/c2(2 ASA - S+»
where A = - 2 p"(8), and cr2 = 2 pee) (1-p(8».
This example shows that the corrective variable in the SKW process
is a transformation of the observed random variable into a discrete,
bounded random variable with a regression function that is less than,
equal to, and greater than zero as x is less than, equal to, or greater
than ~. That is, the SKW process is simply a K-W process after thisn
transformation on the observed random variables has been made.
70
These examples indicate that, at least asymptotically, the SKW
process is more dependent than the K-W process on the family of distri-
butions ~.K. normal, exponential) from which the observed random vari-
abIes come. However, for the special cases of the normal and exponen-
tial families, the SKW process is less sensitive to the variance of the
parent distribution.
4.3 Sample Properties of the SKW Process
Unlike the K-W process, whose small sample properties are largely
determined by M(x) and a(x), the small sample properties of the SKW
process often depend on the distribution function ina more intrinsic
manner. This makes a general discussion of the small sample properties
of the SKW process very difficult.
However, a few general remarks can be made. Firstly, the effect
of using the sign of (Y(X -c ) - Y(X +c )) is somewhat similar to thatn n nn
of constraining an observed random variable; both operations lessen
the effect of an observation from the tails of the distribution.
Secondly, if the SKW process is approaching e along a constant, rela.:.-o
tively small slope, the expected behavior is fairly well approximated
2by (3.11) and (3.12) in which the constants B and a are characteristic
of the family of observed random variables. And, thirdly, if the SKW
process is approaching e along a steep slope, the approach is very
uniform, and the rate is determined by the sequences {a } and {c }.n n
It is clear from the nature of the SKW process that is it im-
possible to consider an example corresponding to M(x) being quadratic
for the K-W process. That is, since 0 (x., c ) is bounded by + 1, itn
71
is impossible for E o(x,c ) = k c (x-e) to hold for all x. This impliesn n
that the moments of nS/ 2 (X -8) generally do not ·converge to those of itsn
asymptotic normal distribution unless possibly when X is restricted ton
lie in a bounded interval containing 8. To study the expected rate of
approach to 8 therefore requires a slight modification of the method
used for the K-W process.
Let {X} be a SKW process. Let I be an open interval of 8 suchn
that
E o(x,c ) = c T (x) (x-8)n n n (4.14)
for two reasons.
k -+ 00, and since
.. where 0 < 1. < Tn (x) 2. T < 00... Let 1\ be the event that XmE I for
m > k, and let X* denote the random variable X conditional on A..- n n-1<:
That is, assume that each member of the random variable sequence
. {X }oo k lies in I.m m=
It is not unreasonable to study the random variable sequence {X*}n
Since Xn -+ e a.e. by Corollary (4.1), P(1\) -+ 1 as
P(A ) + 1 nS/2 (x*-8) -+ nS/2 (X -8) in probability andk ' n n
therefore has the same asymptotic distribution.
We have, for a = ASn-a and c = C n-Wn n
= X* - 8 - a c- l o(x*,c )n n n n n
*Xn+l - 8(4.15)
= (1 - AST (X*)n-a )(X*-8) _ ASC-ln-(a-w)Z(X*)n n n n
.e 72
* * * *where, conditional on X , E Z(X ) = 0 and Var Z(X ) = Var o(X ,c }.n n n nn
In much the same manner that we obtained (3.16), we obtain
nE«X*+1-8)2 ~) ~ D(k,n) (X=_6)2 + A2C-202 i: D(m+l,n)m-2 (a-w) (4.16)
n . m=k
n 2-2where D(k,n) = TI (1 - A~T j-a) and a = sup Var o(x,c). Then,x nj=m
-2 . -. 2reverse inequality holds if ! and a were replaced by T and Q. Quite* a.e. * a.e.
clearly, since T (X) + A and Var o(X ,c) + Vai: 0(6,0), it couldn n n n
be shown that
Q * 2nlJ E(X -8)n
where S = a - 2w.
(4.17)
..
Since it will be useful in the next section, we shall show that
Where
If we raise (4.15) to the'4'th power and take expectations, we
obtain terms of the following forms:
.. * 3 * I *since E[:(X ·-6) Z (X) X ] = O. Using (4.16) and Holder's inequality,:.n n n n
the last two expressions are of smaller order than·thesecond.' Thus
we can write
.e73
Letting
+ 6 ~ (As )2C-2(G2 + 0(1) ~ D2(m+1,n)m~3~+4w. m=k
(4.18)
b =m
nIT
j=m+ln > 0
it is seen that Lemmas (3.1),'(3.2) and (3.3) as well as Theorem (3.2)
hold with 2 D replaced by nD. Thus the last expression in (4.18) is
less than
or, equivalently,
~.e n-2 (a-2w) 3 ~2 (1 + 0(1»).
It should be noted that, when a = 1, the restriction that
sa - 2 w < 2 A A implies ~ - 1 = 2(a-2w) < 4 A A = n D in the above
expressions.
Using (3.17) and (3.18) with 2 B = AS A, weobtairi
as we might have expected.
4.4 Estimation of E(X - 8)211
(4.19)
The fact that the form of A is generally unknown for the SKW
process adds a new complexity to the estimation of the variance of
X. If we can assume a particular distribution for Y(x), we cann
.e 74
often evaluate A as was illustrated in the preceding examples. An
estimate of the variance of X could then be calculated using then
method developed in Chapter 3. HOVlever,the distribution of Y(x) is
not always known. We now develop a 'methO'd'O'f estimating the variance
of X which does not require this knowledge.n
*Denote the Var X by V. Let b be an increasing subsequencen n m
Define V byn
Vn
-1= NN2:
melbs ~~ * 2(X -X ) •
m m n(4.20)
We shall study the properties of V through a study of then
properties of
V' = N-ln
(4.21)
The asterisk, which indicates that X , m=k, ... , n, all lie inm
a neighborhood of 8 in which (4.14) is a reasonable approximation for
E 'o(x,c ), will be omitted in the rest of this discussion., n
It follows from (4.17) that Vi is asymptotically unbiased.n
To obtain the variance of V' we note thatn
Var[V ' ]n
N= N-2 2: (b b I)
I m mm,m =1Cov[ (~
m) 2 -. 2
- 8 . (~ - 8) ]m'
2 N 2= N- 2: (b b ,) (E[X
b- 8) (x. _8)2]
I m m -lJ Im,m =1 m m
- E(~ ~ 6)2 ~(~ _ 8)2)m m'
..
75
+ Var(Xj,m'
- 81Xb )] - E(Xb - 8)2 E(Xbm m m'
-2- N
b ,m
. 11 .j=b
m
..Using (4.19) we obtain
Var(V') < (6 ¢2 + o(1))N-2n -
NL:
m'>mD(b ,b ,).
m m (4.22)
Consider the case a = 1. We have from (3.22) and (3.28) that for
every A < 2 A T there exists a N(A) such that, for every m, n > N(A),
AD(b ,b ,) < (b /b ,) .m m - m m (4.23)
Let m' = m + 1, and suppose that the sequence {b } is chosen such thatm
(bm/bm+l ) is independent of m. Then, for some constant B,
m-1L: (In (b i +l ) - In b)
i=l
or, equivalently,
(m-l)B
Thus b is of the form (sm).m
Let b be so defined with s > 1.m
obtain
76
Then from (4.22) and (4.23) we
Var(V') < (6 ¢2 + 0(1) N- 2 ~ (N-m)s-mAn m=O
Since N is approximately the solution of sN = n, we have
In(n) Var(V') ~ 6 ¢2 In(s)/(l-s-T) + 0(1).n
Consider the case a < 1. We have from (3.22) and (3.32) that for
every A < 2 AST there exists and N(A) such that for every m, n > N(A)
D(b ,b ,) < (b /b ,)¢ exp[A b-~ (b -b ,)]mm mm m mm
where ¢ > O.
(4.24)
Let m' = m + 1 and suppose the sequence {b Vis(:'chosen,'such':\:hatm
the term in brackets is asymptotically independent of m. Then, if
-1bm+l = bm(l + oem », we have
-1 bl-am .. -+ C.m
-1It . 'I h b (T m)l-a)1S eas1Y seen t at = L
mhas the required property.
Let b be so defined. Then, for fixed k and m' = m + k, them
term in brackets becomes [- A T k + 0(1)]. Thus by (4.22) and (4.24)
we obtain
Since N is approximately the solution of
-1(T N) (l-a)
we have
Since 8 is unknown, it is natural to use V instead of Vi.n n
We have
77
..E V
nN-1 ~ b 2= E(X.·m-.Xn·.) .
m=l m.
NL bm(E(~ _8)2 - 2 E(~ -8)(Xn-8) + E(Xn-~}2)
m=l m m
NL:
m=l(¢ + 0(1».
Thus V is asymptotically unbiased. Moreover, for a < 1,n
O(b-S exp{-n-a (n-b )}m m '.
.. Thus for a < 1, V is usually a conservative estimate of V •n n
This method of estimation is obviously very inefficient for a
close to one. It does have the advantages of being relatively easy
-e
78
to calculate and of being generally applicable. It should be noted
that the order of the variance of V depends on the number of stepsn
and not on the number of observations. This is undesirable when we
wish to economize by taking more observations on each step and taking
fewer steps.
CHAPTER 5. TWO EXAMPLES
We now illustrate the application of some of the results of
Chapter 3 and of Chapter 4 by means of two hypothetical examples. Both
examples entail designing an experiment to emphasize the factors that
influence the choice of the sequences {a }, {c }, and {T }.n n n
Example 1: A research worker is studying the relationship between
temperature and the rate of reproduction of a certain strain of bacteria.
One of his objectives is to estimate the temperature (8) that maximizes
the average rate of reproduction. He emphasizes that he wishes to mini
mize the E (M(O) - M(8»2 rather than E(e - 8)2.
-e From past experience he is reasonably confident that 8 lies be-
tween 80 and 110 degrees Fahrenheit and that the average rate of repro-
duction is reasonably cOnstant in the interval (8-5,8+5). For temper-
atures below 8 - 5 degrees the average rate of reproduction changes ap-
proximately one unit for each 10 degree change in temperature, and for
temperatures above 8 + 5 degrees the average rate of reproduction
changes two to three units for each 10 degree change in temperature.
The standard deviation is expected to be approximately three units at
temperatures below 8 and to moderately increase as temperatures rise
in a twenty degree range above 8. He also mentions that for practical
reasons he can perform the experiment at only 20 different temperatures.
The above discussion indicates that the relation between the..average rate of reproduction and temperature may be represented by
Figure 4.
" 22I:l0 20''M.w()
::J18"0
0H
~ 16~ I
in 80
IET-/-
HI IG
I Ifl(c) I I
I I8-5
90 100Temperature
8+5
110 120
80
Figure 4. Graph of Reproduction versus Temperature
We shall illustrate the design of this experiment using two
CKW processes. In order that the first process (X') approaches 8n
along a moderate slope and that the second process (X") approachesn
8 along a steep slope, we set Xi = 80 and Xi' = 110. A reason in
favor.of using two such processes in practice rather than one is
that two processes would usually generate observations more uni-
formly and over a larger range of temperatures. This allows the
experimenter to study the whole relationship more reliably.
Since so many of the results in this paper depend on (3.15)
being a reasonable approximation of the process, we must define an
interval I of 8 for which this is true. Then we must design the
process such that X is in the region at least for the last few steps.n
Since we are practically restricted to 20 steps, we shall choose the
sequence {an} such that with a high probability Xn has reached this
region after 10 steps.
Before doing this we must choose the sequence {c}. Thisn
sequence should be chosen with two thoughts in mind. Firstly, cn
81
should be large as long as Ix - eln
searching for a neighborhood of e.
> C ; that is, X is stilln n
Secondly, when Ix - el < c ,n n
..
c should be small enough so that the expected correction term alwaysn
directs the process to a region in which M(x) ~ M(e).
The maximum value of c for which the last statement is truen
can be approximated by considering Figure 4. Let D - F denote the
length of the line DF, and let c = (D - F)/2. We see from the
picture that if c < c, then I~(c ) - el < 5. The experimenter hasn n
noted that there is not a biologically significant difference between
M(~(c» and M(e) if I~(c) - el <:5. Thus we seek the maximum value of
c for which I~(c) - el < 5.
Let p be such that (G - F) = 2pc. Then if sl and s2 represent
the slopes of lines FE and ED, we have
E - G = 2cpsl = 2c(1-p)s2'
or
Since
5 > I~(c) - e]
for s2 > sl' Thus, for our example,
c - 5(.1 + .2)/(.2 - .1)
= 15.
.e..
82
Consider the choice of c for X'. Initially c should be large enoughn n n
so that if x + c < 8, then X' +l > X' with a high probability •n n n n
Since p[ly(x) - M(x) I < 0] ~ .6, we may choose c1 (~C) as the solution
of
20 = M(xl + C) - M(X1 - C)
~ .2C,
or
C ~ 30.
If, in addition, we require that
then W ~ .3. With these values for C and W we find that cZO ~ 11.5.
Similarly for the choice of c for X' we obtainn n
20 = M(xl
+ C) - M(xl
- C)
~ .4C.
or
C ~ 15.
In this case, c1
is sufficiently small to discount a serious bias due
Thus, we can set w = 0, or c S 15.n
It should be noted that we could start the process at n = k > 1
and let n = k, •.• , 20 + k. By doing so, we can adjust the change in
chosen constants.
.e 83
An examination of Figure 4. reveals that if c < 15 andn
IXn - 81< 5, then (3.15) is a reasonable approximation for Xn+1 . Equa-
tions (3.11) and (3.12) are useful for choosing the sequence {a } suchn
that X reaches the interval I by the n'th step.n
In choosing a (=~a) for X' we note first that this process willn n
be somewhat exploratory due to the small slope. Therefore, we do not
want a to decrease too rapidly as that would make the process toon
sensitive to the first few steps. Thus we might set a = 3/4. Since
8£(80, 110), IXI - (8 - 5)1 < 25. Thus a conservative value for A,
which reasonably assures that X £1 for some n ~ 10, is the solution ofn
102(1.1)A L
m=l
-.75m ~ .2(3.8)A
= 25, (4.25)
by (3.11). We therefore set A = 30.
-aIn choosing a (=An ) for X" we note first that this processn n'
should be in a neighborhood of e quite early due to the steep slope.
Thus, we want the sequence {a } to optimize the behavior of X" undern n
this assumption. Asymptotically this is done by setting a = 1 with
the restriction that A > B/2 Mil (8) I, where B = 1 - 2w = 1. To obtain
a conservative estimate of IM"(8) I, we note from (3.14) and Figure 4.
that
-B(8-10~8) ~ M'(8-10)
~ .1,
or
84
B ~M"(e)-
Thus A > 2.5. Furthermore, by the note following (3.20), asymptotically
the optimal choice of A is 5. However, if we require A to be greater
than the solution of the equation corresponding to (4.25), we must have
A > 25.
There are numerous bases on which the sequence {T } of constraintsn
may be chosen. We illustrate one method for X".n In the region of
steepest slope the expected correction term is
AC-l n- l (M(x-c ) - M(x+c )) ~ 6AC-l n-ln n-
If we set T = DCA-In then the maximum correction would be th It isn
clear that for D > 6 this constraint does not seriously affect the
approach of X' in the region of steep slope. By letting D < 10, then
possible bias arising from a large movement to the left of e in the
early stages of the experiment is diminished. Obviously, the effect
of T diminishes very quickly as n gets large.n
We can get an approximation for the varianc~ of X' and X" usingn n
(3.16) and Lemmas (3.1) and (3.3). We have for X' thatn
nD(m,n) < IT (1 - 2(.1)30j-a)2
j=m+l
giving
2(9)20-. 75+. 6 75Var(X21 - elxk ) ~ ';;;;"'>':1:-:2~ (exp{6(20)-'}
- 75)- exp{-(20 ... k)6(20)' }
•
...
by Lemma (3.3). For k = 10, we have
If X10 - 81 < 5, then
20a·/' ) <IT· (1 - 6.-. 75 )2 25x10 _.. J
j=ll
~ (11/20)-1.5+.6 exp{-12(10)(20)-·75}25
~ O.
by (4.24) with b = 10 and b , = 20.m m
We have for X" thatn
n .-1 2D(m,n) ~ '..II (1 - 2(.1)25 J ) •
j=m+l
Using Lemma (3.1), we obtain
2Var(X
2"1 - alx
10).:. (9) (~5) (20-1 - (.5)5 10-1)
(15) (10-1)
~ .14.
If Ix" - al < 5, thenn
20E2 (X" - alx ) < 'll (1 - 5j-l)2 25
20 10 - j=l1
:::.. (11/20) 10
:::.. .03 .
85
(4.26)
(4.27)
86
Before comparing (4.28) and (4.27), we must adjust for the fact
that X" approached e along the steepest slope. The corresponding adn
justment in (4.25) changes the restriction on AS to AS > 15. With
this new value for AS we obtain
We conclude this example by noting that much the same procedure
would be followed if the SKW process were considered.
Example 2. Suppose it has been noted that the addition of a
moderate amount of a certain medication to the daily diet of some
chickens has improved their health, and it is desired to estimate the
optimal amount.
In this type of problem, it is often difficult to quantify the
observed variable (health). Moreover, the shape of the response func-
tion depends on the manner in which we choose to quantify this vari-
able. Thus it seems natural to use a method of estimation that
depends on relative magnitudes rather than absolute magnitudes. We
therefore illustrate this example using the SKW process.
The corrective variable (o(X ,c » is defined to be -1 if then n
chickens which are given the amount of medication (X - c ) aren n
healthier than the chickens which are given the amount of medication
(X + c). As we noted in the discussion of Example 3t the randomn n
variable o(x,c ) generates a response curve that equals zero at ~.n n
We shall suppose that the experimenter is reluctant to assume that
•• flln - 8.
r)view the"
that the
87
If for no other reason than'"it is somewhat meaningless to
expected health as a function of diet, we shall also assume
experimenter wishes to minimize E(8 - 8)2.
Since it is not too unreasonable to think of ).In - 8 as the bias
of X , we might let c equal some constant (greater than 1) times then n
bias we are willing to accept. The values C and w should be chosen
such that c is large during the period of time we expect the processm
to be searching for the region of 8, and approximately equal to cn
when in this region.
We have such little knowledge of A in this case that is is
probably wise to set a < I and thereby obviate the risk of not satis
fying the inequality 2As~ > S for the case a = 1. Moreover, due to
the unknown nature of A, we are restricted to estimating the variance
of X using the method proposed in the previous section. This lattern
fact may cause us to diminish a still further and to compensate for the
corresponding increase in the variance by decreasing AS.
Using the principles employed in Example 4, we could evaluate the
sequences {a } and {c } more specifically.n n
..CHAPTER 6. SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH
6.1 Sununary
A large class of stochastic approximation procedures can be
represented by
(6.1)
where, conditioned on X , Z(X ) is an observable random variablen n
independent of X. and Z(X.),i = 1, .•. , n-l. In particular the Kiefer-1 1
.. Wolfowitz (K-W) process admits this representation with a' = an n
-1cn
and Z(X ) = Y(X -c ) - Y(X +c ) where, conditional on X· = x, Y(X -c )n n n n n n n n
and Y(X +c ) are independent random variables with means M(x-c ) andn n n
M(x+c ) respectively.n
An extension of Burkholder's [2] theorem on the asymptotic dis-
tribution of the stochastic approximation procedures, which can be
represented by (6.1), is given in Chapter 2. This extension allows
the number of observations taken on each step to be of O(n-p)~· p ~ 0,
allows the random variable Iz(x )/ to be constrained to lie betweenn
bounds of O(n-S), S ~ 0, and admits a more general sequence of con-
stants a'. The practical consequences of this extension for the K-Wn
process are discussed in Chapter 3.
The small sample mean and variance of Xn ' conditional on Xk '
are studied through a consideration of some specific examples. The
case in which M(x) is a quadratic function is considered since M(x)
89
is often closely approximated by a quadratic function in a neighborhood
of e. It is shown by means of some approximations that, after suitable
noxma~ization, the variance of the normal random variable to which the
K-W process converges in distribution is a conservative estimate of the
Var(Xnl~). This asymptotic variance depends on the unknown parameters
02 and M"(8). It is shown that an efficient estimator of M"(e) is also
.. 2a consistent estimator of M" (8) when Y(x). ~ N(M(x),.o ).
A modification of the K-W process first proposed by Wilde [13] is
defined in Chapter 4. It satisfies (6.1) with a' = an n
-1c -and withn
Z(Xn ) being the mean of [rn ] independent random variables defined by
1 if Y. (X -c ) > Y. (X +c ),1 n n 1 n n
01 (x,cn) = 0 if Y. (X -c ) = Y. (X +c ),1 n n 1 n n
-1 if Y. (X -c ) < Y. (X +c ),1 n n 1 n n
where, conditional on X = x, Y.(X -c ) and Y.(X +c ), i = 1, ... , [r J,n 1 n n 1 n n n
are independent random variables with mean M(x-c ) and M(x+c ),n n
respectively. This process is termed the Signed Kiefer-Wolfowitz (SKW)
proc..ess. The asymptotic distribution of the SKW process is obtained
under;quite general assumptions on Y(x) if the sequences a , c , andrn n n
are of the form n-P, p > O. It is seen that the order of the variances
of the K-W process and the SKW process are the same, but that the
variance of theSKW process shows a greater dependency on the nature of
the distribution of Y(x).
The variance of Xn conditional on Xk is again studied by consider
ing specific examples. It is seen that this variance depends on an
,
.,.,
...
90
unknown parameter~ the nature of which depends on the form of the dis-
tribution of Y(x). The estimator of the variance,
-1 N S 2Var(X ) = N l: bm (Xb - X )
n m=l m n
where {bm} is an increasing subsequence of the integers, 1, .. ", n, and
nS(X -8) is asymptotically normal, is proposed since this estimatorn
does not aepend on the unknown parameter. This estimator is shown to
be consistent for somewhat restricted class of sequences {b }.m·
In Chapter 5, two hypothetical examples are given to illustrate
the designing of an experiment in which these procedures are used.
6.2 Suggestions for Further Research
Although the results in the paper provide some added insight into
the behavior of the K-W and SKW processes and methods of estimating the
variance of the estimate of e are presented, both of these areas need
further study. One very promising possibility would be a numerical
study comparing these stochastic approximation procedures with the
Method of Steepest Ascent .
,
91
LIST OF REFERENCES
1. Blum, J. R. 1954.probabili ty one.
Approximation methods which converge withAnn. Math. Statist. 25:382-386.
2. Burkholder, D. L. 1956. On a class of stochastic approximationprocesses. Ann. Math. Statist. 27:1044-1059.
3. Chung, K. L. 1954. On a stochastic approximation method. Ann.Math. Statist. 25:463-483.
4. Dvoretsky, A. 1956.Berkley Symposium.
On stochastic approximation. Proc. Third1:39-56.
5. Fabian, V. 1967. Stochastic approximation with improv~d asymptoticspeed. Ann. Math. Statist. 38:191-200.
6. Fabian, V.mation.
1968. On asymptotic normality in stochastic approxiAnn. Math. Statist. 39:1327-1332.
7. Hodges, J. L., Jr., and E. L. Lehman. 1956. Two approximationsto the Robbins-Monro process. Proc.Third Berkeley Symposium.1:95-104.
8. Kiefer, J. and J. Wolfowitz. 1952'.' Sfochastic:'apptoximati'on ofthe maximum of a regression function. Ann. Math. Statist.23:462-466.
9. Robbins, H~ E. and S. Monro. 1951. A stochastic approximationmethoa. Ann. Math. Statist. 22:400-407.
10. Sachs, J. 1958. Asymptotic distribution of stochastic approximation procedures. Ann. Math. Statist. 29:373-405.
11. Springer, B. G. F. 1969.of random variability.56:65-74.
Numerical optimization in the-presenceThe single factor case. Biometrika.
12. Venter, J. H. 1967.Ann. Math. Statist.
An extension of the Robbins-Monro procedure.38:181-190.
•13. Wilde, D. J. Optimum seeking methods. Prentice~Ha11, Inc. 1964.
Englewood Cliffs, N. J •