Stochastic Approximation Applications

8/13/2019 Stochastic Approximation Applications

http://slidepdf.com/reader/full/stochastic-approximation-applications 1/368



Stochastic Approximation and Its Applications



Stochastic Approximation

and Its Applications

by

Han-Fu Chen Institute of Systems Science,

Academy of Mathematics and System Science,

Chinese Academy of Sciences,

Beijing, P.R. China

KLUWER ACADEMIC PUBLISHERSNEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW



eBook ISBN: 0-306-48166-9Print ISBN: 1-4020-0806-6

©2003 Kluwer Academic PublishersNew York, Boston, Dordrecht, London, Moscow

Print ©2002 Kluwer Academic Publishers

All rights reserved

No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,mechanical, recording, or otherwise, without written consent from the Publisher

Created in the United States of America

Visit Kluwer Online at: http://kluweronline.comand Kluwer's eBookstore at: http://ebooks.kluweronline.com

Dordrecht



Contents

PrefaceAcknowledgments

1. ROBBINS-MONRO ALGORITHM1.1

1.21.31.4

1.51.6

Finding Zeros of a Function.Probabilistic MethodODE MethodTruncated RM Algorithm and TS Method

Weak Convergence MethodNotes and References

2. STOCHASTIC APPROXIMATION ALGORITHMS WITH

2.12.22.32.4

2.52.6

2.72.82.9

MotivationGeneral Convergence Theorems by TS MethodConvergence Under State-Independent ConditionsNecessity of Noise ConditionNon-Additive NoiseConnection Between Trajectory Convergence and Propertyof Limit PointsRobustness of Stochastic Approximation AlgorithmsDynamic Stochastic ApproximationNotes and References

3. ASYMPTOTIC PROPERTIES OF STOCHASTIC

EXPANDING TRUNCATIONS

APPROXIMATION ALGORITHMS3.13.2

3.3

Convergence Rate: Nondegenerate CaseConvergence Rate: Degenerate CaseAsymptotic Normality

v

ix

xv

1

2

4

10

16

2123

25

26

284145

49

57

6782

93

9596

103113



vi STOCHASTIC APPROXIMATION AND ITS APPLICATIONS

3.43.5

Asymptotic EfficiencyNotes and References

4. OPTIMIZATION BY STOCHASTIC APPROXIMATION

4.14.2

4.34.4

4.54.6

Kiefer-Wolfowitz Algorithm with Randomized DifferencesAsymptotic Properties of KW AlgorithmGlobal Optimization

Asymptotic Behavior of Global Optimization AlgorithmApplication to Model ReductionNotes and References

5. APPLICATION TO SIGNAL PROCESSING

5.15.25.35.45.55.65.7

Recursive Blind IdentificationPrincipal Component AnalysisRecursive Blind Identification by PCAConstrained Adaptive FilteringAdaptive Filtering by Sign AlgorithmsAsynchronous Stochastic ApproximationNotes and References

6. APPLICATION TO SYSTEMS AND CONTROL6.16.26.3

6.4

6.5

Application to Identification and Adaptive ControlApplication to Adaptive StabilizationApplication to Pole Assignment for Systems with UnknownCoefficients

Application to Adaptive RegulationNotes and References

Appendices

A.1

A.2A.3A.4A.5A.6

A.7

Probability SpaceRandom Variable and Distribution FunctionExpectationConvergence Theorems and InequalitiesConditional ExpectationIndependenceErgodicity

B.1B.2B.3

Convergence Theorems for MartingaleConvergence Theorems for MDS IBorel-Cantelli-Lévy Lemma

130149

151

153166172194210218

219

220238246

265273278288

289290

305

316321327

329329

329

330330331332333333335335339340



Contents vii

B.4

B.5

B.6

Convergence Criteria for Adapted SequencesConvergence Theorems for MDS IIWeighted Sum of MDS

References

Index

341

343

344

347

355



Preface

Estimating unknown parameters based on observation data contain-ing information about the parameters is ubiquitous in diverse areas of both theory and application. For example, in system identification theunknown system coefficients are estimated on the basis of input-outputdata of the control system; in adaptive control systems the adaptivecontrol gain should be defined based on observation data in such a waythat the gain asymptotically tends to the optimal one; in blind chan-nel identification the channel coefficients are estimated using the output

data obtained at the receiver; in signal processing the optimal weightingmatrix is estimated on the basis of observations; in pattern classifica-tion the parameters specifying the partition hyperplane are searched bylearning, and more examples may be added to this list.

All these parameter estimation problems can be transformed to aroot-seeking problem for an unknown function. To see this, let de-note the observation at time i.e., the information available about theunknown parameters at time It can be assumed that the parameter

under estimation denoted by is a root of some unknown functionThis is not a restriction, because, for example, mayserve as such a function. Let be the estimate for at time Thenthe available information at time can formally be written as

where

Therefore, by considering as an observation on at withobservation error the problem has been reduced to seeking theroot of based on

It is clear that for each problem to specify is of crucial importance.The parameter estimation problem is possible to be solved only if

ix



x STOCHASTIC APPROXIMATION AND ITS APPLICATIONS

is appropriately selected so that the observation error meets therequirements figured in convergence theorems.

If and its gradient can be observed without error at any desiredvalues, then numerical methods such as Newton-Raphson method amongothers can be applied to solving the problem. However, this kind of methods cannot be used here, because in addition to the obvious problemconcerning the existence and availability of the gradient, the observationsare corrupted by errors which may contain not only the purely randomcomponent but also the structural error caused by inadequacy of theselected

Aiming at solving the stated problem, Robbins and Monro proposedthe following recursive algorithm

to approximate the sought-for root where is the step size. Thisalgorithm is now called the Robbins-Monro (RM) algorithm. Follow-ing this pioneer work of stochastic approximation, there have been alarge amount of applications to practical problems and research workson theoretical issues.

At beginning, the probabilistic method was the main tool in con-vergence analysis for stochastic approximation algorithms, and ratherrestrictive conditions were imposed on both and For example,it is required that the growth rate of is not faster than linear as

tends to infinity and is a martingale difference sequence [78].Though the linear growth rate condition is restrictive, as shown by sim-ulation it can hardly be simply removed without violating convergencefor RM algorithms.

To weaken the noise conditions guaranteeing convergence of the algo-

rithm, the ODE (ordinary differential equation) method was introducedin [72, 73] and further developed in [65]. Since the conditions on noiserequired by the ODE method may be satisfied by a large class of including both random and structural errors, the ODE method has beenwidely applied for convergence analysis in different areas. However, inthis approach one has to a priori assume that the sequence of estimates

is bounded. It is hard to say that the boundedness assumption ismore desirable than a growth rate restriction on

The stochastic approximation algorithm with expanding truncationswas introduced in [27], and the analysis method has then been improvedin [14]. In fact, this is an RM algorithm truncated at expanding bounds,and for its convergence the growth rate restriction on is not re-quired. The convergence analysis method for the proposed algorithmis called the trajectory-subsequence (TS) method, because the analysis



PREFACE xi

is carried out at trajectories where the noise condition is satisfied andin contrast to the ODE method the noise condition need not be veri-fied on the whole sequence but is verified only along convergentsubsequences This makes a great difference when dealing with

the state-dependent noise because a convergent subsequenceis always bounded while the boundedness of the whole sequence

is not guaranteed before establishing its convergence. As shown inChapters 4, 5, and 6 for most of parameter estimation problems aftertransforming them to a root-seeking problem, the structural errors areunavoidable, and they are state-dependent.

The expanding truncation technique equipped with TS method ap-pears a powerful tool in dealing with various parameter estimation prob-lems: it not only has succeeded in essentially weakening conditions forconvergence of the general stochastic approximation algorithm but alsohas made stochastic approximation possible to be successfully applied indiverse areas. However, there is a lack of a reference that systematicallydescribes the theoretical part of the method and concretely shows theway how to apply the method to problems coming from different areas.To fill in the gap is the purpose of the book.

The book summarizes results on the topic mostly distributed over journal papers and partly contained in unpublished material. The bookis written in a systematical way: it starts with a general introductionto stochastic approximation and then describes the basic method usedin the book, proves the general convergence theorems and demonstratesvarious applications of the general theory.

In Chapter 1 the problem of stochastic approximation is stated andthe basic methods for convergence analysis such as probabilistic method,ODE method, TS method, and the weak convergence method are intro-duced.

Chapter 2 presents the theoretical foundation of the algorithm withexpanding truncations: the basic convergence theorems are proved byTS method; various types of noises are discussed; the necessity of theimposed noise condition is shown; the connection between stability of the equilibrium and convergence of the algorithm is discussed; the ro-bustness of stochastic approximation algorithms is considered when thecommonly used conditions deviate from the exact satisfaction, and themoving root tracking is also investigated. The basic convergence the-orems are presented in Section 2.2, and their proof is elementary andpurely deterministic.

Chapter 3 describes asymptotic properties of the algorithms: conver-gence rates for both cases whether or not the gradient of is degener-



xii STOCHASTIC APPROXIMATION AND ITS APPLICATIONS

ate; asymptotic normality of and asymptotic efficiency by averagingmethod.

Starting from Chapter 4 the general theory developed so far is ap-plied to different fields. Chapter 4 deals with optimization by usingstochastic approximation methods. Convergence and convergence ratesof the Kiefer-Wolfowitz (KW) algorithm with expanding truncations andrandomized differences are established. A global optimization methodconsisting in combination of the KW algorithms with search methods isdefined, and its a.s. convergence as well as asymptotic behaviors are es-tablished. Finally, the global optimization method is applied to solvingthe model reduction problem.

In Chapter 5 the general theory is applied to the problems arising

from signal processing. Applying the stochastic approximation methodto blind channel identification leads to a recursive algorithm estimatingthe channel coefficients and continuously improving the estimates whilereceiving new signal in contrast to the existing “block” algorithms. Ap-plying TS method to principal component analysis results in improvingconditions for convergence. Stochastic approximation algorithms withexpanding truncations with TS method are also applied to adaptive fil-ters with and without constraints. As a result, conditions required for

convergence have been considerably improved in comparison with theexisting results. Finally, the expanding truncation technique and TSmethod are applied to the asynchronous stochastic approximation.

In the last chapter, the general theory is applied to problems arisingfrom systems and control. The ideal parameter for operation is identifiedfor stochastic systems by using the methods developed in this book.Then the obtained results are applied to the adaptive quadratic controlproblem. Adaptive regulation for a nonlinear nonparametric system and

learning pole assignment are also solved by the stochastic approximationmethod.The book is self-contained in the sense that there are only a few points

using knowledge for which we refer to other sources, and these points canbe ignored when reading the main body of the book. The basic mathe-matical tools used in the book are calculus and linear algebra based onwhich one will have no difficulty to read the fundamental convergenceTheorems 2.2.1 and 2.2.2 and their applications described in the sub-

sequent chapters. To understand other material, probability concept,especially the convergence theorems for martingale difference sequencesare needed. Necessary concept of probability theory is given in AppendixA. Some facts from probability that are used at a few specific points arelisted in Appendix A but without proof, because omitting the corre-sponding parts still makes the rest of the book readable. However, the



PREFACE xiii

proof of convergence theorems for martingales and martingale differencesequences is provided in detail in Appendix B.

The book is written for students, engineers and researchers working inthe areas of systems and control, communication and signal processing,

optimization and operation research, and mathematical statistics.

HAN-FU CHEN



Acknowledgments

The support of the National Key Project of China and the NationalNatural Science Foundation of China is gratefully acknowledged. Theauthor would like to express his gratitude to Dr. Haitao Fang for hishelpful suggestions and useful discussions. The author would also liketo thank Ms. Jinling Chang for her skilled typing and to thank my wifeShujun Wang for her constant support.

xv



ROBBINS-MONRO ALGORITHM

Chapter 1

Optimization is ubiquitous in various research and application fields.It is quite often that an optimization problem can be reduced to findingzeros (roots) of an unknown function which can be observed but

the observation may be corrupted by errors. This is the topic of stochas-tic approximation (SA). The error source may be observation noise, butmay also come from structural inaccuracy of the observed function. For

example, one wants to find zeros of but he actually observes func-tions which are different from Let us denote by the

observation at time the observation noise:

Here, is the additional error caused by the structural in-accuracy. It is worth noting that the structural error normally dependson and it is hard to require it to have a certain probabilistic property

such as independence, stationarity or martingale property. We call thiskind of noises as state-dependent noise.The basic recursive algorithm for finding roots of an unknown function

on the basis of noisy observations is the Robbins-Monro (RM) algorithm,which is characterized by its simplicity in computation. This chapter

serves as an introduction to SA, describing various methods for analyzing

convergence of the RM algorithm.In Section 1.1 the motivation of RM algorithm is explained, and its

limitation is pointed out by an example. In Section 1.2 the classical

approach to analyzing convergence of RM algorithm is presented, whichis based on probabilistic assumptions on the observation noise. To relax

restrictions made on the noise, a convergence analysis method connectingconvergence of the RM algorithm with stability of an ordinary differential

1



2 STOCHASTIC APPROXIMATION AND ITS APPLICATIONS

equation (ODE) was introduced in nineteen seventies. The ODE methodis demonstrated in Section 1.3. In Section 1.4 the convergence analysisis carried out at a sample path by considering convergent subsequences.So, we call this method as Trajectory-Subsequence (TS) method, which

is the basic tool used in the subsequent chapters.In this book our main concern is the path-wise convergence of the

algorithm. However, there is another approach to convergence analy-sis called the weak convergence method, which is briefly introduced inSection 1.5. Notes and references are given in the last section.

This chapter introduces main methods used in literature for conver-gence analysis, but restricted to the single root case. Extension to moregeneral cases in various aspects is given in later chapters.

1.1. Finding Zeros of a Function.Many theoretical and practical problems in diverse areas can be re-

duced to finding zeros of a function. To see this it suffices to notice thatsolving many problems finally consists in optimizing some functioni.e., finding its minimum (or maximum). If is differentiable, thenthe optimization problem reduces to finding the roots of where

the derivative of In the case where the function or its derivatives can be observed

without errors, there are many numerical methods for solving the prob-lem. For example, the gradient method, by which the estimate forthe root of is recursively generated by the following algorithm

where denotes the derivative of This kind of problems belongs

to the topics of optimization theory, which considers general cases wheremay be nonconvex, nonsmooth, and with constraints.In contrast to the optimization theory, SA is devoted to finding zeros

of an unknown function which can be observed, but the observationsare corrupted by errors.

Since is not exactly known and even may not exist, (1.1.1)-like algorithms are no longer applicable. Consider the following simpleexample. Let be a linear function

If the derivative of is available, i.e., if we know and if can precisely be observed, then according to (1.1.1)



ROBBINS-MONRO ALGORITHM 3

This means that the gradient algorithm leads to the zero of by one step.

Assume the derivative of is unavailable but can exactly beobserved.

Let us replace by in (1.1.1). Then we derive

or

This is a linear difference equation, which can inductively be solved,and the solution of (1.1.3) can be expressed as follows

Clearly, tends to the root of as for any initialvalue This is an attractive property: although the gradient of isunavailable, we can still approach the sought-for root if the inverse of thegradient is replaced by a sequence of positive real numbers decreasinglytending to zero.

Let us consider the case where is observed with errors:

where denotes the observation at time the correspondingobservation error and the estimate for the root of at time

It is natural to ask, how will behave if the exact value of in (1.1.2) is replaced by its error-corrupted observation i.e., if is recursively derived according to the following algorithm:

In our example, and (1.1.5) turns to be



STOCHASTIC APPROXIMATION AND ITS APPLICATIONS 4

Similar to (1.1.3), the solution of this difference equation is

Therefore, converges to the root of if tends

to zero as This means that replacement of gradient by asequence of numbers still works even in the case of

error-corrupted observations, if the observation errors can be averagedout. It is worth noting that in lieu of (1.1.5) we have to take the positive

sign before i.e., to consider

if rather than or more general, if is decreasing

as increases.This simple example demonstrates the basic features of the algorithm

(1.1.5) or (1.1.7): 1) The algorithm may converge to a root of 2) Thelimit of the algorithm, if exists, should not depend on the initial value; 3)The convergence rate is defined by that how fast the observation errorsare averaged out.

From (1.1.6) it is seen that the convergence rate is defined by

for linear functions. In the case where is a sequence of indepen-dent and identically distributed random variables with zero mean and

bounded variance, then

by the iterated logarithm law.This means that convergence rate for algorithms (1.1.5) or (1.1.7) with

error-corrupted observations should not be faster than

1.2. Probabilistic MethodWe have just shown how to find the root of an unknown linear function

based on noisy observations. We now formulate the general problem.




Let be an unknown function with unknown rootAssume can be observed at each point with noise

and is the estimate for at timeStochastic approximation algorithms recursively generate to ap-

proximate based on the past observations. In the pioneer work of thisarea Robbins and Monro proposed the following algorithm

to estimate where step size is decreasing and satisfies the fol-lowing conditions and They proved

We explain the meaning of conditions required for step sizeCondition aims at reducing the effect of observation noises.To see this, consider the case where is close to and is closeto zero, say, with small.

Throughout the book, always means the Euclidean norm of avector and denotes the square root of the maximum eigenvalueof the matrix where means the transpose of the matrix A.

By (1.2.2) andEven in the Gaussian noise case, may be large if

has a positive lower bound. Therefore, in order to have the desiredconsistency, i.e., it is necessary to use decreasing gains

such that On the other hand, consistency can neither be

achieved, if decreases too fast as To see this, let

Then even for the noise-free case, i.e., from (1.2.2) we have

if is a bounded function.

Therefore, in this case

if the initial value is far from the true root and hence will neverconverge to

The algorithm (1.2.2) is now called Robbins-Monro (RM) algorithm.

where isthe observation at time is the observation noise,




The classical approach to convergence analysis of SA algorithms isbased on the probabilistic analysis for trajectories. We now present atypical convergence theorem by this approach. Related concept andresults from probability theory are given in Appendices A and B.

In fact, we will use the martingale convergence theorem to prove thepath-wise convergence of i.e., to show For this, the

following set of conditions will be used.

A 1.2.1 The step size is such that

A1.2.2 There exists a continuously twice differentiable Lyapunov func-

tion satisfying the following conditions.i) Its second derivative is bounded;

ii) and as

iii) For any there is a such that

where denotes the gradient of

A1.2.3 The observation noise is a martingale difference se-quence with

where is a family of nondecreasing

A1.2.4 The function and the conditional second moment of the

observation noise have the following upper bound

where is a positive constant.

Prior to formulating the theorem we need some auxiliary results.

Let be an adapted sequence, i.e., is

Define the first exist time of from a Borel set

It is clear that i.e., is a Markov time.

Lemma 1.2.1 Assume and is a nonnegative supermartin-

gale, i.e.,




Then is also a nonnegative supermartingale, where

The proof is given in Appendix B, Lemma B-2-1.

The following lemma concerning convergence of an adapted sequencewill be used in the proof for convergence of the RM algorithm, but thelemma is of interest by itself.

Lemma 1.2.2 Let be two nonnegative adapted se-

quences.

i) If and then converges a.s.

to a finite limit.

ii) If then

Proof. For proving i) set

Then we have

By the convergence theorem for nonnegative supermartingales, con-verges a.s. as

Since by the convergence theorem for martingales it

follows that converges a.s. as Since is

Noticing that both and converge a.s.

as we conclude that is also convergent a.s. as

Consequently, from (1.2.5) it follows that converges a.s. as

For proving ii) set

measurable and is nondecreasing, we have




Taking conditional expectation leads to

Again, by the convergence theorem for nonnegative supermartingales,converges a.s. as Since by the same theorem also

converges a.s. as it directly follows that a.s.

Theorem 1.2.1 Assume Conditions A1.2.1–A1.2.4 hold. Then for anyinitial value, given by the RM algorithm (1.2.2) converges to the root

of a.s. as

Proof. Let be the Lyapunov function given in A1.2.2. Expandingto the Taylor series, we obtain

where and denote the gradient and Hessian of respec-tively, is a vector with components located in-between the corre-sponding components of and and denotes the constant suchthat (by A1.2.2).

Noticing that is and taking con-ditional expectation for (1.2.6), by (1.2.4) we derive

Since by (A1.2.1), we have

Denoting




and noticing by A1.2.2, iii) from (1.2.7) and (1.2.8) itfollows that

Therefore, and converges a.s. by the convergencetheorem for nonnegative supermartingales.

Since also converges a.s.

For any denote

Let be the first exit time of from and let

where denotes the complement to This means that is the firstexit time from after

Since is nonpositive, from (1.2.9) it follows that

for any

Then by (1.2.2), this implies that

By Lemma 1.2.2, ii), the above inequality implies

which means that must be finite a.s. Otherwise, we would have

a contradiction to A1.2.1. Therefore, after with




possible exception of a set with probability zero the trajectory of must enter

Consequently, there is a subsequence such thatwhere as

By the arbitrariness of we then conclude that there is a subsequence,denoted still by such that Hence

However, we have shown that converges a.s. Therefore,a.s. By A1.2.2, ii) we then conclude that a.s.

Remark 1.2.1 If Condition A1.2.2 iii) changes to

then the algorithm (1.2.2) should accordingly change to

We now explain conditions required in Theorem 1.2.1. As noted in

Section 1.1, the step size should satisfy but the condition

may be weakened to

Condition A1.2.2 requires existence of a Lyapunov function Thiskind of conditions is normally necessary to be imposed for convergenceof the algorithms, but the analytic properties of may be weakened.The noise condition A1.2.3 is rather restrictive. As to be shown in thesubsequent chapters, may be composed of not only the random noisebut also structural errors which hardly have nice probabilistic properties

such as martingale difference, stationarity or with bounded variances etc.As in many cases, one can take to serve as Then from(1.2.4) it follows that the growth rate of as should not befaster than linear. This is a major restriction to apply Theorem 1.2.1.However, if we a priori assume that generated by the algorithm(1.2.2) is bounded, then is bounded provided is locallybounded, and then the linear growth is not a restriction for1,2,...}.

1.3. ODE Method

As mentioned in Section 1.2, the classical probabilistic approach toanalyzing SA algorithms requires rather restrictive conditions on theobservation noise. In nineteen seventies a so-called ordinary differentialequation (ODE) method was proposed for analyzing convergence of SA




algorithms. We explain the idea of the method. The estimategenerated by the RM algorithm is interpolated to a continuous functionwith interpolating length equal to the step size used in the algo-rithm. The tail part of the interpolating function is shown to satisfy

an ordinary differential equation The sought-for root is theequilibrium of the ODE. By stability of this equation, or by assumingexistence of a Lyapunov function, it is proved that From

this, it can be deduced thatFor demonstrating the ODE method we need two facts from analysis,

which are formulated below as propositions.

Proposition 1.3.1 (Arzelà-Ascoli) Let be a set of

equi-continuous and uniformly bounded functions, where by equi-continuity we mean that for any and any there exists

such that

Then there are a continuous function and a subsequence of

functions which converge to uniformly in any finite interval of

i.e.,

uniformly with respect to belonging to any finite interval.

Proposition 1.3.2 For the following ODE

with

if there exists a continuously differentiable function such that as and

then the solution to (1.3.1), starting from any initial value, tends toas i.e., is the global asymptotically stable solution to

(1.3.1).

Let us introduce the following conditions.

A1.3.2 There exists a twice continuously differentiable Lyapunov func-

tion such that as

and

A1.3.1

whenever




In order to describe conditions on noise, we introduce an integer-valued function for any and any integer

For define

Noticing that tends to zero, for any fixed diverges toinfinity as In fact, counts the number of iterationsstarting from time as long as the sum of step sizes does not exceedThe integer-valued function will be used throughout the book.

The following conditions will be used:

A1.3.3 satisfies the following conditions

A1.3.4 is continuous.

Theorem 1.3.1 Assume that A1.3.1, A1.3.2, and A1.3.4 hold. If for a fixed sample A1.3.3 holds and generated by the RM algorithm

(1.2.2) is bounded, then for this tends to as

Proof. Set

Define the linear interpolating function

It is clear that is continuous and

Further, define and the corresponding linear interpo-

lating function which is defined by (1.3.4) with replaced by

Since we will deal with the tail part of we define by shiftingtime in

Thus, we derive a family of continuous functions




Let us define the constant interpolating function

Then summing up both sides of (1.2.2) yields

and hence

By the boundedness assumption on the family is uni-formly bounded. We now prove it is equi-continuous.

By definition,

Hence, we have

where since

From this it follows that

which tends to zero as and then by A1.3.3.For any we have

By boundedness of and (1.3.11) we see that is equi-continuous.




By Proposition 1.3.1, we can select from a convergent subse-quence which tends to a continuous function

Consider the following difference with

which is derived by using (1.3.11).By (1.3.9) it is clear that for

Then from (1.3.12) we obtain

Tending to zero in (1.3.13), by continuity of and uniform con-vergence of to we conclude that the last term in (1.3.13)converges to zero, and




By A1.3.2 and Proposition 1.3.2 we see asWe now prove that Assume the converse: there is a

subsequence

Then for There is a such thatBy (1.3.4) we have

where and denotesthe integer part of so

It is clear that the family of functions indexed byis uniformly bounded and equi-continuous. Hence, we can select a

convergent subsequence, denoted still by The limit satisfiesthe ODE (1.3.14) and coincides with being the limit of bythe uniqueness of the solution to (1.3.14).

By the uniform convergence we have

which implies thatFrom here by (1.3.15) it follows that

Then we obtain a contradictory inequality:

for large enough such that and This completes

the proof of We now compare conditions used in Theorem 1.3.1 with those in The-

orem 1.2.1.

Conditions A1.3.1 and A1.3.2 are slightly weaker than A1.2.1 andA1.2.2, but they are almost the same. The noise condition A1.3.3 issignificantly weaker than those used in Theorem 1.2.1, because under

the conditions of Theorem 1.2.1 we have

which certainly implies A1.3.3.




As a matter of fact, Condition A1.3.3 may be satisfied by sequencesmuch more general than martingale difference sequences.

Example 1.3.1 Assume but may be any random or deter-

ministic sequence. Then satisfies A1

.3.3.This is because

Example 1.3.2 Let be an MA process, i.e.,

where is a martingale difference sequence with

Then under conditionA1.2.1, a.s., and hence

a.s. Consequently, A1.3.3 is satisfied for almost all sample paths

Condition A1.3.4 requires continuity of which is not required inA1.2.4. At first glance, unlike A1.2.4, Condition A1.3.4 does not impose

any growth rate condition on but Theorem 1.3.1 a priori requiresthe boundedness of which is an implicit requirement for the growthrate of

The ODE method is widely used in convergence analysis for algo-rithms arising from various application areas, because from the noiseit requires no probabilistic property which would be difficult to verify.Concerning the weakness of the ODE method, we have mentioned thatit a priori assumes that is bounded. This condition is difficult to

be verified in general case. The other point should be mentioned thatCondition A1.3.3 is also difficult to be verified in the case wheredepends on the past which often occurs when containsstructural errors of This is because A1.3.3 may be verifiable if isconvergent, but may badly behave depending upon the behavior of

So we are somehow in a cyclic situation: with A1.3.3 we canprove convergence of on the other hand, with convergent wecan verify A1.3.3. This difficulty will be overcome by using Trajectory-

Subsequence (TS) method to be introduced in the next section and usedin subsequent chapters.

1.4. Truncated RM Algorithm and TS MethodIn Section 1.2 we considered the root-seeking problem where the

sought-for root may be any point in If the region belongs

as




to is known, then we may use the truncated algorithm and the growthrate restriction on can be removed.

Let us assume that and is known. In lieu of (1.2.2) wenow consider the following truncated RM algorithm:

where the observation is given by (1.2.1), is a given point,

and

The constant used in (1.4.1) will be specified later on.

The algorithm (1.4.1) means that it coincides with the RM algorithmwhen it evolves in the sphere but if exits thesphere then the algorithm is pulled back to the fixed point

We will use the following set of conditions:

A1.4.1 The step size satisfies the following conditions

A1.4.2 There exists a continuously d ifferentiable Lyapunov function

(not necessarily being nonnegative) such that and for (which is used in

(1.4.1)) there is such that

A1.4.3 For any convergent subsequence of

where is given by (1.3.2);

A1.4.4 is measurable and locally bounded.

We first compare these conditions with A1.3.1–A1.3.4. We note thatA1.4.1 is the same as A1.3.1, while A1.4.2 is weaker than A1.2.2.

The difference between A1.3.3 and A1.4.3 consists in that Condition(1.4.2) is required to be verified only along convergent subsequences,while (1.3.3) in A1.3.3 has to be verified along the whole sequence




if is small enough and is large enough.This incorporating with (1.4.5) implies that

Therefore, the norm of

cannot reach the truncation bound In other words, the algorithm(1.4.1) turns to be an untruncated RM algorithm (1.4.7) for

for small and large

By the mean theorem there exists a vector with components locatedin-between the corresponding components of and suchthat

Notice that by (1.4.2) the left-hand side of (1.4.6) is of for all

sufficiently large since is bounded. From this it follows that i)for small enough and large enough

and hence and ii) the last term in (1.4.8) is of since as From (1.4.7) and (1.4.8) it thenfollows that

Since the interval does not contain the origin. Noticingthat we findand that there is such that




for sufficientlysmall and all large enough Then by A1.4.2 thereis such that

for all large and small enough As mentioned abovefrom (1.4.9) we have

for sufficiently large and small enough where denotes a mag-nitude tending to zero as

Taking (1.4.4) into account, from (1.4.10) we find that

for large However, we have shown that

The obtained contradiction shows that the number of truncations in(1.4.1) can only be finite.

We have proved that starting from some large the algorithm (1.4.1)develops as an RM algorithm

and is bounded.We are now in a position to show that converges.Assume it were not true. Then we would have

Then there would exist an interval not containing the originand would cross for infinitely many

Again, without loss of generality, assuming by the same

argument as that used above, we will arrive at (1.4.9) and (1.4.10) forlarge and obtain a contradiction. Thus, tends to a finite limit

as It remains to show that

Assume the converse that there is a subsequence

Then there is a such that for all sufficiently largeWe still have (1.4.8), (1.4.9), and (1.4.10) for some




If for any bounded continuous function defined on

then we say that weakly converges toIf for any there is a compact measurable set in

such that

then is called tight.Further, is called relatively compact if each subsequence of

contains a weakly convergent subsequence.In the weak convergence analysis an important role is played by the

Prohorov’s Theorem, which says that on a complete and separable met-ric space, tightness is equivalent to relative compactness. The weakconvergence method establishes the weak limit of as andconvergence of to in probability as whereas

Theorem 1.5.1 Assume the following conditions:

A1.5.1 is a.s. bounded;

A1.5.2 is continuous;

A1.5.3 is adapted, is uniformly integrable in the sense that

and

Then is tight in and weakly converges to

that is a solution to

Further, if is asymptotically stable for (1.5.3), then for anyas the distance between and

converges to zero in probability as

In stead of proof, we only outline its basic idea. First, it is shownthat we can extract a subsequence of weaklyconverging to




For notational simplicity, denote the subsequence still by Bythe Skorohod representation, we may assume For

this we need only, if necessary, to change the probabilistic space and takeand on this new space such that and

have the same distributions as those of and respectively.Then, it is proved that

is a martingale. Since and as can be shown, is Lipschitzcontinuous, it follows that

Since is relatively compact and the limit does not depend onthe extracted subsequence, the whole family weakly convergesto as and satisfies (1.5.3). By asymptotic stability of

Remark 1.5.1 The boundedness assumption on may be removed.For this a smooth function is introduced such that

and the following truncated algorithm

is considered in lieu of (1.5.1). Then is interpolated to a piece-wiseconstant function for the It is shownthat is tight, and weakly convergent as The limit

satisfies

Finally, by showing lim sup lim sup for some

for each it is proved that itself is tight and weaklyconverges to satisfying (1.5.3).

1.6. Notes and ReferencesThe stochastic approximation algorithm was first proposed by Rob-bins and Monro in [82], where the mean square convergence of the algo-

rithm was established under the independence assumption on the obser-vation noise. Later, the noise was extended from independent sequenceto martingale difference sequences (e.g. [7, 40, 53]).




The probabilistic approach to convergence analysis is well summarizedin [78].

The ODE approach was proposed in [65, 72], and then it was widelyused [4, 85]. For detailed presentation of the ODE method we refer to

[65, 68].The proof of Arzelá-Ascoli Theorem can be found in ([37], p.266).Section 1.4 is an introduction to the method described in detail in

coming chapters. For stability and Lyapunov functions we refer to [69].The weak convergence method was developed by Kushner [64, 68].

The Skorohod topology and Prohorov’s theorem can be found in [6, 41].For probability concepts briefly presented in Appendix A, we refer

to [30, 32, 70, 76, 84]. But the proof of the convergence theorem for

martingale difference sequences, which are frequently used throughoutthe book, is given in Appendix B.



STOCHASTIC APPROXIMATION ALGORI-

THMS WITH EXPANDING TRUNCATIONS

In Chapter 1 the RM algorithm, the basic algorithm used in stochas-tic approximation(SA), was introduced, and four different methods foranalyzing its convergence were presented. However, conditions imposedfor convergence are rather strong.

Comparing theorems derived by various methods in Chapter 1, wefind that the TS method introduced in Section 1.4 requires the weakest

condition on noise. The trouble is that the sought-for root has to be in-side the truncation region. This motivates us to consider SA algorithmswith expanding truncations with the purpose that the truncation regionwill finally cover the sought-for root whose location is unknown. This isdescribed in Section 2.1.

General convergence theorems of the SA algorithm with expandingtruncations are given in Section 2.2. The key point of the proof is toshow that the number of truncations is finite. If this is done, then the

estimate sequence is bounded and the algorithm turns to be the conven-tional RM algorithm in a finite number of steps. This is realized by usingthe TS method. It is worth noting that the fundamental convergencetheorems given in this section are analyzed by a completely elementarymethod, which is deterministic and is limited to the knowledge of calcu-lus. In Section 2.3 the state-independent conditions on noise are givento guarantee convergence of the algorithm when the noise itself is state-dependent. In Section 2.4 conditions on noise are discussed. It appearsthat the noise condition in the general convergence theorems in a certainsense is necessary. In Section 2.5 the convergence theorem is given forthe case where the observation noise is non-additive.

In the multi-root (of case, up-to Section 2.6 we have only estab-lished that the distance between the estimate and the root set tends to

25

Chapter 2



In Chapter 1 we have presented four types of convergence theoremsusing different analysis methods for SA algorithms. However, none of these theorems is completely satisfactory in applications. Theorem 1.2.1is proved by using the classical probabilistic method, which requiresrestrictive conditions on the noise and As mentioned before, thenoise may contain component caused by the structural inaccuracy of the function, and it is hard to assume this kind of noise to be mutually

independent or to be a martingale difference sequence etc. The growthrate restriction imposed on the function not only is sever, but also isunavoidable in a certain sense. To see this, let us consider the followingexample:

It is clear that conditions A1.2.1, A1.2.2, and A1.2.3 are satisfied,

where for A1.2.2 one may take The only conditionthat is not satisfied is (1.2.4), since while the right-hand side of (1.2.4) is a second order polynomial. Simple calculationshows that given by RM algorithm rapidly diverges:


From this one might conclude that the growth rate restriction wouldbe necessary.However, if we take the initial value with then

given by the RM algorithm converges to To reduce initial valuein a certain sense, it is equivalent to use step size not from but from

for some The difficulty consists in that from which we should

zero. But, by no means this implies convergence of the estimate itself.This is briefly discussed in Section 2.4, and is considered in Section 2.6in connection with properties of the equilibrium of Conditionsare given to guarantee the trajectory convergence. It is also considered

whether the limit of the estimate is a stable or unstable equilibrium of In Section 2.7 it is shown that a small distortion of conditions

may cause only a small estimation error in limit, while Section 2.8 of this chapter considers the case where the sought-for root is moving dur-ing the estimation process. Convergence theorems are derived with thehelp of the general convergence theorem given in Section 2.2. Notes andreferences are given in the last section.

2.1. Motivation



Stochastic Approximation Algorithms withExpanding Truncations 27

start the algorithm. This is one of the motivations to use expandingtruncations to be introduced later.

Theorem 1.3.1 proved in Section 1.3 demonstrates the ODE method.By this approach, the condition imposed on the noise has significantly

been weakened and it covers a class of noises much larger than thattreated by the probabilistic method. However, it a priori requiresbe bounded. This is the case if converges, but before establishing itsconvergence, this is an artificial condition, which is not satisfied even forthe simple example given above. Further, although the noise condition(1.3.3) is much more general than that used in Theorem 1.2.1, it isstill difficult to be verified for the state-dependent noise. For example,

where is a martingale difference sequence with

If is bounded andthen a.s. and (1.3.3) holds. However, in general,

it is difficult to directly verify (1.3.3) because the behavior of is

unknown. This is why we use Condition (1.4.2) which should be verifiedonly along convergent subsequences. With convergent the noise

is easier to be dealt with.Considering convergent subsequences, the path-wise convergence is

proved for a truncated RM algorithm by using the TS method in Theo-rem 1.4.1. The weakness of algorithms with fixed truncation bounds isthat the sought-for root of has to be located in the truncation region.But, in general, this cannot be ensured. This is another motivation toconsider algorithms with expanding truncations.

The weak convergence method explained in Section 1.5 can avoidboundedness assumption on but it can ensure convergence in dis-tribution only, while in practical computation one always deals with a

sample path. Hence, people in applications are mainly interested inpath-wise convergence.The SA algorithm with expanding truncations was introduced in or-

der to remove the growth rate restriction on It has been developedin two directions: weakening conditions imposed on noise and improv-ing the analysis method. By the TS method we can show that theSA algorithm with expanding truncations converges under a truly weakcondition on noise, which, in fact, is also necessary for a wide class of

In Chapter 1, the root of is a singleton. Fromnow on we will consider the general case. Let J be the root set of

We now define the algorithm. Let be a sequence of positivenumbers increasingly diverging to infinity, and let be a fixed point in




Fix an arbitrary initial value and denote by the estimate attime serving as the approximation to J. Define bythefollowingrecursion:

where is an indicator function meaning that it equals 1 if the inequality indicated in the bracket is fulfilled, and 0 if the inequalitydoes not hold.

We explain the algorithm. is the number of truncations up-to timeserves as the truncation bound when the estimate is

generated. From (2.1.1) it is seen that if the estimate at timecalculated by the RM algorithm remains in the truncation region, i.e., if

then the algorithm evolves as the RM algorithm.

If exits from the sphere with radius i.e., if then the estimate at time is pulled back to thepre-specified point and the truncation bound is enlarged fromto

Consequently, if it can be shown that the number of truncations isfinite, or equivalently, generated by (2.1.1) and (2.1.2) is bounded,then the algorithm (2.1.1) and (2.1.2) turns to be the one without trun-cations, i.e., to be the RM algorithm after a finite number of steps. This

actually is the key step when we prove convergence of (2.1.1) and (2.1.2).The convergence analysis of (2.1.1) and (2.1.2) will be given in thenext section, and the analysis is carried out in a deterministic way at afixed sample without involving any interpolating function.

In This section by TS method we establish convergence of the RM

algorithm with expanding truncations defined by (2.1.1)–(2.1.3) undergeneral conditions. Let us first list conditions to be used.

2.2. General Convergence Theorems by TSMethod




A2.2.2 There is a continuously differentiable function (not necessarily

being nonnegative) such that

for any and is nowhere dense, where

J is the zero set of i.e.,

and denotes the gradient of Further, used in (2.1.1)

is such that for some and

For introducing condition on noise let us denote by the prob-ability space. Let be a mea-surable function defined on the product space. Fixing an meansthat a sample path is under consideration. Let the noise be givenby

Thus, the state-dependent noise is considered, and for fixedmay be random.

A2.2.3 For the sample path under consideration for any sufficiently

large integer

for any such that converges, where is given by(1.3.2) and denotes given by (2.1.1)–(2.1.3) and valued at thesample path

In the sequel, the algorithm (2.1.1)–(2.1.3) is considered for the fixedfor which A2.2.3 holds, and in will often be suppressed if no

confusion is caused.

A2.2.4 is measurable and locally bounded.

Remark 2.2.1 Comparing A2.2.1–A2.2.4 with A1.4.1–A1.4.4, we findthat if the root set J degenerates to a singleton then the only essentialdifference is that an indicator function is included in (2.2.2)while (1.4.2) stands without it. It is clear that if is bounded, thenthis makes no difference. However, before establishing the boundednessof condition (2.2.2) is easier to be verified. The key point here




is that in contrast to Section 1.4 we do not assume availability of theupper bound for the roots of

Remark 2.2.2 It is worth noting that con-

verges. To see this it suffices to take in (2.2.2).

Theorem 2.2.1 Let be given by (2.1.1)–(2.1.3) for a given initial

value Assume A2.2.1–A2.2.4 hold. Then, for the

sample path for which A2.2.3 holds.

Proof. The proof is completed by six steps by considering conver-

gent subsequences at the sample path. This is why we call the analysismethod used here as TS method.Step 1. We show that there are constants such that

for any there exists such that for any

if is a convergent subsequence of where M is

independent of andSince we need only to prove

(2.2.3) forIf the number of truncations in (2.1.1)–(2.1.3) is finite, then there is

an N such that i.e., there is no more truncation forHence, wheneverIn this case, we may take in (2.2.3).

We now prove (2.2.3) for the case where asAssume the converse that (2.2.3) is not true. Take There issuch that

Take a sequence of positive real numbers and as

Since (2.2.3) is not true, for there are andsuch that




and for any there are andsuch that

Without loss of generality we may assume

Then for any from (2.2.4) and (2.2.6) it follows

that

Since there is such that Then from(2.2.7) it follows that

For any fixed if is large enough, then andand by (2.2.10)

Since from (2.2.11) it follows

that

and by (2.2.4), (2.2.7), and (2.2.8)

and hence

by A2.2.4, where is a constant.Let where is specified in A2.2.3. Thenfrom A2.2.3

for any




Taking and respectively in (2.2.10)

and noticing from(2.2.9) we thenhave

and hence

From (2.2.8), it follows that

where the second term on the right-hand of the inequality tends to zeroby (2.2.12) and (2.2.13), while the first term tends to zero because

Noticing that by

(2.2.9) and (2.2.13), we then by (2.2.14) have

On the other hand, by (2.2.6) we have

The obtained contradiction proves (2.2.3).Step 2. We now show that for all large enough

if T is small enough, where is a constant.If the number of truncations in (2.1.1)–(2.1.3) is finite, then is

bounded and hence is also bounded.




Then for large enough there is no truncation, and by (2.2.2) for

if T is small enough. In (2.2.16), for the last inequality the boundednessof is invoked, and is a constant.

Thus, it suffices to prove (2.2.15) for the case where

From (2.2.3) it follows that for any

if is large enough.This implies that for

where is a constant. The last inequality of (2.2.18) yields

With in A2.2.3, from (2.2.2) we have

for large enough and small enough T .Combining (2.2.18), (2.2.19), and (2.2.20) leads to

for all large enough This together with (2.2.16) verifies (2.2.15).Step 3. We now show the following assertion:For any interval with and the

sequence cannot cross infinitely many times with




andAssume the converse: there are infinitely many crossings

and is bounded.

By boundedness of without loss ofgenerality, we may assume

By setting in (2.2.15), we have

But by definition so we have

From (2.2.15) we see that if take sufficiently small, then

for sufficiently largeBy (2.2.18) and (2.2.15), for large we then have

where denotes the gradient of and asFor condition (2.2.2) implies that

By (2.2.15) and (2.2.18) it follows that

bounded, where by “crossing by we mean that




Then, by (2.2.23) and (2.2.1) from (2.2.24)–(2.2.26) it follows that thereare and such that

for all sufficiently largeNoticing (2.2.22), from (2.2.27) we derive

However, by (2.2.15) we have

which implies that for small enoughThis means that which contradicts (2.2.28).Step 4. We now show that the number of truncations is bounded.By A2.2.2, is nowhere dense, and hence a nonempty interval

exists such that and

If then starting from will cross the sphere

infinitely many times. Consequently, will crossinfinitely often with bounded. In Step 3, we have shown thisprocess is impossible. Therefore, starting from some the algorithm(2.1.1)–(2.1.3) will have no truncations and is bounded.

This means that the algorithm defined by (2.1.1)–(2.2.3) turns to bethe conventional RM algorithm for and a stronger than (2.2.2)condition is satisfied:

for any such that converges.Step 5. We now show that converges. Let

We have to showIf and one of and does not belong to thenexists such that and By Step 3 this

is impossible. So, both and belong to and




If we can show that is dense in then from (2.2.30)it will follow that is dense in which contradicts to theassumption that is nowhere dense. This will prove i.e., theconvergence of

To show that is dense in it suffices to show thatAssume the converse: there is a subsequence

Without loss of generality, we may assume converges. Otherwise,a convergent subsequence can be extracted, which is possible because

is bounded. However, ifwe take in (2.2.15), we have

which contradicts (2.2.31). Thus and converges.Step 6. For proving it suffices to show that all

limit points of belong to J.

Assume the converse: By (2.2.15) we

have

for all large if is small enough. By (2.2.1) it follows that

and from (2.2.24)

for smallenough This leads to a contradiction because convergesand the left-hand side of (2.2.32) tends to zero as Thus, weconclude

Remark 2.2.3 In (2.1.1)–(2.1.3) the spheres with expanding radiusesare used for truncations. Obviously, the spheres can be replaced

by other expanding sets. At first glance the point in (2.1.1) may bearbitrarily chosen, but actually the restriction is imposed on the exis-tence of such that The condition is obviouslysatisfied if as because the availability of is notrequired.




Remark 2.2.4 In the proof of Theorem 2.2.1 it can be seen that theconclusion remains valid if in A2.2.2 “ J is the zero

set of is removed. As a matter of fact, J may be bigger than the

zero set of Of course, it should at least contain the zero set of in order (2.2.1) to be satisfied. It should also be noted that for

we need not require to be nowhere dense.

Let us modify A2.2.2 as follows.

A2.2.2’ There is a continuously differentiable function

such that

for any and is nowhere dense. Further, used in

(2.1.1) is such that for some and

A2.2.2” There is a continuously differentiable function

such that

for any and J is closed. Further, used in (2.1.1) is such

that for some and

Notice that, in A2.2.2’ and A2.2.2” the set J is not specified, but itcertainly contains the root sets of both and We may modify

Theorem 2.2.1 as follows.

Theorem 2.2.1’ Let be given by (2.1.1)–(2.1.3) for a given ini-

tial value Assume A2.2.1, A2.2.2’,A2.2.3, and A2.2.4 hold. Then

for the sample path for which A2.2.3 holds.

Proof. The Proof of Theorem 2.2.1 applies without any change.

Theorem 2.2.1” Let be given by (2.1.1)–(2.1.3) for a given initial

value. If A2.2.1, A2.2.2”,A2.2.3, and A2.2.4 hold, then

for the sample path for which A2.2.3 holds.

Proof. We still have Step 1– Step 3 in the proof of Theorem 2.2.1. Let




If or or both do not belong to J , then exists such

that since J is closed. Then would crossinfinitely many times. But, by Step 3 of the Proof for Theorem 2.2.1,this is impossible. Therefore both and belong to

Theorems 2.2.1 and 2.2.1’ only guarantee that the distance betweenand the set J tends to zero. As a matter of fact, we have more

precise result.

Theorem 2.2.2 Assume conditions of Theorem 2.2.1 or Theorem 2.2.1’

hold. Then for fixed and for which A2.2.3 holds, a connected subset

exists such that

where denotes the closure of and is generated by (2.1.1)–

(2.1.3).

Proof. Denote by the set of limit points of Assume the

converse: i.e., is disconnected. In other words, closed sets andexist such that and

Define

Since a exists such that

where denotes the of set A.

Define

It is clear that and

Since by we have

By boundedness of we may assume that converges.Then, by taking in (2.2.15), we derive




which contradicts (2.2.33) and proves the theorem.

Corollary 2.2.1 If J is not dense in any connected set, then under

conditions of Theorem 2.2.1, given by (2.1.1)–(2.1.3) converges to

a point in This is because in the present case any connected set inconsists of a single point.

Example 2.2.1 Reconsider the example given in Section 2.1:

It was shown that the RM algorithm rapidly diverges to even in thenoise-free case.

We now assume the observations are noise-corrupted:

where is an ARMA process driven by the independent identicallydistributed normal random variables

whereWe use the algorithm (2.1.1)–(2.1.3) with The

computation shows

which tend to the sought-for root 10.

Example 2.2.2 Let Then

Clearly, A2.2.1 and A2.2.4 hold. Concerning A2.2.2, we may taketo serve as Since

(2.2.1) is satisfied. The existence of required in A2.2.2 is obvious, forexample,




Finally, is nowhere dense. So A2.2.2 also holds.Now assume the noise is such that

Then A2.2.3 is satisfied too.By Corollary 2.2.1, given by (2.1.1)–(2.1.3) converges to a point

If for the conventional (untruncated) RM algorithm

it is a priori known that is bounded, then we have the followingtheorem.

Theorem 2.2.3 Assume A2.2.1–A2.2.4 hold but in A2.2.2 the require-

ment: “Further, used in (2.1.1) is such that for

some and is removed. If produced by (2.2.34) isbounded, then for the sample path for which A2.2.3

holds, where is a connected subset of

Proof. As a matter offact, by boundedness of (2.2.3) and (2.2.15)become obvious. Steps 3, 5, and 6 in the proof of Theorem 2.2.1 remainunchanged, while Step 4 is no longer needed. Then the conclusion followsfrom Theorems 2.2.1 and 2.2.2.

Remark 2.2.5 All theorems concerning SA algorithms with expandingtruncations remain valid for produced by (2.2.34), if given by

(2.2.34) is known to be bounded.

Theorems 2.2.1 and 2.2.2 concern with time-invariant functionbut the results can easily be extended to time-varying functions, i.e., tothe case where the measurements are carried out for

where depends on time

Conditions A2.2.2 and A2.2.4 are respectively replaced by the follow-ing conditions:A2.2.2o There is a continuously differentiable functionsuch that




is so weak that it is necessary as to be shown later. However, conditionA2.2.3 is state-dependent in the sense that the condition itself dependson the behavior of This makes it not always possible to verifythe condition beforehand. We are planning to give convergence theo-

rems under conditions with no state involved. For this we have toreformulate Theorems 2.2.1 and 2.2.2.

As defined in Section 2.2 where is a mea-surable function. In lieu of A2.2.3 we introduce the following condition.

A2.3.1 For any sufficiently large integer there is an

with such that for any

for any such that converges.

Theorem 2.3.1 Assume A2.2.1, A2.2.2, A2.2.4, and A2.3.1 hold. Then

a.s. for generated by (2.1.1)–(2.1.3) with a

given initial value where is a connected subset contained in theclosure of J.

Proof. Let It is clear that

i.e., Then for any

A2.2.3 is fulfilled with possibly depending on and theconclusion of the theorem follows from Theorems 2.2.1 and 2.2.2.We now introduce a state-independent condition on noise.

A2.3.2 For any is a martingale difference se-

quence and for some

where is a family of nondecreasing independent of

We first give an example satisfying A2.3.2. Let be andimensional martingaledifference sequencewith




for some and let

be a measurable and locally bounded function. Thensatisfies A2.3.2, because

and

by assumption.

Theorem 2.3.2 Let be given by (2.1.1)–(2.1.3) for a given initialvalue. Assume A2.2.1, A2.2.2, A2.2.4, and A2.3.2 hold and

for given in A2.3.2. Then a.s., where is a

connected subset contained in

Proof. Since is measurable and is it fol-lows that is adapted. Approximating

by simple functions, it is seen that

Hence, is a martingale difference sequence, and

a.s.

By the convergence theorem for martingale difference sequences, theseries

converges a.s., which implies that with exists such thatfor each

converges to zero as uniformly inThis means that A2.3.1 holds, and the conclusion of the theorem

follows from Theorem 2.3.1.

In applications it may happen that is not directly observed. In-stead, the time-varying functions are observed, and the observa-

tions may be done not at but at i.e., at with bias




Theorem 2.3.3 Let be given by (2.1.1)–(2.1.3) for a given ini-

tial value. Assume that A2.2.1, A2.2.2, A2.2.4, and A2.3.2 hold and

for p given in A2.3.2. Further, assume is an

adapted sequence, is bounded by a constant, and for any sufficiently

large integer there exists with such that for any

for any such that converges. Then, a.s.,where is a connected subset contained in

Proof. By assumption where is a constant. Then

and again by the convergence theorem for martingale difference sequences,

the series

convergence a.s. Consequently, there exists with such that

for any the convergence indicated in (2.3.5) holds and for any

integer

tends to zero as uniformly inTherefore, A2.3.1 is fulfilled and the conclusion of the theorem follows

from Theorem 2.3.1.

Remark 2.3.1 The obvious sufficient condition for (2.3.5) is

which in turn is satisfied, if is continuous and

Remark 2.3.2 Theorems 2.3.2 and 2.3.3 with A2.2.2 and A2.2.4 re-

placed by A2.2.2° and A2.2.4’, respectively, remain valid, if isreplaced by time-varying




2.4. Necessity of Noise ConditionUnder Conditions A2.2.1–A2.2.4 we have established convergence The-

orems for recursively given by (2.1.1)–(2.1.3). Condition A2.2.1 is a

commonly accepted requirement for decreasing step size, while A2.2.2 is

a stability condition. This kind of conditions are unavoidable for conver-gence of SA type algorithms, although it may appear in different forms.

Concerning A2.2.4 on it is the weakest possible: neither continuity

nor growth rate of is required. So, it is natural to ask is it possi-ble to further weaken Condition A2.2.3 on noise? We now answer thisquestion.

Theorem 2.4.1 Assume only has one root , i.e., and

is continuous at Further, assume A2.2.1 and A2.2.2 hold. Thengiven by (2.1.1)–(2.1.3) converges to at those sample paths for

which one of the following conditions holds:

i)

ii) can be decomposed into two parts such that

and

Conversely, if then both i) and ii) are satisfied.

Proof. Sufficiency. It is clear that ii) implies i), which in turn implies

A2.2.3. Consequently, sufficiency follows from Theorem 2.2.1.Necessity. Assume Then is bounded and (2.1.1)–

(2.1.3) turns to be the RM algorithm after a finite number of steps (for. Therefore,

where

Since and is continuous, Condition ii) is satisfied. And,

Condition i) being a consequence of ii) also holds.

Remark 2.4.1 In the case where and is continuous at

, under conditions A2.2.1, A2.2.2, and A2.2.3 by Theorem 2.2.1 wearrive at Then by Theorem 2.4.1 we derive (2.4.1) which isstronger than A2.2.3. One may ask why a weaker condition A2.2.3 can

imply a stronger condition (2.4.1)? Are they equivalent ? The answer




is “yes” or “no”: Yes, these conditions are equivalent but only underadditional conditions A2.2.1, A2.2.2, and continuity of at beingthe unique root of However, these conditions by themselves are notequivalent because condition A2.2.3 is weaker than (2.4.1) indeed.

We now consider the multi-root case. Instead of the singleton wenow have a root set J . Accordingly, continuity of at is replacedby the following condition

In order to derive the necessary condition on noise, we consider the

linear interpolating function

where From form a family of func-

tions, where

where is a constant.For any subsequence define

where appearing on the right-hand side of (2.4.3) denotes the de-pendence of the limit function on the subsequence, and the limsup of a

vector sequence is taken component-wise. In general, may bediscontinuous.However, if then

which is not only continuous but also differentiable.Thus, (2.4.2) for the multi-root case corresponds to the continuity of

at for the single root case, while and a certain

analytic property of correspond to

Theorem 2.4.2 Assume (2.4.2), A2.2.1, A2.2.2, and A2.2.4 hold. Then

given by (2.1.1)–(2.1.3) is bounded, and the

right derivative for any convergent subsequence




Necessity. We now assume is bounded, and

for any convergent subsequence and want toshow A2.2.3. Let For any from (2.4.5) we have

From (2.4.6) it is seen that

where asThe assumption means that

where and

Noticing the continuity of from (2.4.10) and (2.4.11) it follows

that

which incorporating with yields (2.4.9). Thus, we have

for any such that converges.By the boundedness of (2.4.12) is equivalent to (2.2.2), and the

proof is completed.

Corollary 2.4.1 Assume (2.4.2), A2.2.1, A2.2.2, and A2.2.4 hold, and assume J is not dense in any connected set. Then given by (2.1.1)–

(2.1.3) converges to some point in J if and only if A2.2.3 holds.

This corollary is a direct generalization of Theorem 2.4.1. The suffi-ciency part follows from Corollary 2.2.1, while the necessity part fol-

lows from Theorem 2.4.2 if notice that convergence of implies

for sufficiently large

The first term on the right-hand side of (2.4.8) tends to zero asby (2.4.2) and So, to verify A2.2.3 it suffices to

show that




2.5. Non-Additive NoiseIn the algorithm (2.1.1)–(2.1.3) the noise in observation is

additive. In this section we continue considering (2.1.1)–(2.1.2) but inlieu of (2.1.3) we now have the non-additive noise

where is the observation noise at timeThe problem is that under which conditions does the algorithm defined

by (2.1.1), (2.1.2), and (2.5.1) converge to J , the root set of whichisthe average of with respect to its second argument? To be precise,let be an measurable function and let be a

distribution function in The function is defined by

It is clear that the observation given by (2.5.1) can formally be ex-pressed by the one with additive noise:

and Theorems 2.2.1 and 2.2.2 can still be applied. The basic problem ishow to verify A2.2.3. In other words, under which conditions onand does given by (2.5.3) satisfy A2.2.3?

Before describing conditions to be used we first introduce some no-tations. We always take the regular version of conditional probability.This makes conditional distributions introduced later are well-defined.

Let be the distribution function of and be theconditional distribution of given where

Further, let us introduce the following coefficients,

where denotes the Borel in and for a random variablewhere runs over all setswith probability zero.

is known as the mixing coefficient of and it measures the

dependence between and It is clear thatmeasures the closeness of the distribution of to




The following conditions will be needed.

A2.5.2 (=A2.2.2);

A2.5.3 is a measurable function and is locally Lipschitz-continuousin the first argument, i.e., for any fixed

where is a constant depending on

A2.5.4 (Noise Condition)

i) is a process with mixing coefficient asuniformly in

ii)

where is defined in (2.5.6);iii) as

Theorem 2.5.1 Assume A2.5.1–A2.5.4. Then for generated by

(2.1.1), (2.1.2), and (2.5.1)

where is a connected subset of

The proof consists in verifying Condition A2.2.3 satisfied a.s. bygiven in (2.5.3). Then the theorem follows from Theorems 2.2.1 and

2.2.2.We first prove lemmas.

Lemma 2.5.1 Assume A2.5.1, A2.5.3, and A2.5.4 hold. Then there

is an with such that for any and any bounded

subsequence of say,

A2.5.1

as




(without loss of generality assume there exists an integer

such that for all

if T is small enough, where is given by (2.1.1), (2.1.2), and (2.5.1),and is given by (1.3.2).

Proof. For any set

By setting in (2.5.6), it is clear that

From (2.5.7), it follows that

and

where (and hereafter) L is taken large enough so thatSince is a convergent martingale, there is a a.s.

such that

From (2.5.13) and it is clear that for any integer L the

series of martingale differences

converges a.s.Denote by the where the above series converges, and set




It is clear thatLet be fixed and with and

Then for any integer by (2.5.13) we have

where the first term on the right-hand side tends to zero as by(2.5.15).

Assume is sufficiently large such thati) for if as orii) if

We note that in case ii) there will be no truncation in (2.1.1) for

Assume and fix a small enough T such that Letbe arbitrarily fixed.We prove (2.5.9) by induction. It is clear (2.5.9) is true forAssume (2.5.9) is true for and there

is no truncation for if Noticingwe have, by (2.5.16)

if is large enough.

This means that at time there is no truncation in (2.1.1), and

Lemma 2.5.2 Assume A2.5.1, A2.5.3, and A2.5.4 hold. There is anwith such that if and if as




is a bounded subsequence of produced by (2.1.1), (2.1.2),

and (2.5.1), then

Proof. Write

where

By (2.5.13), for we have

which converges to a finite limit as by the martingale conver-gence theorem.

Therefore, for any integers L and

converges a.s.




Therefore, there is with such that (2.5.23) holds forany integers L and

Let be fixed, By Lemma 2.5.1,for small

Then

for any by (2.5.23).

We now estimate (II). By Lemma 2.5.1 we have the following,

Noticing (2.5.7) and (2.5.14), we then have

Similarly, by Lemma 2.5.1 and (2.5.7)

Combining (2.5.18), (2.5.24), and (2.5.26) leads to

Therefore, to prove the lemma it suffices to show that the right-handside of (2.5.27) is zero.




Applying the Jordan-Hahn decomposition to the signed measure,

and noticing that is a process with mixing coefficientwe know that there is a Borel set D in such that for any

Borel set A in

and

Then, we have the following,

where




For any given there is a j such that

For any fixed by (2.5.13), (2.5.14), and it follows that

Therefore,

Since may be arbitrarily small, this combining with (2.5.27)

proves the lemma.

Proof of Theorem 2.5.1.For proving the theorem it suffices to show that A2.2.3 is satisfied by

a.s. By Lemma 2.5.2, we need only to prove

that

for is a bounded subsequence, and asAssume

Applying the Jordan-Hahn decomposition to the signed measure,




we conclude that

where for the last inequality (2.5.8) and (2.5.12) are invoked. Sinceas the right-hand side of (2.5.32) tends to zero asfor any This proves (2.5.31) and completes the proof

of Theorem 2.5.1.

Remark 2.5.1 From the expression (2.5.3) for observation it is seenthat the observation with non-additive noise can be reduced to the ad-ditive but state-dependent noise which was considered in Section 2.3.However, Theorem 2.5.1 is not covered by Theorems in Section 2.3 andvice versa.

2.6. Connection Between Trajectory Convergenceand Property of Limit Points

In the multi-root case, what we have established so far is that the dis-tance between given by (2.1.1)–(2.1.3) and a connected subsetof converges to zero under various sets of conditions.

As pointed out in Corollary 2.2.1, if J is not dense in any connectedset, then converges to a point belonging to However, it is stillnot clear how does behave when J is dense in some connected set?The following example shows that still may not converge, although

Example 2.6.1 Let




and let

Take step sizes as follows

We apply the RM algorithm (2.2.34) withAs we may take

Then, all conditions A2.2.1–A2.2.4 are satisfied.Notice that

and

where k is such that

By (2.6.1), it is clear that in (2.6.2)

and

Therefore, is bounded and by Theorem 2.2.4.

As a matter of fact, changes from one to zero and then from zero

to one, and this process repeats forever with decreasing step sizes.




Thus, is dense in [0,1]. This phenomenon hints that for trajectory convergence of the stability-like condition A2.2.2 is notenough; a stronger stability is needed.

Definition 2.6.1

A point i.e., a root of is called dominantlystable for if there exist a and a positive measurable function

which is bounded in the interval and

satisfies the following condition

for all the ball centered at with radius

Remark 2.6.1 The dominant stability implies stability. To see this, it

suffices to take as the Lyapunov function. Then

The dominant stability of however, is not necessary for asymptoticstability.

Remark 2.6.2 Equality (2.6.3) holds for any whatever is.Therefore, all interior points of J are dominantly stable for Further,

for a boundary point of J to be dominantly stable for it sufficesto verify (2.6.3) for with small i.e., all that areclose to and outside J .Example 2.6.2 Let

In fact, is the gradient of

In this example We now show that all points of J

are dominantly stable for For this, by Remark 2.6.2, it suffices toshow that all with are dominantly stable for and for this,it in turn suffices to show (2.6.3) for any with and

for small enough Denoting by the angle between vectorsand we have for




It is clear that

for all small enough Therefore, all points in J are dominantlystable for

Theorem 2.6.1 Assume A2.2.1, A2.2.2, and A2.2.4 hold. If for a

given is convergent and a limit point of generated

by (2.1.1)–(2.1.3) is deminantly stable for then for this trajectory

Proof. For any define

where is the one indicated in Definition 2.6.1.

It is clear that is well-defined, because there is a convergent subse-quence: and for any greater than some If for any for some then by arbitrariness of

Therefore, for proving the theorem, it suffices to show that, for anysmall an exists such that implies if

Since implies A2.2.3, all conditions of Theorem 2.2.1

are satisfied. By the boundedness of we may assume that islarge enough so that the truncations no longer exist in (2.1.1)–(2.1.3)for It then follows that

Notice that for any andis bounded, and hence by (2.6.3)




for some because is convergent and

Further,

An argument similar to that used for (2.6.5) leads to

if is large enough.Then from (2.6.6) we have

From (2.6.4) and (2.6.7) we see that we can inductively obtain

Then, noticing by definitions of we have

where the elementary inequality




is used with for the first inequality in (2.6.8), and with

for the third inequality in (2.6.8). Because is bounded,

and an exists such

that

This means that and completes the proof.For convergence of SA algorithms we have imposed the stability-like

condition A2.2.2 for and the dominant stability con-

dition (2.6.3) for trajectory convergence. It is natural to ask does a limit

point of trajectory possess a certain stability property? The followingexample gives the negative answer.

Example 2.6.3 Let

It is straightforward to check that

satisfies A2.2.2. Take

where is a sequence of mutually independent random

variables such that a.s. Then with 1 being

a stable attractor for and all A2.2.1–A2.2.4 are satisfied. TakeThen by Theorem 2.2.1 it follows that

a.s. Since must converge to 0 a.s. Zero, however, isunstable for

In this example converges to a limit, which is independent of ini-

tial values and unstable, although conditions A2.2.1–A2.2.4 hold. This

strange phenomenon happens because

as a function of is singular for some in the sense that it

restricts the algorithm to evolve only in a certain set of Therefore,




in order the limit of to be stable, imposing a certain regularitycondition on and some restrictions on noises is unavoidable.

As in Section 2.3, assume that observation noise iswith being a measurable function defined on Set

Let us introduce the following conditions:

A2.6.1 For a given is a surjection for any

A2.6.2 For any and is continuous in and for any

and

where denotes the ball centered at with radius

It is clear, that A2.6.2 is equivalent to A2.6.2’:

A2.6.2’ For any and any compact set

Before formulating Theorem 2.6.2 we first give some remarks on Con-ditions A2.6.1 and A2.6.2.

Remark 2.6.3 If does not depend on then in (2.6.9)can be removed when taking supremum. In Condition A2.2.3

is a convergent subsequence, and hence is automaticallylocated in a compact set. In Theorems in Sections 2.2, 2.3, 2.4, and2.5, the initial value is fixed, and hence for fixed is a fixedsequence. In contrast to this, in Theorem 2.6.2 we will consider the casewhere the initial value arbitrarily varies, and hence for any fixedmay be any point in If in (2.6.9) were not restricted to a compact

set (i.e., with removed in (2.6.9)), then the resultingcondition would be too strong. Therefore, to put in(2.6.9) is to make the condition reasonable.

Remark 2.6.4 If is continuous and if then is a surjection.




By this property, is a surjection for a large class of Forexample, let be free of and let the growth rate of be notfaster than linear as Then with satisfying A2.2.1 wehave as for all Hence, A2.6.1

holds. In the case where the growth rate of is faster than linearas and for some we alsohave that as for all and A2.6.1

holds.In what follows by stability of a set for we mean it in the

Lyapunov sense, i.e., a nonnegative continuously differentiable function

exists such that andfor some where

Theorem 2.6.2 Assume A2.2.1, A2.2.2, and A2.6.2 hold, and that is continuous and for a given A2.6.1 holds. If defined by (2.1.1)–

(2.1.3) with any initial value converges to a limit independent of

then belongs to the unique stable set of

Proof. Since by A2.2.2 and by conti-

nuity of exists with such thatHence, By continuity of J is closed, and hence by A2.2.2,

Since we must have Denote by the connected

subset of containing The minimizer set of that contains isclosed and is contained in Since is a connected set

and byA2.2.2 is nowhere dense, is a constant.By continuity of all connected root-sets are closed and they are

separated. Thus, there exists a such that

i.e., contains no root of other than those located inSet

Then andTherefore, by definition, is stable for

We have to show that and is the unique stable root-set.Let be the connected set of such

that contains By continuity of for an arbitrary small

exist such that and the distance

between the interval and the set is positive;

i.e.,




We first show that, for any and there existand such that, for any if then

By Theorem 2.2.1, for with sufficiently large there will beno truncation for (2.1.1)–(2.1.3), and

For any let By A2.6.2, sufficiently small

and large enough exist such that for any

If for then (2.6.10) immediately

follows by setting Assume for someLet be the first such one. Then

By (2.6.11), however,

which contradicts (2.6.12). Thus and (2.6.10)is verified.

For a given we now prove the existence of such thatfor any if where the dependence of

on and on the initial value is emphasized. For simplicity of writing,

is written as in the sequel.Assume the assertion is not true; i.e., for any exists such thatand for some

Suppose and




If there exists an with then withexists because is connected and with

This yields a contradictory inequality:

where the first inequality follows from A2.2.2 while the second inequalityis because is the minimizer of

Consequently, for any and

and a subsequence of exists, also denoted by for

notational simplicity, such that By the continuityof

Hence, by the factBy (2.6.10) and the fact we can choose sufficiently

small T and large enough N such that

and i.e.,

for any By (2.6.10), exists with theproperty such that

Because as for sufficiently large N,

by (2.6.10) the last term of (2.6.15) is Then




By (2.6.10) and the continuity of the third term on the right handside of (2.6.16) is and by A2.6.2 (Since

with for all sufficiently large N .), the norm of the secondterm on the right-hand side of (2.6.16) is also as Henceby A2.2.2 and (2.6.13), some exists such that the right-handside of (2.6.16) is less than for all sufficiently large N if T is smallenough. By noticing and mentioned

above, from (2.6.14) it follows that the left-hand side of (2.6.16) tendsto a nonnegative limit as The obtained contradiction showsthat exists such that for any if With fixed for any byA2.6.1 exists such thatBy and the arbitrary smallness of from this it

follows that Since by assumption, we havewhich means that is stable. If another stable set existed

such that then by the same argument would belong toThe contradiction shows that the uniqueness of the stable set.

2.7. Robustness of Stochastic ApproximationAlgorithms

In this section for the single root case, i.e, the case we

consider the behavior of SA algorithms when conditions for convergenceof algorithms to are not exactly satisfied. It will be shown that a“small” violation of conditions will cause no big effect on the behaviorof the algorithm.

The following result known as Kronecker lemma will be used severaltimes in the sequel. We state it separately for convenience of reference.

Kronecker Lemma. If where is a sequence

of positive numbers nondecreasingly diverging to infinity and is a

sequence of matrices, then

Proof. Set Since

for any there is such that if Then it




follows that

as and thenWe still consider the algorithm given by (2.1.1)–(2.1.3), where de-

notes the estimate for at time but may not be the exact root of As a matter of fact, the following set of conditions will be used to

replace A2.2.1–A2.2.4:

A2.7.1 nonincreasingly tends to zero, and

exists such that

A2.7.2 There exists a nonnegative twice continuously differentiabl e func-

tion such that and

A2.7.3 For sample path the observation noise satisfies the fol-

lowing condition

A2.7.4 is continuous, but is not necessary to

be the root of

Comparing A2.7.1–A2.7.4 with A2.2.1–A2.2.4, we see the followingconditions required here are not assumed in Section 2.2: nonincreasing




Set

We will only consider those in (2.7.2) for which where is givenin (2.7.7). From (2.7.7) and (2.7.8) it is seen thatConsequently, by (2.7.2), a given by (2.7.12) is positive.

By continuity of and and existsuch that the following inequalities hold:

By A2.7.3 for can be taken sufficiently large suchthat

Lemma 2.7.1 Assume A2.7.1, A2.7.2, A2.7.4 hold with given in (2.7.3)being less than or equal to If for given by (2.1.1)–

(2.1.3) with (2.7.5) fulfilled, for some where K is

given in (2.7.18), then for any

Proof. Because is nondecreasing as T

increases, it suffices toprove the lemma forAssume the converse: there exists an such that




Then for any we have

and hence

which incorporating with the definition of leads to

On the other hand, from (2.7.20) and (2.7.21) it follows that

From (2.7.9) we have

By a partial summation we have

Applying (2.7.3) to the first two terms on the right-hand side of (2.7.25),

and (2.7.1) and (2.7.3) to the last term we find

From (2.7.24) and (2.7.26) it then follows that




which contradicts (2.7.22). This proves the lemma.

Lemma 2.7.2 Under the conditions of Lemma 2.7.1, for any

the following estimate holds:

Proof. Since by Lemma 2.7.1 we have

and hence

Consequently, we have

Lemma 2.7.3 Assume A2.7.1–A2.7.4 hold and satisfies (2.7.7). Then

for the sample path for which A 2.7.3 holds, a that is independent of

and exists such that

in other words, given by (2.1.1)–(2.1.3) is bounded.

Proof. Let be a sufficiently large integer such that

where K is given by (2.7.18).




Assume the lemma is not true. Then there exist and such

that Let be the maximal integer satisfying thefollowing equality:

Then by definition we have

and by (2.7.28) and (2.7.29),

We first show that under the converse assumption there must be ansuch that

Otherwise, for any and from (2.7.24) it follows

that

This together with (2.7.30) implies

which contradicts with the converse assumption.Hence (2.7.31) must be held.By the definition of (2.7.6), and (2.7.30) we have

Since by (2.7.31), from (2.7.4) and (2.7.6) it follows that




We now show For this it suffices to proveby noticing (2.7.34).

Since similar to (2.7.32) we have

and hence

From (2.7.32) and (2.7.36) it is seen that

where for the second inequality, (2.7.9) and are used, whilefor the last inequality (2.7.18) is invoked.

Paying attention to (2.7.10), we have and andby (2.7.16)

Then by (2.7.32) we see and (2.7.34) becomes

Thus, we can define

and have

Taking in Lemmas 2.7.1 and 2.7.2, and paying attentionto (2.7.4) and we know By Lemmas 2.7.1and 2.7.2, from (2.7.28) we see From (2.7.28)–(2.7.30) wehave obtained which together with the definition of

implies and hence Therefore, iswell defined, and by the Taylor’s expansion we have




where with components located in-between andWe now show that which, as to be shown, implies

a contradiction.By Lemma 2.7.2 we have

and hence

By (2.7.10) it follows that and by (2.7.11).Using Lemma 2.7.1, we continue (2.7.41) as follows:

Noticing we seeIt is clear that (2.7.35) and (2.7.37) remain valid with replacedby Hence, similar to (2.7.37) we have

By (2.7.11) and the Taylor’s expansion we have

and consequently,

and




By (2.7.40), Substituting (2.7.44) into (2.7.43) and using(2.7.12) lead to

Estimating by the treatment similar to that used for

(2.7.26) yields

Noticing by Lemma 2.7.2 we find that

and

Hence, and by (2.7.15) from (2.7.45) it follows that

Using (2.7.14), from the above estimate we have




From (2.7.18) it follows that Taking notice of (2.7.13) by

(2.7.17) we derive

On the other hand, by Lemma 2.7.2 and (2.7.11), (2.7.17), and (2.7.44)

it follows that

where

From (2.7.39), (2.7.40), and (2.7.48) we see that

and hence which contradicts with (2.7.47). This

means that the converse assumption of the lemma cannot be held.

Corollary 2.7.1 From Lemma 2.7.3 it follows that there exist

and which is independent of and arbitrarily varying in

intervals and such that

and for with sufficiently large the algorithm (2.1.1)–(2.1.3)

turns to an ordinary RM algorithm:

Set

Take and denote

By A2.7.2, Set




If in (2.7.2), then In the general case may

be positive.

Theorem 2.7.1 Assume A2.7.1–A2.7.4 hold and is given by

(2.1.1)–(2.1.3) with (2.7.5) held. Then there exist

and a nondecreasing, left-continuous function defined on such

that for the sample path for which A2.7.3 holds,

whenever and where and are the ones appearing in

(2.7.2) and (2.7.3), respectively. As a matter of fact, can be taken as

the inverse function of

Proof. Given recursively define

We now show that exists such that

Set and assume

From the recursion of we have

Assume is large enough such that by A2.7.3




By a partial summation, from (2.7.57) we find that

where (2.7.58) is invoked.By (2.7.1) we see

Without loss of generality, we may assume Then by(2.7.1) we have

Applying (2.7.60) and (2.7.61) to (2.7.59) leads to




and hence

which implies (2.7.56).For and by (2.7.53)

Taking this into account for by (2.7.51)–(2.7.54) and the

Taylor’s expansion we have

Therefore, in the following Taylor’s expansion

we have and henceand

Denote

For we have





Similar to (2.7.62), we see that

Consequently, we arrive at

Define

It is clear that is nondecreasing as increases and

Take such that Then we have




Define function

It is clear that is left-continuous, nondecreasing and

From (2.7.66) and (2.7.67) it follows that

which implies, by (2.7.57) and the definition of

Corollary 2.7.2 If in (2.7.2) may not be zero), then

and the right-hand side of (2.7.55) will be

Since may be arbitrarily small and hence the estimation errormay be arbitrarily small. If, in addition, in A2.7.3, then

tending and then in both sides of (2.7.55) we derive

In the case where by tending the right-hand side of (2.7.55) converges to

Consequently, as the estimation error depends on how big isIf in (2.7.2), then can also be taken

arbitrarily small and the estimation error depends on the magnitude of

2.8. Dynamic Stochastic ApproximationSo far we have discussed the root-searching problem for an unknown

function, which is unchanged during the process of estimation. We nowconsider the case where the unknown functions together with their rootschange with time. To be precise, Let be a sequence of unknown




functions with roots i.e.,Let be the estimatefor at time based on the observations

Assume the evolution of the roots satisfies the following equation

where areknown functions, while is a sequenceof dynamic noises.

The observations are given by

where is the observation noise and is allowed to depend on

In what follows the discussion is for a fixed sample, and the analysisis purely deterministic. Let us arbitrarily take as the estimate forand define

From equation (2.8.1), we see that may serve as a rough esti-mate for In the sequel, we will impose some conditions onand sothat

where is an unknown constant. Therefore, should notdiverge to infinity. But is unknown, so we will use the expandingtruncation technique.

Take a sequence of increasing numbers satisfying

Let be recursively defined by the following algorithm:

where denotes the number of truncations in (2.8.5) occurred untiltime




We list conditions to be used.

A2.8.1 and

A2.8.2 is measurable and for any

constant possibly depending on exists so that

for with

A2.8.3 is known such that

for where

and

A2.8.4 and

A2.8.5 There is a continuously differentiable function

such that for and for any

where is a positive constant possibly depending on and A con-

stant exists such that

where is an unknown constant that is an upper bound for

A2.8.6 For any convergent subsequence the observation noise

satisfies

where

Remark 2.8.1 Condition A2.8.2 implies the local boundedness, but theupper bound should be uniform with respect to In A2.8.3,measures the difference between the estimation error and the




prediction error In general, is greaterthan For example, then A2.8.3 holdswith A2.8.4 means that in the root dynamics, thenoise should be vanishing.

As A2.2.3, Condition A2.8.6 is about existence of a Lyapunov func-tion. To impose such kind a condition is unavoidable in convergenceanalysis of SA algorithms. Inequality (2.8.7) is an easy condition. Forexample, if as then this condition is automati-cally satisfied. The noise condition A2.8.6 is similar to A2.2.3.

Before analyzingconvergence property ofthe algorithm (2.8.5), (2.8.6),and (2.8.2) we give an example of application of dynamic stochastic ap-

proximation.

Example 2.8.1 Assume that the chemical product is produced in abatch mode, and the product quality or quantity of the batch de-pends on the temperature in batch. When the temperature equals theideal one, then the product is optimized. Let denote the deviationof the temperature from its optimal value for the batch, wheredenotes the control parameter, which may be, for example, the pressurein batch, the quantity of catalytic promoter, the raw material propor-tion and others. The deviation reduces to zero if the control equals itsoptimal value i.e., Because of the environment change,the optimal parameter may change from batch to batch. Assume

where is known and is the noise.

Let be the estimate for Then may serve as a prediction

for Apply as the control parameter for the batch.Assume that the temperature deviation of for the thbatch can be observed, but the observation may be corrupted bynoise, i.e., where is the observationnoise.

Then we can apply algorithm (2.8.5), (2.8.6), and (2.8.2) to estimateUnder conditions A2.8.1–A2.8.6, by Theorem 2.8.1 to be proved in

this section, the estimate is consistent, i.e.,

Theorem 2.8.1 Under Conditions A2.8.1–A2.8.6 the estimation error

tends to zero as where is given by (2.8.5),

(2.8.6), and (2.8.2).

To prove the theorem we start with lemmas.




Lemma 2.8.1 Under A2.8.3 and 2.8.4, the sequence

is bounded for any

Proof. By A2.8.3 and A2.8.4 from (2.8.1) it follows that

Lemma 2.8.2 Assume A2.8.1–A2.8.4 and A2.8.6 hold. Let beaconvergent subsequence such that as Then, there

are a sufficiently small and a sufficiently large integer such that

for

where is implied by

for where is a constant independent

of

Proof. In the case as

is bounded, and hence is bounded. By Lemma

2.8.1, is bounded. Therefore, is bounded. For

large and

The following expression (2.8.11) and estimate (2.8.12) will frequentlybe used. By (2.8.1) and A2.8.3 we have




and

Substitution of (2.8.12) into (2.8.10) leads to

By boundedness of and A2.8.3,

for some ByA2.8.4, while the last term is also

less than by A2.8.6.Without loss ofgenerality, we may assume

Therefore, and the lemma is true for the case

We now consider the case as Let be so large

that for

with being a constant, and

where is given by (2.8.8).

as




Without loss of generality we may assume

Define and take T so small that Weprove the lemma by induction.

By (2.8.8) and (2.8.12), we have

Therefore, at time there is no truncation. Then by (2.8.11) and

(2.8.12) we have

where (2.8.14) and (2.8.15) have been used.Let the conclusions of the lemma hold for

We prove that it also holds for Again by (2.8.12), we have

Hence there is no truncation at time By the inductive assump-tion, (2.8.11) and (2.8.12), it follows that

where (2.8.13) and (2.8.14) are invoked.Therefore, the conclusions of the lemma are also true for This

completes the proof.

Lemma 2.8.3 Assume A2.8.1–A2.8.6 hold. Then the number of trun-

cations in (2.8.5) is finite and isbounded.




and (2.8.11) we have

Notice that by Lemma 2.8.2 and (2.8.13)

for sufficiently large From (2.8.21) and (2.8.23), it follows that

On the other hand, by Lemma 2.8.2

Identifying and in A2.8.5 to and respectively, we can

find such that

by A2.8.5.




Let us consider the right-hand side of (2.8.22). Noticing

by A2.8.3 and A2.8.4 we have

By A2.8.6,

Noticing that

as and by continuity of we find thattends to zero as and

Since the sum of the first and second

terms on the right-hand side of (2.8.22) is as and

This combining with (2.8.26) yields the following conclusion that for

with sufficiently large and for small enough T from (2.8.22) it

follows that

By (2.8.20), tending to infinity, from (2.8.30) we derive

By Lemma 2.8.2 we have

However, by definition,and Hence from (2.8.32), we must have

if T is small enough. Therefore, This contradicts

(2.8.31). The obtained contradiction shows that




Theorem 2.8.2 Assume A2.8.1–A2.8.6 hold. Then the estimation er-ror tends to zero as

Proof. We first show that converges. Assume the converse:

where because is bounded by Lemma 2.8.3.It is clear that there exists an interval that does not containzero such that Without loss of generality, assume

From A2.8.6, it follows that there are infinitely manysequences such that and that

forWithout loss of generality we may assume converges:

Since exists such that and byLemma 2.8.2, Completely thesame argument as that used for (2.8.22)–(2.8.32) leads to a contradiction.Hence is convergent.

We now show that as Assume the converse: thereis a subsequence By the same argument we again arrive

at (2.8.30). Tending by convergence of we obtain acontradictory inequality This implies that as

The following theorem is similar to Theorem 2.4.1.

Theorem 2.8.3 Assume A2.8.1–A2.8.5 hold and is continuous at

uniformly in Then as if and only if A2.8.6

holds. Furthermore, under conditions A2.8.1–A2.8.5, the following three

conditions are equivalent.

1) Condition A2.8.6;

2)

3) can be decomposed into two parts: so that

Proof. Assume as Then is bounded. Wehave shown in the proof of Lemma 2.8.3 that the number of truncationsmust be finite if is bounded. Therefore, starting from some thealgorithm (2.8.5) becomes

The following theorem is similar to Theorem 2.4.1.

and as





Set

By A2.8.3 and A2.8.4 and as

while tends to zero becauseis uniformly continuous at and Consequently,3) holds.

On the other hand, it is clear that 3) implies 2), which in turn im-plies A2.8.6. By Theorem 2.8.1, under A2.8.1–A2.8.5, Condition A2.8.6implies as

Thus, the equivalence of l)–3) has been justified under A2.8.1–A2.8.5.

2.9. Notes and ReferencesThe initial version of SA algorithms with expanding truncations and

its associated analysis method were introduced in [27], where the algo-rithm was called SA with randomly varying truncations. Convergenceresults of this kind of algorithms can also be found in [14, 28]. The-orems given in Section 2.2 are the improved versions of those given in[14, 27, 28]. Theorems in Section 2.3 can be found in [18]. Necessity of

the noise condition is proved in [24, 94] for the single-root case, and in[17] for the multi-root case.Convergence results of SA algorithms with additive noise can be found

in [16]. Concerning the measure theory, we refer to [31, 76, 84]. Resultsgiven in Section 2.6 can be found in [48], and some related problems arediscussed in [3]. For the proof of Remark 2.6.4 we refer to Theorem 3.3in [34]. Example 2.6.1 can be found in [93]. Robustness of SA algorithmsis presented in [24]. The dynamic SA was considered in [38, 39, 91], but

the results presented in Section 2.8 are given in [25].



Chapter 3

ASYMPTOTIC PROPERTIES OF STOCHA-

STIC APPROXIMATION ALGORITHMS

In Chapter 2 we were mainly concerned with the path-wise conver-gence analysis for SA algorithms with expanding truncations. Condi-tions were given to guarantee where J denotes the

root set of the unknown function, and the estimate for unknown rootgiven by the algorithm.

In this chapter, for the case where J consists of a singleton we

consider the convergence rate of asymptotic normality of and asymptotic efficiency of the estimate.Assume is differentiable at Then as

whereIt turns out that the convergence rate heavily depends on whether

or not F is degenerate. Roughly speaking, in the case where the stepsize in (2.1.1) the convergence rate of forsome positive when F is nondegenerate, and for some

when F vanishes.It will be shown that is asymptotically normal and the covari-

ance matrix of the limit distribution depends on the matrix D if in(2.1.1) the step size is replaced by If F in (3.0.1) is available,then D can be defined to make the limiting covariance matrix minimal,i.e., to make the estimate efficient. However, this is not the case in SA.To overcome the difficulty one way is to derive the approximate valueof F by estimating it, but for this one has to impose rather heavy con-ditions on Efficiency here is derived by using a sequence of slowly

95

is

and

to




decreasing step sizes, and the averaged estimate appears asymptoticallyefficient.

3.1. Convergence Rate: Nondegenerate CaseIn this section, we give the rate of convergence of to zero

in the case F in (3.0.1) is nondegenerate, where is given by (2.1.1)–(2.1.3). It is worth noting that F is the coefficient for the first order inthe Taylor’s expansion for

The following conditions are to be used.

A3.1.2 A continuously differentiable function exists

such that

for any and for some with

where is used in (2.1.1).

A3.1.3 For the sample path under consideration the observation noise

in (2.1.3) can be decomposed into two parts such that

for some

A3.1.4 is measurable and locally bounded, and is differentiable at

such that as

The matrix F is stable (This implies nondegeneracy of F.), in ad-

dition, is also stable, where and are given by (3.1.1) and (3.1.3), respectively.

By stability of a matrix we mean that all its eigenvalues are with

negative real parts.

and



Asymptotic Properties of Stochastic Approximation Algorithms 97

Remark 3.1.1 We now compare A3.1.1–A3.1.4 with A2.2.1–A2.2.4. Be-cause of additional requirement (3.1.1), A3.1.1 is stronger than A2.2.1,but it is automatically satisfied if with In this case a in(3.1.1) equals Also, (3.1.1) is satisfied if with

In this case Take sufficiently small such that

Then and Assume

is a martingale difference sequence with

Then by the convergence theorem for martingale difference se-

quences, Therefore (3.1.3) is satisfied a.s. with

Condition A3.1.4 assumes differentiability of whichis not required in A2.2.4.

Lemma 3.1.1 Let and H be -matrices. Assume H is stable

and If satisfies A3.1.1 and l-dimensional vectors

satisfy the following conditions

then defined by the following recursion with arbitrary initial value

tends to zero:

Proof. Set

We now show that there exist constants and such that

Let S be any negative definite matrix. Consider

at



Asymptotic Properties of Stochastic Approximation Algorithms 99

and hence

where denotes the minimum eigenvalue of P.

Paying attention to that

from (3.1.13) we derive

which verifies (3.1.8).From (3.1.6) it follows that

We have to show that the right-hand side of (3.1.14) tends to zero as

For any fixed because of (3.1.1) and(3.1.8). This implies that as for any initial value

Since as for any exists such thatThen by (3.1.8) we have

The first term at the right-hand side of (3.1.15) tends to zero by A3.1.1,while the second term can be estimated as follows:

as




where the first inequality is valid for sufficiently large sinceas and the second inequality is valid when

Therefore, the right-hand side of (3.1.15) tends to zero asand then

Set

By assumption of the lemma Hence, for anythere exists such that By a partialsummation, we have

where except the last term, the sum of remaining terms tends to zero asby (3.1.8) and

Let us now estimate



Asymptotic Properties of StochasticApproximation Algorithms 101

Since for and as by (3.1.8)

we have

which tends to zero as and by (3.1.16) and the factthat Thus, the right-hand side of (3.1.17) tends to

zero as and the proof of the lemma is completed.Theorem 3.1.1 Assume A3.1.1- A3.1.4 hold. Then given by (2.1.1)–

(2.1.3) for those sample paths for which (3.1.3) holds converges to

with the following convergence rate:

where is the one given in (3.1.3).

Proof. We first note that by Theorem 2.4.1 and there is no

truncation after a finite number of steps. Without loss of generality, we

may assumeBy (3.1.1), Hence, by the Taylor’s expansion we

have

Write given by (3.1.4) as follows

where

By (3.1.4) and (3.1.19), for sufficiently large k we have




if is a martingale difference sequence with

So, for (3.1.25) it is sufficient to require

Since the best convergence rate is achievedat the convergence rate is Sincethe convergence rate is slowing down as approaches

to When (3.1.25) cannot be guaranteed. From this it is seenthat the convergence rate depends on how big is.

3.2. Convergence Rate: Degenerate CaseIn the previous section, for obtaining the convergence rate of

stability and hence nondegeneracy of F is an essential requirement. Wenow consider what will happen if the linear term vanishes in the Taylor’sexpansion of For this we introduce the following set of conditions:

A3.2.2 A continuously differentiable function exists

such that

for any and for some withwhere is used in (2.1.1);

A3.2.3 For the observation noise on the sample path under con-sideration the following series converges:

where

A3.2.4 is measurable and locally bounded, and is differentiable at

such that as

where F is a stable matrix, and is the one used in A3.2.3.

We first note that in comparison with A3.1.1–A3.1.4, here we do notrequire (3.1.1), but A3.2.2 is the same as A3.1.2. From (3.2.3) we see that

A3.2.1 and

or

For




the Taylor’s expansion for does not contain the linear term. HereF is the coefficient for a term higher than second order in the Taylor’sexpansion of The noise condition A3.2.3 is different from A3.1.3,but, as to be shown by the following lemma, it also implies A2.2.3.

Lemma 3.2.1 If (3.2.2) holds, then and hence A2.2.3

is satisfied.

Proof. We need only to show

Setting

by a partial summation we have

Since as and converges as the first twoterms on the right-hand side of (3.2.4) tend to zero as and

The last term in (3.2.4) is dominated by

where

By the following elementary calculation we conclude that the right-hand side of (3.2.5) tends to zero as and

as




which tends to zero as and because as

This combining with (3.2.4) and (3.2.5) shows that

By the Lyapunov equation (3.1.9), there is a positive definite matrixP > 0 such that

Assuming is large enough so that there is no truncation, by (3.2.3) wehave

where is the maximum eigenvalue of P given by (3.2.6).

We start with lemmas. Note that by Theorems 2.2.1 or 2.4.1Therefore, starting from some the algorithm has no truncation.

Define

Denote by and the maximum and minimum eigenvalues of P, respectively, and by K the condition number

Theorem 3.2.1 Assume A3.2.1–A3.2.4 hold and is given by (2.1.1)

–(2.1.3). Then for the sample paths where A3.2.3 holds the following

convergence rate takes place:




consider the case since if it is not true then is clearlybounded.

Let P be given by (3.2.6). We have

where

In what follows we will prove that

By (3.2.10) and (3.2.6) it is clear that

where the last inequality follows by the following consideration:

By (3.2.11) so for (3.2.16) it suffices to show that

By definition of we have and hence

or




Consequently,

and by the agreement

which verifies the last inequality in (3.2.16).

We now estimate By (3.2.10) (3.2.11) and the agreementwe have

Noticing that, as agreed,from (3.2.17) we have

and by (3.2.13),

Again, from (3.2.10) and noticing we have

Consequently, by (3.2.12)

Combining (3.2.14), (3.2.16), (3.2.18), and (3.2.20) yields

for

and




Proof of Theorem 3.2.1.By Lemma 3.2.2 and the fact

we have

where

By setting

from (3.2.9) it follows that

This is nothing else but an RM algorithm. Since by Lemma 3.2.2is bounded, no truncation is needed and one may apply Theorem 2.2.1”.

First note that

Hence, A2.2.1 is satisfied.

as So A2.2.3 holds with replaced by

A2.2.4 is clearly satisfied, since is continuous. The key issue is tofind a satisfying A2.2.2”.

Take

and define which is closed.Notice

Notice and

as

by




ForThen we have

This means that

and the condition A2.2.2” holds.By Theorem 2.2.1”, This implies

which in turn implies (3.2.7) by (3.2.8).Imposing some additional conditions on F , we may have more precise

than (3.2.7) results by using different Lyapunov functions.

Theorem 3.2.2 Assume A3.2.1–A3.2.4 hold, in addition, assume F is

normal, i.e., Let be given by (2.1.1)–(2.1.3). Then

for those sample paths for which A3.2.3 holds, converges

to either zero or one of where denotes an eigenvalue of

More precisely,

where is an unit eigenvector of H corresponding to

Proof. Since F is stable, the integral

is well defined. Noticing that we have

and

and

for




This means that H is also stable. Therefore, all eigenvalues arenegative. Further, by we find

and hence

We consider (3.2.23) and take

By (3.2.26) we have

Define

Obviously,

for any

Clearly,

where is the dimension of Thus, J is a discrete set, and is nowhere dense because is

continuous. This together with (3.2.28) shows that A2.2.2’ is satisfied.

and




By Theorem 2.2.1’, and (3.2.25) is verified.

Corollary 3.2.1 Let Then

In this case,

and hence (3.2.7) and (3.2.25) are respectively equivalent to

and

Remark 3.2.1 For the convergence rate given by (3.1.18)for the nondegenerate case is while for the degenerate case is

by (3.2.29), which is much slower than

3.3. Asymptotic Normality

In Theorem 3.1.1 we have shown that givenby (2.1.1)–(2.1.3). As shown in Remark 3.1.2,

This is a path-wise result. Assuming the observation noise isa random sequence, we show that is asymptotically normal,

i.e., the distribution of converges to a normal distributionas This convergence implies that in the convergence rate

cannot be improved toWe first consider the linear regression case, i.e., is a linear func-

tion, but may be time-varying.Let us introduce a central limit theorem on double-indexed random

variables. We formulate it as a lemma.

Lemma 3.3.1 Let be an array of l-dimensional random

vectors. Denote

as

for

if




and

Assume

and

Then

where and hereafter denotes the normal distribution with meanand covariance S.

Let us first consider the linear recursion (3.1.6) and derive its asymp-totic normality. We keep the notation introduced by (3.1.7).

We have obtained estimate (3.1.8) for and now derive moreproperties for it.

Lemma 3.3.2 Assume and

H where H is stable. Then for any

Proof. By (3.1.8) it follows that




We will use the following elementary inequality

which follows from the fact that the function equals

zero at x = 0 and its derivative By (3.3.8), we derive

which implies

Assume is sufficiently large such that Then

where for the last inequality (3.3.9) is invoked.

Combining (3.3.7) and (3.3.10) gives (3.3.6).

Lemma 3.3.3 Set




Under conditions of Lemma 3.3.2,

uniformly with respect to and

uniformly with respect to

Proof. Expanding to the series

with we have

where by definition

By stability of H , there exist constants and p > 0 such that

Putting (3.3.13) into (3.3.12) yields that for any

where for the last inequality is assumed to be sufficiently large suchthat and (3.1.8) is used too.

as

as




Since and may be arbitrarily small the conclusions

of the lemma follow from (3.3.14) by Lemma 3.3.2.

Lemma 3.3.4 Assume as and

Let A, B, and Q be matrices and let A and B be stable. Then

Proof. For any T > 0 define

Since for fixed T. Denoting

by we then have Consequently,

serves as an integral sum for or equivalently, for

and hence

Therefore, for (3.3.15) it suffices to show that

Similar to (3.3.10), by stability of A we can show that there is a constantsuch that




By stability of A and B, constants and existsuch that

Consequently, we have

which verifies (3.3.18) and completes the proof of the lemma.

Theorem 3.3.1 Let be given by (3.1.6) with an arbitrarily given

initial value. Assume the following conditions holds:

where are constant matrices with is

a martingale difference sequence of dimension satisfying the following

conditions:

and

and is stable;l

I

and

as




and

Then is asymptotically normal:

where

Proof. Define by the following recursion

By (3.1.6) it follows that

Using (3.3.19) we have

Consequently,

where




and

by (3.3.20).Define

By (3.3.30) and stability of A, from (3.1.8) it follows that constantsand exist such that

Consequently, from (3.3.29) we have

The first term on the right-hand side of (3.3.34) tends to zero as

by (3.3.33), while the second term is estimated as follows. By (3.3.31)

where for the last equality, Lemma 3.3.2 and (3.3.33) are used. Thismeans that r and have the same limit distribution if exists.

Consequently, for the theorem it suffices to show

Similar to (3.3.29) and (3.3.31), by (3.3.28) we have




Noticing

by Lemma 3.3.2 and (3.1.8), we find that the last term of (3.3.36) tends

to zero in probability. Therefore, for (3.3.24) it suffices to show

We now show that for (3.3.37) it is sufficient to prove

For any fixed we have

By (3.3.21) we have

where convergence to zero follows from and Lemma 3.3.2.

It is worth noting that the convergence is uniform with respect to This

By (3.3.21) and we see that




implies that the second term on the right-hand side of (3.3.39) tends tozero in probability. The first term on the right-hand side of (3.3.39) canbe rewritten as

By (3.3.33) for any fixed we estimate the first term of (3.3.40) as follows

while for the second term we have

since and

We now show that the last term of (3.3.40) also converges to zero inprobability as

Notice that by (3.3.28), for any fixed and

Therefore, for a fixed there exist constants

and such that

as




Then the last term of (3.3.40) is estimated as follows:

For the first term on the right-hand side of (3.3.44) we have

where the last inequality is obtained because is bounded

by some constant by (3.3.30). Since is fixed, in order to

prove that the right-hand side of (3.3.45) tends to zero as itsuffices to show




By (3.3.33), for any fixed

while for any given we may take sufficiently large such that

Therefore,

by Lemma 3.3.2.Incorporating (3.3.47) with (3.3.48) proves (3.3.46). Therefore, the

right-hand side of (3.3.45) tends to zero as This implies

that the first term on the right-hand side of (3.3.44) tends to zero inprobability.By (3.3.43), for the last term of (3.3.44) we have

which tends to zero as as can be shown by an argument similar

to that used for (3.3.45).In summary we conclude that the right-hand side of (3.3.44) tends

to zero in probability, and hence all terms in (3.3.40) tend to zero inprobability. This implies that the right-hand side of (3.3.39) tends tozero in probability as and then Thus, we have shown

that for (3.3.37) it suffices to show (3.3.38).

We now intend to apply Lemma 3.3.1, identifying

to in that lemma. We have to check conditions of the lemma.Since is a martingale difference sequence, (3.3.1) is obviously

satisfied.




By (3.3.22) and Lemma 3.3.2,

This verifies (3.3.3). We now verify (3.3.2). We have

where the last term tends to zero by (3.3.22) and Lemma 3.3.2.We show that the first term on the right-hand side of (3.3.49) tends

to (3.3.25).

With A and respectively identified to H and in Lemma 3.3.3,

by Lemmas 3.3.2 and 3.3.3 we have

This incorporating with (3.3.49) leads to

By Lemma 3.3.4 we conclude

Finally, we have to verify (3.3.4).




By (3.3.33) we have

Noticing that uniformly with respect to

since or equivalently,

uniformly with respect to by (3.3.23) we have

Consequently, for any by Lemma 3.3.2

Thus, all conditions of Lemma 3.3.1 hold, and by this lemma we conclude(3.3.38). The proof is completed.

Remark 3.3.1 Under the conditions of Theorem 3.3.1, if integers

are such that then it can be

shown that converges in distribution to

where is a stationary Gaussian Markov process satisfying

the following stochastic differential equation

where is the standard Wiener process.




Corollary 3.3.1 From (3.1.7) and (3.3.28), similar to (3.3.29)–(3.3.31)we have

and

By (3.3.33), the first term on the right-hand side of (3.3.50) tendsto zero as Note that the last term in (3.3.34) has beenproved to vanish as and it is just a different writing of

Therefore, from (3.3.50) by Theorem 3.3.1, it fol-

lows that for any fixed

We have discussed the asymptotic normality of for the case

where is linear. We now consider the general Let us firstintroduce conditions to be used.

and

A3.3.2 A continuously differentiable function exists such that



for some




where is a martingale difference sequence satisfying (3.3.21)–

(3.3.23).

A3.3.3

A3.3.4 is measurable and locally bounded. As

where with a specified in (3.3.52) is stable and

satisfying which is specified in (3.3.53).

Theorem 3.3.2 Let be given by (2.1.1)–(2.1.3) and let A3.3.1–

A3.3.4 be held. Then

where

Proof. Since there exists such that

which implies From (3.3.53) it follows that

This together with the convergence theorem for martingale differencesequences yields




which implies

Since from it follows thatStability of is implied by stability of which is a part of A3.3.4. Then by Theorem 3.1.1

By (3.3.55) and (3.3.58) we have

From Theorem 3.1.1 we also know that there is an integer-valued

(possibly depending on sample paths) such thatand there is no truncation in (2.1.1) for Consequently,for we have

Denoting

by (3.3.59) and (3.3.54) we see a.s.Then (3.3.60) is written as


where




Using introduced by (3.3.32), we find

By the argument similar to that used in Corollary 3.3.1, we have

a n d a s

Then by (3.3.51) from (3.3.63) we conclude (3.3.56).

Corollary 3.3.2 Let D be an matrix and let in (2.1.1)–(2.1.2)be replaced by In other words, in stead of (2.1.1) and (2.1.2) if

we consider

then this is equivalent to replacing and by andrespectively.

In this case the only modification should be made in conditions of Theorem 3.3.2 consists in that stability of in A3.3.4 should bereplaced by stability of The conclusion of Theorem 3.3.2 re-

mains valid with only modification that and F in (3.3.57)

should be replaced by and DF, respectively.

3.4. Asymptotic EfficiencyIn Corollary 3.3.2 we have mentioned that the limiting covariance

matrix S ( D) for depends on D, if in (2.1.l)–(2.1.3) is replaced

by By efficiency

we

mean

that

S ( D)

reaches

its minimum withrespect to D.

Denote




In what follows we will show that is asymptotically normaland is asymptotically efficient.

We list the conditions to be used.

A3.4.1 nonincreasingly converges to zero,

and for some

A3.4.2 A continuously differentiable function exists such that



A3.4.3 The observation noise is such that

with being a constant independent of and

where is specified in (3.4.7).

A3.4.4 is measurable and locally bounded. There exist a stable ma-

trix F, and such that




where is a constant.

Remark 3.4.1 It is clear that satisfies A3.4.1.From (3.4.7) it follows that

where denotes the integer part of

Since is nonincreasing, from (3.4.12) we have

which implies

or

Remark 3.4.2 If with being a martingale

difference sequence satisfying (3.3.21)–(3.3.23), then identifying to

in Lemma 3.3.1, by this lemma we have

where is given by (3.4.1). Thus, in this case the second condition in(3.4.8) holds.

We now show that the first condition in (3.4.8) holds too.By the estimate for the weighted sum of martingale difference se-

quences (See Appendix B) we have

which incorporating with (3.4.13) yields




It is clear that (3.4.9) is implied by (3.3.21). Therefore, in the presentcase all requirements in A3.4.3 are satisfied.

Theorem 3.4.2 Assume A3.4.1–A3.4.4 hold. Let be given by

(2.1.1)–(2.1.3) and be given by (3.4.5). Then is asymptoticallyefficient:

Prior to proving the theorem we establish some properties of slowly

decreasing step size.Set

By (3.1.8) we have

where and are constants.

Set

Lemma 3.4.1 i) The following estimate takes place

where o(1) denotes a magnitude that tends to zero as

ii) is uniformly bounded with respect to both and

and

Proof. i) By (3.4.6) we know that




and

which implies (3.4.17) since asii) By (3.4.6) as and hence for any we have

where denotes the integer part of Using (3.4.15) we have

for any where the first term at the right-hand side tends to zeroas by (3.4.20), and the last term tends to zero asTherefore, for (3.4.18) it suffices to show

Noticing that (3.4.13) implies for any we have




Lemma 3.4.2 Under Conditions A3.4.1–A3.4.4, there exists an integer-

valued such that a.s., a.s., and given by(2.1.1)–(2.1.3) has no truncation for i.e.,

and a.s.

Proof. If we can show that A2.2.3 is implied by A3.4.3, then all condi-tions of Theorem 2.2.1 are fulfilled a.s., and the conclusions of the lemma

follow from Theorem 2.2.1.

Since we have

which means that (2.2.2) is satisfied forWe now check (2.2.2) for By a partial summation we have

where (3.4.6) is used and asBy (3.4.8) the first two terms on the right-hand side of (3.4.34) tend

to zero as by the same reason and by the fact

the last term of (3.4.34) also tends to zero as This means thatsatisfies (2.2.2), and the lemma follows.

By Lemma 3.4.2 we have




and by (3.4.14)

For specified in (3.4.11) and a deterministic integer define the

stopping time as follows


and

Lemma 3.4.3 If A 3.4.1-A3.4.4 hold, then

is uniformly bounded with respect to




Proof. By (3.4.11) and (3.4.15) from (3.4.39) we have

where respectively denote the terms on the right-handside of the inequality in (3.4.40).

By (3.4.19) we see

where as From this we find that is bounded inif is large enough so thatBy (3.4.19) we estimate as follows:

where is assumed to be large enough such that

Thus, by (3.4.9)




We now pay attention to (3.3.10) in the proof of Lemma 3.3.2 and findthat the right-hand side of (3.4.42) is bounded with respect to

For by (3.4.19) and (3.4.10) we have

where is a constant. Again, by (3.3.10), is bounded inIt remains to estimate By Schwarz inequality we have

By (3.4.19), for large enough

which, as shown by (3.3.11), is bounded in we then by (3.4.37) have


Combing (3.4.40)-(3.4.44) we find that there exists a constantsuch that




Setting

and

from (3.4.45) we have

where is a constant.Denoting

from (3.4.48) we find

where is set to equal to 1.


which combining (3.4.46) leads to




where for the last equality we have used (3.4.47).Choosing sufficiently small so that

from (3.4.51) we then have

which is bounded with respect to as shown by (3.3.10).

Lemma 3.4.4 If A3.4.1-A3.4.4 hold, then

Proof. It suffices to prove

Then the lemma follows from (3.4.53) by using the Kronecker lemma.By (3.4.11) and (3.4.37) we have

where the last inequality follows by using the Lyapunov inequality.




By Lemma 3.4.2, a.s. and

Consequently,

where as

Noticing we have

and hence

By (3.4.16) and (3.4.57), from here we derive

By Lemma 3.4.1, is bounded. Then with the help of (3.4.58) wehave




From (3.4.58) and the boundedness of there exists a constantsuch that

Then, we have

where the convergence to zero a.s. follows from Lemma 3.4.4.Putting (3.4.59), (3.4.61) into (3.4.56) leads to

By (3.4.58) we then have

Notice that

Let us denote by the upper bound for where the existence of is guaranteed by Lemma 3.4.1. Then using (3.4.9) and (3.4.18) wehave




This incorporating with (3.4.8) implies the conclusion of thetheorem.

This theorem tells us that if in (2.1.1)-(2.1.3) we apply the slowly

decreasing step size, then the averaged estimate leads to the minimal

covariance matrix of the limit distribution.

3.5. Notes and ReferencesConvergence rates and asymptotic normality can be found in [28, 68,

78] for the nondegenerate case. The rate of convergence for the degen-erate case was first considered by Pflug in [74]. The results presented inSection 3.2 are given in [15, 47].

For the proof of central limit theorem (Lemma 3.3.1) we refer to [6,56, 78], while for Remark 3.3.1 refer to [78]. The proof of Theorem 3.3.1and 3.3.2 can be found in [28].

Asymptotic normality of stochastic approximation algorithm was firstconsidered in [44].

For asymptotic efficiency the averaging technique was introduced in[80, 83], and further considered in [35, 59, 66, 67, 74, 98]. Theoremsgiven in Section 3.4 can be found in [13]. For adaptive stochastic ap-proximation refer to [92, 95].



Chapter 4

OPTIMIZATION BY STOCHASTIC

APPROXIMATION

Up-to now we have been concerned with finding roots of an unknownfunction observed with noise. In applications, however, one often

faces to the optimization problem, i.e., to finding the minimizer or max-

inizer of an unknown function It is well know that achieves

its maximum or minimum values at the root set of its gradient, i.e., at

although it may be only in the local sense.

The gradient is also written asIf the gradient can be observed with or without noise, then theoptimization problem is reduced to the SA problem we have discussed in

previous chapters. Here, we are considering the optimization problem for

the case where the function itself rather than its gradient is observed

and the observations are corrupted by noise. This problem was solved

by the classical Kiefer-Wolfowitz (KW) algorithm which took the finite

differences to serve as estimates for the partial derivatives. To be precise,

let be the estimate at time for the minimizer (maximizer) of and let

be two observations on at time with noises and

respectively, where

are two vectors perturbed from the estimate by and respec-tively, on the component of The KW algorithm suggests taking

151




the finite difference

as the observation of the component of the gradientIt is clear that

where the component of equals

The RM algorithm

with defined above is called the KW algorithm.It is understandable that in the classical theory for convergence of

the KW algorithm rather restrictive conditions are imposed not only onbut also on and Besides, at each iteration to form finite

differences, observations are needed, where is the dimension of

In some problems may be very large, for example, in the problem of optimizing weights in a neuro-network corresponds to the number of nodes, which may be large. Therefore, it is of interest not only to weakenconditions required for convergence of the optimizing algorithm but alsoto reduce the number of observations per iteration.

In Section 4.1 the KW algorithm with expanding truncations usingrandomized differences is considered. As to be shown, because of replac-ing finite differences by randomized differences, the number of observa-

tions is reduced from to 2 for each iteration, and because of involvingexpanding truncations in the algorithm and applying TS method forconvergence analysis, the conditions needed for have been weak-ened significantly and the conditions imposed on the noise have beenimproved to the weakest possible. The convergence rate and asymp-totic normality for the KW algorithm with randomized differences and

expanding truncations are given in Section 4.2.The KW algorithm as other gradient-based optimization algorithms

may be stuck at a local minimizer (or maximizer). How to approachto the global optimizer is one of the important issues in optimizationtheory. Especially, how to pathwisely reach the global optimizer is adifficult and challenging problem. In Section 4.3 the KW algorithm iscombined with searching initial values, and it is shown that the resultingalgorithm a.s. converges to the global optimizer of the unknown function



Optimization by Stochastic Approximation 153

The obtained results are then applied to some practical problemsin Section 4.4.

4.1. Kiefer-Wolfowitz Algorithm with

Randomized DifferencesThere is a fairly long history of random search or approximation ideas

in SA. Different random versions of KW algorithm were introduced: forexample, in one version a sequence of random unit vectors that are inde-pendent and uniformly distributed on the unit sphere or unit cube wasused; and in another version the KW algorithm with random directionswas introduced and was called a simultaneous perturbation stochasticapproximation algorithm.

Here, we consider the expandingly truncated KW algorithm with ran-domized differences. Conditions needed for convergence of the proposedalgorithm are considerably weaker than existing ones.

Conditions on

Let be a sequence of independent andidentically distributed (iid) random variables such that

Furthermore, let be independent of the algebra generated by

is the observation noise to be explained later.For convenience of writing let us denote

It should be emphasized that is a vector and is irrelevant to inverse.At each time two observations are taken: either

or




where is the estimate for the sought-for minimizer (maximizer) of denote the observation noises, and is a real

number.The randomized differences are defined as

and

may serve as observations of randomized differences.To be fixed, let us consider observations defined by (4.1.3) and (4.1.4).

The convergence analysis, however, can analogously be done for obser-vations (4.1.5) and (4.1.6).

Thus, the observations considered in the sequel are

where

We now define the KW algorithm with expanding truncations and

randomized differences. Let be a sequence of positive numbersincreasingly diverging to infinity, and let be a fixed point inGiven any initial value the algorithm is defined by:

where is given by (4.1.9) and (4.1.10).It is worth noting that the algorithm (4.1.9)-(4.1.12) differs from

(2.1.1)- (2.1.3) only by observations As a matter of fact, (4.1.11)and (4.1.12) are exactly the same as (2.1.1) and (2.1.2), but (4.1.9) and




Remark 4.1.2 If is the unique minimizer of then in (4.1.11)and (4.1.12) should be replaced by

Theorem 4.1.1 Assume A4.1.1, A4.1.2, and Conditions on hold.

Let be given by (4.1.9)-(4.1.12) (or (4.1.11)-(4.1.14)) with anyinitial value. Then

if and only if for each the random noise given by (4.1.10) can be

decomposed into the sum of two terms in ways such that

with

and

where is given in Conditions on

Proof. We will apply Theorem 2.2.1 for sufficiency and Theorem 2.4.1for necessity.

Let us first check Conditions A2.2.1–A2.2.4. Condition A2.2.1 is apart of A4.1.1. Condition A2.2.2 is automatically satisfied if we take

noticing that in the presented case. Condition

A2.2.4 is contained in A4.1.2. So, the key issue is to verify thatgiven by (4.1.14) satisfies the requirements.

Let and be vector functions obtained fromwith some of its components replaced by zero:

It is clear that

and




For notational convenience, let denote a genericrandom vector such that

where is specified in (4.1.1), and may vary for differentapplications.

We express given by (4.1.14) in an appropriate form to be dealtwith. We mainly use the local Lipschitz-continuity to treat the structuralerror (4.1.15) in

Rewrite the component of the structural error as follows

and for any express

where on the right-hand side of the equality all terms are cancelled exceptthe first and the last terms, and in each difference of L, the argumentsof L differ from each other only by one

We write (4.1.25) in the compact from:




Applying the Taylor’s expansion to (4.1.26) we derive

where

Similarly, we have

and

where




Define the following vectors:

Finally, putting (4.1.27)-(4.1.35) into (4.1.14) we obtain the followingexpression for

It is worth noting that each component of and is a martingaledifference sequence, because both and are independent of

For the sufficiency part we have to show that (2.2.2) is satisfied a.s.Let us show that (2.2.2) is satisfied by all components of and

For components of we have for any

since by (4.1.1), and asTherefore, for any integer N

for any such that converges.Thus, all sample paths of components of satisfy (2.2.2). Com-

pletely the same situation takes place for the components of

and




By the convergence theorem for martingale difference sequences, wefind that for any integer N

This is because is inde-

pendent of and is bounded by a constant uniformly with respect

to by Lipschitz-continuity of Then the martingale convergence

theorem applies since for some by A4.1.1.

Similar argument can be applied to components of Since for anyinteger N (4.1.38) holds outside an exceptional set with probability zero,there is an with such that for any

and

for all and N = 1,2, ….

Therefore, for all and any integer N





From (4.1.17) and (4.1.18) it follows that there exists suchthat and for each

and hence

Combining (4.1.41) and (4.1.42), we find for each

This means that for the algorithm (4.1.11)-(4.1.14), Condition A2.2.3 issatisfied on Thus by Theorem 2.2.1, on This provesthe sufficiency part of the theorem.

Under the assumption a.s. it is clear that both andconverge to zero a.s. and (4.1.39) and (4.1.40) turn to be

and

Then the necessity part of the theorem follows from Theorem 2.4.1. Weshow this. By Theorem 2.4.1, can be decomposed into two parts

and such that and Let us

denote by the component of a vector Define

Then for




and


This together with (4.1.44) and (4.1.45) proves the necessity of the the-orem.

Theorem 2.4.1 gives necessary and sufficient condition on the obser-

vation noise in order the KW algorithm with expanding truncations andrandomized differences converges to the unique maximizer of a function

L. We now give some simple sufficient conditions on

Theorem 4.1.2 Assume A4.1.1 and A4.1.2 hold. Further, assume that

is independent of

and satisfies one of the following two conditions:

i) where is a random variable;

ii) Then

whre is given by (4.1.9)-(4.1.12).

Proof. It suffices to prove (4.1.16)-(4.1.18). Assume i) holds. Letbe given by

By definition, is independent of and so

and




A2.2.3 is satisfied as shown in Theorems 4.1.1 and 4.1.2. Then theconclusion of the theorem follows from Theorem 2.2.2.

Remark 4.1.3 In the multi-extreme case, the necessary conditions on

for convergence can also be obtained on theanalogy of Theorem 2.4.2.

Remark 4.1.4 Conditions i) or ii) used in Theorem 4.1.2 are simpleindeed. However, in Theorem 4.1.2 is required to be independentof This may not be satisfied if the observation noise

is state-dependent. Taking into account that is theobservation noise when observing at and we

see that depends on and if the observationnoise is state-dependent. In this case, does depend on Thisviolates the assumption about independence made in Theorem 4.1.2.

Consider the case where the observation noise may depend on loca-tions of measurement, i.e., in lieu of (4.1.3) and (4.1.4) consider

Introduce the following condition.

A4.1.3 Both and are measurable functions

and are martingale dif-

ference sequences for any and

for p specified in A4.1.1 with

where is a family of nondecreasing independent of both

and




Theorem 4.1.4 Let be given by (4.1.9)–(4.1.12) with a given ini-tial value Assume A4.1.1, A4.1.2’, and A4.1.3 hold. Then


Proof. Introduce the generated by andi.e.,

It is clear that is measurable with respect toand hence are Both

and are Ap-proximating and by simple functions, it is seenthat

Therefore, and aremartingale difference sequences, and

where

Hence, is a martingale difference sequence with

Noticing is bounded and as by (4.1.50) and

(4.1.51) and the convergence theorem for martingale difference sequenceswe have, for any integer N > 0

This together with (4.1.37) with replaced by (4.1.39), and (4.1.40)verifies that expressed by (4.1.36) satisfies A2.2.3. Then the con-clusion of the theorem follows from Theorem 2.2.2.

Remark 4.1.5 If J consists of a singleton then Theorems 4.1.3 and4.1.4 ensure a.s. If J is composed of isolated points, then




theorems ensure that converges to some point in J . However, the

limit is not guaranteed to be a global minimizer of Depending oninitial value, may converge to a local minimizer. We will return backto this issue in Section 4.3.

4.2. Asymptotic Properties of KW AlgorithmWe now present results on convergence rate and asymptotic normality

of the KW algorithm with randomized differences.

Theorem 4.2.1 Assume hypotheses of Theorem 4.1.2 or Theorem 4.1.4

with and that

for some and as

where is stable and and are specified in (4.2.1) and (4.2.2),

respectively.

Then given by (4.1.9)–(4.1.12) satisfies

Proof. First of all, under conditions of Theorems 4.1.2 or 4.1.4,By Theorem 3.1.1 it suffices to show that given by(4.1.36) can be represented as

where

From (4.1.28) and (4.1.31) by the local Lipschitz continuity of itfollows that




by (4.2.2). Since it follows that

Since and given by (4.1.27) and (4.1.32)are uniformly bounded for for each

where converges. By the convergence theorem for martingaledifference sequences it follows that

where and are are given by (4.1.35).

In the proof of Theorem 4.1.2, replacing by and using (4.2.2),

the same argument leads to

Then by defining

we have shown (4.2.4) under the hypotheses of Theorem 4.1.2.Under the hypotheses of Theorem 4.1.4 we have the same conclusions

about and as before. We need only to show (4.2.5). But

this follows from (4.1.52) with replaced by and the convergence

Remark 4.2.1 Let be given by (4.1.9)–(4.1.12). If and

with then conditions (4.2.1) and (4.2.2) are satisfied.

Theorem 4.2.2 Assume A4.1.1 and A4.1.2 hold and that i) and for some

ii) for some c > 0 and

iii) is stable and for some

iv) given by (4.1.10) is an MA process:

for

and




where are real numbers and is a martingale

difference sequence which is independent of and satisfies

Then

where and

Proof. Since it follows that and

By assumption is independent of and hence is inde-pendent of Then by (4.2.11) and the convergence theorem formartingale difference sequences we obtain (4.2.5). By Theorem 4.2.1 wehave as

and after a finite number of iterations of (4.1.11), say, for thereare no more truncations.

Since and is stable,it follows that

Let be given by




By (4.1.11), (4.1.13), (4.1.36), and condition ii) it follows that for

Let be given by

whereSince is stable, by (3.1.8) it follows that there are constants

and such that

Noticing where becauseby condition iii), we have




where respectively denote the five terms on the right-hand side of the first equality of (4.2.19).

By (4.2.18),

By Lemma 3.3.2, because andBy (4.1.28) and (4.1.3) it follows that and hence

by i) and (4.2.18)

where is a constant.By Lemma 3.3.2 and the right-hand side of (4.2.20) tends to

zero a.s. asTo estimate let us consider the following linear recursion


By (4.2.11), Since and

Then by the convergence theorem for martingale differ-

ence sequences it follows that

i.e.,




Similarly,

Applying Lemma 3.1.1, we find that From (4.2.22),

it follows that

Since is an MA process driven by a martingale difference sequence

satisfying (4.2.6),

By the argument similar to that used for (4.2.21) and (4.2.22), from

Lemma 3.1.1 it follows that

Therefore, putting all these convergence results into (4.2.19) yields

By (3.3.37),

where is given by (4.2.10). By (4.2.18), from (4.2.23) and (4.2.24)

it follows that which together with the definition

(4.2.14) for proves the theorem.

Example 4.2.1 The following example of and satisfies Con-ditions i) and iii) of Theorem 4.2.2:

In this example, and




Remark 4.2.2 Results in Sections 4.1 and 4.2 are proved for the case,

where the two-sided randomized differences are

used where and are given by (4.1.3) and (4.1.4), respectively.

But, all results presented in Sections 4.1 and 4.2 are also valid for thecase where the one-sided randomized differences

are used, where and are given by (4.1.3) and (4.1.6), respec-

tively.

In this case, in (4.1.27), (4.1.28) and in the expression of should

be replaced by 1, and (4.1.29)–(4.1.32) disappear. Accordingly, (4.1.36)

changes to

Theorems 4.1.1-4.1.4 and 4.2.1 remain unchanged. The conclusion of

Theorem 4.2.2 remains valid too, if in Condition iv)

changes to

4.3. Global OptimizationAs pointed out at the beginning of the chapter, the KW algorithm may

lead to a local minimizer of Before the 1980s, the random search

or its combination with a local search method was the main stochastic

approach to achieve the global minimum when the values of L can exactly

be observed without noise. When the structural property of L is usedfor local search, a rather rapid convergence rate can be derived, but itis hard to escape a local attraction domain. The random search hasa chance to fall into any attraction domain, but its convergence rate

decreases exponentially as the dimension of the problem increases.Simulating annealing is an attractive method for global optimization,

but it provides only convergence in probability rather than path-wise

convergence. Moreover, simulation shows that for functions with a few

local minima, simulated annealing is not efficient. This motivates oneto combine KW-type method with random search. However, a simplecombination of SA and random search does not work: in order to reach

the global minimum one has to reduce the noise effect as time goes on.

A hybrid algorithm composed of a search method and the KW algo-rithm is presented in the sequel with main effort devoted to design eas-




ily realizable switching rules and to provide an effective noise-reducingmethod.

We define a global optimization algorithm, which consists of threeparts: search, selection, and optimization. To be fixed, let us discuss

the global minimization problem. In the search part, we choose an ini-tial value and make the local search by use of the KW algorithm withrandomized differences and expanding truncations described in Section4.1 to approach the bottom of the local attraction domain. At the sametime, the average of the observations for L is used to serve as an estimateof the local minimum of L in this attraction domain. In the selectionpart, the estimates obtained for the local minima of L are compared witheach other, and the smallest one among them together with the corre-

sponding minimizer given by the KW algorithm are selected. Then, theoptimization part takes place, where again the local search is carried out,i.e., the KW algorithm without any truncations is applied to improvethe estimate for the minimizer. At the same time, the correspondingminimum of L is reestimated by averaging the noisy observations. Afterthis, the algorithm goes back to the search part again.

For the local search, we use observations (4.1.3) and (4.1.4), or (4.1.5)and (4.1.6). To be fixed, let us use (4.1.5) and (4.1.6).

In the sequel, by KW algorithm with expanding truncations we meanthe algorithm defined by (4.1.11) and (4.1.12) with

where and are given by (4.1.5) and (4.1.6), respectively. Sim-ilar to (4.1.9) and (4.1.10) we have

where

By KW algorithm we mean

with defined by (4.3.2).It is worth noting that unlike (4.1.8), is used in (4.3.1).

Roughly speaking, this is because in the neighborhood of a miminizer

of is increasing, and in (4.1.11) should be anobservation on




In order to define switching rules, we have to introduce integer-valuedand increasing functions and such thatand

Define

In the sequel, by the search period we mean the part of algorithmstarting from the test of selecting the initial value up to the nextselection of initial value. At the end of the search period, weare given and being the estimates for the global minimizerand the minimum of L, respectively. Variables such as

and etc. in the search period are equipped by superscriptetc.

The global optimization algorithm is defined by the following fivesteps.

(GO1) Starting from at the search period, the initial value

is chosen according to a given rule (deterministic or random),

and then is calculated by the KW algorithm with expanding

truncations (4.1.11) and (4.1.12) with defined by (4.3.1), forwhich , step sizes and and used for truncation aredefined as follows:

where c > 0 and are fixed constants, andare two sequences of positive real numbers increasingly diverging toinfinity.

(GO2) Set the initial estimate for and update theestimatefor by

where is the noise when observing

After steps, is obtained.

(GO3) Let be a given sequence of real numbers such thatand as Set For if

as

e.g.,




then set Otherwise, keepunchanged.

(GO4) Improve to by the KW algorithm with expanding

truncations (4.1.11) and (4.1.12) with defined by (4.3.1), forwhich

where in (4.1.11) and (4.1.12) may be an arbitrary sequence of numbers increasingly diverging to infinity, and

At the same time, update the estimate for by

where is the noise when observing At the end of thisstep, and are derived.

(GO5) Go back to (GO1) for the search period.

We note that for the search period is added to and (see(4.3.7) and (4.3.8)). The purpose of this is to diminish the effect of

the observation noise as increases. Therefore, and both tendto zero, not only as but also as The followingexample shows that adding an increasing to the denominators of

and is necessary.

Example 4.3.1 Let

It is clear that the global minimizer is and are twolocal minima. Furthermore, and are attractiondomains for –1 and +1, respectively.




Since is linear, for local search we apply the ordinary KW al-gorithm without truncation

Here, no randomized differences are introduced, because this is a one-dimentional problem.

Assume

where

and and are mutually independent and both are sequences of iid random variables with

Let us start from (GO1) and take

(not tending to infinity),

If then, by noticing one of andmust belong to Elementary calculation shows that

Paying attention to (4.3.13), we see

and

i.e.,




A4.3.2

A4.3.3 For any convergent subsequence of

where denotes given by (4.3.3) with replaced by

denotes used for the ¢ search period, and

A4.3.4 For any convergent subsequence


It is worth emphasizing that each in the sequence

is used only once when we form andWe now give sufficient conditions for A4.1.2, A4.3.3, and A4.4.4. For

this, we first need to define generated by estimates andderived up-to current time. Precisely, for running in the search

period of Step (GO1) define

and for running in Step (GO4) define




Remark 4.3.1 If both sequences

and are martingale difference sequences with

and if

for some then A4.3.2 holds.

This is because

is a maringale difference sequence with bounded second conditional mo-ment, and hence

which implies (4.3.15).By using the second parts of conditions (4.3.22) and (4.3.23), (4.3.16)

can be verified in a similar way.

Remark 4.3.2 If and is independent of

and if exists

such that then by the uncorrelatedness of

with for or




where

Assume, further, for fixed

Lemma 4.3.1 Assume L( J ) is nowhere dense, where

Let be a nonempty interval such If there are

two sequences and such that

and is bounded, then it is impossible to have

where

Proof. Without loss of generality we may assume converges as

otherwise, it suffices to select a subsequence.Assume the converse: i.e., (4.3.28) holds. Along the lines of the proof

for Theorem 2.2.1 we can show that

for some constant M if is sufficiently large. As a matter of fact, this is

an analogue of (2.2.3). From (4.3.29) the following analogue of (2.2.15)

takes place:

and the algorithm for has no truncation forif is large enough, where is a constant. Similar to

(2.2.27), we then have

and

and




for some small T > 0 and all sufficiently large

From this, by (4.3.27) and convergence of it follows that

By continuity of and (4.3.30) we have

which implies that for small enough T .Then by definition,

which contradicts (4.3.32). The obtained contradiction shows the im-

possibility of (4.3.28).

Introduce

such that

and

Lemma 4.3.2 Let be given by (GO1). Assume

A4.3.1 and A4.3.3 hold and for some Then

for any may occur infinitely many often with

probability 0, i.e.,

Proof. Since L( J ) is nowhere dense, for any belonging to infinitely

many of there are subsequences such that

and

whereand




By assumption as must be bounded.

Hence, is bounded. Without loss of generality we may assume

that is convergent.

Notice that at Step (GO1), is calculated according to (4.1.11)and (4.1.12) with given by (4.3.2) and (4.3.3), i.e.,

which differ from (4.1.11) (4.1.12), (4.3.2), and (4.3.3) by superscript (i),which means the calculation is carried out in the search period.

By (4.1.27) with notations (4.1.33) and (4.1.34), equipped by super-script we have

where

If we can show that and




where

then by Lemma 4.3.1, (4.3.42) contradicts with that all sequences

cross the interval which is disjoint with L ( J ) .

This then proves (4.3.36).We now show for all sufficiently large if T is small

enough.

Since and are finite, where

We now show that on the

if is sufficiently large and T is smallenough.

Suppose the converse: for any fixed T > 0, there always exists

whatever large is taken such thatSince by continuity of there is a constant

q > 0 such that

For any let us estimate By

and the local Lipschitz continuity of it is seen that

is uniformly bounded with respect to and allThen by A4.3.3, it follows that there is a constant such that

From this it follows that there is no truncation forand

Let T be so small that




On the other hand, however, we have andThe obtained contradiction shows for all sufficiently

large if T is small enough.

We now prove (4.3.42). Let us order in the following way

From (4.1.34) and by the fact that is an iid sequence and isindependent of sums appearing in (4.1.34), it is easy to be convincedthat is a martingale difference sequence.

By the condition for some it is clearthat for with being a constant. Then we have

By (4.1.28) and (4.3.8), we have

where is a constant. Noticing that for large andsmall T , by (4.3.44),(4.3.45), and A4.3.3 we may assume sufficientlylarge and T small enough such that

This will imply (4.3.42) if we can show

We prove (4.3.47) by induction.We have by definition of Assume that

and by the convergence theorem for martingale difference sequences




and Then there is no truncation at timesince by (4.3.46) (with chosen such that

if T in (4.3.46) is sufficiently small.

Then by (4.3.40), we have

and by (4.3.43) and (4.3.46)

for small T . This completes induction, and (4.3.42) is proved, which, in

turn, concludes the lemma.

Lemma 4.3.3 Assume A4.3.1–A4.3.3 hold. Further, assume that

for some and If there

exists a subsequence such that then

Proof. For any by Lemma 4.3.2 there exists such that for

any if By (GO2),

we have

Then by A4.3.2, there exists such that, for any

This implies the conclusion of the lemma by the arbitrariness of




Lemma 4.3.4 Assume A4.3.1–A4.3.3 hold, for

some and If subsequence is such that

then

where denotes the closure of L( J ) , and and are

given by (GO1) and (GO2) for the search period.

Proof. Since by A4.3.1, for (4.3.50) it is

seen that contains a bounded infinite subsequence, and hence, a

convergent subsequence (for simplicity of notation, assume

such that

Since there exists a such that

and hence

Define

It is worth noting that for any T > 0, is well defined for all

sufficiently large because and hence

We now show that

By the same argument as that just used before, without loss of gen-

erality, we may assume is convergent (otherwise, a convergentsubsequence should be extracted) and thus

We have to show

as




By the same argument as that used for deriving (2.2.27), it followsthat there is such that

which implies the correctness of (4.3.53).From (4.3.53) it follows that

because, otherwise, we would have a subsequence with

such that and by (4.3.54)

for large However, by (2.2.15), so for smallenough T > 0, (4.3.56) is impossible. This verifies (4.3.55).

We now show

Assume the converse, i.e.,

From (4.3.54) and (4.3.58) it is seen that for all sufficiently large thesequence

contains at least a crossing the interval withIn other words, we are dealing with a sample path on which both(4.3.54) and (4.3.58) are satisfied. Thus, belongs to ByLemma 4.3.2, the set composed of such is with zero probability. This

verifies (4.3.57).From (4.3.57) it follows that

for all sufficiently large




Notice that from the following elementary inequalities

by (4.3.5) it follows that

By definition of we write

By (4.3.59) and (4.3.61), noticing we have

because

By (4.3.55) and (4.3.61) we have

Since by (4.3.15), combining (4.3.62)–(4.3.64)

leads to




which completes the proof of the lemma.

Lemma 4.3.5 Let be given by (GO1)–(GO5). Assume that A4.3.1–

A4.3.4 hold, initial values selected in (GO1) are dense in an open

set U containing the set of global minima of for some and Then for any

Proof. Among the first search periods denote by the number of those search periods for which are reset to be i.e.,

Since L( J ) is not dense in any interval, there exists an intervalsuch that So, for lemma it suffices to prove

that cannot cross infinitely many times a.s.If then after a finite number of steps, is generated

by (GO4). By Lemma 4.3.1 the assertion of the lemma follows immedi-ately. Therefore, we need only to consider the case where

Denote by the search period for which a resetting happens, i.e.,It is clear that by

In the case by (GO4) the algorithm generates a family

of consecutive sequences:

Let us denote the sequence by

and the corresponding sequence of the values of by

Let be sufficiently small such that




and which is possible because L( J ) isnowhere dense.

Since is dense in U, visits infinitely often. Assume

By Lemma 4.3.2

if is large enough.Define

This means that the first resetting in or after the search periodoccurs in the search period.We now show that there is a large enough such that the

following requirements are simultaneously satisfied:

where is fixed;

We first show ii)-v).Since all three intervals indicated in ii) have an empty intersection

with L( J ), by Lemma 4.3.1 ii) is true if S is large enough. It is clear

i) implies

ii) does not cross intervals

and

iii)

vi)

v)




that iii) and vi) are correct for fixed and if is large enough,while v) is true because

For i) we first show that there are infinitely many for which

By (4.3.68) and (4.3.71) we have

Consider two cases.1) There is no resetting in the search period. Then

and by (4.3.72) and (4.3.74) it follows that

By (4.3.70) and the definition of there exists at least an integeramong such that

because, otherwise, we would have which contradicts(4.3.74).

By ii) we conclude that

and by (4.3.68) we also have (4.3.76).From (4.3.76), by ii) does not cross for

Consequently,

This together with (4.3.70) implies that

and, in particular,2) If there is a resetting in the search period, then




By (GO3) we then have

Noticing as we conclude that there are infinitelymany for which (4.3.73) holds.

We now show that there is a such that

where lim sup is taken along those for which (4.3.73) holds.Assume the converse: there is a subsequence of such that

Then by Lemma 4.3.4,

which contradicts (4.3.73). This proves (4.3.78), and also i). As a matter

of fact, we have proved more than i): Precisely, we have shown that thereare infinitely many for which (4.3.73) holds, and for (4.3.73)implies the following inequality:

Let us denote by the totality of those for which (4.3.73)holds and What we have just proved is that contains infinitely

many if Consider a sequence By ii) it cannot cross the interval

This means that

Then by (4.3.70)

and by (GO3)




since is a search period with resetting.Thus, we have shown that if then also be-

longs to Therefore, and

From here and (4.3.67) it follows that

Since may cross the interval only forfinite number of times by Lemma 4.3.1. This completes the proof of thelemma.

Proof of Theorem 4.3.1.By Lemma 4.3.5 the limit exists. By arbitrariness of from (4.3.69) it follows that

By continuity of we conclude that

4.4. Asymptotic Behavior of Global OptimizationAlgorithm

In last section a global optimization algorithm combining the KW al-

gorithm with search method was proposed, and it was proved that thealgorithm converges to the set of global minimizers, i.e.,However, in the algorithm defined by (GO1)–(GO5), reset-

tings are involved. The convergence by no means

excludes the algorithm from resettings asymptotically. In other words,although it may still happen that

where is defined in Lemma 4.3.5, i.e., it may still be possible to have

infinitely many resettings.In what follows we will give conditions under which

In this case, the global optimization algorithm (GO1)–(GO5) asymp-totically behaves like a KW algorithm with expanding truncations andrandomized differences, because for large is purely generated by(GO4) without resetting.

a.s.

a.s.,

a.s.

a.s.




A4.4.1 is a singleton, is twice continuously differentiable

in the ball centered at with radius for some and

of is positive definite.

A4.4.2 and ordered as in (4.3.20) (4.3.21) and Remark 4.3.1 are martingale difference sequences with

A4.4.3 is independent of

for and

and

for

We recall that is the observation noise in thesearch period.

A4.4.4 is independent of and where

denotes the observation noise when is calculated

in (Go4).

Lemma 4.4.1 Assume A4.4.2 holds and, in addition,

Then, there exists an (maybe depending on such that for any

and

and




Proof. Notice that by A4.4.2 is a martingale

difference sequence with bounded conditional variance. By the conver-

gence theorem for martingale difference sequences

which implies (4.4.2).

Estimate (4.4.3) can be proved by a similar way.

Lemma 4.4.2 Assume A4.4.3 and A4.4.4 hold. If for some then

and

for where and are given in (4.1.34), where super-script denotes the corresponding values in the ith search period.

Proof. Let us prove

Note that

is a martingale difference sequence with bounded conditional secondmoment. So, by the convergence theorem for martingale difference se-quences for (4.4.6) it suffices to show




By assumption of the lemma or and

for large The last inequality yields

and hence

Therefore,

Thus, (4.4.6) is correct. As noted in the proof of Lemma 4.3.2, isa martingale difference sequence. So, (4.4.4) is true.

Similarly, (4.4.5) is also verified by using the convergence theorem for

martingale difference sequences.

Lemma 4.4.3 In addition to the conditions of Theorem 4.3.1, suppose

that A4.4.1 and A4.4.3 hold, is positive definite, and

for some Then there exists a sufficiently large such

that, for if the inequality

holds for some with then the following inequality holds

Proof. . By A4.4.1 and the Taylor’s expansion, we have

i.e.,




where

Therefore, for any there is a such that for any

and

where and denote the minimum and maximum eigenvalue of

H, respectively, and o(1) is the one given in (4.4.10).

Since is the unique minimizer of and is continuous, there

is such that if We always assumethat is large enough such that

and

where is used in (GO1). From (4.4.8) it then follows thatand there is no truncation at time

Denote

For satisfying (4.4.8) and we have





By (4.4.11) it then follows that

where is given by (4.1.33) with superscript denoting thesearch period and

By (4.4.14) it is clear that

Let

For (4.4.9) it suffices to show thatAssume the converse:Let

By (4.4.20), for all

and hence,

Thus, (4.4.12)-(4.4.14) are applicable.




By (4.4.17) and the second inequality of (4.4.13), we have for


Applying the first inequality of (4.4.13) and then (4.4.20) leads to

Since for there is no truncation forUsing (4.4.18) we have

where

We now show that is negative for all sufficiently largeLet us consider terms in By assumption,

from (4.4.19) and (4.4.22) it follows that




We now estimate the second term on the right-hand side of (4.4.25)after multiplying it by


uniformly with respect to and withNoticing that with being a constant,

and that which implies we find

Then, noticing that is bounded by some constantwe have

For the third term on the right-hand side of (4.4.25), multiplying it bywe have

where is a constant.Finally, for the last term of (4.4.25) we have the following estimate

Combining (4.4.26)–(4.4.30) we find that

where

and for large




Consequently, from (4.4.25) it follows that

We now show that

by induction.Assume it holds for i.e.,

which has been verified for We have to show it is true for

By (4.4.18) we have

and




where

Comparing (4.4.35) with (4.4.25), we find that in lieu of and

we now have and respec-

tively. But, for both cases we use the same estimate (4.4.27). Therefore,completely by the same argument as (4.4.26)–(4.4.30), we can prove that

and for large

Thus, we have proved (4.4.32).By the elementary inequality

for which is derived from

for any matrices A and B of compatible dimensions, we derive




from (4.4.32)

As mentioned before, for and there is notruncation. Then by (4.4.18)

where

Then from (4.4.36) and (4.4.27) it follows that

where

which tends to zero as by (4.4.27) and (4.4.38).

Then

where for the last equality (4.4.10) is used.Finally, by (4.4.21), for large from (4.4.39) it follows that





This contradicts (4.4.20), the definition of The contradictionshows

Theorem 4.4.1 Assume that A4.3.1, A4.4.1–A4.4.4 hold, and

is positive definite for some

Further, assume that

and for some constants

Then the number of resettings is finite, i.e.,

where is the number of resettings among the first search periods

(GO1), and is given in (GO3).

Proof. If (4.4.44) were not true, then there would be an S with positive

probability such that, for any there exists a subsequencesuch that at the search period a resetting occurs, i.e.,

Notice that

by(4.4.41) and and




by (4.4.41) and (4.4.42). Hence, conditions of Lemma 4.4.1 are satisfied.Without loss of generality, we may assume that (4.4.2)–(4.4.5) and theconclusion of Theorem 4.3.1 hold From now on assume that

is fixed.

It is clear that, for any constant

if is large enough, since forLet

Rewrite (4.4.46) as

Define

and

Noticing that there is no resetting between and and (4.4.47)

corresponds to (4.4.8), by the same argument as that used in the proof of Lemma 4.4.3, we find that, for any

Since we have




By (4.4.3) (4.4.42) and (4.4.43) it follows that

where for the last inequality (4.4.41) is used.

Thus, by (4.4.40)


provided is large enough, where for the last inequality, (4.4.2) is

used.

Since by (4.4.43)

and since

and




we find

where the last inequality follows from (4.4.40).Using (4.4.51) and (4.4.53), from (4.4.52) for sufficiently large we

have

Using the second inequality of (4.4.43) and then observing that

and

by (4.4.40) and (4.4.41) and we find

We now show that there is such that

Assume the converse:

with




Then, we have

for large enough because

Inequality (4.4.57) contradicts (4.4.55). Consequently, (4.4.56) is true.In particular, for we have

Completely by the same argument as that used for (4.4.47)–(4.4.50), by

noticing that there is no resetting from to we concludethat

By the same treatment as that used for deriving (4.4.54) from (4.4.50),we obtain

Comparing (4.4.58) with (4.4.54), we find that has been changed toand this procedure can be continued if the number of resettings




is infinite. Therefore, for any we have

From (4.4.40) we see

Since we have and hence by

Consequently, by (4.4.41) the right-hand side of (4.4.59) can be esti-mated as follows:

by (4.4.61) if is large enough.However, the left-hand side of (4.4.59) is nonnegative. The obtained

contradiction shows that must be finite, and (4.4.44) is correct.By Theorem 4.4.1, our global optimization algorithm coincides with

KW algorithm with randomized differences and expanding truncationsfor sufficiently large Therefore, theorems proved in Section 4.2 are

applicable to the global optimization algorithm. By Theorems 4.2.1 and4.2.2 we can derive convergence rate and asymptotic normality of thealgorithm described by (GO1)–(GO5).

4.5. Application to Model ReductionIn this section we apply the global optimization algorithm to system

modeling. A real system may be modeled by a high order system which,however, may be too complicated for control design. In control engineer-

ing the order reduction for a model is of great importance. In the linearsystem case, this means that a high order transfer function is to be

approximated by a lower order transfer function. For this one may usemethods like the balanced truncation and the Hankel norm approxima-

tion. These methods are based on concept of the balanced realization.We are interested in recursively estimating the optimal coefficients of the




reduced model by using the stochastic optimization algorithm presentedin Section 4.3.

Let the high order transfer function be

and let it be approximated by a lower order transfer function

If is of order then is taken to be of order

To be fixed, let us take to be a polynomial of orderand of order

where coefficients should not be confused with stepsizes used in Steps (GO1)-(GO5). Write as where

and stand for coefficients of and

It is natural to take

as the performance index of approximation. The parameters and areto be selected to minimize under the constraint that

is stable. For simplicity of notations we denote and writeas

Let us describe the where has the required property.Stability requires that

This implies that

because is the sum of two complex-conjugate roots of

If then which yields If

then and hence

(or ).




Set

Identify and appeared in Section 4.3 toand respectively for the present case.

We now apply the optimization algorithm (GO1)–(GO5) to minimiz-ing under constraint that the parameter in belongs to D. For this we first concretize Steps (GO1)–(GO5) described in Section4.3.

Since is convex in for fixed we take the fixed initial value

for any search period and randomly select initial valuesonly for according to a distribution density which is defined asfollows:

where with and being the uniform dis-tributions over [ – 2,2] and – 1,1], respectively.

After having been selected in the search period, the algorithm

(4.1.11) and (4.1.12) is calculated with and

As to observations, in stead of (4.3.1) we will use information

about gradient because in the present case the gradientof can explicitly be expressed:

In the search period the observation is denoted by and is givenby

where is independently selected from according to the uniform

distribution, and stands for the estimate for at time in the

search period. It is clear that is an approximation to the integral




where are independently selected from according to the uni-form distribution for each Clearly, is an approximation to

Finally, take equal to

In control theory there are several well-known model reduction meth-ods such as model reduction by balanced truncation, Hankel norm ap-proximation among others. These methods depend on the balanced re-alization which is a state space realization method for a transfer matrix

keeping the Gramians for controllability and observability of therealized system balanced. In order to compare the proposed global op-timization (GO) method, we take the commonly used model reductionmethods by balanced truncation (BT) and Hankel norm approximation

(HNA), which, are realized by using Matlab. For this, the discrete-timetransfer functions are transformed to the continuous time ones byusing d2c provided in Matlab. Then the reduced systems are discretizedto compute for comparison.

As we take a 10th order transfer function respec-tively for the following examples:

Example 4.5.1

Example 4.5.2

Example 4.5.3

Using the algorithm described in Section 4.3, for Examples 4.5.1-4.5.3we obtain the approximate transfer functions of order 4, respectively,




denoted by and with

Using Matlab we also derive the 4th order approximations for Exam-ples 4.5.1–4.5.3 by balanced truncation and Hankel norm approximation,which are as follows:

where the subscripts and H denote the results obtained by balancedtruncation and Hankel norm approximation, respectively.

The approximation errors are given in the following table:

From this table we see that the algorithm presented in Section 4.3gives less approximation errors in in comparison with othermethods.

We now compare approximation errors in norm and compare step

responses between the approximate models and the true one by figures.In the figures of step response

the solid lines denote the true high order systems;

the dashed lines (- - -) denote the system reduced by Hankel normapproximation;




the dotted lines denote the system reduced by balanced trun-cation;

The dotted-dashed lines denote the systems reduced by the

stochastic optimization method given in Section 3.In the figures of the approximation error

the solid lines denote the systems reduced by the stochasticoptimization method;

the dashed lines (- - -) denote the system reduced by Hankel norm

approximation;the dotted lines denote the system reduced by balanced trun-cation.




Example 4.5.1

Example 4.5.2

Example 4.5.3

These figures show that the algorithm given in Section 4.3 gives lessapproximation error in in comparison with other methodsfor Example 4.5.1 and the intermediate error in for Exam-ples 4.5.2 and 4.5.3. Concerning step responses, the algorithm givenin Section 4.3 provides better approximation in comparison with othermethods for all three examples.



Chapter 5

APPLICATION TO SIGNAL PROCESSING

The general convergence theorems developed in Chapter 2 can deal

with noises containing not only random components but also structural

errors. This property allows us to apply SA algorithms to parameterestimation problems arising from various fields. The general approach,roughly speaking, is as follows. First, the parameter estimation problemcoming from practice is transformed to a root-seeking problem for a rea-sonable but unknown function which may not be directly observed.

Then, the real observation is artificially written in the standardform

with Normally, it is quite straightforward to arrive

at this point. The main difficulty is to verify that the complicated noise

satisfies one of the noise conditions required in the

convergence theorems. It is common that there is no standard method to

complete the verification procedure, because for different problemsare completely different from each other.

In Section 5.1, SA algorithms are applied to solve the blind channelidentification problem, an active topic in communication. In Section 5.2,the principle component analysis used in pattern classification is dealt

with by SA methods. Section 5.3 continues the problem discussed in

Section 5.1, but in more general setting. Namely, unlike Section 5.1,the covariance matrix of the observation noise is no longer assumed to

be known. In Section 5.4, adaptive filtering is considered: Very simpleconditions for convergence of sign-algorithms are given. Section 5.5 dis-

cusses the asymptotic behavior of asynchronous SA algorithms, which

take the possible communication delays between parallel processors intoconsideration.

219




5.1. Recursive Blind IdentificationIn system and control area, the unknown parameters are estimated on

the basis of observed input and output data of the system. This is thesubject of system identification. In contrast to this, for communicationchannels only the channel output is observed and the channel input is un-available. The topic of blind channel identification is to estimate channelparameters by using the output data only. Blind channel identificationhas drawn much attention from researchers because of its potential ap-plications in wireless communication. However, most existing estimationmethods are “block” algorithms in nature, i.e., parameters are estimatedafter the entire block of data have been received.

By using the SA method, here a recursive approach is presented: Es-timates are continuously improved while receiving new signals.

Consider a system consisting of channels with L being the maximumorder of the channels. Let be the one-dimensionalinput signal, and be the channel out-put at time where N is the number of samplesand may not be fixed:

where

are the unknown channel coefficients.Let us denote by

the coefficients of the channel, and by

the coefficients of the whole system which compose avector.

The observations may be corrupted by noise

where is a vector. The problem is to estimate onthe basis of observations.



Application to Signal Processing 221

Let us introduce polynomials in backward-shift operator

whereWrite and in the component forms

respectively, and express the component via

From this it is clear that

Define

where is a

It is clear that is a xSimilar to and let us define and and and which

have the same structure as and but with replaced by and

respectively.




By (5.1.5) we have

From (5.1.8), (5.1.4), and (5.1.10) it is seen that

This means that the channel coefficient satisfies the set of linearequations (5.1.12) with coefficients being the system outputs.

From the input sequence we form the( N – 2 L + 1) × (2 L + 1)-Hankel matrix

It is clear that the maximal rank of is 2 L + 1 as

If is of full rank for some then will

also be of full rank for anyLemma 5.1.1 Assume the following conditions hold:

A5.1.1 have no common root.

A5.1.2 The Hankel matrix composed of input signal is of

full rank (rank=2 L + 1).

Then is the unique up to a scalar multiple nonzero vector simulta-

neously satisfying

Proof. Assume there is another solution to(5.1.14), which is different from

where isDenote




From (5.1.15) it follows that

By (5.1.7), we then have

which implies

where by we denote the (2 L + 1)-dimensional vector composed

of coefficients of the polynomial written inthe form of increasing orders of

Since is of full rank, In other words,

For a

fixed

(5.1.17) is valid for all Therefore, allroots of should be roots of for all By A5.1.1,

all roots of must be roots of Consequently, there is a

constant such that Substitutingthis into (5.1.17) leads to

and hence Thus, we conclude that

We first establish a convergence theorem for blind channel identifica-tion based on stochastic approximation methods for the case where anoise-free data sequence is observed.

Then, we extend the results to the case where N is not fixed and

observation is noise-corrupted.Assume is observed. In this case

are available, and we have We will repeatedlyuse the data by setting




Define estimate for recursively by

with an initial valueWe need the following condition.

Theorem 5.1.1 Assume A5.1.1–A5.1.3 hold. Let be given by

(5.1.19) with any initial value with Then


Proof. Decompose and respectively into orthogonalvectors:

where

If serves as the initial value for (5.1.19), then by (5.1.14),Again, by (5.1.14) we have

and we conclude that

and

Therefore, for proving the theorem it suffices to show thatas

Denote




andThen by (5.1.21) we have

Noticing that and is uniformly bounded with respect tofor large we have

andBy (5.1.18)

and by Lemma 5.1.1, is its unique up to a constant multiple eigenvec-

tor corresponding to the zero eigenvalue, and the rank of

is

Denote by the minimal nonzero eigenvalue of

Let be an arbitrary vector orthogonal to

Then can be expressed by

where – 1, are the unit eigenvectors of

corresponding to its nonzero eigenvalues.

It is clear that

By this, from (5.1.23) and (5.1.24), it follows that for




and

Noticing that

we conclude

and hence

From (5.1.21) it is seen that is nonincreasing forHence, the convergence implies that

The proof is completed.

Remark 5.1.1 If the initial value is orthogonal to thenand (5.1.20) is also true. But this is a non-interesting case giving

no information about

Remark 5.1.2 Algorithm (5.1.19) is an SA algorithm with linear time-varying regression function The root set J for istime-invariant: As mentioned above, evolves in

one of the subspaces depending on the initialvalue: In the proof of Theorem 5.5.1we have actually verified that may serve as the Lyapunov function

satisfyingA2.2.20

for Then applying Remark 2.2.6 also leads tothe desired conclusion.

We now assume the input signal is a sequence of infinitely manymutually independent random variables and that the observations donot contain noise, i.e., in (5.1.5).

Lemma 5.1.2 Assume A5.1.1 holds and is a sequence of mutually

independent random variables with Then is

the unique unit eigenvector corresponding to the zero eigenvalue for the

matrices




and the rank of is

Proof. Since is a sequence of mutually independent random vari-ables and it follows that

where

Proceeding along the lines of the proof of Lemma 5.1.1., we arrive at theanalogue of (5.1.16):

which implies

From (5.1.28) and (5.1.29) it follows that Then followingthe proof of Lemma 5.1.1, we conclude that is the unique unit vectorsatisfying

This shows that is of rank and isits unique unit eigenvector corresponding to the zero eigenvalue.

Let denote the minimal nonzero eigenvalue of Onwe need the following condition.

A5.1.4 is a sequence of mutually independent random variables

with for some and such that

Condition A5.1.3 is strengthened to the following A5.1.5.

A5.1.5 A5.1.3 holds and where is given in A5.1.4.




It is obvious that if is an iid sequence, then is a positiveconstant, and (5.1.30) is automatically satisfied.

Theorem 5.1.2 Assume A5.1.1, A5.1.4, and A5.1.5 hold, and is

given by (5.1.19) with initial value Then

where

Proof. In the present situation we still have (5.1.21) and (5.1.22). So,it suffices to show

With N replaced by 4 L in the definitions of and we again arriveat (5.1.23).

Since

converges a.s. by A5.1.4 and A5.1.5, there is a large suchthat

Let be an arbitrary vector such that

Then by Lemma 5.1.2,

and hence

Therefore, which

tends to zero since This implies

is bounded, and

a.s.,




We now consider the noisy observation (5.1.5). By the definition

(5.1.11), similar to (5.1.9) we have

where and have the same structure as given by (5.1.10) with

replaced by and , respectively.The following truncated algorithm is used to estimate

with initial value andIntroduce the following conditions.

A5.1.6 and are mutually independent and each of them is asequence of mutually independent random variables (vectors) such that

and

for some

and where is given in A5.1.4.

Set

Then

Denote by the resetting times, i.e.,

Then, we have

A5.1.6




and

Let be an orthogonal matrix, where

Denote

Then

Noticing we find that

Lemma 5.1.3 Assume A5.1.6 and A5.1.7 hold. Then for given by(5.1.32),

Proof. Setting

we have

and




Proof. Since is a sequence of mutually independent nondegenerate

random variables, where

Notice that coincides with given by (5.1.13) if setting N = 4 L and in (5.1.13).

Proceeding as the proof of Lemma 5.1.1, we again arrive at (5.1.16).

Then, we have Since

we find that Then bythe same argument as that used in the proof of Lemma 5.1.1, we con-

clude that for any is the unique unit nonzero vector simultaneouslysatisfying

Since is a matrix, the above assertion

proves that the rank of is and also

proves that is its unique unit eigenvector corresponding to the zeroeigenvalue.

Denote by the minimal nonzero eigenvalue of

We need the following condition.

A5.1.8 There is a such that

It is clear that if is an iid sequence, then is independentof and and A5.1.8 is automatically satisfied.

Lemma 5.1.6 Assume A5.1.1 and A5.1.6–A5.1.8 hold. Then for any




which incorporating with (5.1.44) leads to

for large enough

where and

Theorem 5.1.3 Assume A5.1.1 and A5.1.6–A5.1.8 hold. Then for

given by (5.1.32) with initial valueand

where is a random variable expressed by (5.1.60).

Proof. We first prove that the number of truncations is finite, i.e.,

a.s.Assume the converse:

By Lemma 5.1.3, for any given

and

as




if is large enough, say,By the definition of we have

which incorporating with (5.1.52) implies

and

Define

Since is well-defined

by (5.1.54). Notice that from to there is no truncation. Con-sequently,

and




To be fixed, let us takeFrom (5.1.52) and (5.1.54) it follows that sequences

starting from cross the intervalfor each This means that

crosses interval for eachHere, we call that the sequence

crosses an interval with if and

there is no truncation in the algorithm (5.1.32) forWithout loss of generality, we may assume converges:

It is clear that andBy Lemma 5.1.4, there is no truncation forif T is small enough.

Then, similar to (2.2.24), for large by Lemmas 5.1.3 and 5.1.4 wehave

where and

By Lemma 5.1.6, for large and small T we have

By Lemma 5.1.4 Noticing that

and by definition of crossing wesee that for small enough T ,




This implies that

Letting in (5.1.57), we find that

which contradicts (5.1.58). The contradiction shows that

Thus, starting from the algorithm (5.1.32) suffers from no truncation.If did not converge as then

and would cross a nonempty interval

infinitely often. But this leads to a contradiction as shown above. There-fore, converges as

If were not zero, then there would exist a convergent

subsequence Replacing in (5.1.56) by from

(5.1.57) it follows that

Since converges, the left-hand side of (5.1.59) tends to zero,which makes (5.1.59) a contradictory inequality. Thus, we have proved

a.s.

Since from (5.1.40) it follows that

By (5.1.38) and the fact that we finally conclude that

The difficulty of applying the algorithm (5.1.32) consists in that thesecond moment of the noise may not be available. Identificationof channel coefficients without using will be discussed in Sec-tion 5.3, by using the principal component analysis to be described inthe next section.

a.s.




5.2. Principal Component AnalysisThe principal component analysis (PCA) is one of the basic methods

used in feature extraction, signal processing and other areas. Roughly

speaking, PCA gives recursive algorithms for finding eigenvectors of asymmetric matrix A based on the noisy observations on A.

Let be a sequence of observed symmetric matrices, andThe problem is to find eigenvectors of A, in particular,

the one corresponding to the maximal eigenvalue.Define

with initial value being a nonzero unit vector. serves as anestimate for unit eigenvector of A.

If then is reset to a different vector with norm equalto 1.

Assume have been defined as estimates for unit

eigenvectors of A. Denote which isan where

where denotes the pseudo-inverse of Since for largeis a full-rank matrix,

Define

if withIf we redefine an with such that

Define the estimate for the eigenvalue corresponding to the

eigenvector whose estimate at time is by the following recursion.




Take an increasingly diverging to infinity sequence

and define by the SA algorithm with expanding truncations:

where

We will use the following conditions:

A5.2.1 and

A5.2.2 are symmetric, and

A5.2.3 and


Examples for which (5.2.8) is satisfied are given in Chapters 1 and 2.

We now give one more example.

Example 5.2.1 Assume is stationary and ergodic,If then satisfies (5.2.8). Set By

ergodicity, we have a.s. By a partial summation it follows

that

which implies (5.2.8).

Let be the unit eigenvector of A corresponding to eigenvalue

where may not be different.




Theorem 5.2.1 Assume A5.2.1 and A5.2.2 hold. Then given by

(5.2.1)–(5.2.6) converges at those samples for which A5.2.3 holds,

and the limits of coincide with

Let denote the limit of as Then

Proof. Consider those for which A5.2.3 holds. We first prove con-

vergence of Note that may happen only for a finite

number of steps because as and By

boundedness of we expand into the power series of

where

Further, we rewrite (5.2.9) as

where




Denote by S the unit sphere in Then defined by (5.2.2)

evolves on S.Define

The root set of on S is

Defining we find for

Thus, Condition A2.2.2(S) introduced in Remark 2.2.6 is satisfied.

Since is bounded, no truncation is needed. Then, by Remark

2.2.6 we conclude that converges to one of sayDenote

Inductively, we now assume

We then have

Since and from (5.2.21) and (5.2.5) it

follows that and by (5.2.6)

We now proceed to show that converges to one of unit eigenvectorscontained in

From (5.2.5) we see that the last term in the recursion




tends to zero as So, by (5.2.22) we need to reset withand at most for a finite number of times.

Replacing by in (5.2.9)–(5.2.11), we again arrive at

(5.2.11) for Precisely,

where

and

By noticing

and using (5.2.22), (5.2.23) can be rewritten as

where as

Since tends to an eigenvector of A, from (5.2.11) it follows that

where

Since converges, from (5.2.13) and it follows that




Inductively, assume that

with satisfying (5.2.27), i.e.,

Noticing that for any matrix V, we have

by (5.2.28).

Since by (5.2.24), denoting by

the term we have

for any convergent subsequence

Denoting

from (5.2.26) we see

By (5.2.8) and (5.2.30), similar to (5.2.18)–(5.2.20), by Remark 2.2.6

converges to an unit eigenvector of From (5.2.5) it

is seen that converges since and Then from

(5.2.6) it follows that itself converges as

Thus, we have





which implies that and consequently,

Since the limit of is an unit eigenvector of

we have

By (5.2.33) it is clear that can be expressed as a linear combi-nation of eigenvectors Consequently,

which incorporating with (5.2.34) implies that

This means that is an eigenvector of A, and is different from

by (5.2.33).Thus, we have shown (5.2.21) for To complete the induction it

remains to show (5.2.28) for

As have just shown,tends to zero as from (5.2.31) we have

where satisfies (5.2.29) with replaced by by taking

notice of that (5.2.30) is fulfilled for whole sequence because whichhas been shown to be convergent.

Elementary manipulation leads to




This expression incorporating with (5.2.35) proves (5.2.28) forThus, we have proved that given by (5.2.1)–(5.2.6)

converge to different unit eigenvectors of A, respectively.

To complete the proof of the theorem it remains to showRewrite the untruncated version of (5.2.7) as follows

We have just proved that Then by (5.2.8) and

noticing the fact that converges and we see that

satisfies A2.2.3.The regression function in (5.2.36) is linear:

Applying Theorem 2.2.1 leads to

Remark 5.2.1 If in (5.2.1) and (5.2.3) is replaced by Theo-rem 5.2.1 remains valid. In this case given by (5.2.18) should changeto and correspondingly changes to As a

result, the limit of changes to the opposite sign, fromto

5.3. Recursive Blind Identification by PCAAs mentioned in Section 5.1, the algorithm (5.1.32) for identifying

channel coefficients uses the second moment of the obser-vation noise. This causes difficulty in possible applications, because

may not be available.We continue to consider the problem stated in Section 5.1 with nota-

tions introduced there. In particular, (5.1.1)–(5.1.12), and (5.1.31) will

be used without explanation.In stead of (5.1.32) we now consider the following normalized SAalgorithm:




Comparing (5.3.1) and (5.3.2) with (5.2.1) and (5.2.2), we find thatthe channel parameter identification algorithm coincides with the PCAalgorithm with By Remark 5.2.1, Theorem 5.2.1 canbe applied to (5.3.1) and (5.3.2) if conditions A5.2.1, A5.2.2, and A5.2.3

hold.The following conditions will be used.

A5.3.1 The input is a sequence, i.e., there exist a con-

stant and a function such that for any

where

A5.3.2 There exists a distribution function over such that

where denotes the Borel in and

A5.3.3 The (2 L + 1) × (2 L + 1)-matrix is nondegenerate,

where

A5.3.4 The signal is independent of and

a.s., where is a random variable with

A5.3.5 All components of of are

mutually independent with and and is bounded where

is a constant.

A5.3.6 have no common root.

For Theorem 5.1.1, is assumed to be a sequence of mutuallyindependent random variables (Condition A5.1.6), while in A5.3.1 theindependence is weakened to a property, but the distribution of

A5.3.7 and




is additionally required to be convergent. Although thereis no requirement on distribution of in Theorem 5.1.1, we noticethat (5.1.30) is satisfied if are identically distributed.

In the sequel, denotes the identity matrix.

Define with

and

In what follows denotes the Kronecker product.

Theorem 5.3.1 Assume A5.3.1–A5.3.7 hold. Then

where C is a -matrix and Q is given in A5.3.3, and for given by (5.3.1) and (5.3.2),

where J denotes the set of unit eigenvectors of C.

Proof. By the definition of we have

Since




and by A5.3.2, (5.3.3) im-

mediately follows.

From the definition (5.1.31) for by A5.3.5 it is clear that

is a -identity matrix multiplied by withThen by A5.3.4 and A5.3.5

Identifying inTheorem 5.2.1 to we find that Theorem5.2.1 can be applied to the present algorithm, if we can show (5.2.8),

which, in the present case, is expressed as

where is given by (1.3.2), and B is given by (5.3.6).

Notice, by the notation introduced by (5.1.33),

Since

and

by the convergence theorem for martingale difference

sequences, for (5.3.7) it suffices to show

Identifying and in Lemma 2.5.2 to

and respectively, we find that conditions required there are

satisfied. Then (5.3.8) follows from Lemma 2.5.2, and hence (5.3.7) is

fulfilled.




By Theorem 5.2.1 given by (5.3.1) and (5.3.2) converges to an

unit eigenvector of B, which clearly is an eigenvector of C.

Lemma 5.3.1 is the unique up to a scalar multiple nonzero vector

simultaneously satisfying

Proof. Since it is known that satisfies (5.3.9), it suffices to prove the

uniqueness.

As in the proof of Lemma 5.1.1, assume is

also a solution to (5.3.9). Then, along the lines of the proof of Lemma5.1.1, we obtain the analogue of (5.1.16), which implies (5.1.29):

where is given by (5.1.28) while by (5.1.16).

By A5.3.3 which is nondegener-

ate. Then we have The rest of proof for uniqueness coincides

with that given in Lemma 5.1.1.

By Lemma 5.3.1 zero is an eigenvalue of C with multiplicity one and

the corresponding eigenvector is Theorem 5.3.1 guar-antees that the estimate approaches to J , but it is not clear if tends to the direction of

Let be all different eigenvalues

of C. J is composed of disconnected sets andwhere Note that

the limit points of are in a connected set, so converges to a

for some Let We want to prove that

a.s. or This is the conclusion of Theorem5.3.2, which is essentially based on the following lemma, proved in [9].

Lemma 5.3.2 Let be a family of nondecreasing and

be a martingale difference sequence with

Let be an adapted random sequence and be a real sequence

such that and Suppose that onthe following conditions 1, 2 and 3 hold.




2) can be decomposed into two adapted sequences and

such that

3) coincides with an random variablefor some

Then

Theorem 5.3.2 Assume A5.3.1–A5.3.7 hold. Then defined by

(5.3.1) and (5.3.2) converges to up-to a constant multiple:

where equals either

Proof. Assume the contrary: for some

Since C is a symmetric matrix, for where andhereafter a possible set with zero probability in is ignored. The proof

is completed by four steps.Step 1. We first explicitly expressExpanding defined by (5.3.2) to the power series of we derive

where

Noting and we derive

and




where is defined by (5.1.4), is given by (5.1.10)

with replaced by the observation noise, and denotes theestimate for at time

By (5.3.4) and (5.3.5), there exists a.s. such thata.s.

For any integers and define and

Note that for

and by the convergence of from (5.3.12) it follows thatwhere is a constant for all in By

(5.3.7) we then have

as where and hereafter T should not be confused with thesuperscript T for transpose.

Choose large enough and sufficiently small T such thatLet

and It then follows

that forIn




for sufficiently large.

Consequently, for with fixed

and hence

Define


Tending in (5.3.21) and replacing by in the resulting equal-

ity, by (5.3.19) we have

Thus, we have expressed in two ways: (5.3.21) shows that is

measurable, while (5.3.22) is in the form required in 5.3.2, where




Step 2. In order to show that the summand in (5.3.22) can beexpressed as that required in Lemma 5.3.2 we first show that the series

is convergent on By (5.3.14) and (5.3.7) it suffices to showis convergent on

Define

and

Clearly, is measurable with respect to and Thenby the convergence theorem for martingale difference sequences,





The first term on the right-hand side of the last equality of (5.3.29) can

be expressed in the following form:

where the last term equals

Combining (5.3.30) and (5.3.31) we derive that the first term on the

right-hand side of the last equality of (5.3.29) is




By A5.3.4, A5.3.5, and A5.3.7 it is clear that

Hence replacing by in (5.3.29) results in

producing an additional term of magnitude Thus, by (5.3.24)–

(5.3.26) we can rewrite (5.3.29) as

where and is By (5.3.28) and A5.3.7

the series (5.3.33) is convergent, and hence given by (5.3.23) is a

convergent series.

Step 3. We now define sequences corresponding to and in

Lemma 5.3.2.

Let We have

where

Denote




Then and are adapted sequences, is a mar-tingale difference sequence, and is written in the form of Lemma 5.3.2:

It remains to verify (5.3.10) and (5.3.11).

From (5.3.23) and (5.3.33) it follows that there is a constantsuch that Then for noticing

and

we have

By A5.3.4 and A5.3.5 it follows that

As in Step 4 it will be shown that





Then from the following inequality

by (5.3.34) and (5.3.36) it follows that

Therefore all conditions required in Lemma 5.3.2 are met, and we con-clude Since it follows that

and must converge to a.s.Step 4. To complete the proof we have to show (5.3.35).Proof. If (5.3.35) were not true, then there would exist a subsequence

such that

For notational simplicity, let us denote the subsequence still by

Since by A5.3.5 for if and for anybut if we then have

which incorporating with (5.3.37) implies that

and

Noticing that and from (5.3.38)

and (5.3.24) it follows that




On the other hand, we have

and hence,

where denotes the estimate provided by for at timeSince for any

we have

Hence (5.3.40) implies that




and

By A5.3.4 the left-hand side of (5.3.41) equals

Since it follows that for any

The left side of (5.3.42) equals

Thus (5.3.42) implies that for any




Noticing from (5.3.25) we have

Then by A5.3.5, (5.3.39) implies that for any

Notice that

and




Then by A5.3.5, from (5.3.45)–(5.3.47) it follows that

and hence for any

and

Notice that (5.3.49) means that

However, the above expression equals

Therefore,




In the sequel, it will be shown that (5.3.43), (5.3.44), (5.3.48), and(5.3.50)) imply that which contradicts with

This means that the converse assumption (5.3.37) is not true.

For any since are coprime, where isgiven in (5.1.6), there exist polynomials such that

Let and be the degrees of and respectively. SetIntroduce the q-dimensional vector and q × q

square matrices W and A as follows:

Note that where and Then (5.3.43), (5.3.44), (5.3.48),and (5.3.50) can be written in the following compact form:

To see this, note that for any fixed and on the left hand sides of

(5.3.48) and (5.3.50) there are 2 L different sums when varies from 0 to L – 1 and replace roles each other. These together with (5.3.43) and(5.3.44) give us 2 L + 1 sums, and each of them tends to zero. Explicitlyexpressing (5.3.52), we find that there are 2 L +1 nonzero rows and eachrow corresponds to one of the relationships in (5.3.43), (5.3.44), (5.3.48),and (5.3.50).

Since we have put enough zeros in the definition of after multiply-ing the left hand side of (5.3.52) by

has only shifted nonzero elements inFrom (5.3.52) it follows that for any and in

(5.3.51)





Note that for any polynomial of degree if the last elements of are zeros. From (5.3.54) it follows that

Denoting

from (5.3.55) we find that

By the definition of the first elements of are zeros, i.e.,

This means that the lastelements of are zeros, i.e.,

On the other hand,

By (5.3.56), from (5.3.57) and (5.3.58) it is seen that i.e.,




From (5.3.53) it then follows that

i.e., But this is impossible, becauseare unit vectors. Consequently, (5.3.37) is impossible and this completes

the proof of Theorem 5.3.2.

which, however, is unknown.It is required to design the optimal weighting X, which

minimizes

under constraint

where C and are matrices, respectively. In the

case where C = 0, the problem is reduced to the unconstrained one.It is clear that (5.4.3) is solvable with respect to X if and only if

and in this case the solution to (5.4.3) is

where Z is anyFor notational simplicity, denote

Let L(C ) denote the vector space spanned by the columns of matrix C ,and let the columns of matrix be an orthogonally normalized basis

5.4. Constrained Adaptive FilteringWe now apply SA methods to adaptive filtering, which is an important

topic in signal processing. We consider the constrained problem, whilethe unconstrained problem is only a special case of the constrained one

as to be explained.Let and be two observed sequences, where and are

respectively. Assume is stationary and

ergodic with




of L(C ). Then there is a full-rank decompositionNoticing we have Let bean orthogonal matrix. Then

and hence


and hence a.s. This implies that

Let us express the optimal X minimizing (5.4.2) via By(5.4.8) substituting (5.4.4) into (5.4.2) leads to

On the right-hand side of (5.4.9) only the first term, which is quadratic,depends on Z. Therefore, the optimal should be the solution of

i.e.,

where is any satisfying




Combining (5.4.4) with (5.4.11), we find that

Using the ergodic property of we may replace and bytheir sample averages to obtain the estimate for And, the esti-mate can be updated by using new observations. However, to updatethe estimate, it involves taking the pseudo-inverse of the updated esti-mate for which may be of high dimension. This will slow down thecomputation speed. Instead, we now use an SA algorithm to approach

By (5.4.8), we can rewrite (5.4.10) as

or

We now face to the standard root-seeking problem for a linear function

As before, letand

Thefollowing algorithm is used to estimate given by (5.4.12), which inthe notations used in previous chapters is the root set J for the linearfunction given by (5.4.14):

with initial value such that and

Theorem 5.4.1 Assume that is stationary and ergodic with sec-

ond moment given by (5.4.1) and that Then, after a finite number of steps, say (5.4.16) has no more

truncations, i.e.,




Then from (5.4.26) it follows that

Denote

and

Since is stationary and ergodic, a.s., and

Then by a partial summation, we have




Notice that a.s. by ergodicity. Then for large

and from (5.4.29) it follows that

where (5.4.24) is used incorporating with the fact that

and is stationary with E

From (5.4.27)–(5.4.30) by convergence of it follows that

for large and small T , where and are constants independent of

andConsequently, in the case i.e., in

(5.4.16), will never reach the truncation bound for

if is large enough and T is small enough.

Then coincides with This verifies(5.4.22), while (5.4.23) follows from (5.4.16) because for a fixed

and are bounded, and

are also bounded by (5.4.31) and the convergence In

the case i.e., ‚ for some is

bounded, and hence (5.4.22) and (5.4.23) are also satisfied.We are now in a position to verify the noise condition required in

Theorem 2.2.1 for given by (5.4.20), i.e., we want to show that

for any convergent subsequence




By (5.4.24)

so for (5.4.32) it suffices to show

Again, by (5.4.24) and also by (5.4.23)

which implies (5.4.33). By Theorem 2.2.1, there is such that for is defined by(5.4.17) and converges to the root set J for given by (5.4.14).This completes the proof for the theorem.

Remark 5.4.1 For the unconstrained problem and C = 0, thealgorithm (5.4.16) becomes




Theorem 5.5.1 Assume is stationary and ergodic with

Then

where is defined by (5.5.4) and (5.5.5) with an arbitrary initialvalue. In addition, in a finite number of steps truncations cease to exist

in (5.5.4).

Proof. Define

and

Let be a countable set that is dense in let and betwo sequences of positive real numbers such that andas and denote

and

where and is an integer.

The summands of (5.5.9)–(5.5.11) are stationary with finite expecta-tions for any any integer any and any and then theergodic theorem yields that

a.s.,




and

Therefore, there is an such that and for eachthe convergence for (5.5.12)–(5.5.14) takes place for any anyinteger any and any

Let us fix anWe first show that for any fixed

if is large enough (say, for ), and in addition,

where c is a constant which may depend on but is independent of In what follows always denote constants that may

depend on but are independent of By (5.4.24) we have for any

There are two cases to be considered. If then for largeenough, and (5.5.15) holds. If is bounded, then thetruncations cease to exist after a finite number of steps. So, (5.5.15) alsoholds if is sufficiently large. Then (5.5.16) follows immediately from

(5.5.15) and (5.5.17).Let us define

where is given by (5.5.2). Then (5.5.15) can be represented as

Let be a convergent subsequence of and letbe such that We now show that




Let By (5.5.16) or forsome integer

We examine that the terms on the right-hand side of (5.5.20) satisfy(5.5.19) .

For the first term on the right-hand side of (5.5.20) we have

where and are deterministic for a fixed and the expecta-tion is taken with respect to and

Since (5.5.6), a.s., applying the dominated convergencetheorem yields

Then from (5.5.21) it follows that




Similarly, for the second term on the right-hand side of (5.5.20) wehave

since a.s.

For the third term on the right-hand side of (5.5.20) by (5.4.24),(5.5.10), and (5.5.13) we have

sinceFinally, for the last term in (5.5.20), by (5.5.14) and (5.4.24) we have

where the last convergence follows from the fact that

a.s. as since and

a.s.Combining (5.5.23)–(5.5.26) yields that




Since the left-hand side of (5.5.27) is free of tending to infinity in(5.5.27) leads to (5.5.19). Then the conclusion of the theorem followsfrom Theorem 2.2.1 by noticing that as in A2.2.2 one may take

5.6. Asynchronous Stochastic ApproximationWhen dealing with large interconnected systems, it is natural to con-

sider the distributed, asynchronous SA algorithms. For example, in acommunication network with servers, each server has to allocate audioand video bandwidths in an appropriate portion in order to minimizethe average time of queueing delay. Denote by the bandwidthratio for the server, and Assume the average delay

time depends on only and is differentiable,Then, to minimize is equivalent to find the root of Assumethe time, denoted by spent on transmitting data from the serverto the server is not negligible. Then at the server for theiteration we can observe or only at where

denotes the total time spent until completion of iterations for theserver. This is a typical problem solved by asynchronous SA. Simi-

lar problem arises also from job-scheduling for computers in a computernetwork.We now precisely define the problem and the algorithm.

At time denote by the estimate for the unknown

root of Components of are observedby different processors, and the communication delays from theprocessor to the processor at time are taken into account. Theobservation of the processor is carried only ati.e.,

where is the observation noise.In contrast to the synchronous case, the update steps now are different

for different processors, so it is unreasonable to use the same step sizefor all processors in an asynchronous environment. At time the stepsize used in the processor is known and is denoted by

We will still use the expanding truncation technique, but we are un-able to simultaneously change estimates in different processors when theestimate exceeds the truncation bound because of the communicationdelay.

Assume all processors start at the same given initial valueand for all The observation at




the processor is and is updated to by the rule givenbelow. Because of the communication delay the estimate produced bythe processor cannot reach the processor for the initial steps:

By agreement we will take to serve as whenever

At the processor, there are two sequences andare recursively generated, where is the estimate for the

component of at time and is connected with the number of truncations up-to and including time at the processor. For the

processor at time the newest information about other processorsis In all algorithms discussed until

now all components of are observed at the same point at timeand this makes updating to meaningful. In the present case,

although we are unable to make all processors to observe at thesame points at each time, it is still desirable to require all processorsobserve at points located as close as possible. Presumably, thiswould make estimate updating reasonable. For this, by noticing thatthe estimate gradually changes after a truncation, the ideal is to keepall are equal, but for this the best we can do is to

equalize with otherKeeping this idea in mind, we now define the algorithm and the ob-

servations for the processor,Let be a fixed point from where the algorithm

restarts after a truncation.i) If there exists with then reset to equal the biggest

one among and pull back to the fixed point although

may not exceed the truncation bound. Precisely, in this case define

and observe

for any then observe atii) If




i.e.,

For both cases i) and ii), and are updated as follows:

where is the step size at time and may be random, and

is a sequence of positive numbers increasingly diverging to in-

finity.

Let us list conditions to be used.

A5.6.1 is locally Lipschitz continuous.

A5.6.2 and

there exist two positive constants such that

A5.6.3 There is a twice continuously differentiable function (not neces-

sarily being nonnegative) such that

and is nowhere dense, where

and denotes the gradient of

A5.6.4 For any convergent subsequence any and any




where

and

A5.6.5

Note that (5.6.10) holds if is bounded, since Note

also that A5.6.3 holds if and

Theorem 5.6.1 Let be given by (5.6.1)–(5.6.6) withinitial value Assume A5.6.1–A5.6.5 hold, and thereis a constant such that and

where is given in A5.6.3. Then

where

The proof of the theorem is separated into lemmas. From now on wealways assume that A5.6.1–A5.6.5 hold.

We first introduce an auxiliary sequence and its associated ob-servation noise It will be shown that differs from only

by a finite number of steps. Therefore, for convergence of it sufficesto prove convergence of

Let be a sample

path generated by the algorithm (5.6.1)–(5.6.6), where is the one afterresetting according to (5.6.2). Let where is

defined in A5.6.4. Assume By the resetting rule given

in i), for any after resetting we have For we

have and by the definition of

In the processor we take and to replace and

respectively, and define for those

Further, define and for




Then we obtain new sequences associated withBy (5.6.1)–(5.6.6), if then there exists a with

and

since and forBecause during the period there is no truncationfor the sequences are recursively

updated as follows:

where

Define delays for as follows

is available to the processor at time

Lemma 5.6.1 For any any convergent subsequence

and any satisfies the following condition

where

Proof. Since equals either or which is available at time itis seen that

For by definition of we have

which is certainly available to the processor. Therefore,

We rewrite By the definition of and

paying attention to (5.6.17) we see

so

as




We now show that (5.6.18) is true for all Forthere is no truncation for the processor,

and hence by the resetting rule i). If

for some then by (5.6.16) and the definition of it follows that

which implies (5.6.18).If for some then as explained above for the processor

at time the latest information about the estimate produced by theprocessor is In other words,

However, by definition of which yields

This again implies (5.6.18).In summary, we have

This means that for there is no truncation at any time equal toand the observation is carried out at

i.e.,

For any any convergent subsequence and any wehave

By (5.6.11), Then from A5.6.2 and

A5.6.5 it follows that and hence the second term




on the right-hand side of (5.6.21) tends to zero as Further,from the definition of there is such that Hence the firstterm on the right-hand side of (5.6.21) is of order o(T ) by A5.6.4. Con-sequently, from A5.6.2, A5.6.4 and A5.6.5 it follows that satisfies(5.6.15).

Lemma 5.6.2 Let be generated by (5.6.12)–(5.6.14). For any con-vergent subsequence of if is bounded,

then there are and such that

where is given in (5.6.14).

where is given in A5.6.2.By (5.6.15) for convergent subsequence there exists such

that for any and

Choose such that For any let

Then for any

If then if is sufficiently large,i.e., no truncation occurs after and hence for

If then there exists such that forany From (5.6.24) it follows that

Therefore, in both cases

Proof. Let whereand




If then for sufficiently large

i.e.,

This contradicts the definition of Therefore,

Lemma 5.6.3 Let be given by (5.6.12) – (5.6.14). For any

with the following assertions take place:

i) In the case, cannot cross infinitely manytimes keeping bounded, where are the starting

points of crossing;

ii) In the case cannot converge to keeping

bounded.

Proof . i) Since is bounded, there exists a convergent subse-quence, which is still denoted by for notational simplicity,

By the boundedness of and (5.6.22)

for sufficiently large there is no truncation between andand hence

where By (5.6.20), (5.6.22) and

it follows that




By A5.6.2 and A5.6.3 we have

Then by A5.6.1

where is the Lipschitz coefficient of in and

By the boundednessof

and the fact that there is no truncation between and it follows

that

Without loss of generality, we may assume is a convergent se-quence. Then by A5.6.3 and A5.6.5

Therefore,

where

Since is continuous for fixed by A5.6.4 there exists afor such that




Thus, for sufficiently small T and sufficiently large we have

On the other hand, by Lemma 5.6.2

Thus, for sufficiently small T , and

This contradicts (5.6.31), and i) is proved.

ii) If is bounded, then there is a convergent subsequenceThen the assertion can be deduced by a similar way as that for i).

Lemma 5.6.4 Under the conditions of Theorem 5.6.1


Proof. If then there exists a sequence such that

From (5.6.12)–(5.6.14) we haveChoose a small positive constant such that

Let be a connectedset containing and included in the set

and let be a connected set containing and included in the setClearly, and and are

bounded.

Since diverges to infinity, there exists such thatfor Noting that there exists i such that

and we can define and

for

Since there is a convergent subsequence in also de-

noted by Let be a limit point of By the definition of is bounded. But

crosses infinitely many times, and it

is impossible by Lemma 5.6.3. Thus,

Proof of Theorem 5.6.1

and




By Lemma 5.6.4 is bounded. Let

If then by Lemma 5.6.3, we have

If then there are and such thatand since is nowhere dense. But by

Lemma 5.6.3 this is impossible. Therefore,We now show If there is a convergent subsequence

and then (5.6.26)–(5.6.30) still hold. Hence,

This is a contradiction to

Consequently, i.e.,

Since and the truncations occur onlyfor finitely many times. Therefore, and differ from each other onlyfor a finite number of So,

5.7. Notes and ReferencesFor blind identification with “block” algorithms we refer to [71, 96].

Recursive blind channel identification algorithms appear to be new. Sec-tion 5.1 is written on the basis of the joint work “H. F. Chen, X. R. Cao,and J. Zhu, Convergence of stochastic approximation based algorithmsfor blind channel identification”. Principal component analysis is ap-plied in different areas (see, e.g., [36, 79]). The results presented inSection 5.2 are the improved version of those given in [101]. The princi-pal component analysis is applied to solve the blind identification prob-lem in Section 5.3, which is based on the recent work “H. T. Fang and

H. F. Chen, Blind channel identification based on noisy observation bystochastic approximation method”. The proof of Lemma 5.3.2 is givenin [9].

For adaptive filter we refer to [57]. The results presented in Sec-tion 5.4 are stronger than those given in [11, 28]. The sign algorithmsare dealt with in [42], but conditions used in Section 5.5 are consider-ably weaker than those in [42]. Section 5.5 is based on the recent work“H. F. Chen and G. Yin, Asymptotic properties of sign algorithms for

adaptive filtering”.Asynchronous stochastic approximation was considered in [9, 88, 89,

99]. Section 5.6 is written on the basis of [50].

i.e.,



Chapter 6

APPLICATION TO SYSTEMS

AND CONTROL

Assume a control system depends on a parameter and the systemoperation reaches its ideal status when the parameter equals someSince is unknown, we have to estimate it during the operation of thesystem, which, therefore, can work only on the estimate of Inother words, the real system is not under the ideal parameter and

the problem is to on-line estimate and to make the system asymptot-ically operating in the ideal status. It is clear that this kind of systemparameter identification can be dealt with by SA methods.

Adaptive control for linear stochastic systems is a typical examplefor the situation described above. If the system coefficients are known,then the optimal stochastic control may be a feedback control of thesystem state. The corresponding feedback gain can be viewed as theideal parameter which depends on the system coefficients. In the setupof adaptive control, system coefficients are unknown, and hence isunknown. The problem is to estimate and to prove that the resultingadaptive control system by using the estimate as the feedback gain isasymptotically optimal as tends to infinity.

In Section 6.1 the ideal parameter is identified by SA methods forsystems in a general setting, and the results are applied to solving theadaptive quadratic control problem. The adaptive stabilization problemis solved for stochastic systems in Section 6.2, while the adaptive exactpole assignment is discussed in Section 6.3. An adaptive regulationproblem for nonlinear and nonparametric systems is considered is Section6.4.

289




6.1. Application to Identification and AdaptiveControl

Consider the following linear stochastic system depending on param-

eter

where andare unknown.

The ideal parameter for System (6.1.1) is a root of an unknownfunction

The system actually operates with equal to some estimate for ,i.e., the real system is as follows:

For the notational simplicity, we suppress the dependence on thestate and rewrite (6.1.3) as

The observation at time is

where is a noise process.From (6.1.5) it is seen that the function is not directly observed,

but it is connected with as follows:

We list conditions that will be used.

where is generated by (6.1.1).Let be a sequence of positive numbers increasingly diverging to

infinity and let be a fixed point. Fixing an initial value werecursively estimate by the SA algorithm with expanding truncations:



291 Application to Systems and Control

A6.1.2 There is a continuously differentiable function

such that

for any and is nowhere dense,

where J is given by (6.1.2). Further, used in (6.1.8) is such that

inf for some and

A6.1.3 The random sequence in (6.1.1) satisfies a mixing condition

characterized by

uniformly in where Further, is such that

sup where

A6.1.4 For sufficiently large integer

for any such that converges, where is given by (1.3.2).

Let is stable}, and let be an open, connected subsetof

A6.1.5 and f are connected by (6.1.6) and (6.1.1) for each

satisfies a local Lipschitz condition on

with for any constants and where is

given in A6.1.3.

with

A6.1.1 and




A6.1.6 and in (6.1.1) are globally Lipschitz continuous:

where L is a constant.

A6.1.7 given by (6.1.7) is If converges for some

then where may depend on

Theorem 6.1.1 Assume A6.1.1–A6.1.7 hold. Then


Proof. By (6.1.5) we rewrite the observation in the standard from

where

By Theorem 2.2.2 and Condition A6.1.4, the assertion of the theoremwill immediately follow if we can show that for almost all condition(2.2.2) is satisfied with replaced by

Let be expressed as a sum of seven terms:

where




where

and and denote the distribution and

conditional distribution of given respectively.To prove the theorem it suffices to show that there exists with

such that for each all satisfy

(2.2.2) with respectively identified to

By definition, for any there is such that

where is independent of

Let us first show that satisfy (2.2.2).Solving (6.1.1) yields

By A6.1.3 is bounded. Hence, by (6.1.18) is bounded andby A6.1.5 is also bounded:

where

where is given in A6.1.5.

Since we haveWe now show that and are continuous in uni-

formly with respect toBy (6.1.18) and (6.1.20), from (6.1.19) it follows that



By (6.1.18) (6.1.20) and the Lipschitz condition A6.1.5 for it followsthat

and


which implies the uniform continuity of This together with (6.1.13)

yield that is also uniformly continuous.Let be a countable dense subset of Noticing that is and expressing

as a sum of martingale difference sequences

by (6.1.20) and we find that there is withsuch that for each

for any integer and any From here by uniform continuity of it follows that for and for any integer

Note that



This is because by (6.1.18) and (6.1.20) we have the following estimate:

We now estimate by the treatment used in Lemma 2.5.2. By ap-plying the Jordan-Hahn decomposition to the signed measure

Similarly, we can find with such that forand

since is bounded by the martingale convergencetheorem. It is worth noting that (6.1.23) holds a.s. for anybut without loss of generality (6.1.23) may be assumed to hold for all

with To see this, we first selectthat (6.1.23) holds for any This is possible

because is a countable set. Then, we notice that iscontinuous in uniformly with respect to Thus, we have


with

on

such



where is the mixing coefficient given in A6.1.3. Thus, by(6.1.27)–(6.1.29) we have

and


it is seen that there is a Borel set D in the sampling spacesuch that for any A in the sampling space



Application to Systems and Control 297

By A6.1.5, (6.1.18), (6.1.20), and noticing we find

whose expectation is finite as explained for (6.1.20). Therefore, on theright-hand side of (6.1.30) the conditional expectation is bounded withrespect to by the martingale convergence theorem, and the last term isalso bounded with respect to Thus, by (6.1.10) from (6.1.30) it follows

that there is with such that

Assume is a convergent subsequence

Define

Write (6.1.4) as

Let be fixed.




whereBy induction we now show that

for all suitable large .

For any fixed if is large enough, since

Therefore (6.1.36) holds for sinceAssume (6.1.36) holds for some By notic-

ing from (6.1.34) and (6.1.35) it follows that

By using (6.1.20) (6.1.37) and the inductive assumption and applying

(6.1.19) to it follows that

for where and satisfies thefollowing equation

By A6.1.7 and (6.1.20) we have

and using (6.1.18), (6.1.37), and the inductive assumption we derive




This combining with (6.1.38) leads to that there are real numbersand such that

for From here it follows that

From the inductive assumption it follows that for

for some large enough integer N . Then by (6.1.12)

Setting

we derive

where (6.1.22), (6.1.24), (6.1.25), (6.1.31), (6.1.39), and (6.1.40) areused.

Choose sufficiently small so that (6.1.35) holds, and




Since by A6.1.5 there is such that

for all From (6.1.41) it then follows that

It can be assumed that is sufficiently large so that

Since by (6.1.42) it follows that

and hence there is no truncation at

Thus, we have

or equivalently,

which proves (6.1.36).

Consequently, (6.1.39) is valid for andhence

times and

and



where is the estimate of Let be given by (6.1.7) and (6.1.8) with given by (6.1.5).

where and are related by (6.1.44).However, since the ideal is unknown, the real system satisfies the

equation

where and are symmetric such that andLet given by A6.1.3). The control

where is the feedback control which is required to minimize

Finally, noticing that A6.1.5 assumes (6.1.6), we conclude that for each

all satisfy (2.2.2) with

respectively replaced by The proof of the theorem iscompleted.

We now apply the obtained result to an adaptive control problem.Assume that is the ideal parameter for the system, being

the unique zero of an unknown function The system in the idealcondition is described by the equation


From (6.1.21) and (6.1.13) it is seen that is contin-uous in uniformly with respect to Therefore, its limit is acontinuous function. Then by (6.1.36) it follows that

should be selected in the family U of admissible controls:




In order to give adaptive control we need the expression of the optimalcontrol when is known.

Lemma 6.1.1 Suppose that

is a martingale difference sequence with

ii) where is controllable and observ-

able, i.e., · · · , and

· · · , are of full rank.Then in the class of nonnegativedefinite matrices there is an unique satisfying

and

where

and

Proof. The existence of an unique solution to (6.1.50) and stability of F given by (6.1.51) are well-known facts in control theory. We show theoptimality of control given by (6.1.52).

For notational simplicity , we temporarily suppress the dependence of

and on and write them as A, B,

and D, respectively.Noticing

is stable. The optimal control minimizing (6.1.45) is




we then have

Since by the estimate for the weighted sum of martingale

difference sequence from (6.1.55) it follows that

where is the state in (6.1.47).

Thus the closed system becomes

Notice that the last term of (6.1.56) is nonnegative. The conclusions of

the lemma follow from (6.1.56).

According to (6.1.52), by the certainty-equivalence-principle, we form

the adaptive control




which has the same structure as (6.1.4). Therefore, under the assump-

tions A6.1.1–A6.1.7 with replaced by and with J being asingleton by Theorem 6.1.1 it is concluded that

By continuity and stability of it is seen that there are andpossibly depending on such that

This yields the boundedness of and

because


Therefore, the closed system (6.1.58) asymptotically operates under theideal parameter and makes the performance index (6.1.45) minimized.

6.2. Application to Adaptive StabilizationConsider the single-input single-output system

where and are the system input, output, and noise, respec-

tively, and

where is the backward shift operator,The system coefficient

is unknown. The purpose of adaptive stabilization is to design control

so that

a.s.



The fact that and a can be solved from (6.2.5) for anymeans that

is nonzero. In other words, the coprimeness of and is equiva-lent to

In the case is unknown the certainly-equivalency-principle suggestsreplacing by its estimate to derive the adaptive control law. How-ever, for may be zero and (6.2.5) may not be solvable withand replaced by their estimates.

Let us estimate by the following algorithm called the weighted leastsquares (WLS) estimate, which is convergent for any feedback control


If is known and if and are coprime, then for an arbitrarystable polynomial of degree there are unique polynomials

and both of order with such that

Then the feedback control generated by

leads the system (6.2.1) to

Then, by stability of (6.2.4) holds if assume

Considering coefficients of and as unknowns, and identifyingcoefficients of for both sides of (6.2.5), we derivea system of linear algebraic equations with matrix for unknowns:




where

Though converges a.s., its limit may not be the true If a boundedsequence can be found such that the modified estimate

and for some

is convergent and

then the control obtained from (6.2.6) with replaced by solves theadaptive stabilization problem, i.e., makes (6.2.4) to hold.

Therefore, the central issue in adaptive stabilization is to find a bound

-ed sequence such that given by (6.2.12) is convergent and(6.2.13) is fulfilled. This gives rise to the following definition.Definition. System (6.2.1) is called adaptively stabilizable by the use

of parameter estimate if there is a bounded sequence such that

(6.2.13) holds and given by (6.2.12) is convergent.

It can be shown that if system (6.2.1) is controllable, i.e., andare coprime, then it is adaptively stabilizable by the use of the WLS

estimate. It can also be shown that the system is adaptively stabilizable

by use of if and only if where and F denotethe limits of and respectively, which are generated by (6.2.9)–(6.2.11).

We now use an SA algorithm to recursively produce such thatis convergent and the resulting estimate by (6.2.12) satisfies

(6.2.13).



is generated by (6.2.9)–(6.2.11), is defined by (6.2.11), and isrecursively defined by an SA given below.

Let us take a few real sequences defined as follows:

where

which can be written as

From algebraic geometry it is known that is a

finite set.

However, is not directly observed; the real observation is

The root set of is denoted by where

where

As a matter of fact,

Let and be –dimensional, and let





Let be l-dimensional with only one nonzero element equal toeither +1 or –1, Similarly, let be -dimensionalwith only nonzero elements, each of which equals either 1 or – 1,

The total number of such vectors is

Normalize these vectors and denote the resulting vectors byin the nondecreasing order of the number of nonzero elements in

Define and for Introduce

Define the recursive algorithm for as follows:

and is a fixed vector.The algorithm (6.2.23)–(6.2.27) is the RM algorithm with expanding

truncations, but it differs from the algorithm given by (2.1.1)–(2.1.3)

as follows. The algorithm (2.1.1)–(2.1.3) is truncated at the upper sideonly, but the present algorithm is truncated not only at the upper sidebut also at the lower side: is allowed neither to diverge to infinity norto tend to zero; whenever it reaches the truncation bounds the estimate

is pulled back to and is enlarged to at the upper side,while at the lower side is pulled back to which will change to the




next whenever is satisfied. If for successiveresettings of we have to change to the next one, then we reduceto

Lemma 6.2.1 Assume the following conditions hold:

A6.2.2 System (6.2.1) is adaptively stabilizable by use of generated

by (6.2.9)–(6.2.11), i.e.,

If then after a finite number of steps the algorithm (6.2.23)– (6.2.27) becomes the RM algorithm

converges and

Proof. The basic steps of the proof are essentially the same as those forproving Theorem 2.2.1, but some modifications should be made becauseof truncations at the lower side.

Step 1. Let be a convergent subsequence of

For any define the RM algorithm

with or for some for some

We show that there are M > 0, T > 0 such thatwhen and

when if is large enough, where is givenby (1.3.2).

Let > 1 be a constant such that

It is clear that

A6.2.1 and




Since and are convergent, there is such that

Let By (6.2.29) and (6.2.30), we have

for if and for if whereLet (6.2.31) hold for or

It then follows that

where orThus, (6.2.31) has been inductively proved for

orStep 2. Let be a convergent subsequence. We show that there

are M > 0 and T > 0 such that

if is large enough.If defined by (6.2.25) is bounded, then (6.2.32) directly follows.Again take such that and setAssume Then there is a such that

By the result proved in Step 1, starting from the algo-

rithm for cannot directly hit the sphere with radius without atruncation for So it may first hit somelower bound at time and switch to some

from which again by Step 1 cannot directly reach without atruncation. The only possibility is to be truncated again at a lowerbound. Therefore, (6.2.32) takes place.




Step 3. Since and are convergent, by (6.2.32) it follows that

from any convergent subsequence there are constants andsuch that

if is large enough.Consequently, there is such that

By (6.2.32) and the convergence of and it also follows that

Therefore,

Using (6.2.33) and (6.2.34) by the same argument as that given in Step 3

of the proof for Theorem 2.2.1, we arrive at the following conclusion.If starting from the algorithm (6.2.24) is calculated

as an RM algorithm and is bounded, then for

any with and cannot crossinfinitely often.

Step 4. We now show that is bounded.

If is unbounded, then as Therefore,is unbounded and comes back to the fixed point infinitely manytimes.

Notice that is a finite set and

We see that there is an interval with and

0 such that crosses infinitely often, and during each cross-

ing the algorithm (6.2.24) behaves like an RM algorithm with staringpoint It is clear that is bounded because asBut by Step 3, this is impossible. Thus, we conclude that

is bounded, and after a finite number of steps (6.2.24) becomes

as




Step 5. We now show (6.2.28), i.e., after a finite number of steps thealgorithm (6.2.35) ceases to truncate at the lower side.

Since and by A6.2.2, it follows that thereis at least one nonzero coefficient in the polynomial for some

with Therefore, for some and a small

From (6.2.16) it is seen that for sufficiently small we have

This combining with convergence of and leads to

for sufficiently large

From (6.2.26) and (6.2.36) it follows that must be bounded, andhence is bounded. This means that there is a such that

We now show that is bounded.Since for all sufficiently large it follows that

were unbounded, then by (6.2.37) the algorithm, starting fromwould infinitely many times enter the sphere with radius

where is small enough such that

Then would cross infinitely often an intervalSince is a finite set, we may assume It is clearthat during the crossing the algorithm behaves like an RM algorithm.By Step 4, this is impossible.

Therefore, there is a such that

Noticing (6.2.20), (6.2.34), and that serves as the Lyapunov func-tion for from Theorem 2.2.1 we conclude the remaining assertionsof the lemma.



where and are defined by 1)-3) described above.

Proof. The key step is to show that

Assume the converse:

Case i) The assumption implies that

and occurs infinitely many times. However,

this is impossible, since and The contradictionshows

Theorem 6.2.1 Assume conditions A6.2.1 and A6.2.2 hold. Then thereis such that and converges and

and use to produce the adaptive control as in 1), and go back to

1) for3) If and none of a)-c) of 2) is the case, then set

and go back to 1) for and at the same time change to

i.e.,

Define


Using we now define in (6.2.12) satisfying (6.2.13) and thussolving the adaptive stabilization problem.

Let1) If then set Using we produce

the adaptive control from (6.2.6) with and defined from

(6.2.5) with replaced by and go back to 1) for2) If then definea) for the case where

b) defined by (6.2.24) for the case where

butc) for the case, where

but



and the algorithm defining will run over the following cases: 1) and2a)-2c). Since and are convergent, the inequality

for all sufficiently large Again, this means that (6.2.41) may take placeat most a finite number of times, and we conclude that

Thus, there is such that

If then from (6.2.43) it follows that

Since and for sufficiently large

from (6.2.42) it follows that

for all sufficiently large Thus, (6.2.41) may take place at most a finitenumber of times. The contradiction shows that

we havethen as

Take a convergent subsequence of For notational simplicity de-

note by itself its convergent subsequence. Thus

By Lemma 6.2.1,1) If then


Case ii) The assumption implies that there

is a sequence of integers such that andi.e., for all the following indicator equals one

2) If

implies



6.3. Application to Pole Assignment for Systems

with Unknown CoefficientsConsider the linear stochastic system

where is the -dimensional state, is the one-dimensional control,and is the -dimensional system noise.

The task of pole assignment is to define the feedback control

in order that the characteristic polynomial

of the closed-loop system coincides with a given polynomial

The pair is called similar to if there exists a nonsin-gular matrix such that

where denotes the column of T .

Consequently, the truncation at the lower bound in (6.2.24) should bevery rare. The computation will be simplified if there is no lower boundtruncation.


for sufficiently large This means that the algorithm can be at 2b)only for finitely many times. By the same reason it cannot be at 2c)for infinitely many times. Therefore, the algorithm will stick on 1) if

and on 2a) if and in both cases there is a

such that and

The convergence of follows from the convergence of and

Remark 6.2.1 For the case the origin is not a stableequilibrium for the equation



So, is nonsingular if and only if is nonsingular.Assume that is controllable and is already in its con-

troller form (6.3.5). For notational simplicity, we will write rather

than

where

which imply


Define

where are coefficients of

The pair is called the controller form associated to the pair

If is controllable, i.e., is of full rank,then is similar to its controller form. To see this, we note that(6.3.4) implies and from it followsthat



where is the system noise at time “1” for the system with feedbackgain applied.

Having observed we compute its characteristic polynomial detwhich is a noise-corrupted characteristic polynomial of

Let be the estimate for By observing det weactually learn the difference det which in a certain sensereflects how far det differs from the ideal polynomial

For any let


With feedback control the closed-loop system takes theform

Since is in controller form,

where are elements of the row vector F :

Therefore, if is known, then comparing (6.3.10) with (6.3.3) givesthe solution to the pole assignment problem, where

We now solve the pole assignment problem by learning for the casewhere is unknown.

Let us combine the vector equation (6.3.9) for initial values to form

the matrix equation

Let In learning control, can be observed at any fixed

For any the observation of is denoted by




be the row vector composed of coefficients of

By (6.3.10)

composed of coefficients of

and respectively.Take a sequence of positive real numbers

and

Calculate the estimate for by the following RM algorithm with

expanding truncations:

with fixed

Theorem 6.3.1 Assume that is controllable and is in the con-

troller form. Further, assume the following conditions A6.3.1 and A6.3.2

hold:

A6.3.1 The components of

of in (6.3.13) are mutually independent with

A6.3.2

where is the same as that in A6.3.1.Then there is with such that for each as

Similarly, define row vectors

for some



From here it is seen that is a sum of products of elements from with +1 and –1 as

multiple for each product, where and denote elements of A andrespectively. It is important to note that each product inincludes at least one of as its factor. Thus, the product is of the form

From (6.3.21) by (6.3.18), (6.3.15), and (6.3.13) it follows that

Therefore, the conclusion of the theorem will follow from Theorem 2.2.1,if we can show that for any integer N


where is the desired feedback gain realizing the exact pole

assignment.

Proof. Define

where and are given by (6.3.14) and (6.3.17), respectively.By (6.3.11) and (6.3.16) it follows that

Thus, (6.3.19) and (6.3.20) become

It is clear that the recursive algorithm for has the same structure

as (2.1.1)–(2.1.3). For the present case, as function required inA2.2.2 we may take




where

By A6.3.1 we have

whereBy A6.3.2 and the convergence theorem for martingale difference se-

quences it follows that

for any integer which implies (6.3.24).

6.4. Application to Adaptive Regulation

We now apply the SA method to solve the adaptive regulation problemfor a nonlinear nonparametric system.

Consider the following system

where is the system state, is the control, andis an unknown nonlinear function with being

the unknown equilibrium for the system (6.4.1).Assume the state is observed, but the observations are corruptedby noise:

where is the observation noise, which may depend onThe purpose of adaptive regulation is to define adaptive control based

on measurements in order the system state to reach the desired value,

which, without loss of generality, may be assumed to be equal to zero.We need the following conditions.

A6.4.1 and




A6.4.2 The upper bound for is known, i.e., and is

robust stabilizing control in the sense that for any the state

tends to zero for the following system

A6.4.3 The system (6.4.1) is BIBS stable, i.e., for any bounded input,the system state is also bounded;

A6.4.4 is continuous for bounded i.e., for any

A6.4.5 The system (6.4.1) is strictly input passive, i.e., there are and such that for any input

A6.4.6 For any convergent subsequence

where is defined by (1.3.2).

It is worth noting that A6.4.6 becomes

if is independent of The adaptive control is given according to the following recursive

algorithm:

where b is specified in A6.4.2.

Theorem 6.4.1 Assume A6.4.1 – A6.4.6. Then the system (6.4.1), (6.4.2),

and (6.4.4) has the desired properties:




at sample paths where A6.4.6 holds.

Proof. Let be a convergent subsequence of such that

and

We have

for sufficiently large and small enough T , where is a constant to bespecified later on. The relationships (6.4.5) and (6.4.6) can be provedalong the lines of the proof for Theorem 2.2.1, but here is known to bebounded, and (6.4.5) and (6.4.6) can be proved more straightforwardly.We show this.

Since the system (6.4.1) is BIBS, from it follows that thereis such that

By A6.4.6 for large and small T > 0,

This implies that

Let be large enough such that

and let T be small enough such that

Then we have

and hence there is no truncation in (6.4.4) for i.e., (6.4.5) holdsfor Therefore,



indeed.By induction, the assertions (6.4.5) and (6.4.6) have been proved.We now show that for any convergent subsequence

there is a such that

from (6.4.4) it follows that (6.4.5) holds for Hence,

Thus, (6.4.5) and (6.4.6) hold for Assume they are true for allWe now show that they are true for

too.Since

for small enough T > 0.By A6.4.5, we have

Let us restrict in (6.4.8) to Then forsmall T and large from (6.4.6) and (6.4.8) it follows that


and (6.4.6) is true for



Since and it is seen that

Using a partial summation, by (6.4.9) we have

for all sufficiently large and small enough T > 0.Set

for

This implies that there exist a and a sufficiently large whichmay depend on but is independent of such that





Then (6.4.10) implies that

This proves (6.4.7).Define


for convergent subsequenceUsing A6.4.6 and (6.4.11), by completely the same argument as that

used in the proof (Steps 3– 6) of Theorem 2.2.1, we conclude that

Finally, write (6.4.1) as

By A6.4.4 and the boundedness of we have

and by A6.4.2 we conclude

Remark 6.4.1 It is easy to see that A6.4.6 is also necessary if A6.4.1–A6.4.5 hold and and This is because for largethe observation noise can be expressed as

and hence



6.5. Notes and ReferencesFor system identification and adaptive control we refer to [10, 23, 54,

62, 75, 90]. The identification problem stated in Section 6.1 was solved in[72] by ODE method. In comparison with [72], conditions used here haveconsiderably been weakened, and the convergence is proved by the TSmethod rather than the ODE method. Section 6.1 is based on the jointwork by H. F. Chen, T. Duncan and B. Pasik-Duncan. The existenceand uniqueness of the solution to (6.1.50) can be found, e.g., in [23]. Forstochastic quadratic control refer to [2, 10, 12, 33].

Adaptive stabilization for stochastic systems is dealt with in [5, 55, 77].

The convergence of WLS and adaptive stabilization using WLS are givenin [55]. The problem is solved by the SA method in [19]. This approachis presented in Section 6.2.

The pole assignment problem for stochastic system with unknowncoefficients is solved by SA with the help of learning in Section 6.3,which is based on [20]. For concept of linear control systems we refer to


which tends to zero as since and

Remark 6.4.2 In the formulation of Theorem 6.4.1 the condition A6.4.5

can be replaced either by (6.4.7) or by (6.4.11), which are the conse-

quences of A6.4.5. Further, the quadratic can be replaced by acontinuously differentiable function such thatand In this case, in (6.4.7) should becorrespondingly replaced by

Example 6.4.1 Let the nonlinear system be affine:

where the scalar nonlinear function is bounded from above and frombelow by positive constants:

Note that and hence(6.4.7) holds, if Assume is known:Then A6.4.2, A6.4.3, and A6.4.4 are satisfied. Therefore, if satisfiesA6.4.6, then given by (6.4.4) leads to and

In the area of system and control, the SA methods also are successfullyapplied in discrete event dynamic systems, especially, to the perturbationanalysis based parameter optimization.




[1, 46, 60]. The connection between the feedback gain and coefficients of the desired characteristic polynomial is called the Ackermann’s formula,which can be found in [46].

Application of SA to adaptive regulation is based on [26].

For perturbation analysis of discrete event dynamic systems we referto [58]. The perturbation analysis based parameter optimization is dealtwith in [29, 86, 87].



Appendix A

In Appendix A we introduce the basic concept of probability theory. Results arepresented without proof. For details we refer to [31, 32, 70, 76, 84].

A.1. Probability SpaceThe basic space is denoted by The point is called elementary event or

sample. The point set in is denoted by A,

Let be a family of sets in satisfying the following conditions:1.

2.

3.Then, is called the or The element A of is called themeasurable set, or random event, or event.

As a consequence of Properties 2 and 3,

then the complement of A, also belongs toIf

If

if

A set function defined on is called -additive if for any

sequence of disjoint events By definition, one of the values or isnot allowed to be taken by

A nonnegative set function is called a measure.Define

The set functions and are called the upper, lower, and totalvariation of on respectively.

Jordan-Hahn Decomposition Theorem If is on then there

exists a set D such that, for any

and are measures andLet P be a set function defined on with the following properties.1.

2.

329

then




3. if are disjoint. Then, P is called a

probability measure on The triple is called a probability space.PA is called the probability of random event A.

It is assumed that any subset of a measurable set of probability zero is measurableand its probability is zero. After such a completion of measurable sets the resulting

probability space is called completed.If a relationship between random variables holds for any with possible exception

of a set with probability zero, then we say this relationship holds a.s. (almost surely)

or with probability one.

A.2. Random Variable and Distribution FunctionIn R, the real line, the smallest containing all intervals is called the Borel

and is denoted by The “smallest” means that if there is acontaining all intervals, then there must be in the sense that for any

The Borel can also be defined in Any set in or iscalled the Borel set.

Any interval can be endowed with a measure equal to its length. This measurecan be extended to each i.e., to each Borel set. Any subset of a set with

measure zero is also assumed to be a measurable set with measure zero. After such acompletion, the measurable set is called Lebesgue measurable, and the measure theLebesgue measure. In what follows always means the completed Borel

A real function defined on is called measurable, if

If is a real measurable function defined on andthen is called a random variable. Therefore, if is a measurable function, then

is also a random variable if Let be a random variable. The distribution function of is defined as

By a random vector we mean that each component

of is a random variable.The distribution function of a random vector is defined as

If is differentiable, then its derivative is called the densityof The density of a random vector is defined by a similar way. The density of l-dimensional normal distribution is defined by

A.3. ExpectationLet be a random variable and let

Define



APPENDIX A 331

whereis called the expectation of

For an arbitrary random variable define

The expectation of is defined as

if at least one of and is finite .If then is called integrable.The expectation of can be expressed by a Lebesgue-Stieltjes integral with respect

to its distribution function

In the density of l-dimensional random vector with normal distribution,

A.4. Convergence Theorems and InequalitiesLet be a sequence of random variables and be a random variable.If then we say that converges to and write

If for any then we say that converges to in

probability and writeIf the distribution functions of converge to at any where

is continuous, then we say weakly (or in distribution) converges to and write

If then we say converges to in the mean square sense and

write l.i.m.implies which in turn implies

Monotone Convergence Theorem If random variables nondecreasingly(nonincreasingly) converge to andthen

Dominated Convergence Theorem If and there exists an inte-

grable random variable such that then and

Fatou Lemma If for some random variable withthen

If is a measurable function, then




Chebyshev Inequality

Lyapunov Inequality

Hölder Inequality

In the special case where the Hölder inequality is called the Schwarzinequality.

A.5. Conditional Expectat ionLet be a probability space. is called a of if is a

and by which it is meant that any impliesRadon-Nikodym Theorem Let be a of For any random

variable with at least one of and being finite, there is an uniquemeasurable random variable denoted by such that for any

The random variable satisfying the above equality is called condi-tional expectation of given

Let be the smallest (see A.2) containing all setsis called the generated by

The conditional expectation of given is defined as

Let A be an event. Conditional probability of A given is defined by

Properties of the conditional expectation are listed below.1) for constants and2)3) if is and

4) if

5) if

Convergence theorems and inequalities stated in A.4 remain true with expectationreplaced by the conditional expectation For example, the conditional

Hölder inequality

forFor a sequence of random variables and a the consistent

conditional distribution functions of given

Let and Then



APPENDIX A 333

can be defined such that i) they are for any and anyfixed ii) they are distribution functions for any fixed and iii) for anymeasurable function

A.6. IndependenceLet be a sequence of events.If for any set of indices

then is called mutually independent.Let be a sequence of If events are mutually

independent whenever then the family of is called mutually independent.

Let be a sequence of random variables and let be the generatedby If is mutually independent, then the sequence of random variablesis called mutually independent.

Law of iterated logarithm Let be a sequence of independent and identically

distributed (iid) random variables, Then

Proposition A.6.1 Let be a measurable function defined on

If the l-dimensional random vector is independent of the m-dimensional ran-

dom vector then

where

From this proposition it follows that

if is independent of

A. 7. ErgodicityLet be a sequence of random variables and let be the

distribution function of If for any integer then

is called stationary, or is a stationary process.

Proposition A.7.1 Let be stationary.

provided exists for all in the range of




If exists, then

where is a of and is called invariant

If then the stationary process is called ergodic. Thus, forstationary and ergodic process we have

If is a sequence of mutually independent and identically distributed (and hence

stationary) random variables, then and the sequence is ergodic.



Appendix B

In Appendix B we present the detailed proof of convergence theorems for martin-

gales and martingale difference sequences.Let be a sequence of random variables, and let be a family of nonde-

creasing i.e.,

If is for any then we write and call it as an adaptedprocess.

An adapted process with is called a martingale if a supermartingale if and a submartingale if

An adapted process is called a martingale difference sequence (MDS) if

A sequence of mutually independent random vectors with is anobvious example of MDS.

An integer-valued measurable function is called a Markov time with respect toif

If, in addition, then is called a stopping time.

B.1. Convergence Theorems for MartingaleLemma B.1.1 Let be adapted, a Markov time, and B a Borel set. Letbe the first time at which the process hits the set B after time i.e.,

Then is a Markov time.Proof. The conclusion follows from the following expression:

335




For defining the number of up-crossing of an interval by a submartingale

we first define

The largest for which is called the number of up-crossing of the interval

by the process and is denoted by

By Lemma B.1.1

So, is a Markov time.

Assume is a Markov time. Again, by Lemma B.1.1,

and

Therefore, all are Markov times.Theorem B.1.1 (Doob) For submartingales the following inequalities

hold

where

Proof. Note that equals the number of up-crossing of the interval

by the submartingale or bySince for

is a submartingale.

Thus, without loss of generality, it suffices to prove that for a nonnegative sub-

martingale

Define



APPENDIX B 337

Define also Then for even crosses (0, b) from time toTherefore,

and

Further, the set is since is a Markov time,and

Taking expectation of both sides of (B-l-2) yields

where the last inequality holds because is a submartingale and hence theintegrand is nonnegative.

Thus (B.1.1) and hence the theorem is proved.Theorem B.1.2 (Doob) Let be a submartingale with

a.s.Then there is a random variable with such that

Proof. Set

Assume the converse:Then

where and run over all rational numbers.




By the converse assumption there exist rational numbers such that

Let be the number of up-crossing of the interval byBy Theorem B.1.1

By the monotone convergence theorem from (B-1-4) it follows that

However, (B.1.3) implies which contradicts (B.1.5). Hence,

and

where is invoked. Hence,Corollary B.1.1 If is a nonnegative supermartingale or nonpositive sub-

martingale, then

Because for nonpositive submartingales the corollary follows from the the-orem; while for a nonnegative supermartingale is a nonpositivesubmartingale.

Corollary B.1.2 If is a martingale with thenand

This is because for a martingale andand hence

or converges to a limit which is finite a.s.By Fatou lemma it follows that



APPENDIX B 339

B.2. Convergence Theorems for MDS ILet be an adapted process, and let G be a Borel set in

Then the first exit time from G defined by

is a Markov time. This is because

Lemma B.2.1. Let be a martingale (supermartingale, submartingale)and a Markov time. Then the process stopped at is again a martingale(supermartingale, submartingale), where

Proof. Note that

is

If is a martingale, then

This shows that is a martingale. For supermartingales and submartin-gales the proof is similar.

Theorem B.2.1. Let be a one-dimensional MDS. Then as

converges on

Proof. Since is the first exit time

is a Markov time and by Lemma B.2.1 is a martingale, where M is apositive constant.

Noticing that and that

is we find




By Corollary B.1.2 converges as It is clear that onTherefore, as pathwisely converges on Since M is

arbitrary, converges on which equals A.

Theorem B.2.2. Let be an MDS and If

then converges on If then

converges on

Proof. It suffices to prove the first assertion, because the second one is reduced tothe first one if is replaced byDefine

By Lemma B.2.1 is a martingale. It is clear that

Consequently,

By Theorem B.1.2 converges asSince on as converges on and

consequently on which equals

B.3. Borel-Cantelli-Lévy LemmaTheorem B.3.1. (Borel-Cantelli-Lévy Lemma) Let be a sequence of

events, Then if and only if or equivalently,

Proof. Define

Clearly, is a martingale and is an MDS.Since by Theorem B.2.2, converges on



APPENDIX B 341

If then from (B.3.2) it follows that which implies that

converges. Then, this combining with by (B.3.2) yields

Conversely, if then from (B.3.2) it follows that

Noticing that is contained in the set where converges by

Theorem B.2.2, from the convergence of by (B.3.2) it follows that

If are mutually independent and then

Proof. Denote by the generated by

If then

and hence which, by (B.3.1), implies (B.3.3).

When are mutually independent, then

B.4. Convergence Criteria for AdaptedSequences

Let be an adapted process.Theorem B.4.1 Let be a sequence of positive numbers. Then

where

Consequently, implies andfollows from (B.3.1).

Theorem B.3.2 (Borel-Cantelli Lemma) Let be a sequence of events. If

then the probability that occur infinitely often is zero, i.e.,




Proof. Set

By Theorem B.3.1

or

This means that A is the set where events may occur only finitely many times.

Therefore, on A the series converges if and only if converges.

Theorem B.4.2 (Three Series Criterion) Denote by S the where the

following three series converge:

and

where c is a positive constant.

Then converges on S as

Proof. Taking in (B.4.1), we have and

by Theorem B.4.1.Define

Since converges on S, from (B.4.2) it follows that

Noticing that is an MDS and

we see

By Theorem B.2.1 converges on S, or



APPENDIX B 343

Then from (B.4.3) it follows that

or converges ).

B.5. Convergence Theorems for MDS IILet be an MDS.

Theorem B.5.1 (Y. S. Chow) converges on

Proof. By Theorem B.4.2 it suffices to prove where S is defined in Theo-rem B.4.2 with replaced by considered in the present theorem.

We now verify that three series defined in Theorem B.4.2 are convergent on A if is replaced byFor convergence of the first series it suffices to note

For convergence of the second series, taking into account we find

Finally, for convergence of the last series it suffices to note

and

by the conditional Schwarz inequality.Theorem B.5.2. The conclusion of Theorem B.5.1 is valid also forProof. Define

Then we have




on A where A is still defined by (B-5-1) but with

Applying Theorem B.5.1 with to the MDS leads to that

converges on A, i.e.,

This is equivalent to

B.6. Weighted Sum of MDSTheorem B.6.1 Let be an l-dimensional MDS and let be a

matrix adapted process. If

for some then as

where

Proof. Without loss of generality, assume

Notice that convergence of implies convergence of since for

sufficiently largeConsequently, from (B.5.2) it follows that



APPENDIX B 345

We have the following estimate:

By Theorems B.5.1 and B.5.2 it follows that

where

Notice that is nondecreasing as If is bounded, then the conclusion

of the theorem follows from (B.6.1). If then by the Kronecker lemma(see Section 3.4) the conclusion of the theorem also follows from (B.6.1).



References

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

B. D. O. Anderson and T. B. Moore, Optimal Control: Linear Quadratic Methods,

Prentice-Hall, N. J., 1990.

K. J. Åström, Introduction to Stochastic Control, Academic Press, New York,1970.

M. Benaim, A dynamical systems approach to stochastic approximation, SIAMJ. Control & Optimization, 34:437–472, 1996.

A. Benveniste, M. Metivier and P. Priouret, Adaptive Algorithms and Stochastic

Approximation, Springer-Verlag, New York, 1990.

B. Bercu, Weighted estimation and tracking for ARMAX models, SIAM J. Con-

trol & Optimization, 33:89–106, 1995.

P. Billingsley, Convergence of Probability Measures, Wiley, New York, 1968.

J. R. Blum, Multidimensional stochastic approximation, Ann. Math. Statist.,

9:737–744, 1954.

V. S. Borkar, Asynchronous stochastic approximations, SIAM J. Control and

Optimization, 36:840–851, 1998.

O. Brandière and M. Duflo, Les algorithmes stochastiques contournents-ils les

pièges? Ann. Inst. Henri Poincaré, 32:395–427, 1996.

P. E. Caines, Linear Stochastic Systems, Wiley, New York, 1988.

H. F. Chen, Recursive algorithms for adaptive beam-formers, Kexue Tongbao

(Science Bulletin), 26:490–493, 1981.

H. F. Chen, Recursive Estimation and Control for Stochastic SyNew York, 1985.

stems, Wiley,

H. F. Chen, Asymptotic efficient stochastic approximation, Stochastics and

Stochastics Reports, 45:1–16, 1993.

347




[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

H. F. Chen, Stochastic approximation and its new applications, Proceedingsof 1994 Hong Kong International Workshop on New Directions of Control andManufacturing, 1994, 2–12.

H. F. Chen, Convergence rate of stochastic approximation algorithms in thedegenerate case, SIAM J. Control & Optimization, 36:100–114, 1998.

H. F. Chen, Stochastic approximation with non-additive measurement noise, J.

of Applied Probability, 35:407–417, 1998.

H. F. Chen, Convergence of SA algorithms in multi-root or multi-extreme cases,

Stochastics and Stochastics Reports, 64: 255–266, 1998.

H. F. Chen, Stochastic approximation with state-dependent noise, Science inChina (Series E), 43:531–541, 2000.

H. F. Chen and X. R. Cao, Controllability is not necassry for adaptive poleplacement control, IEEE Trans. Autom. Control, AC-42:1222–1229, 1997.

H. F. Chen and X. R. Cao, Pole assignment for stochastic systems with unknowncoefficients, Science in China (Series E), 43:313–323, 2000.

H. F. Chen, T. Duncan, and B. Pasik-Duncan, A Kiefer-Wolfowitz algorithmwith randomized differences, IEEE Trans. Autom. Control, AC-44:442–453, 1999.

H. F. Chen and H. T. Fang, Nonconvex stochastic optimization for model reduc-tion, Global Optimization, 2002.

H. F. Chen and L. Guo, Identification and Stochastic Adaptive Control,Birkhäuser, Boston, 1991.

H. F. Chen, L. Guo, and A. J. Gao, Convergence and robustness of the Robbins-Monro algorithm truncated at randomly varying bounds, Stochastic Processesand Their Applications, 27:217–231, 1988.

H. F. Chen and R. Uosaki, Convergence analysis of dynamic stochastic approx-

imation, Systems and Control Letters, 35:309–315, 1998.

H. F. Chen and Q. Wang, Adaptive regulator for discrete-time nonlinear non-

parametric systems, IEEE Trans. Autom. Control, AC-46: , 2001.

H. F. Chen and Y. M. Zhu, Stochastic approximation procedures with randomlyvarying truncations, Scientia Sinica (Series A), 29:914–926, 1986.

H. F. Chen and Y. M. Zhu, Stochastic Approximation (in Chinese), ShanghaiScientific and Technological Publishers, Shanghai, 1996.

E. K. P. Chong and P. J. Ramadge, Optimization of queues using an infinitesi-mal perturbation analysis-based stochastic algorithm with general update times,SIAM J. Control & Optimization, 31:698–732, 1993.

Y. S. Chow, Local convergence of martingales and the law of large numbers,Ann. Math. Statst. 36:552–558, 1965.



REFERENCES 349

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

Y. S. Chow and H. Teicher, Probablility Theory: Independence, Interchangeability, Martingales, Springer Verlag, New York, 1978.

K. L. Chung, A Course in Probability Theory, (second edition), Academic Press,New York, 1974.

M. H. A. Davis, Linear Estimation and Stochastic Control, Chapman and Hall,New York, 1977.

K. Deimling, Nonlinear Functional Analysis, Springer, Berlin, 1985.

B. Delyon and A. Juditsky, Stochastic optimization with averaging of trajecto

ries, Stochastics and Stochastics Reports, 39:107–118, 1992.

E. F. Deprettere (eds.), SVD and Signal Processing, Elsevier, HorthHolland,1988.

N. Dunford and J. T. Schwartz, Linear Operators, Part 1: General Theory, Wiley

Interscience, New York, 1966.

V. Dupač, A dynamic stochastic methods, Ann. Math. Statist. 36:1695–1702.

V. Dupač, Stochastic approximation in the presense of trend, Czeshoslovak Math.

J., 16:454–461, 1966.

A . Dvoretzky, On stochastic approximation, Proceedings of the Third Berkeley

Symposium on Mathematical Statistics and Probability, pp. 39–55, 1956.

S. N. Ethier and T. G. Kurtz, Markov Processes: Characterization and Conver

gence, Wiley, New York, 1986.

E. Eweda, Convergence of the sign algorithm for adaptive filtering with corre

lated data, IEEE Trans. Information Theory, IT37:14501457, 1991.

V. Fabian, On asymptotic normality in stochastic approximation, Ann. of Math.

Statis., 39: 1327–1332, 1968.

V. Fabian, On asymptotically efficient recursive estimation, Ann. Statist., 6:854–856, 1978.

V. Fabian, Simulated annealing simulated, Computers Math. Applic., 33:81–94,

1997.

F. W. Fairman, Linear Control Theory, The State Space Approach, Wiley, Chich

ester, 1998.

H. T. Fang and H. F. Chen, Sharp convergence rates of stochastic approximation

for degenerate roots, Science in China (Series E), 41:383–392, 1998.

H. T. Fang and H. F. Chen, Stability and instability of limit points of stochastic

approximation algorithms, IEEE Trans. Autom. Control, AC45:413–420, 2000.

H. T. Fang and H. F. Chen, An a.s. convergent algorithm for global optimization

with noise corrupted observations, J. Optimization and Its Applications, 104:343–

376, 2000.



REFERENCES 351

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

H. J. Kushner and G. Yin, Stochastic Approximation Algorithms and Applica-tions, Springer-Verlag, New York, 1997.

J. P. LaSaller and Lefchetz, Stability by Lyapunov’s Direct Methods with Ap-

plications, Academic Press, New York, 1961.

R. Liptser and A. N. Shiryaev, Statistics of Random Processes, Springer-Verlag,New York, 1977.

R. Liu, Blind signal processing: An introduction, Proceedings 1996 Intl. Symp.Circuits and Systems, Vol. 2, 81–83, 1996.

L. Ljung, Analysis of recursive stochastic algorithms, IEEE Trans. Autom. Con-

trol, AC-22:551-575, 1977.

L. Ljung, On positive real transfer functions and the convergence of some recur-sive schemes, IEEE Trans. Autom. Control, AC-22:539–551, 1977.

L. Ljung, G. Pflug, and H. Walk, Stochastic Approximation and Optimizationof Random Systems, Birkhäuser, Basel, 1992.

L. Ljung and T. Söderström, Theory and Practice of Recursive Identification,MIT Press, Cambridge, MA, 1983.

M. Loéve, Probability Theory, Springer, New York, 1977–1978.

R. Lozano and X. H. Zhao, Adaptive pole placement without excitation probingsignals, IEEE Trans. Autom. Control, AC-39:47–58, 1994.

M. B. Nevelson and R. Z. Khasminskii, Stochastic Approximation and Recur-

sive Estimation, Amer. Math. Soc., Providence, RI, 1976, Translation of Math.

Monographs, Vol. 47.

E. Oja, Subspace Methods of Pattern Recognition, 1st ed., Letchworth, ResearchStudies Press Ltd., Hertfordshire, 1983.

B. T. Polyak, New stochastic approximation type procedures, (in Russian) Au-tom. i Telemekh., 7:98–107, 1990.

B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation byaveraging, SIAM J. Control & Optimization, 30:838–855, 1992.

H. Robbins and S. Monro, A stochastic approximation method, Ann. Math.

Statist., 22:400–407, 1951.

D. Ruppert, Stochastic approximation, In B. K. Ghosh and P. K. Sen, Editors,

Handbook in Sequential Analysis, 503–529, Marcel Dekker, New York, 1991.

A. N. Shiryaev, Probability, Springer, New York, 1984.

J. C. Spall, Multivariate stochastic approximation using a simultaneous pertur-

bation gradient approximation, IEEE Trans. Autom. Control, AC-37:331–341,

1992.




[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[99]

Q. Y. Tang and H. F. Chen, Convergence of perterbation analysis based optimiza-tion algorithm with fixed-number of customers period, Discrete Event DynamicSystems, 4:359–373, 1994.

Q. Y. Tang, H. F. Chen, and Z. J. Han, Convergence rates of perturbation-

analysis-Robbins-Monro-Single-run algorithms, IEEE Trans. Autom. Control,AC-42:1442–1447, 1997.

J. N. Tsitsiklis, Asynchronous stochastic approximation and Q-learning MachineLearning, 16:185–202, 1994.

N. J. Tsitsiklis, D. P. Bertsekas, and M. Athans, Distributed asynchronous de-terministic and stochastic gradient optimization algorithms, IEEE Trans. Autom.

Control, 31:803–812, 1986.

Ya. Z. Tsypkin, Adaptation and Learning in Automatic Systems, AcademicPress, New York, 1971.

K. Uosaki, Some generalizations of dynamic stochastic approximation processes,Ann. Statist., 2:1042–1048, 1974.

J. Venter, An extension of the Robbins-Monro procedure, Ann. Math. Stat.,38:181–190, 1967.

G. J. Wang and H. F. Chen, Behavior of stochastic approximation algorithm

in root set of regression function, Systems Science and Mathematical Sciences,12:92–96, 1999.

I. J. Wang, E. K. P. Chong and S. R. Kulkarni, Equivalent necessary and suffi-

cient conditions on noise sequences for stochastic approximation algorithms, Adv.Appl. Probab., 28:784–801, 1996.

C. Z. Wei, Multivariate adaptive stochastic approximation, Ann. Stat., 15:1115–1130, 1987.

G. Xu, L. Tong, and T. Kailath, A least squares approach to blind identification,IEEE Trans. Signal Processing, SP-43:2982–2993, 1995.

S. Yakowitz, A globally convergent stochastic approximation, SIAM J. Control

& Optimization, 31:30–40, 1993.

G. Yin, On extensions of Polyak’s averaging approach to stochastic approxima-tion, Stochastics and Stochastics Reports, 36:245–264, 1991.

G. Yin and Y. M. Zhu, On w.p.l. convergence of a parallel stochastic approxi-mation algorithm, Probability in the Eng. and Infor. Sciences, 3:55–75, 1989.

[100] R. Zeilinski, Global stochastic approximation: A review of results and someopen problems. In F. Archetti and M. Cugiani (eds.), Numerical Techniques forStochastic Systems, 379–386, Northholland Publ. Co., 1980.

[101] J. H. Zhang and H. F. Chen, Convergence of algorithms used for principalcomponent analysis, Science in China (Series E), 40:597–604, 1997.



REFERENCES 353

[102] K. Zhou, J. C. Doyle, and K. Glover, Robust Optimal Control, Prentice-Hall,New Jersey, 1996.



Index

50, 55, 247329

329329

Ackermann’s formula, 328adapted process, 335adapted sequence, 341adaptive control, 290, 303, 327adaptive filter, 288adaptive filtering, 265, 273adaptive regulation, 321adaptive stabilization, 305, 307, 314, 327adaptive stochastic approximation, 132,

149adaptively stabilizable, 310admissible controls, 302algebraic Riccati equation, 131ARMA process, 39Arzelá-Ascoli theorem, 11, 24asymptotic behavior, 194asymptotic efficiency, 95, 130, 132, 149asymptotic normality, 95, 113, 119, 127,

149, 210

asymptotic properties, 95, 166asymptotically efficient, 135asynchronous stochastic approximation,

219, 278, 288averaging technique, 132, 149

balanced realization, 210, 214balanced truncation, 214, 215blind channel identification, 219, 220, 223blind identification, 220Borel 330

Borel set, 330Borel-Cantelli Lemma, 341Borel-Cantelli-Lévy Lemma, 340

certainly-equivalency-principle, 304, 306

Chebyshev inequality, 332closure, 38conditional distribution function, 332conditional expectation, 332conditional probability, 332conditional Schwarz inequality, 343constant interpolating function, 13constrained optimization problem, 268controllable, 307, 317, 319controller form, 317–319convergence, 28, 36, 41, 153, 223, 331, 341convergence analysis, 6, 28, 95, 154convergence rate, 95, 96, 101–103, 105,

149convergence theorem for martingale dif-

ference sequences, 97, 128, 160,170, 185, 196, 231, 249, 321,339, 343

convergence theorem for nonnegative su-permartingales, 7–9

convergence theorems for martingale, 335convergent subsequence, 17, 18, 30, 36,

84, 86, 89, 178, 187, 237, 241,244, 271, 275, 280, 282, 283,

285, 287, 288, 297, 312, 315,322, 323

coprimeness, 306covariance matrix, 130, 132crossing, 18, 34, 188, 236, 312

degenerate case, 103, 149density, 330distribution function, 330dominant stability, 59, 62dominated convergence theorem, 331

dynamic stochastic approximation, 82, 93

equi-continuous, 15ergodic, 265, 268, 270, 273, 274, 334ergodicity, 333

355




event, 329expectation, 330

Fatou lemma, 331first exit time, 9, 339

general convergence theorems, 28global minimum, 177global minimizer, 174, 177, 180global optimization, 172–174, 218global optimization algorithm, 180, 194global optimizer, 152globally Lipschitz continuous, 292Gronwall inequality, 298

Hölder Inequality, 332Hankel matrix, 222Hankel norm approximation, 210, 214,

215Hessian, 8, 195

identification, 290integrable, 331interpolating function, 11invariant 334

Jordan-Hahn decomposition, 55, 56, 295,329

Kiefer-Wolfowitz (KW) algorithm, 151–153, 166, 173, 218

Kronecker lemma, 67, 144, 148, 345Kronecker product, 248KW algorithm with expanding trunca-

tions, 152, 154, 173–175

Law of iterated logarithm, 333Lebesgue measurable, 330Lebesgue measure, 330

Lebesgue-Stieltjes integral, 331linear interpolating function, 12Lipschitz continuous, 23Lipschitz-continuity, 160local search, 172, 173locally bounded, 17, 29, 96, 103, 133locally Lipschitz continuous, 50, 155, 163,

177, 280Lyapunov equation, 105Lyapunov function, 6, 8, 10, 11, 17, 111,

226, 268, 313

Lyapunov inequality, 144, 332Lyapunov theorem, 98

MA process, 171Markov time, 6, 335, 336, 339martingale, 335, 339, 340martingale convergence theorem, 6, 180,

297

martingale difference sequence, 6, 16, 42,97, 128, 134, 159, 164, 168,179, 185, 195–197, 231, 250,257, 294, 335

maxinizer, 151

measurable, 17, 29, 96, 103, 133measurable function, 330measurable set, 329measure, 329minimizer, 151mixing condition, 291model reduction, 210monotone convergence theorem, 331multi-extreme, 163, 164multi-root, 46, 57mutually independent, 333, 341

necessity of noise condition, 45non-additive noise, 49nondegenerate case, 96, 149nonnegative adapted sequence, 7nonnegative supermartingale, 6, 7, 338nonpositive submartingale, 338normal distribution, 113, 114, 330nowhere dense, 29, 35, 37, 41, 177, 181,

182, 280, 291

observation, 5, 17, 132, 321observation noise, 5, 103, 133, 175, 195,

321ODE method, 2, 10, 24, 327one-sided randomized difference, 172optimal control, 303optimization, 151optimization algorithm, 212ordinary differential equation (ODE), 10

pattern classification, 219perturbation analysis, 328pole assignment, 316, 318, 327principal component analysis, 238, 288probabilistic method, 4probability measure, 330probability of random event, 330probability space, 329, 330Prohorov’s theorem, 22, 24

Radon-Nikodym Theorem, 332

random noise, 10, 21random search, 172random variable, 330randomized difference, 152–154recursive blind identification, 246relatively compact, 22RM algorithm with expanding trunca-

tions, 28, 155, 309, 319



INDEX 357

Robbins-Monro (RM) algorithm, 1, 5, 8,11, 12, 17, 20, 45, 110, 310, 313

robustness, 67, 93

SA algorithm, 67

SA algorithm with expanding trunca-tions, 25, 40, 95, 290SA with randomly varying truncations, 93Schwarz inequality, 142, 332sign algorithms, 273, 288signal processing, 219, 265signed measure, 56, 295Skorohod representation, 23Skorohod topology, 21, 24slowly decreasing step sizes, 132spheres with expanding radiuses, 36

stability, 131stable, 96, 97, 102, 131, 133state-dependent, 42, 164state-dependent noise, 29, 57state-independent condition, 41, 42stationary, 265, 268, 270, 273, 274, 333step size, 5, 6, 17, 102, 132, 174stochastic approximation (SA), 1, 223,

226, 246stochastic approximation algorithm, 5,

307, 308stochastic approximation method, 321

stochastic differential equation, 126stochastic optimization, 211stopping time, 335strictly input passive, 322structural error, 10, 157

structural inaccuracy, 21submartingale, 335–337, 339subspace, 41, 226supermartingale, 335, 339surjection, 63system identification, 327

three series criterion, 342time-varying, 44trajectory-subsequence (TS) method, 2,

16, 21

truncated RM algorithm, 16, 17TS method, 28, 327

uniformly bounded, 15uniformly locally bounded, 41up-crossing, 336, 338

weak convergence method, 21, 24weighted least squares, 306weighted sum of MDS, 344Wiener process, 126



Nonconvex Optimization and Its Applications

22.23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41

H. Tuy: Convex Analysis and Global Optimization. 1998 ISBN 0792348184D. Cieslik: Steiner Minimal Trees. 1998 ISBN 0792349830

N.Z. Shor: Nondifferentiable Optimization and Polynomial Problems. 1998

ISBN 0792349970

R. Reemtsen and J.J. Rückmann (eds.): SemiInfinite Programming. 1998

ISBN 0792350545

B. Ricceri and S. Simons (eds.): Minimax Theory and Applications. 1998

ISBN 0792350642

J.P. Crouzeix, J.E. MartinezLegaz and M. Volle (eds.): Generalized Convexitiy,

Generalized Monotonicity: Recent Results. 1998 ISBN 079235088X J. Outrata, M. Kočvara and J. Zowe: Nonsmooth Approach to Optimization Problems

with Equilibrium Constraints. 1998 ISBN 0792351703

D. Motreanu and P.D. Panagiotopoulos: Minimax Theorems and Qualitative Proper

ties of the Solutions of Hemivariational Inequalities. 1999 ISBN 0792354567

J.F. Bard: Practical Bilevel Optimization. Algorithms and Applications. 1999

ISBN 0792354583

H.D. Sherali and W.P. Adams: A ReformulationLinearization Technique for Solving

Discrete and Continuous Nonconvex Problems. 1999 ISBN 0792354877

F. Forgó, J. Szép and F. Szidarovszky: Introduction to the Theory of Games. Concepts,Methods, Applications. 1999 ISBN 0792357752

C.A. Floudas and P.M. Pardalos (eds.): Handbook of Test Problems in Local and

Global Optimization. 1999 ISBN 0792358015

T. Stoilov and K. Stoilova: Noniterative Coordination in Multilevel Systems. 1999

ISBN 0792358791

J. Haslinger, M. Miettinen and P.D. Panagiotopoulos: Finite Element Method for

Hemivariational Inequalities. Theory, Methods and Applications. 1999

ISBN 0792359518

V. Korotkich: A Mathematical Structure of Emergent Computation. 1999ISBN 0792360109

C.A. Floudas: Deterministic Global Optimization: Theory, Methods and Applications.

2000 ISBN 0792360141

F. Giannessi (ed.): Vector Variational Inequalities and Vector Equilibria. Mathemat

ical Theories. 1999 ISBN 0792360265

D. Y. Gao: Duality Principles in Nonconvex Systems. Theory, Methods and Applica

tions. 2000 ISBN 0792361453

C.A. Floudas and P.M. Pardalos (eds.): Optimization in Computational Chemistry

and Molecular Biology. Local and Global Approaches. 2000 ISBN 0792361555G Isac: Topological Methods in Complementarity Theory 2000 ISBN 0 7923 6274 8

Stochastic Approximation Applications

Documents

Transcript of Stochastic Approximation Applications