AND VIVEK S. BORKAR arXiv:1902.01048v5 [math.OC] 22 Oct 2021

32
arXiv:1902.01048v5 [math.OC] 22 Oct 2021 Average cost optimal control under weak ergodicity hypotheses: Relative value iterations ARI ARAPOSTATHIS AND VIVEK S. BORKAR Abstract. We study Markov decision processes with Polish state and action spaces. The action space is state dependent and is not necessarily compact. We first establish the existence of an optimal ergodic occupation measure using only a near-monotone hypothesis on the running cost. Then we study the well-posedness of Bellman equation, or what is commonly known as the average cost optimality equation, under the additional hypothesis of the existence of a small set. We deviate from the usual approach which is based on the vanishing discount method and instead map the problem to an equivalent one for a controlled split chain. We employ a stochastic representation of the Poisson equation to derive the Bellman equation. Next, under suitable assumptions, we establish convergence results for the ‘relative value iteration’ algorithm which computes the solution of the Bellman equation recursively. In addition, we present some results concerning the stability and asymptotic optimality of the associated rolling horizon policies. 1. Introduction The long run average or ‘ergodic’ cost is popular in applications when transients are fast and/or unimportant and one is optimizing over possible asymptotic behaviors. The dynamic programming equation for this criterion, in the finite state-action case, goes back to Howard [29]. A recursive algorithm to solve it in the aforementioned case is the so called relative value iteration scheme [42], dubbed so because it is a modification of the value iteration scheme for the (simpler) discounted cost criterion. This modification consists of subtracting at each step a suitable offset and track only the ‘relative’ values. Suitable counterparts of this algorithm for a general state space are available, if at all, only under rather strong conditions (see, e.g., Section 5.6 of [28]). Our aim here is to consider a special case of immense practical importance, viz., that of a near-monotone or inf-compact cost which penalizes instability [7], [8], and to establish both the dynamic programming equation and the relative value iteration scheme for it. Perforce the latter involves iteration in a function space and as far as implementation is concerned, would have to be replaced by suitable finite approx- imations through either state aggregation or parametrized approximation of the value function. But the validity of such an approximate scheme depends on provable convergence properties of the algorithm. Our aim is to provide this. The results on convergence of the relative value iteration presented here may be viewed as discrete time counterparts of the results of [3]. It is not, however, the case that they can be derived simply from the results of [3], which relies heavily on the analytic machinery of the partial differential equations arising therein. This, in particular, leads to convenient regularity results which are not available here. For studies on the average cost optimality equation (ACOE) of Markov decision processes (MDP) on Borel state space, we refer the reader to [14, 1921, 27, 28, 30, 37, 40, 41]. All these papers assume only the (weak) Feller property on the transition kernel, whereas in this paper the kernel is assumed Deceased, was with the Department of ECE, The University of Texas at Austin, Austin, TX 78712 Department of Electrical Engineering, Indian Institute of Technology, Powai, Mumbai E-mail addresses: [email protected], [email protected] . 2000 Mathematics Subject Classification. Primary: 90C40, Secondary: 93E20. Key words and phrases. ergodic control, Bellman equation, inf-compact cost, relative value iteration. 1

Transcript of AND VIVEK S. BORKAR arXiv:1902.01048v5 [math.OC] 22 Oct 2021

arX

iv:1

902.

0104

8v5

[m

ath.

OC

] 2

2 O

ct 2

021

Average cost optimal control under weak ergodicity

hypotheses: Relative value iterations

ARI ARAPOSTATHIS† AND VIVEK S. BORKAR‡

Abstract. We study Markov decision processes with Polish state and action spaces. The actionspace is state dependent and is not necessarily compact. We first establish the existence of anoptimal ergodic occupation measure using only a near-monotone hypothesis on the running cost.Then we study the well-posedness of Bellman equation, or what is commonly known as the averagecost optimality equation, under the additional hypothesis of the existence of a small set. We deviatefrom the usual approach which is based on the vanishing discount method and instead map theproblem to an equivalent one for a controlled split chain. We employ a stochastic representation ofthe Poisson equation to derive the Bellman equation. Next, under suitable assumptions, we establishconvergence results for the ‘relative value iteration’ algorithm which computes the solution of theBellman equation recursively. In addition, we present some results concerning the stability andasymptotic optimality of the associated rolling horizon policies.

1. Introduction

The long run average or ‘ergodic’ cost is popular in applications when transients are fast and/orunimportant and one is optimizing over possible asymptotic behaviors. The dynamic programmingequation for this criterion, in the finite state-action case, goes back to Howard [29]. A recursivealgorithm to solve it in the aforementioned case is the so called relative value iteration scheme [42],dubbed so because it is a modification of the value iteration scheme for the (simpler) discounted costcriterion. This modification consists of subtracting at each step a suitable offset and track only the‘relative’ values. Suitable counterparts of this algorithm for a general state space are available, if atall, only under rather strong conditions (see, e.g., Section 5.6 of [28]). Our aim here is to considera special case of immense practical importance, viz., that of a near-monotone or inf-compact costwhich penalizes instability [7], [8], and to establish both the dynamic programming equation andthe relative value iteration scheme for it. Perforce the latter involves iteration in a function spaceand as far as implementation is concerned, would have to be replaced by suitable finite approx-imations through either state aggregation or parametrized approximation of the value function.But the validity of such an approximate scheme depends on provable convergence properties of thealgorithm. Our aim is to provide this.

The results on convergence of the relative value iteration presented here may be viewed as discretetime counterparts of the results of [3]. It is not, however, the case that they can be derived simplyfrom the results of [3], which relies heavily on the analytic machinery of the partial differentialequations arising therein. This, in particular, leads to convenient regularity results which are notavailable here.

For studies on the average cost optimality equation (ACOE) of Markov decision processes (MDP)on Borel state space, we refer the reader to [14,19–21,27,28,30,37,40,41]. All these papers assumeonly the (weak) Feller property on the transition kernel, whereas in this paper the kernel is assumed

†Deceased, was with the Department of ECE, The University of Texas at Austin, Austin, TX 78712‡Department of Electrical Engineering, Indian Institute of Technology, Powai, Mumbai

E-mail addresses: [email protected], [email protected] Mathematics Subject Classification. Primary: 90C40, Secondary: 93E20.Key words and phrases. ergodic control, Bellman equation, inf-compact cost, relative value iteration.

1

2 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

to be strong Feller (with the exception of Lemma 2.1 and Theorem 2.1). Classical approachesbased on the vanishing discount argument such as [14, 21, 27, 37] need to ensure some variant ofpointwise boundedness of the relative discounted value function. This typically requires additionalhypotheses, or it is directly imposed as an assumption. The weakest condition appears in [21] whereonly the limit infimum of relative discounted value functions is required to be pointwise boundedin the vanishing discount limit. It follows from [23, Theorem 4.1] that if the solution of the ACOEis bounded then the relative discounted value functions are also bounded uniformly in the discountfactor. This is a very specific case though, and for the more general case studied in this paper it isunclear how our assumptions compare with those of [21].

Pointwise boundedness of discounted relative value functions was verified from scratch for aspecific application in [1]. The techniques therein, which leverage near-monotonicity in a mannerdifferent from here, may be more generally applicable.

Some of the aforementioned works derive an average cost optimality inequality (ACOI) as opposedto an equation. The ACOE is derived in [14, 19, 20, 28, 30, 40, 41]. Also worth noting is [20]which derives the ACOE for a classical inventory problem under a weak condition known as K-inf-compactness. The works in [30,41] derive the ACOE under additional uniform stability conditionswhich we avoid. The works [40], [41] also use a minorization condition like us, but the purpose thereis to facilitate a fixed point argument which is possible due to the stronger stability assumptions.

Our focus is on the ACOE rather than the ACOI because our eventual aim is to establish con-vergence of relative value iteration for which this is explicitly used. Moreover, for this convergenceresult we require uniqueness of the solution to the ACOE within a suitable class of functions.

Studies such as [21, 27, 37] work with standard Borel state spaces whereas we work with Polishspaces. We assume that the running cost is near-monotone (see (C) in Subsection 2.2), a notionmore general than the more commonly used ‘inf-compactness’. The latter requires the level sets ofthe running cost functions to be compact, necessitating in particular that for non-σ-compact spaces,they be extended real-valued. On the contrary, a K-inf-compact cost (see (A1) in Subsection 3.1)together with (C) allows for more flexibility.

Furthermore, the above works do not address the relative value iteration which is our main focushere. This algorithm, after the seminal work of [42] for the finite state case, has been extendedto denumerable state spaces in [10, 11, 13]. An analogous treatment for a general metric statespace appears in [28, Section 5.6]. This directly assumes equicontinuity of the iterates, for whichproblems with convex value functions [24] have been cited as an example. We do not make any suchassumption. The related though distinct algorithm of policy iteration has been analyzed in [33].This work also uses the ‘pseudo-atom’ construction as we do, in order to obtain a solution to thefixed policy Poisson equation. We use it to derive the Bellman equation itself using a representationof the value function.

Another important part of this work concerns the stability and asymptotic optimality of rollinghorizon policies. Analogous results in the literature have been reported only under very strongblanket ergodicity assumptions [11, 26]. For a review of this topic, see [15]. In this paper, weavoid any blanket ergodicity assumptions and impose a stabilizability hypothesis, namely, thatunder some Markov control the process is geometrically ergodic with a Lyapunov function that hasthe same growth as the running cost (see (H2) and Remark 6.1 in Section 6). This property isnatural for ‘linear-like’ problems, and is also manifested in queueing problems with abandonment,or problems with the structure in Example 6.1. Under this hypothesis, we assert in Theorems 6.1and 6.3 global convergence for the relative value iteration, and show in Theorem 6.2 that therolling horizon procedure is stabilizing after a finite number of iterations. Then, under a ‘uniform’ψ-irreducibility condition, Theorem 6.4 shows that the rolling horizon procedure is asymptoticallyoptimal. The latter is an important problem of current interest (see, e.g., [12,28]). Our results alsocontain computable error bounds.

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 3

The article is organized as follows. Section 2 has three main parts. We first review the formalismand basic notation of Markov decision processes in Subsection 2.1 and then, in Subsection 2.2, weestablish the existence of an optimal ergodic occupation measure, thus extending the results ofthe convex analytic framework in [8] to MDPs on a Polish state space. Subsection 2.4 introducesan equivalent controlled split chain and the associated pseudo-atom. Section 3 then derives thedynamic programming equation to characterize optimality. Section 4 establishes the convergenceof the ‘value iteration’, which is the name we give to the analog of value iteration for discountedcost with no discounting, but with the cost-per-stage function modified by subtracting from it theoptimal cost. The latter is in principle unknown, so this is not a legitimate algorithm. It does,however, pave the way to prove convergence of the true relative value iteration scheme, which wedo in Section 5. Section 6 is devoted to the analysis of the rolling horizon procedure.

1.1. Notation. We summarize some notation used throughout the paper. We use Rd (and Rd+),

d ≥ 1, to denote the space of real-valued d-dimensional (nonnegative) vectors, and write R ford = 1. Also, N denotes the natural numbers, and N0 := N ∪ 0. For x, y ∈ R, we let

x ∨ y := maxx, y and x ∧ y := minx, y .

The Euclidean norm on Rd is denoted by | · |. We use Ac, A, and 1A to denote the complement,

the closure, and the indicator function of a set A, respectively.For a Polish space X we let B(X) stand for its Borel σ-algebra, and P(X) for the space of proba-

bility measures on B(X) with the Prokhorov topology. We let M(X), L(X), and C(X), denote thespaces of real-valued Borel measurable functions, lower semi-continuous functions bounded frombelow, and continuous functions on X, respectively. Also, Mb(X), Lb(X), and Cb(X), denote thecorresponding subspaces consisting of bounded functions.

For a Borel probability measure µ on B(X) and a measurable function f : X → R, which isintegrable under µ, we often use the convenient notation µ(f) :=

∫Xf(x)µ(dx).

For f ∈ M(X), we define

‖g‖f := supx∈X

|g(x)|

1 + |f(x)|, g ∈ M(X) ,

and O(f) := g ∈ M(X) : ‖g‖f <∞.

2. Preliminaries

In this paper, we consider a controlled Markov chain otherwise referred to as a Markov decisionprocess (MDP), taking values in a Polish space X.

2.1. The MDP model. Recall the notation introduced in Subsection 1.1. According to the mostprevalent definition in the literature (see [17,28]), an MDP is represented as a tuple

(X,U,U , P, c

),

whose elements can be described as follows.

(a) The state space X is a Polish space (complete, separable, metric). Its elements are calledstates.

(b) U is a Polish space, referred to as the action or control space.(c) The map U : X → B(U) is a strict, measurable multifunction. The set of admissible

state/action pairs is defined as

K :=(x, u) : x ∈ X, u ∈ U(x)

,

endowed with the subspace topology corresponding to B(X×U).(d) The transition probability P (· |x, u) is a stochastic kernel on K×B(X), that is, P ( · |x, u)

is a probability measure on B(X) for each (x, u) ∈ K, and (x, u) 7→ P (A |x, u) is in M(K)for each A ∈ B(X).

4 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

(e) The map c : K → R is measurable, and is called the running cost or one stage cost. Weassume that it is bounded from below in K, so without loss of generality, it takes values in[1,∞).

The (admissible) history spaces are defined as

H0 := X , Ht := Kt−1 ×X , t ∈ N ,

and the canonical sample space is defined as Ω := (X×U)∞. These spaces are endowed with theirrespective product topologies and are therefore Polish spaces. The state, action (or control), andinformation processes, denoted by Xtt∈N0

, Utt∈N0and Htt∈N0

, respectively, are defined bythe projections

Xt(ω) := xt , Ut(ω) := ut , Ht(ω) := (x0, . . . , ut−1, xt)

for each ω = (x0, . . . , ut−1, xt, ut, . . . ) ∈ Ω.An admissible control strategy, or policy, is a sequence ξ = ξtt∈N0

of stochastic kernels onHt × B(U) satisfying the constraint

ξt(U(xt) |ht) = 1 , xt ∈ X , ht ∈ Ht .

The set of all admissible strategies is denoted by U. It is well known (see [34, Prop. V.1.1, pp. 162–164]) that for any given µ ∈ P(X) and ξ ∈ U there exists a unique probability measure Pξµ on(Ω,B(Ω)

)satisfying

Pξµ(X0 ∈ D) = µ(D) ∀D ∈ B(X) ,

Pξµ(Ut ∈ C |Ht) = ξt(C |Ht) Pξµ -a.s. , ∀C ∈ B(U) ,

Pξµ(Xt+1 ∈ D |Ht, Ut) = P (D |Xt, Ut) Pξµ -a.s. , ∀D ∈ B(X) .

The expectation operator corresponding to Pξµ is denoted by Eξµ. If µ is a Dirac mass at x ∈ X, we

simply write these as Pξx and Eξx.A strategy ξ is called randomized Markov if there exists a sequence of measurable maps vtt∈N0

,where vt : X → P(U) for each t ∈ N0, such that

ξt( · |Ht) = vt(Xt)(·) Pξµ -a.s.

With some abuse in notation, such a strategy is identified with the sequence v = vtt∈N0. Note then

that vt may be written as a stochastic kernel vt(· |x) on X×B(U) which satisfies vt(U(x) |x) = 1.We say that a Markov randomized strategy ξ is simple, or precise, if ξt is a Dirac mass, in which

case vt is identified with a Borel measurable function vt : X → U. In other words, vt is a measurableselector from the set-valued map U(x) [22].

We add the adjective stationary to indicate that the strategy does not depend on t ∈ N0, thatis, vt = v for all t ∈ N0. We let Usm denote the class of stationary Markov randomized strategies,henceforth referred to simply as stationary strategies.

The basic structural hypotheses on the model, which are assumed throughout the paper, are asfollows.

Assumption 2.1. The following hold:

(i) The transition probability P (dy |x, u) is weakly continuous, that is, the map

(x, u) 7→

X

f(y)P (dy |x, u)

is continuous for every f ∈ Cb(X).(ii) The set-valued map U : X → B(U) is upper semi-continuous and closed-valued.(iii) The running cost c : K → [1,∞) is lower semi-continuous.

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 5

Assumption 2.1 is assumed throughout the paper, and repeated only for emphasis. More specificassumptions are imposed later in Subsection 2.4 and Section 3.

Definition 2.1. For v ∈ Usm we use the abbreviated notation

Pv(A |x) :=

U(x)P (A |x, u) v(du |x) , and cv(x) :=

U(x)c(x, u) v(du |x) .

Also, Pvf(x) :=∫Xf(y)Pv(dy |x) for a function f ∈ M(X), assuming that the integral is well

defined. Similarly, we write Pu(A |x) := P(A |x, u

)for u ∈ U(x), and define Puf analogously.

When needed to avoid ambiguity, we denote the chain controlled under v as Xvnn∈N0

.

2.1.1. Control objective. The control objective is to minimize over all admissible ξ = ξnn∈N0the

average (or ‘ergodic’) cost

E(µ, ξ

):= lim sup

N→∞

1

NEξµ

[N−1∑

n=0

c(Xn, Un)

], µ ∈ P(X) , ξ ∈ U .

We letJ(µ) := inf

ξ∈UE(µ, ξ

), and β := inf

µ∈P(X)J(µ) . (2.1)

We say that an admissible strategy ξ is optimal if E(µ, ξ

)= J(µ) for all µ ∈ P(X). The class of

Markov stationary strategies that are optimal is denoted by U⋆sm.In the next section we introduce the concept of an optimal ergodic occupation measure, and

assume that, under a near-monotone type hypothesis on the running cost, such a measure exists.We use this result in Section 3 to derive a solution to the Bellman equation.

2.2. Existence of an optimal ergodic occupation measure. Recall that ζ ∈ P(K) is calledan ergodic occupation measure if it satisfies

K

(f(x)−

X

f(y)P (dy |x, u)

)ζ(dx,du) = 0 ∀ f ∈ Cb(X) .

We let Merg stand for the class of ergodic occupation measures. Any ζ ∈ Merg can be disintegratedas

ζ(dx,du) = πζ(dx) vζ(du |x) πζ-a.s. , (2.2)

where πζ ∈ P(X) and vζ is a stochastic kernel on X × B(U) which satisfies vζ(U(x) |x) = 1. Wedenote this disintegration as ζ = πζ ⊛ vζ .

Remark 2.1. Note that (2.2) does not define vζ on the entire space, and thus vζ cannot be viewedas an element of Usm. However, if v ∈ Usm is any strategy that agrees πζ-a.e. with vζ , thenπζ( · ) =

∫XPv( · |x)πζ (dx), or, in other words, πζ is an invariant probability measure for the chain

controlled under v. Note that such a strategy can be easily constructed. For example, for arbitraryv0 ∈ Usm, we can define v = vζ on the support of πζ and v = v0 on its complement.

Definition 2.2. We say that ζ⋆ ∈ Merg is optimal if∫

K

cdζ⋆ = β := infζ∈Merg

K

cdζ , (2.3)

and denote the set of optimal ergodic occupation measures by M⋆erg.

The convex analytic method introduced in [6] (see also [8]) is a powerful tool for the analysisof ergodic occupation measures. Two main models have been considered: MDPs with a blanketstability property, and MDPs with a near-monotone running cost. Near-monotonicity is a structuralassumption, which, stated in simple terms, postulates that the running cost is strictly larger thanthe optimal average value in (2.1) on the complement of some compact set. More precisely, thisassumption is stated as follows:

6 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

(C) There exists a continuous one-one embedding Ψ : K → K∗ of K into a Polish space K∗ such

that Ψ(K) is compact in K∗. By abuse of notation, we identify K with its image Ψ(K)

under this map and K∗ with Ψ(K). Furthermore, we assume that there exists a compact

set K ⊂ K and a ε0 > 0 such that

c(x, u) ≥ β + ε0 ∀(x, u) ∈ K \ K . (2.4)

This implies in particular that xn ⊂ K, xn → ∂K := K∗\K, then

lim infn↑∞

infuc(x, u) ≥ β + ε0. (2.5)

It will be convenient for us to take K∗ to be the closure of K (≈ Ψ(K)) embedded in

X∗ × U

∗ where X∗,U∗ are resp., compact dense embeddings of X,U into suitable Polish

spaces, assumed to exist. We shall assume that this is so.

For MDPs on a countable state space a natural counterpart of assumption (C) is enough toguarantee the existence of an optimal ergodic occupation measure as shown in [8]. In Theorem 2.1,we extend this result to MDPs on a Polish space under Assumption 2.1 and the above assumption.

As an example, consider the case where the state space X is locally compact and K = X × U

for a compact action space U. Suppose that a sequence ζnn∈N of mean empirical measures

converges vaguely to a positive measure µ ∈ P(K), meaning that∫Kfdζn →

∫Kfdµ as n→ ∞ for

all f ∈ Cc(K), where Cc(K) denotes the subspace of Cb(X) consisting of functions with compactsupport. A key lemma then asserts that the normalized measure µ

µ(K) is an ergodic occupation

measure. This is established in [8, Lemma 2.6] for models with a countable state space, and theproof can be adapted to MDPs with a locally compact state space. An important ingredient in thisproof is employing the Alexandroff extension, commonly known as the one-point compactification,and then applying Prokhorov’s theorem to the compactified space X ∪ ∞.

The Alexandroff extension has a simple and geometrically meaningful structure, but it does notresult in a Hausdorff compactification unless the original space is locally compact. For modelswith general Polish state and action spaces, a general scheme that is always available is to em-ploy Urysohn’s theorem to embed K in the Hilbert cube, and use the closure of its image as acompactification. This is done as follows.

Definition 2.3 (Embedding in the Hilbert cube). As is well known, K, being a Polish space, canbe homeomorphically embedded as a Gδ subset of the Hilbert cube [0, 1]∞ by a homeomorphismΨ: K ↔ Ψ(K) ⊂ [0, 1]∞ [5, Propositions 7.2 and 7.3]. Thus we can identify K with Ψ(K). Let

K∗ := Ψ(K), and view K as being densely homeomorphically embedded in K

∗ with ∂K := K∗ \K.

We may view P(K) as being isometrically embedded in P(K∗) in the obvious manner. The latteris compact by Prokhorov’s theorem.

This may not always be convenient and one may use better problem-specific choices. As anexample, consider K := a closed bounded subset of C1[0, 1], the space of continuous real-valuedfunctions on [0, 1] which are continuously differentiable on (0, 1) with left, resp. right limits at 0, 1,equipped with the norm ‖f‖1 := supx∈[0,1] |f(x)| + supx∈(0,1) |f

′(x)|. Its natural embedding into

C[0, 1] := the space of continuous real-valued functions on [0, 1] with the sup-norm, is compactand dense. Thus taking any bounded subset of C1[0, 1] as state space with extended real valuedcost c(x, u) := ‖f‖∗ := ‖f‖1 + supx∈(0,1) |f

′′(x)|, satisfies the above conditions. Further examplescan be constructed using compact embedding theorems for Sobolev and Holder spaces such asthe ones provided by the Rellich-Kondrachov theorem. Another example is a bounded subset ofthe space of probability measures on a euclidean space with finite p-th moment, p ≥ 1, with theWasserstein-p distance, embedded in the space of all probability measures on the underlying spacewith Prohorov topology. A set of laws with uniformly bounded p-th moment is necessarily tight,hence the embedding is compact. Density follows easily. Yet another simple example is the natural

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 7

embedding of the open unit ball f : ‖f‖ < 1 in L2[0, 1] with norm topology with its naturalembedding into L2[0, 1] with the weak∗ topology.

Let F to be the class of nonnegative functions in Cb(X∗). The functions in F can also be viewed

as functions on K∗ by letting f(x, u) ≡ f(x) for u ∈ U∗(x). Abusing the notation, we use the same

symbol F to denote the pullback of the family F by the map Ψ−1. These are functions on Ψ(X),that is, f(z) for z ∈ Ψ(X) is identified with f

(Ψ−1(z)

). Since f(x, u) in the family F ⊂ Cb(K

∗) doesnot depend on u, abusing the notation, we denote it simply as f(x) whenever this is convenient.We make the further assumption:

(D) Given f ∈ F , the map (x, u) 7→∫f(y)P (dy|x, u) extends uniquely to a map Hf ∈ C(X∗ ×

U∗).

In the study of the average cost problem, empirical occupation measures play an important role.These are defined as follows.

Definition 2.4. For any given µ ∈ P(X) and ξ ∈ U, we define the family of mean empirical

measuresζt ∈ P(K) , t > 0

by:

K

h(x, u) ζt(dx,du) :=1

t

t−1∑

m=0

Eξµ

[h(Xm, ξm)

]∀h ∈ Cb(K) .

Naturally, ζt depends on µ and ξ, but we suppress this dependence in the notation.

Concerning limit points of sequences of mean empirical measures, the notion of vague convergencein locally compact spaces is replaced by the following:

K

f dζn −−−→n→∞

K

f dµ for all f ∈ Cb(K) having bounded support. (2.6)

As shown in [18, Proposition 4.4], for a separable metric space, if (2.6) holds and µ is a probability

measure, then ζn ⇒ µ, where ⇒ denotes weak convergence of probability measures in P(K). Inother words, the subspace of Cb(K) consisting of functions having bounded support is convergencedetermining. This also means that the limit µ in (2.6) is unique, irrespective of whether or not this

is a probability measure. Note also that unless the sequence ζnn∈N is tight, the measure µ in(2.6) cannot be a probability measure.

We state and prove a key lemma, which is analogous to the one mentioned earlier for the locallycompact case.

Lemma 2.1. Suppose that a sequenceζnn∈N

of mean empirical measures satisfies (2.6) along

a subsequence for some positive measure µ in the vague topology. Then µµ(K) ∈ Merg. The same

conclusion applies for a sequence ζnn∈N ⊂ Merg.

Proof. Using the strong law of large numbers for martingales given by

1

t

t∑

m=1

(f(Xm)− Eξµ

[f(Xm) |Xm−1, Um−1

])−−−→t→∞

0 a.s.

for f ∈ Cb(X), we obtain, upon taking expectations, that∫

K

(f(x)−

X

f(y)P (dy |x, u)

)ζt(dx,du) −−−→

t→∞0 . (2.7)

Consider the embedding in K∗ in Definition 2.3. By Prokhorov’s theorem, moving to a subse-

quence if necessary, also denoted as ζnn∈N, we have ζn ⇒ ζ for some ζ ∈ P(K∗). Since K∗ is the

disjoint union of K and ∂K, it is clear that ζ must be of the form

ζ = aζ0 + (1− a)ζ1 (2.8)

8 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

for some a ∈ [0, 1], ζ0 ∈ P(∂K), and ζ1 ∈ P(K). Since

limn→∞

K

f dζn = limn→∞

K∗

f dζn =

K∗

f dζ

= a

∂K

f dζ0 + (1− a)

K

f dζ1 ∀f ∈ Cb(K∗)

(2.9)

by the hypothesis that ζn ⇒ ζ, and since the family of functions in Cb(K∗) restricted to K is

convergence determining for K, it is clear by comparing (2.6) and (2.9) that µ = (1 − a)ζ1. Sinceµ(K) > 0 by hypothesis, it follows that a < 1. Hence it suffices to show that ζ1 ∈ Merg.

As shown in [18, Theorem 4.5], if X is Polish, then any subset F ⊂ Cb(X ) which separates pointsin X and is also an algebra is a separating class for Borel probability measures, meaning that ifµ′, µ′′ ∈ P(X ) satisfy

∫fdµ′ =

∫fdµ′′ for all f ∈ F then µ′ = µ′′. The method that we use in this

proof reduces the problem of proving that ζ1 ∈ Merg to establishing equality of two given measuresin P(X). Therefore, it suffices to continue with a class F that only separates probability measures.

Thus we have

lim infn→∞

K

(∫

X

f(y)P (dy |x, u)

)ζn(dx,du)

= lim infn→∞

K∗

Hf (x, u) ζn(dx,du)

≥ (1− a)

K

Hf (x, u) ζ1(dx,du)

= (1− a)

K

(∫

X

f(y)P (dy |x, u)

)ζ1(dx,du) .

(2.10)

Combining (2.9) with the above and using also Fubini, we get∫

K

f(x) ζ1(dx,du) ≥

X

f(y)

(∫

K

P (dy |x, u) ζ1(dx,du)

)∀ f ∈ F . (2.11)

For A ∈ B(X), define

η1(A) :=

(A×U)∩Kζ1(dx,du) ,

η2(A) :=

K

P (A |x, u) ζ1(dx,du) .

(2.12)

Then, (2.11) can be written as∫

X

f(x) η1(dx) ≥

X

f(y) η2(dx) ∀ f ∈ F . (2.13)

Since X is Polish, inner regularity of a measure µ ∈ P(X) implies that for any bounded openneighborhood O, we can can give an ascending sequence of compacts K1 ⊂ K2 ⊂ · · · ⊂ O such thatµ(Kn) ր µ(O). Therefore, since F contains all nonnegative continuous functions with boundedsupport, it is standard to show using Urysohn’s lemma [5, Lemma 7.1] and (2.13) that

η1(B) ≥ η2(B) =

K

P (B |x, u) ζ1(dx,du) ∀B ∈ B(X) . (2.14)

However, η1(X) = η2(X) = 1 by (2.12). Thus, equality must hold in (2.14) for all B ∈ B(X),which means that ζ1 ∈ Merg by the definition of Merg.

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 9

In the case of a sequence ζnn∈N ⊂ Merg which satisfies (2.6), observe that the left-hand side of(2.7) over this sequence is identically equal to 0 by the definition of an ergodic occupation measure.Thus, the proof of the statement is identical to the above.

We continue by showing that (C) implies the existence of an optimal ergodic occupation measurein the sense of Definition 2.2.

Theorem 2.1. Under (C), we have M⋆erg 6= ∅. In addition if ζ⋆ ∈ M⋆

erg, πζ⋆ ∈ P(X) and vζ⋆

satisfy (2.2), and v ∈ Usm agrees πζ⋆-a.e. with vζ⋆, then

limN→∞

1

NEvx

[N−1∑

n=0

cv(Xn)

]= J(x) = β πζ⋆-a.e. (2.15)

Proof. Let ζkk∈N ⊂ Merg such that∫cdζk ց β as k → ∞. We select a subsequence such that

ζk ⇒ ζ ∈ P(K∗), and write ζ = aζ ′ + (1− a)ζ ′′, with a ∈ [0, 1], ζ ′ ∈ P(∂K∗) and ζ ′′ ∈ P(K).Since c is lower semi-continuous on K, there exists a sequence cn ∈ Cb(K) such that cn ↑ c

pointwise. Then for n0 > β + 2ε, we have

β ≥ lim infk→∞

K∗

cdζk ≥ lim infk→∞

K∗

cn dζk ≥ a(β + ε) + (1− a)

K

cn dζ′′ . (2.16)

for all n ≥ n0. From (2.16) we deduce that a < 1, which by Lemma 2.1 implies that ζ ′′ ∈ Merg.Letting n→ ∞ in (2.16), we obtain

β ≥ a(β + ε) + (1− a)

K

cdζ ′′

≥ a(β + ε) + (1− a)β .

This shows that a = 0 and∫Kcdζ ′′ = β. Therefore, ζ ′′ ∈ M⋆

erg.It remains to establish (2.15). If ζ⋆ = πζ⋆ ⊛ vζ⋆ ∈ M⋆

erg and v ∈ Usm agrees πζ⋆-a.e. with vζ⋆ ,then an application of Birkhoff’s ergodic theorem shows that

β =

K

cdζ⋆ = limN→∞

1

NEvx

[N−1∑

n=0

cv(Xn)

]πζ⋆-a.e. (2.17)

This completes the proof.

Remark 2.2. The pair (v,πζ⋆) in Theorem 2.1 is a stationary minimum pair in the sense of [45,Definition 2.2] (see also [44]). It is worthwhile comparing the assumptions in [45] to the ones in thispaper. In [45] the state spaceX is Borel, U is countable, andK is a Borel subset of X×U. Existenceof a stationary minimum pair is established under the assumption that c is strictly unbounded andthe transition kernel satisfies a majorization condition. The latter involves weak continuity of thekernel P and lower semi-continuity of c when these are restricted to D × U, where D ⊂ X is aclosed set that appears in the majorization condition [45, Assumption 3.1].

By comparison, we allow U to be Polish, the running cost satisfies (C) and is not necessarilystrictly unbounded, and we don’t need a majorization condition. However, we assume that X isPolish, that U is upper semicontinuous, weak continuity of P and lower semi-continuity of c on K,which are more restrictive than [45, Assumption 3.1].

2.3. Discussion. To guide the reader in the approach we follow to establish the Bellman equationand the existence of an optimal stationary Markov policy, we review the case of an MDP on acountable state space with compact action space under the near monotone hypothesis [7]. Letthe state space be N0 := 0, 1, 2, . . . , and U(x) = U for all x ∈ N0. Under the near-monotonehypothesis in (C), we obtain an optimal ergodic occupation measure ζ⋆ = πζ⋆ ⊛ vζ⋆ . Let K ⊂ N0

denote the support of πζ⋆ . Without loss of generality, suppose that 0 ∈ K. Then vζ⋆ is defined on

10 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

K via the disintegration of ζ⋆, and thus the Markov chain ‘controlled’ by vζ⋆ is well defined whenrestricted to K. We would like to extend vζ⋆ to some policy v⋆ ∈ Usm which is optimal in the senseof the definition in Subsection 2.1.1. Let τA denote the first return time to a set A, defined by

τA := min n ≥ 1: Xn ∈ A .

Let τ0 := τ0. A key observation is that vζ⋆ satisfies

Evζ⋆x

[τ0−1∑

n=0

(cvζ⋆ (Xn)− β

)]

= infv∈Usm

Evx

[τ0−1∑

n=0

(cv(Xn)− β

)]

∀x ∈ K . (2.18)

This can be shown by following the proof of Lemma 3.1 which establishes an analogous result forthe model in this paper. Therefore, any Markov control that arises from the disintegration of anoptimal ergodic occupation measure attains the infimum on the right-hand side of (2.18) for x ∈ K.Let

V (x) := infv∈Usm

Evx

[τ0−1∑

n=0

(cv(Xn)− β

)], x ∈ N0 . (2.19)

with β as in (2.3), and suppose that the right-hand side of (2.19) is finite for all x ∈ N. Then it isstraightforward to show, using a one step analysis, that V satisfies

V (x) = minu∈U

[c(x, u)− β +

y∈N0

V (y)P (y |x, u)

]∀x ∈ N ,

or in other words, we have the Bellman equation on the entire state space except perhaps at x = 0.Now, since β is the ergodic value, we have

Ev0

[τ0−1∑

n=0

(cv(Xn)− β

)]

≥ 0 ,

with equality when v = vζ⋆. In particular, V (0) = 0. But this shows that the Bellman equationalso holds for x = 0. One crucial step in this derivation is the finiteness of the right-hand side of(2.19). Since c − β ≥ 0 on the complement of a finite set by the near-monotone hypothesis, it iseasy to show that it suffices to assume that there exists some v ∈ Usm which satisfies

Evx

[τ0−1∑

n=0

cv(Xn)

]< ∞ ∀x ∈ N . (2.20)

The fact that 0 is an atom plays of course an important role in showing that the Bellman equationis satisfied at x = 0. For the model in this paper, we circumvent this difficulty by imposing asuitable hypothesis and adopting the splitting method introduced by Athreya–Ney and Nummelin.This is discussed in the next subsection.

2.4. The split-chain and the pseudo-atom. We introduce here the notions of the split chainand pseudo-atom, originally due to Athreya and Ney [4], and Nummelin [35] for uncontrolledMarkov chains. We follow the treatment of [2, Section 8.4]. See [32] for an extended treatment,albeit in the uncontrolled framework.

The basic assumption concerns the existence of a 1-small set which is compatible with the near-monotonicity condition (C). More precisely, the transition probability P is assumed to satisfy thefollowing minorization hypothesis.

(A0) There exists a bounded set B ⊂ X which satisfies

inf(x,u)∈(Bc×U)∩K

c(x, u) > β ,

such that for some measure ν ∈ P(X) with ν(B) = 1, and a constant δ > 0, we haveP (A |x, · ) ≥ δν(A)1B(x) for all A ∈ B(X). Here, β is as defined in (2.3).

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 11

Remark 2.3. If X = Rd and the transition kernel has a continuous density ϕ, then a necessary and

sufficient condition for the minorization condition in (A0) is that the function Γ : B → R+ definedby

Γ (y) := inf(x,u)∈(B×U)∩K

ϕ(y |x, u)

is not equal to 0 ν-a.e. In particular, if the density ϕ is strictly positive, then (A0) is automaticallysatisfied.

Definition 2.5 (Pseudo-atom). Let

X := (X× 0) ∪ (B × 1)

and B(X ) denote its Borel σ-algebra. For a probability measure µ ∈ P(X) we define the corre-sponding probability measure µ on B(X ) by

µ(A× 0) := (1− δ)µ(A ∩ B) + µ(A ∩ Bc) , A ∈ B(X) ,

µ(A× 1) := δµ(A) , A ∈ B(B) .(2.21)

Let B := B × 1, and refer to it as the pseudo-atom.

Definition 2.6 (Split chain). Given the controlled Markov chain(X,U,U , P, c

)as described in

Section 2, we define the corresponding split chain (X ,U,U , Q, c), with state space X , and transitionkernel given by

Q(dy | (x, i), u) :=

P (dy |x, u) if (x, i) ∈ (X× 0) \ (B × 0) ,

11−δ

(P (dy |x, u)− δν(dy)

)if (x, i) ∈ B × 0 ,

ν(dy) if (x, i) ∈ B × 1 .

(2.22)

The running cost c is defined in Subsection 2.5.Using Definition 2.5 and (2.22), the kernel Q of the split chain can be expressed as follows:

Q(A× 0 | (x, 0), u) :=[P (A ∩ B |x, u)− δν(A ∩ B) + 1

1−δ P (A ∩ Bc |x, u)]1B(x)

+[(1− δ)P (A ∩ B |x, u) + P (A ∩ Bc |x, u)

]1Bc(x)

(2.23)

for A ∈ B(X),

Q(A× 1 | (x, 0), u) := δ1−δ

(P (A |x, u) − δν(A)

)1B(x) + δP (A |x, u)1Bc (x) (2.24)

for A ∈ B(B), andQ(dy × 0 | (x, 1), u) := (1− δ) ν(dy) ,

Q(dy × 1 | (x, 1), u) := δ ν(dy) .(2.25)

Note that Bc × 1 is not visited.Given an initial distribution µ0 of X0, the corresponding initial distribution µ0 of the split chain

is determined according to (2.21). We let Xn = (Xn, in) ∈ X denote the state process of the splitchain.

An equivalent description of the split chain is as follows. Let ξn denote the control process.

(1) If Xn = x ∈ B, ξn = u and in = 0, then Xn+1 = y according to the transition probability

1

1− δ

(P (dy |x, u)− δν(dy)

).

Furthermore, if y ∈ B, then in+1 = 1 with probability δ and = 0 with probability 1− δ.(2) If Xn = x ∈ B and in = 1, then Xn+1 = y ∈ B with probability ν(dy) and in+1 = 1 with

probability δ and = 0 with probability 1− δ.

12 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

(3) If Xn = x /∈ B and in = 0, then Xn+1 = y according to P (dy|x, u) and in+1 ia as in (1)above.

(4) The set Bc × 1 is never visited.

This gives a causal description of the split chain. We dub the control ξ = ξn as an admissiblecontrol. Intuitively, it can depend at time n on the past history till n, i.e., on (Xm, im),m ≤n, ξk, k < n, and in addition, on any extraneous randomization conditionally independent of the‘future’ (Xm, im), ξm,m > n, given the history till n. An alternative interpretation given on p. 62and 66 of [35] exhibits in above in terms of (Xn,Xn+1) and loses causality. We do not use thisinterpretation. In particular, in our framework, an admissible control ξ defined as above remainsadnmissible in the same sense for the original chain Xn.

It is clear that an admissible strategy ξ ∈ U, or a Markov randomized strategy v = vtt∈N0

maps in a natural manner to a corresponding control for the split chain, which is also denoted as ξ

or v, respectively. We use the symbol Eξ(x,i) to denote the expectation operator on the path space

of the split chain controlled under ξ ∈ U, and adopt the analogous notation as in Definition 2.1,that is, Qv and Xv

nn∈N0. In addition, we let

τ := minn ≥ 1: Xn ∈ B , (2.26)

that is, the first return time to B := B × 1.

Let

δ :=1− δ

δ

(inf

(x,u)∈(B×U)∩KP (B |x, u) − δ

)−1

. (2.27)

Since δ > 0 in (A0) can always be chosen so that (x, u) 7→ P (B |x, u) − δ is strictly positive on(B ×U) ∩K, we may assume that δ is a (finite) positive constant. We have the following simplelemma.

Lemma 2.2. For any v ∈ Usm it holds that

Ev(x,0)

[τ−1∑

k=0

1B×0(Xk)

]≤ δ .

Proof. This follows directly from the fact that Q(B × 1 | (x, 0), u) ≥ δ−1 for x ∈ B by (2.24) and

(2.27).

Let µ0 be an initial distribution of Xnn∈N0. Adopting the notation Qu( · | z) = Q( · | z, u) for

z ∈ X , an easy calculation using Definition 2.6 shows that µ0Qu(·) :=∫X Qu( · | z) µ0(dz) is given

by

µ0Qu(A× 0) =

X

[(1− δ)P (A ∩ B |x, u) + P (A ∩ Bc |x, u)

]µ0(dx) , A ∈ B(X) ,

µ0Qu(A× 1) =

X

δP (A ∩ B |x, u)µ0(dx) , A ∈ B(B) .

(2.28)

It is important to note, as seen by (2.28), that the marginal of the law of (Xn, Un), n ≥ 0, on (K)∞

coincides with the law of (Xn, Un), n ≥ 0, but the split chain has a pseudo-atom B×1 with manydesirable properties that will become apparent in the next section (see [2, Theorem 8.4.1, p. 289]and [4, 35]).

2.5. An equivalent running cost for the split chain. Consider a function c : X × U → R

satisfying

c((x, 0), u

)= c(x, u) , (x, u) ∈ (Bc ×U) ∩K ,

δc((x, 1), u

)+ (1− δ)c

((x, 0), u

)= c(x, u) , (x, u) ∈ (B ×U) ∩K ,

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 13

with c((x, 1), u

)not depending on u.

Let P(X ) denote the class of probability measures µ on B(X ) which satisfy (1− δ)µ(A×1) =δµ(A × 0) for all A ∈ B(B). It follows by (2.28), that for any initial µ0 ∈ P(X ), we have

µ0Qu ∈ P(X ). In other words, P(X ) is invariant under the action of Q. This property implies that

Eξµ0

[N−1∑

n=0

c(Xn, Un)

]= Eξµ0

[N−1∑

n=0

c(Xn, Un)

]∀ ξ ∈ U .

In particular, the ergodic control problem of the split chain under the cost-per-stage function c isequivalent to the original ergodic control problem.

With the above property in mind, we introduce the following definition.

Definition 2.7. We define the cost-per-stage function c : X ×U → R for the split chain by

c((x, 0), u

):=

c(x, u) for all x ∈ (Bc ×U) ∩K ,

c(x,u)1−δ for all x ∈ (B ×U) ∩K ,

c((x, 1), u

):= 0 ∀x ∈ B .

(2.29)

For v ∈ Usm, we let cv be as in Definition 2.1 with c(·) replaced by c(·).

2.6. Some basic notions. We now recall some standard background from the theory of Markovchains on a general state space, see, e.g., [32] for a more detailed treatment. For v ∈ Usm we definethe resolvent Rv by

Rv(x,A) :=

∞∑

n=1

2−n Pnv (x,A) .

Consider the chain Xnn≥0 controlled by v ∈ Usm. Recall that a measure ψ on B(X) is called a(maximal) irreducibility measure for the chain if ψ is absolutely continuous with respect to Rv(x, ·)for all x ∈ X (and ψ is maximal among such measures). In turn, the chain itself is said to beψ-irreducible. Let B

+(X) denote the class of Borel sets A satisfying ψ(A) > 0. Let τA denote thefirst return time to a set A, defined by

τA := min n ≥ 1: Xn ∈ A .

For a ψ-irreducible chain, a set C is petite if there exists a positive constant c such that Rv(x,A) ≥cψ(A) every A ∈ B(X) and x ∈ C. Recall also that a ψ-irreducible chain is called Harris ifPx(τA < ∞) = 1 for every A ∈ B

+(X) and x ∈ X, and it is called positive Harris if it admits aninvariant probability measure.

Let f : X → [1,∞) be a measurable map. For a ψ-irreducible chain, a set D ∈ B(X) is calledf -regular [32] if

supx∈D

Ex

[τA−1∑

n=0

f(Xn)

]< ∞ ∀A ∈ B

+(X) .

If there is countable cover of X by f -regular sets, then the chain is called f -regular. An f -regularchain is always positive Harris with a unique invariant probability measure π and satisfies

limN→∞

1

N

N−1∑

n=0

Ex[f(Xn)

]= π(f) :=

X

f(x)π(dx) ∀x ∈ X .

14 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

3. The Bellman equation

In view of the definitions of the preceding section, we lift the control problem in Subsection 2.1.1to an an equivalent problem on the controlled split chain (X ,U,U(x), Q, c) described in Defini-tions 2.6 and 2.7. In other words, we seek to minimize over all admissible ξ ∈ U the cost

lim supN→∞

1

NEξ(x,i)

[N−1∑

n=0

c(Xn, Un)

].

3.1. Two assumptions. We need two additional assumptions. To state the first, we borrow thenotion of K-inf-compactness from [22]. Recall that a function f : S → R, where S is a topologicalspace is called inf-compact (on S), if the set x ∈ S : f(x) ≤ κ (possibly empty) is compact in Sfor all κ ∈ R. A function f : K → R is called K-inf-compact if for every compact set K ⊂ X thefunction is inf-compact on (K ×U) ∩K.

The first assumption is a structural hypothesis on the running cost and is stated as follows:

(A1) One of the following holds.(i) For some x ∈ X, we have J(x) <∞, and the running cost c is inf-compact on K.(ii) The running cost c is K-inf-compact and (C) holds.

It is clear that part (i) of (A1) implies (C). Therefore, as shown in Theorem 2.1, under (A1),there exists an optimal ergodic occupation measure.

Remark 3.1. Hypothesis (A1) (i) cannot be satisfied unlessK is σ-compact. A non-trivial example ofsuch a Polish space is

∏i∈Nλei : λ ≥ 0 where eii∈N is a complete orthonormal basis for a Hilbert

space with relative topology inherited from the ambient Hilbert space. This space is not locallycompact. Note also that an inf-compact c is automatically K-inf-compact [22, Lemma 2.1 (ii)].

The second assumption is analogous to (2.20) for denumerable MDPs. We start with the followingdefinition.

Definition 3.1. Let τ be as defined in (2.26). We say that v ∈ Usm is c-stable if for the chaincontrolled by v the map

x 7→ Ev(x,0)

[τ−1∑

k=0

cv(Xk)

]

is locally bounded on X, and by that we mean that it is bounded on every bounded set of X.

We impose the following assumption.

(A2) There exists a c-stable v ∈ Usm.

It is clear that (A0) implies that the split chain controlled by a c-stable v ∈ Usm is positiveHarris.

In the rest of the paper we let

ζ⋆ = πζ⋆ ⊛ vζ⋆ ∈ M⋆erg (3.1)

be a generic optimal ergodic measure. It is clear that Evζ⋆

(x,0)

[∑τ−1k=0 cvζ⋆ (Xk)

]is well defined πζ⋆-a.e.

We continue with the following lemma.

Lemma 3.1. Any c-stable v ∈ Usm satisfies

Ev(x,0)

[τ−1∑

k=0

(cv(Xk

)− β

)]

≥ Evζ⋆

(x,0)

[τ−1∑

k=0

(cvζ⋆

(Xk

)− β

)]

πζ⋆-a.e.

In particular, (A2) implies that x 7→ Evζ⋆

(x,0)

[∑τ−1k=0 cvζ⋆ (Xk)

]is locally bounded πζ⋆-a.e.

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 15

Proof. If not, then we have the reverse inequality on some set A ∈ B(X) with πζ⋆(A) > 0, that is,

Ev(x,0)

[τ−1∑

k=0

(cv(Xk

)− β

)]< E

vζ⋆

(x,0)

[τ−1∑

k=0

(cvζ⋆

(Xk

)− β

)]

∀x ∈ A . (3.2)

To simplify the expressions let

J (v) :=τ−1∑

k=0

(cv(Xk

)− β

), v ∈ Usm .

Since πζ⋆(A) > 0, (3.2) implies that

Evζ⋆

(x,1)

[1τ>τAE

vζ⋆

XτA

[J (vζ⋆)

]]> E

vζ⋆

(x,1)

[1τ>τAE

vXτA

[J (v)

]]. (3.3)

Consider v = (vn , n ∈ N0) defined by Let

vn :=

v if τA ≤ n < τ ,

vζ⋆ otherwise.

It is clear that this can be extended to a (nonstationary) strategy v ∈ U over the infinite horizon,

by using the kth return time to B, denoted as τk, and the number of cycles κ(n) completed at timen ∈ N, which is defined by

κ(n) := max k : n ≥ τk .

Using the strong Markov property and (3.3), we obtain

0 = Evζ⋆

(x,1)

[J (vζ⋆)

]

= Evζ⋆

(x,1)

[1τ≤τAJ (vζ⋆)

]+ E

vζ⋆

(x,1)

[1τ>τAE

vζ⋆

XτA

[J (vζ⋆)

]]

> Ev(x,1)

[1τ≤τAJ (vζ⋆)

]+ Ev(x,1)

[1τ>τAE

vXτA

[J (v)

]]

= Ev(x,1)

[τ−1∑

k=0

(cv(Xk)− β

)].

For m > 0, let cmv := minm, cv. The preceding inequality shows that, for some ε > 0, we have

Ev(x,1)

[τ−1∑

k=0

(cmv (Xk)− β

)]< −ε ∀m ∈ N . (3.4)

We claim that (3.4) contradicts the fact that β is the optimal ergodic value. Indeed, it is ratherstandard to show (see the proof of Theorem 5.1 of [25]) that

1

T

T−1∑

t=0

(cmv (Xt)− β

)−−−−→T→∞

1

Ev(x,1)[τ]Ev(x,1)

[τ−1∑

k=0

(cmv (Xk)− β

)]

Pv-a.s.,

which together with (3.4) implies (since cmv is bounded) that

limT→∞

1

TEv(x,1)

[T−1∑

t=0

(cmv (Xt)− β

)]< −ε1 ∀m ∈ N , (3.5)

for some ε1 > 0. Let cm := minm, c, and βm denote the optimal ergodic value for cm in place ofc, defined as in assumption (C). We first show that βm → β as m → ∞. By Theorem 2.1, thereexists ζm ∈ Merg such that

βm =

K

cm dζm ∀m > β + 2ε .

16 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

As argued in the proof of Theorem 2.1, ζm converges along some subsequence to a measure aζ ′ +(1 − a)ζ ′′ ∈ P(K∗), with ζ ′ ∈ P(∂K∗) and ζ ′′ ∈ P(K). We employ a family cmn , m, n ∈ N oflower semi-continuous functions on K

∗ defined as in (C) with c replaced by cm. Then, analogouslyto (2.16), for any fixed m > β + 2ε, we have

limk→∞

βk ≥ limk→∞

K

cmm dζk ≥ a(β + ε) + (1− a)

K

cmm dζ ′′ .

Taking limits as m→ ∞ and using monotone convergence, we obtain

β ≥ limk→∞

βk ≥ a(β + ε) + (1− a)

K

cdζ ′′ . (3.6)

This implies that a < 1, and therefore ζ ′′ ∈ Merg by Lemma 2.1. But then∫Kcdζ ′′ ≥ β by (2.3),

and the equality limk→∞ βk = β follows from (3.6). Parenthetically, we mention that the aboveargument also shows that the sequence ζmm∈N is tight. Continuing, (3.5) implies that

βm − β ≤ limT→∞

1

TEv(x,1)

[T−1∑

t=0

(cmv (Xt)− β

)]

= −ε1 ∀m ∈ N ,

which is a contradiction. This completes the proof.

Let v ∈ Usm be c-stable. It follows from Lemma 3.1 that the strategy which agrees with vζ⋆ onthe support of πζ⋆ and with v on its complement is also c-stable.

Assumptions (A0)–(A2) are in effect throughout the rest of the paper, unless mentioned other-wise. To see how they are used, consider the following. Let v ∈ Usm be c-stable, and such that itagrees πζ⋆-a.e. with the control vζ⋆ obtained via the disintegration of an optimal ergodic occupationmeasure ζ⋆ = πζ⋆ ⊛ vζ⋆ , whose existence is guaranteed by (A1). It is then clear by (A0) that thechain controlled by v is ν-irreducible and aperiodic. Thus, the invariant probability measure πζ⋆

is unique for the chain controlled by v. This implies that J(x, v) = β, a constant that does notdepend on x ∈ X. Compare this with the counterexample in [17, Example 1, p. 178]. Also, (A2)should be compared with part (b) of [45, Theorem 3.5].

It is clear from the definition of Q that the first exit distribution of the split-chain from B × 1

does not depend on x ∈ B. Thus x 7→ Ev(x,1)[τ ] is constant on B. This implies that, for all

f ∈ Cb(X), with f defined analogously to (2.29) so that f((x, 1)

)= 0 for all x ∈ B, we have

πζ⋆(f) :=

X

f(y)πζ⋆(dy) =Ev(x,1)

[∑τ−1k=0 f(Xk)

]

Ev(x,1)[τ ]∀x ∈ B . (3.7)

In fact, (3.7) holds for any f ∈ L1(X;πv) by [36, Proposition 5.9]. Therefore, we have

Ev(x,1)

[τ−1∑

k=0

(cv(Xk)− β

)]

= 0 ∀x ∈ B . (3.8)

It then follows by (A2) that the function

G(i)v (x) = Gv(x, i) := Ev(x,i)

[τ−1∑

k=0

(cv(Xk)− β

)], (x, i) ∈ X , (3.9)

is locally bounded from above. On the other hand, by Lemma 2.2 and the fact that cv ≥ β on

Bc × 0 we have infX G(0)v ≥ −βδ.

In Subsection 3.2 we show that G(i)v (x) solves the Poisson equation.

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 17

3.2. Solution to the Poisson equation. Let v ∈ Usm be c-stable, and such that it agrees πζ⋆-a.e. with the control vζ⋆ obtained via the disintegration of an optimal ergodic occupation measureζ⋆ = πζ⋆ ⊛ vζ⋆ . By one step analysis, using (2.23)–(2.25), (2.29), and (3.9), adopting the notationin Definition 2.1, we obtain

G(1)v (x) = −β + (1− δ)

BG(0)v (y)ν(dy) + δ

BG(1)v (y)ν(dy) , x ∈ B , (3.10)

G(0)v (x) = cv(x)

1−δ − β +

BG(0)v (y)

[Pv(dy |x)− δν(dy)

]+ 1

1−δ

Bc

G(0)v (y)Pv(dy |x)

+ δ1−δ

BG(1)v (y)

[Pv(dy |x)− δν(dy)

], x ∈ B ,

(3.11)

and

G(0)v (x) = cv(x)− β + (1− δ)

BG(0)v (y)Pv(dy |x) +

Bc

G(0)v (y)Pv(dy |x)

+ δ

BG(1)v (y)Pv(dy |x) , x ∈ Bc .

(3.12)

Let

c(x, u) := c(x, u) − β , and cv(x) := cv(x)− β . (3.13)

Multiplying (3.10) and (3.11) by δ and (1− δ), respectively, and adding them together, we obtain

(1− δ)G(0)v (x) + δ G

(1)v (x) = cv(x) +

B

[(1− δ)G

(0)v (y) + δ G

(1)v (y)

]Pv(dy |x)

+

Bc

G(0)v (y)Pv(dy |x) , x ∈ B .

(3.14)

We define

Gv(x) :=

(1− δ)G

(0)v (x) + δ G

(1)v (x) , for x ∈ B ,

G(0)v (x), otherwise.

(3.15)

It follows by (3.12), (3.14), and (3.15) that

Gv(x) = cv(x) +

∫Gv(y)Pv(dy |x) = cv(x) + PvGv(x) , x ∈ X . (3.16)

It is clear that (3.8) implies that G(1)v ≡ 0 on B. Thus

BGv(y)ν(dy) =

B(1− δ)G

(0)v (y)ν(dy) = β

by (3.10) and (3.15).

Note that G(i)vζ⋆ and Gvζ⋆ are well defined πζ⋆-a.e. via (3.9) and (3.15).

3.3. Derivation of the Bellman equation (ACOE). Starting in this section, and throughoutthe rest of the paper, we enforce the following structural hypothesis on the controlled chain. Thisassumption is implicit in all the results of the paper which follow, unless otherwise mentioned.

Assumption 3.1. P (dy |x, u) is strongly continuous (or strong Feller), that is, the map K ∋(x, u) 7→

∫Xf(y)P (dy |x, u) is continuous for every f ∈ Mb(X).

Remark 3.2. Assumption 3.1 implies that the family P ( · |x, u) : (x, u) ∈ K is tight for anycompact set K ⊂ K. Indeed, for any sequence (xn, un) ∈ K converging to some (x, u) in this set,we have P ( · |xn, un) ⇒ P ( · |x, u). Then the above set, being the continuous image of a compactset, is compact. By Prokhorov’s theorem, it is tight.

18 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

Remark 3.3. If X = Rd, a sufficient condition for Assumption 3.1 is that

P (dy |x, u) = ϕ(y |x, u)λ(dy)

for a density function ϕ with respect to λ, the Lebesgue measure on Rd, and that the map

(y, x, u) ∈ Rd ×K 7→ ϕ(y |x, u) ∈ [0,∞)

is continuous. This implies the continuity of the measure-valued map (x, u) 7→ ϕ(y|x, u)dy in totalvariation norm by Scheffe’s theorem, which in turn implies Assumption 3.1.

Definition 3.2. Define

V(0)⋆ (x) := inf

v∈Usm

Ev(x,0)

[τ−1∑

k=0

(cv(Xk)− β

)], x ∈ X ,

where Usm denotes the set of c-stable v ∈ Usm. Also let V(1)⋆ (x) = 0 for x ∈ B, and

V⋆(x) :=

(1− δ)V

(0)⋆ (x) , for x ∈ B ,

V(0)⋆ (x), otherwise.

Recall the definition of O(f) in Subsection 1.1, and that L(X) denotes the class of real-valuedlower semi-continuous functions which are bounded from below in X.

Theorem 3.1. The function V⋆ in Definition 3.2 is in the class L(X) and satisfies

V⋆(x) = minu∈U(x)

[c(x, u) +

X

V⋆(y)P (dy |x, u)

]∀x ∈ X , (3.17)

with c as in (3.13). Moreover, every v⋆ ∈ Usm which satisfies

v⋆(x) ∈ Argminu∈U(x)

[c(x, u) + PuV⋆(x)

](3.18)

is an optimal stationary Markov strategy. In addition, (3.17) has, up to an additive constant, aunique solution in L(X) ∩ O(V⋆).

Proof. As in (3.11), by one step analysis, we obtain

V(0)⋆ (x) = min

u∈U(x)

[c(x,u)1−δ − β +

BV

(0)⋆ (y)

[P (dy |x, u)− δν(dy)

]

+ 11−δ

Bc

V(0)⋆ (y)P (dy |x, u)

], x ∈ B ,

(3.19)

and

V(0)⋆ (x) = min

u∈U(x)

[c(x, u) − β + (1− δ)

BV

(0)⋆ (y)P (dy |x, u) +

Bc

V(0)⋆ (y)P (dy |x, u)

](3.20)

for x ∈ Bc. One step analysis here is justified because, if the policy after the first stage is not c-stablewith postive probability with respect to P (dy|x, u), then value of the quantity being minimized is∞. Thus minimization over all admissible policies and minimization over c-stable policies areequivalent.

On the other hand, since V(0)⋆ = G

(0)vζ⋆ ν-a.e. by Lemma 3.1, then (3.10) shows that

0 = V(1)⋆ (x) = −β + (1− δ)

BV

(0)⋆ (y)ν(dy) ∀x ∈ B . (3.21)

It then follows by (3.19)–(3.21) and Definition 3.2, that V⋆ satisfies

V⋆(x) = minu∈U(x)

[c(x, u) + PuV⋆(x)

]. (3.22)

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 19

Since the kernel P is strongly continuous and V⋆ is bounded from below in X by Lemma 2.2, themap (x, u) 7→ PuV⋆(x) is lower semi-continuous on K. Therefore, since U is upper semi-continuous,the map (x, u) 7→ c(x, u) + PuV⋆(x) is K-inf-compact by [22, Lemma 2.1 (i)]. Hence, applyingTheorem 2.1 of [22] to (3.22), we deduce that V∗ ∈ L(X). Since V⋆ is bounded from below in X

by Lemma 2.2, optimality of v⋆ in (3.18) follows by a standard argument using Birkhoff’s ergodictheorem.

We continue with the proof of uniqueness. Since V(0)⋆ is bounded from below in X, it is standard

to show, using (3.20) and Fatou’s lemma, that

V(0)⋆ (x) ≥ Ev⋆(x,0)

[τ−1∑

k=0

(cv⋆(Xk)− β

)], x ∈ X , (3.23)

with v⋆ as in (3.18). Definition 3.2 shows that we must have equality in (3.23). In turn, applyingDynkin’s formula to (3.20) we obtain

V(0)⋆ (x) = lim

n→∞Ev⋆(x,0)

[τ∧n−1∑

k=0

(cv⋆(Xk)− β

)], x ∈ X ,

and

lim supn→∞

Ev⋆(x,0)

[V

(0)⋆ (X

τ)1τ>n

]= 0 . (3.24)

Let V ∈ L(X) ∩ O(V⋆) be a solution of (3.17) and v ∈ Usm a selector from its minimizer. Going

to the split chain and scaling with an additive constant, we obtain functions V (i)(x), i = 0, 1, which

satisfy (3.19)–(3.21) (with V(i)⋆ replaced by V (i)), and

V (x) :=

(1− δ)V (0)(x) , for x ∈ B ,

V (0)(x), otherwise.

In analogy to (3.23), we also have

V (0)(x) ≥ Ev(x,0)

[τ−1∑

k=0

(cv(Xk)− β

)]

≥ V(0)⋆ (x) , x ∈ X , (3.25)

where for the second inequality we use Definition 3.2. Thus, if v⋆ is as in (3.18), then using the kernel

Qv⋆ of the split chain in (2.22), we deduce that V (i) − V(i)⋆ is a nonnegative local supermartingale

under Qv⋆ . Since V(1) = V

(1)⋆ = 0, and V ∈ O(V⋆), using Dynkin’s formula, we obtain from (3.24)

and the supermartingale inequality that

V (0)(x)− V(0)⋆ (x) ≤ 0 ∀x ∈ X . (3.26)

Therefore, V (0) = V(0)⋆ on X by (3.25) and (3.26). This completes the proof.

Remark 3.4. If we relax the strong Feller hypothesis in Assumption 3.1, and assume instead thatthe transition kernel is weak Feller, we can obtain an ACOI with a lower semi-continuous potentialfunction. Indeed, if we let

V⋆(x) := supr>0

infy∈Br(x)

V⋆(y) ,

with Br(x) denoting the open ball of radius r centered at x, then V⋆ ∈ L(X). Therefore, by (3.16)we have

V⋆(x) ≥ infu∈U(x)

[c(x, u) + PuV⋆(x)

]≥ inf

u∈U(x)

[c(x, u) + PuV⋆(x)

], (3.27)

20 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

and the term on the right-hand side of (3.27) is in L(X). Since V⋆ is the largest lower semi-continuous function dominated by V⋆ [41], we obtain

V⋆(x) ≥ infu∈U(x)

[c(x, u) + PuV⋆(x)

].

It is standard to show that any measurable selector from this equation is optimal. We refer thereader to [30,41] on how to improve this to an ACOE under additional hypotheses.

Remark 3.5. Our approach differs from the standard approach of deriving the Bellman equationusing a vanishing discount argument. We briefly indicate here how near-monotonicity or inf-compactness of the cost function can help us with the standard methodology. One importantconsequence of near-monotonicity is that we can prove that the discounted value function attainsits minimum on a fixed compact set as the discount parameter varies. Thus if we can establishequicontinuity of the relative discounted value functions as the discount factor varies (e.g., usingconvexity when available as in [24], or in Example 6.1 in Section 6), one can argue that as thediscount parameter tends to 1, the relative discounted value functions either remain bounded oncompacts or tend to infinity uniformly on compacts along a subsequence. Eliminating the latterpossibility by a suitable choice of the offset in the definition of the relative discounted value func-tion shows uniform boundedness over compacts. This idea is used in [1] for deriving the Bellmanequation for the average cost for a specific class of problems, and can potentially be generalized.See also [21, Theorem 6].

Remark 3.6. It is worth noting that the derivation of the Bellman equation (3.22) does not requirethe strong continuity of Assumption 3.1; weak continuity will suffice. We do, however, requirestrong continuity in order to obtain a solution in the class L(X).

4. The value iteration

Throughout this section as well as Section 6, v⋆ ∈ U⋆sm is some optimal stationary Markov strategywhich is kept fixed.

4.1. The value iteration algorithm. We start with the following definition.

Definition 4.1. [Value Iteration] Given Φ0 ∈ L(X) which serves as an initial condition, wedefine the value iteration (VI) by

Φn+1(x) = T Φn(x) := minu∈U(x)

[c(x, u) + PuΦn(x)

], n ∈ N0 . (4.1)

Since T : L(X) → L(X), it is clear that the algorithm lives in the space of lower semi-continuousfunctions which are bounded from below in X. It is also clear that T is a monotone operator onL(X), that is, for any f, f ′ ∈ L(X) with f ≤ f ′, we have T f ≤ T f ′.

4.1.1. The value iteration for the split chain. Using (2.23)–(2.25) we can also express the algorithm

via the split chain as follows. The value iteration functionsΦ(i)n , n ∈ N0 , i = 0, 1

, are defined as

follows. Let V0 : X → R be a nonnegative continuous function. The initial condition is Φ(1)0 = 0,

and Φ(0)0 (x) = V0(x)

[(1− δ)−1

1B(x) + 1Bc(x)], and for each n ∈ N, define

Φn(x) =

(1− δ)Φ

(0)n (x) + δΦ

(1)n , for x ∈ B ,

Φ(0)n (x), otherwise.

(4.2)

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 21

Thus, the algorithm takes the form

Φ(0)n+1(x) =

1

1− δminu∈U(x)

[c(x, u) +

X

Φn(y)P (dy |x, u)

]−

δ

1− δ

BΦn(y) ν(dy) , x ∈ B ,

Φ(0)n+1(x) = min

u∈U(x)

[c(x, u) +

X

Φn(y)P (dy |x, u)

], x ∈ Bc ,

Φ(1)n+1(x) = −β +

BΦn(y) ν(dy) , x ∈ B .

Notation 4.1. We adopt the following simplified notation. We let vn ∈ Usm be a measurableselector from the minimizer of (4.1), and define

Pn(· |x) := P(· |x, vn(x)

), and cn(x) := c(x, vn(x)) − β . (4.3)

Note that these depend on the initial value Φ0.We fix an optimal strategy v⋆ ∈ U⋆sm, and let P⋆(dy |x) denote the transition kernel under v⋆. In

addition, we let c⋆(x) = c(x, v⋆(x)

)and c⋆ = c⋆ − β.

With this notation, for n ∈ N0, we have

Φn+1(x) = minu∈U(x)

[c(x, u) + PuΦn(x)

]

= cn(x) + PnΦn(x) ∀x ∈ X ,

(4.4)

and

V⋆(x) = c⋆(x) + P⋆V⋆(x) ∀x ∈ X . (4.5)

It follows from optimality of v⋆ and vn that

Φn+1 ≤ c⋆ + P⋆Φn , (4.6)

and

V⋆ ≤ cn + PnV⋆ . (4.7)

4.2. General results on convergence of the VI. Recall the function V⋆ from Theorem 3.1, andlet π∗ denote the associated invariant probability measure. Consider the following hypothesis.

(H1) π⋆(V⋆) <∞.

For c bounded, finiteness of the second moments of τ implies (H1) (see, for example, [7, p. 66]).In general, (H1) is equivalent to the finiteness of the second moments of the modulated first hittingtimes to B × 1 on a full and absorbing set.

For a constant κ ∈ R we define the set

V(κ) :=f ∈ L(X) ∩ O(V⋆) : f ≥ V⋆ − κ , π⋆(f) ≤ κ+ 1

. (4.8)

Under (H1), we show that the VI converges pointwise for any Φ0 ∈ V(κ). In order to prove thisresult, we need the following lemma.

Lemma 4.1. Under (H1), if Φ0 ∈ V(κ) for some κ ∈ R, then Φn ∈ V(κ) for all n ∈ N, or inother words, the set V(κ) is invariant under the action of T . In addition, π⋆(Φn+1) ≤ π⋆(Φn) forall n ∈ N0.

22 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

Proof. Subtracting (4.5) from (4.6) we obtain

Φn+1 − V⋆ ≤ P⋆(Φn − V⋆) , (4.9)

while by subtracting (4.7) from (4.4) we have

Φn+1 − V⋆ ≥ Pn(Φn − V⋆) . (4.10)

It is straightforward to show using the f -norm ergodic theorem (see [32, Theorem 14.0.1]) thatunder (H1) there exists a constant m such that

Pn⋆ g(x) ≤ m‖g‖V⋆ ∀x ∈ X , ∀n ∈ N ,

and for all g ∈ O(V⋆). Therefore, π⋆(Φn+1) ≤ π⋆(Φn) by (4.9), while Φn+1 − V⋆ ≥ infX (Φn − V⋆)by (4.10). The result then follows from these.

Theorem 4.1. Assume (H1), and suppose Φ0 ∈ V(κ) for some κ ∈ R. Then the following hold

Φn −−−→n→∞

V⋆ + limn→∞

π⋆(Φn − V⋆) in L1(X;π⋆) and π⋆-a.s. . (4.11)

Also,limn→∞

∣∣ν(Φn)− π⋆(Φn − V⋆)∣∣ = β . (4.12)

Proof. Let X⋆nn∈Z denote the stationary optimal process controlled by v⋆. If Φ0 ∈ V(κ), we have

supn∈N

∫|Φn(x)− V⋆(x)|π⋆(dx) < ∞

by Lemma 4.1. Then (4.9) implies that the process

Mk :=Φ−k(X

⋆−k)− V⋆(X

⋆−k)

k≤0

is a backward submartingale with respect to the filtration Fkk≤0 :=σ(X⋆

−ℓ , ℓ ≤ k)

k≤0. From

the theory of martingales, it is well known that Mk converges a.s. and in the mean. The latterimplies the convergence in L1(X;π⋆) claimed in (4.11). By the ergodicity of X⋆

nn∈Z the limit isconstant π⋆-a.s.

Convergence of Φn in L1(X;π⋆), and hence also in L1(B; ν) by (A0), implies that ν(Φn) converges.

Since ν(V⋆) = β by (3.21), (4.12) then follows from (4.11).

Corollary 4.1. Assume (H1), and suppose that V⋆ is bounded. Then Φn(x) − V⋆(x) converges toa constant π⋆-a.e. as n→ ∞ for any initial condition Φ0 ∈ Lb(X).

Proof. This clearly follows from Theorem 4.1, since if Φ0 ∈ Lb(X), then Φ0 ∈ V(κ) for someκ ∈ R.

5. Relative value iteration

We consider three variations of the relative value iteration algorithm (RVI). All these start withinitial condition V0 ∈ L(X).

LetSf(x) := inf

u∈U(x)

[c(x, u) + Puf(x)

], f ∈ L(X) .

The iterates Vnn∈N ⊂ L(X) are defined by

Vn(x) = T Vn−1(x) := SVn−1(x)− ν(Vn−1) , x ∈ X . (5.1)

An important variation of this isV n(x) =

T

V n−1(x) := S

V n−1(x)−min

X

V n−1 , x ∈ X . (5.2)

Also, we can modify (5.1) to

Vn(x) = T Vn−1(x) := SVn−1(x)− Vn−1(x) , x ∈ X , (5.3)

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 23

where x ∈ B is some point that is kept fixed.We let vn be a measurable selector from the minimizer of (5.1)–(5.3) (note that all three minimiz-

ers agree if the algorithms start with the same initial condition). We refer to vn as the recedinghorizon control sequence.

Lemma 5.1. Provided that Φ0 = V0, then we have

Φn(x)− Φn(y) = Vn(x)− Vn(y) ∀x, y ∈ X , ∀n ∈ N , (5.4)

and the same applies if Vn is replaced by

V n or Vn. In addition, the convergence of Φn implies

the convergence of Vn, and also that of

V n, Vn in the same space.

Proof. A straightforward calculation shows that

Vn(x)− Φn(x) = nβ −n−1∑

k=0

ν(Vk) ,

from which (5.4) follows. The proofs for

V n and Vn are completely analogous.

We have

Vn+1(x)− Φn+1(x) = Vn(x)− Φn(x) + β − ν(Vn) ,

which implies that

ν(Vn+1) = ν(Φn+1)− ν(Φn) + β .

Therefore, convergence of Φn implies that ν(Vn) → β as n → ∞. In turn, this implies the

convergence of Vn by (5.4). In the case of

V n we obtain minX

V n → β as n → ∞, and

analogously for Vn.

The following theorem, under hypotheses (a)–(b), is a direct consequence of Theorem 4.1, Corol-lary 4.1, and Lemma 5.1. Hypothesis (H2) is given in the beginning of the next section.

Theorem 5.1. Let one of the following assumptions be satisfied.

(a) (H1) holds and V0 ∈ V(κ) for some κ ∈ R. Here, V(κ) is as defined in (4.8).(b) (H1) holds, V⋆ is bounded, and V0 ∈ Lb(X).(c) (H2) holds and V0 ∈ L(X) ∩ O(V⋆).

Then the value iteration functions in (5.1)–(5.3) converge π⋆-a.e. to V⋆ as n→ ∞. In addition, if(c) holds, then convergence is pointwise for all x ∈ X.

The assertions concerning (H2) are proved in the next section. They are included in Theorem 5.1in order to give a unified statement.

6. Stability of the rolling horizon procedure

Consider the following hypothesis.

(H2) There exist constants θ1 > 0 and θ2 such that

minu∈U(x)

c(x, u) ≥ θ1V⋆(x)− θ2 ∀x ∈ X .

Without loss of generality, we assume that θ1 ∈ (0, 1).

Remark 6.1. Hypothesis (H2) can be written in the following equivalent, but seemingly more generalform.

24 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

(H2′) There exists v ∈ Usm, and a function Vv : X → [1,∞) satisfying

minu∈U(x)

c(x, u) ≥ θ1Vv(x)− θ2 ∀x ∈ X ,

for some constants θ1 > 0 and θ2, and

PvVv(x)− Vv(x) ≤ C1B(x)− cv(x)

for some constant C.

It is clear that (H2) implies (H2′), while the converse follows by the stochastic representation ofVv and Definition 3.2.

Concerning the value iteration, we have the following:

Theorem 6.1. Assume (H2), and suppose that the initial condition Φ0 lies in L(X)∩O(V⋆). Then,

there exists a constant C0 depending on Φ0 such that∣∣Φn(x)− V⋆(x)

∣∣ ≤ C0

(1 +

(1− θ1)

nV⋆(x))

∀x ∈ X , ∀n ∈ N .

In addition, Φn(x)− V⋆(x) converges to a constant π⋆-a.e. as n→ ∞.

Proof. Under (H2) we obtain

P⋆V⋆(x) = β − c⋆(x) + V⋆(x) ≤ β + θ2 + (1− θ1)V⋆(x) . (6.1)

Let ρ := 1− θ1, and define

fn(x) := Φn(x)− (1− ρn)(V⋆(x)−

β+θ2θ1

).

Recall (4.3) and (4.5). We have

fn+1(x)− Pnfn(x) = cn(x)− θ1ρn(V⋆(x)−

β+θ2θ1

)+ (1− ρn)(Pn − I)V⋆(x)

≥ cn(x)− θ1ρn(V⋆(x)−

β+θ2θ1

)− (1− ρn)cn(x)

= ρn(−θ1V⋆(x) + θ2 + cn(x)

)≥ 0 ∀(x, n) ∈ X×N ,

where we also used (4.7). Iterating the above inequality, we get fn ≥ infXΦ0 for all n ∈ N.Assuming without loss of generality that Φ0 is nonnegative, this implies that

(1− ρn)(V⋆(x)−

β+θ2θ1

)≤ Φn(x) . (6.2)

On the other hand, by (4.9) and (6.1), we obtain

Φn(x) ≤ V⋆(x) + C0

(C1 + ρnV⋆(x)

)(6.3)

for some constants C0 and C1 which depend on Φ0. Since π⋆(Φn − V⋆) is bounded from above by(6.3), and bounded from below by (6.2), the result follows by the same argument as was used inthe proof of Theorem 4.1.

Definition 6.1. We say that v ∈ Usm is stabilizing if

lim supN→∞

1

NEvx

[N−1∑

k=0

cv(Xk)

]< ∞ ∀x ∈ X ,

and denote the class of stabilizing controls by Ustab.

Recall the definition of vn in Notation 4.1. We have the following theorem.

Theorem 6.2. Under (H2), for every Φ0 ∈ L(X) ∩ O(V⋆) there exists N0 ∈ N such that thestationary Markov control vn is stabilizing for any n ≥ N0.

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 25

Proof. Combining (4.4), (6.2), and (6.3), we obtain

Pn

((1− ρn)

(V⋆(x)−

β+θ2θ1

))≤ PnΦn(x)

= −cn(x) + Φn+1(x)

≤ −cn(x) + V⋆(x) + C0

(C1 + ρn+1V⋆(x)

). ,

Rearranging, this gives

(1− ρn)Pn V⋆(x) ≤ −cn(x) + (1− ρn)β+θ2θ1

+ C0C1 + (1 + C0ρ)ρnV⋆(x) + (1− ρn)V⋆(x) . (6.4)

From (H2) we have V⋆ ≤cn+β+θ2

θ1, and using this in (6.4) gives

(1− ρn)Pn V⋆(x) ≤ −(1− 1+C0ρ

θ1ρn

)cn(x) +

(1 + C0ρ

n+1)β+θ2θ1

+ C0C1 + (1− ρn)V⋆(x) . (6.5)

Let N0 ∈ N be large enough such that 1+C0ρθ1

ρN0 < 1. Using (6.5) as a telescoping sum, and thefact that V⋆ is bounded from below in X, we obtain

lim supN→∞

1

NEvnx

[N−1∑

k=0

cn(Xk)

]≤ β +

(1 + C0ρ

n+1)(β + θ2)

θ1 −(1 + C0ρ

)ρn

∀n ≥ N0 , (6.6)

with cn(x) := c(x, vn(x)). This shows that vn is stabilizing for all n ≥ N0.

We improve the convergence result in Theorem 6.1.

Theorem 6.3. Assume (H2). Then, for every initial condition Φ0 ∈ L(X) ∩ O(V⋆), the sequenceΦn(x)− V⋆(x) converges pointwise to a constant.

Proof. Without loss of generality, we may translate the initial condition Φ0 by a constant so that

Φn → V⋆ as n→ ∞ π⋆-a.s. Then of course Φ(1)n → V

(1)⋆ = 0 on B. Recall also, that these functions

are constant on B. Let Ψn := Φn − V∗. Then∣∣Ψ(1)

n

∣∣ = ǫn on B, for some sequence ǫn → 0. Also∣∣Ψn(x)

∣∣ ≤ C0

(1 +

(1− θ1)

nV⋆(x))

∀x ∈ X (6.7)

by Theorem 6.1, and

Pn⋆ V⋆(x) ≤β + θ2θ1

+ (1− θ1)nV⋆(x) (6.8)

by (6.1). Let τm denote the first exit time from the ball of radius m centered at some fixed pointx0. Applying the optional sampling theorem to (4.9) relative to the stopping time τ ∧ n ∧ τm, weobtain

Ψ(0)2n (x) ≤ Ev⋆(x,0)

(1)2n−τ

(Xτ)1τ≤n<τm + Ψ(0)

n (Xn)1n<τ<τm + Ψ(0)2n−τm

(Xτm)1τm≤τ∧n

]

≤ supk≥n

ǫk + Ev⋆(x,0)

[Ψ(0)n (Xn)1τ>n

]+ Ev⋆(x,0)

(0)2n−τm

(Xτm)1τm≤n

].

(6.9)

By (4.2), we have∣∣Ψ(0)

n

∣∣ ≤1

1− δ

∣∣Ψn

∣∣+ δǫn1− δ

. (6.10)

Therefore

lim supn→∞

Ev⋆(x,0)

[Ψ(0)n (Xn)1τ>n

]≤ 0

by (6.7), (6.8), and (6.10) and the fact that Ev⋆(x,0)[τ] <∞.

By a standard application of the optional sampling theorem to the Poisson equation P⋆V⋆+ c⋆ =V⋆ + β, and keeping in mind that V⋆ is bounded from below, we obtain, for some constant κ1,

Ev⋆x

[V⋆(Xτm)1τm≤n

]≤ nβ + κ1 + V⋆(x) for all m,n ∈ N . (6.11)

26 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

Therefore, combining (6.7) and (6.11), we see that there exists some constant κ2 such that

Ev⋆(x,0)

(0)2n−τm

(Xτm)1τm≤n

]≤ κ2

(Pv⋆(x,0)(τm ≤ n) + (1− θ1)

n(nβ + κ1 + V⋆(x)

))

for all m,n ∈ N. This shows that the third term on the right-hand side of (6.9) tends to 0 asm→ ∞ for any fixed n ∈ N.

A slight modification of (6.9) also shows that lim supn→∞ Ψ(0)2n+1(x) ≤ 0 for all x ∈ X. Thus we

have established that for each initial condition Φ0 ∈ L(X) ∩ O(V⋆) there exists a constant κ0 suchthat

lim supn→∞

Φ(0)n (x) ≤ κ0 + V

(0)⋆ (x) ∀x ∈ X , and Φ(1)

n −−−→n→∞

κ0 . (6.12)

In the second part of the proof, we establish that lim infn→∞ Φ(0)n agrees with the value of

superior limit in (6.12). Let vn denote the nonstationary Markov policy (vn, vn−1, . . . , v1) with vnas defined in Notation 4.1. We claim that

Pv2n

(x,0)[τ > n] −−−→n→∞

0 . (6.13)

To prove the claim, we apply Dynkin’s formula and Fatou’s Lemma to the value iteration over thesplit chain, to obtain

Φ(0)n (x) ≥ Ev

n

(x,0)

[τ∧n−1∑

k=0

(cvn−k

(Xk)− β)]+ Ev

n

(x,0)

[Φ(1)n−τ

1τ≤n + Φ(0)0 (X0)1τ>n

](6.14)

for x ∈ X. By (A0), there exists some positive constant ε0 such that c((x, 0), u

)≥ β + ε0 for

(x, u) ∈ (Bc ×U) ∩K. Therefore,

τ∧n−1∑

k=0

(cvn−k

(Xk)− β)≥ ε0(τ ∧ n)− (β + ε0)

τ∧n−1∑

k=0

1B×0(Xk).

Taking expectations, using Lemma 2.2, we see that (6.14) reduces to

Φ(0)n (x) ≥ ε0 E

vn

(x,0)[τ ∧ n]− (β + ε0)δ + Evn

(x,0)

[Φ(1)n−τ

1τ≤n + Φ(0)0 (X0)1τ>n

]. (6.15)

We use (6.12) and the hypothesis that Φ(0)0 is bounded from below, to obtain from (6.15) that

lim supn→∞

Evn

(x,0)[τ ∧ n] ≤ κ3 +1

ε0V

(0)⋆ (x)

for some constant κ3 which depends on Φ0. This establishes (6.13).Continuing, we assume without loss of generality (as in the first part of the proof), that the

initial condition Φ0 is translated by a constant so that κ0 = 0 in (6.12). Recall that Ψn = Φn−V∗.Applying Dynkin’s formula together with Fatou’s lemma to (4.10) relative to the stopping timeτ ∧ n, we obtain

Ψ(0)2n (x) ≥ Ev

2n

(x,0)

(1)2n−τ

(Xτ)1τ≤n + Ψ(0)

n (Xn)1τ>n

]

≥ − supk≥n

ǫk + Ev2n

(x,0)

[Ψ(0)n (Xn)1τ>n

].

(6.16)

Let N0 ∈ N be such that

2(1 + C0ρ

)ρN0 < θ1 .

From (H2), we have, cn ≥ θ1V⋆ − θ2 − β, which, if we use in (6.4), we obtain

Pn V⋆(x) ≤ C0 +(1− θ1

2

)V⋆(x) ∀n ≥ N0 , (6.17)

with

C0 :=β + θ2θ1

+1

1− ρN0

(C0C1 + β + θ2

),

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 27

In turn, (6.17) shows that

Ev2n

(x,0)

[V⋆(Xn)

]≤ 2C0

θ1+

(1− θ1

2

)nV⋆(x) ∀n ≥ N0 . (6.18)

Shifting our attention to the split-chain, it is clear from (6.7) and (6.18) that for some constant κ4,we have

Ev2n

(x,0)

[∣∣Ψ(0)n (Xn)

∣∣]≤ κ4

(1 +

(1− θ1

2

)2nV⋆(x)

)∀n ≥ N0 . (6.19)

By (6.13) and (6.19), we obtain

lim infn→∞

Ev2n

(x,0)

[Ψ(0)n (Xn)1τ>n

]≥ 0 ,

which together with (6.16) shows that lim infn→∞ Ψ(0)2n (x) ≥ 0. Using Dynkin’s formula for Ψ

(0)2n+1

in an analogous manner to (6.16), we obtain the same conclusion for this function. Thus we haveshown that

limn→∞

Ψ(0)n (x) = 0 ,

which completes the proof.

We next show that the sequence vnn∈N is asymptotically optimal.

Theorem 6.4. In addition to (H2), we assume the following:

(a) The running cost c is inf-compact on K.(b) There exists ψ ∈ P(X) such that under the stabilizing policies vn in Theorem 6.2, the controlled

chain is positive Harris recurrent and the corresponding invariant probability measures areabsolutely continuous with respect to ψ.

Then for every Φ0 ∈ L(X) ∩ O(V⋆) the sequence vnn∈N is asymptotically optimal in the sensethat

limn→∞

πn(cn) = β

where πn denotes the invariant probability measure of the chain under the control vn.

Proof. First, by (6.6), we have

πn(cn) ≤ β +

(1 + C0ρ

n+1)(β + θ2)

θ1 −(1 + C0ρ

)ρn

∀n ≥ N0 , (6.20)

with N0 as in the proof of Theorem 6.2.Combining (6.2), (6.3), and (H2), we obtain

|Φn+1 − Φn| < C0C1 +β+θ2θ1

+ (C0 + 1)ρnV⋆(x)

≤ C0C1 + θ−11 θ2(C0 + 1)ρn + θ−1

1 (C0 + 1)ρn cn .(6.21)

Writing (4.4) as

PnΦn = −(cn − Φn+1 +Φn

)+Φn , (6.22)

and combining this with (6.21), we obtain

Pn Φn ≤ β + C0C1 + θ−11 θ2(C0 + 1)ρn −

(1− θ−1

1 (C0 + 1)ρn)cn +Φn . (6.23)

On the other hand, by (H2) and (6.3), we have

cn ≥ θ1(1 + C0ρ

n)−1(

Φn − C0C1

)− θ2 . (6.24)

Select N1 such that 2(C0 +1)ρN1 < θ1. Then, (6.23) and (6.24) imply that there exists a constantsC2 such that

PnΦn ≤ C2 +(1− θ1

2

)Φn ∀n ≥ N1 . (6.25)

28 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

Note that Φn+1 − Φn ∈ O(cn) by (6.21). Thus, (6.20), (6.22), and (6.25) imply that

πn

(cn − Φn+1 +Φn

)= 0 ∀n ≥ N0 ∨N1 . (6.26)

Since πn is absolutely continuous with respect to ψ by hypothesis, and the sequence πnn≥N0

is tight by (6.20), the Radon–Nikodym derivatives dπn

dψ are uniformly integrable with respect to ψ.

Therefore, every sequence of Λn := dπn

dψ contains a subsequence which converges in the topology

σ(L1, L∞) by the Dunford–Pettis theorem [16, p. 27-II].Consider such a subsequence, which we denote as Λnn∈N for simplicity, and let Λ be its limit.

Define π(A) := ψ(1A Λ) for A ∈ B(X). For every f ∈ Cb(X) we have

πn(f) = ψ(f Λn) −−−→n→∞

ψ(f Λ) = π(f) .

On the other hand we have

πn(A) = ψ(1A Λn) −−−→n→∞

ψ(1A Λ) , (6.27)

where πn(A) = πn(1A) by a liberal use of the notation. Since, ψ(f Λ) = π(f) for all f ∈ Cb(X), itfollows of course that ψ(1A Λ) = π(A). Thus limn→∞ πn(A) = π(A) by (6.27). Let fn := Φn+1−Φn.Then

πn

(|fn| > ǫ

)=

|fn|>ǫΛn dψ ≤ sup

m∈N

|fn|>ǫΛm dψ −−−→

n→∞0 (6.28)

by uniform integrability, since ψ(|fn| > ǫ

)→ 0 as n → ∞ by Theorem 6.1. Equation (6.28)

implies that fn → 0 in πn-measure in the sense of [38, p. 385]. It is also straightforward to verifyusing (6.20) and (6.21) and the inf-compactness of c, that fn is tightly and uniformly πn-integrablein the sense of definitions [38, (2.4)–(2.5)]. Hence,

πn(fn) −−−→n→∞

0 (6.29)

by [38, Theorem 2.8]. Since (6.29) holds over any sequence over which Λn converges in σ(L1, L∞),it is clear that it must hold over the original sequence n. The result then follows by (6.26) and(6.29).

Remark 6.2. Concerning the positive Harris assumption in Theorem 6.4, it is clear that the Lya-punov equation (6.25) implies that the controlled chain is bounded in probability. If in additionthe chain is a ψ-irreducible T -model (see [39, p. 177]) then it is positive Harris recurrent [39, The-orem 3.4].

The result in Theorems 6.2 and 6.4 justify in particular the use of vn for large n as a ‘rollinghorizon’ approximation of optimal long run average policy, as is often done in Model PredictiveControl, a popular approach in control engineering practice (see, e.g., [31]), wherein one works withtime horizons of duration T ≫ 1 and at each time instant t, the Markov control strategy optimalfor the finite horizon control problem on the horizon [t, t+ 1, . . . , t+ T ] is used.

We present an important class of problems for which (H2) is satisfied.

Example 6.1. Consider a linear quadratic Gaussian (LQG) system

Xt+1 = AXt +BUt +DWt , t ≥ 0

X0 ∼ N (x0,Σ0) ,(6.30)

where Xt ∈ Rd is the system state, Ut ∈ R

du is the control, Wt ∈ Rdw is a white noise process,

and N (x,Σ) denotes the normal distribution in Rd with mean x and covariance matrix Σ. We

assume that each Wt ∼ N (0, Idw ) is i.i.d. and independent of X0, and that (A,B) is stabilizable.The system is observed via a finite number of sensors scheduled or queried by the controller ateach time step. Let γt be a Bernoulli process indicating if the data is lost in the network:each observation is either received (γt = 1) or lost (γt = 0). A scheduled sensor attempts to

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 29

send information to the controller through the network; depending on the state of the network,the information may be received or lost. The query process Qt takes values in the finite set ofallowable sensor queries denoted by Q. The observation process Yt is given by

Yt = γt(CQt−1

Xt + FQt−1Wt

), t ≥ 1, (6.31)

if γt = 1, otherwise no observation is received. The value of γt is assumed to be known to thecontroller at every time step. In (6.31), Cq and Fq are matrices which depend on the query q ∈ Q.Their dimension is not fixed but depends on the number of sensors queried by q.

For each query q ∈ Q, we assume that det(FqFTq ) 6= 0 and (primarily to simplify the analysis)

that DFTq = 0. Also without loss of generality, we assume that B is full rank; if not, we restrict

control actions to the row space of B.The observed information is lost with a probability that depends on the query, that is,

P(γt+1 = 0) = λ(Qt) , (6.32)

where the loss rate λ : Q → [0, 1).The running cost is the sum of a positive querying cost c : Q → R and a quadratic plant cost

cp : Rd ×R

du → R given by

cp(x, u) = xTRx+ uTMu ,

where R,M ∈ M+. Here, M+ (M+

0 ) denotes the cone of real symmetric, positive definite (positivesemi-definite) d× d matrices.

The system evolves as follows. At each time t, the controller takes an action vt = (Ut, Qt), andthe system state evolves as in (6.30). Then the observation at t + 1 is either lost or received,determined by (6.31) and (6.32). The decision vt is non-anticipative, that is, it depend only on thehistory Ft of observations up to time t defined by

Ft := σ(x0,Σ0, Y1, γ1, . . . , Yt, γt).

This model is an extension of the one studied in [43]. More details can be found in [9] whichconsiders an even broader class of problems where the loss rate depends on the ‘network congestion’.

We convert the partially observed controlled Markov chain in (6.30)–(6.32) to an equivalentcompletely observed one. Standard linear estimation theory tells us that the expected value of the

state Xt := E[Xt | Ft] is a sufficient statistic. Let Πt denote the error covariance matrix given by

Πt = cov(Xt − Xt) = E[(Xt − Xt)(Xt − Xt)

T].

The state estimate Xt can be recursively calculated via the Kalman filter

Xt+1 = AXt +BUt + KQt,γt+1(Πt)

(Yt+1 − CQt(AXt +BUt)

), (6.33)

with X0 = x0. The Kalman gain Kq,γ is given by

Kq,γ(Π) := Ξ(Π)γCTq

(γ2CqΞ(Π)C

Tq + FqF

Tq

)−1,

Ξ(Π) := DDT +AΠAT,

and the error covariance evolves on M+0 as

Πt+1 = Ξ(Πt)− KQt,γt+1(Πt)CQtΞ(Πt) , Π0 = Σ0 .

When an observation is lost (γt = 0), the gain Kq,γt = 0 and the observer (6.33) simply evolveswithout any correction factor.

Define Tq : M+0 → M

+0 by

Tq(Π) := Ξ(Π)− Kq,1(Π)CqΞ(Π) , q ∈ Q ,

30 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

and an operator Tq on functions f : M+0 → R,

Tqf(Π) =((1− λ(q)

)f(Tq(Π)

)+ λ(q)f

(Ξ(Π)

).

It is clear then that Πt forms a completely observed controlled Markov chain on M+0 , with action

space Q, and kernel Tq. Admissible and Markov policies are defined as usual but with vt = Qt,

since the evolution of Πt does not depend on the state control Ut.As shown in [43], there is a partial separation of control and observation for the ergodic control

problem which seeks to minimize the long-term average cost,

Jv := lim supT→∞

1

TEv

[T−1∑

t=0

(c(Qt) + cp(Xt, Ut)

)].

The dynamic programming equation is given by

V⋆(Π) + ∗ = minq∈Q

c(s, q) + trace(Π∗Π) + TqV⋆(Π)

, (6.34)

with Π∗ := R−Π∗ +ATΠ∗A, and Π∗ ∈ M+ the unique solution of the algebraic Riccati equation

Π∗ = R+ATΠ∗A−ATΠ∗B(M +BTΠ∗B)−1BTΠ∗A.

If q∗ : M+0 → Q is a selector of the minimizer in (6.34), then the policy given by v∗ = U∗

t , q∗(Πtt≥0,

with

U∗t := −K∗Xt ,

K∗ := (M +BTΠ∗B)−1BTΠ∗A ,

and Xt as in (6.33), is optimal, and satisfies

Jv∗

= infvJv = ∗ + trace(Π∗DDT) .

In addition, the querying component of any optimal stationary Markov policy is an a.e. selector ofthe minimizer in (6.34).

The analysis of the problem also shows that V⋆ is concave and non-decreasing in M+0 , and thus

V⋆(Σ) ≤ m∗1 trace(Σ) +m∗

0 , (6.35)

for some positive constants m∗1 and m

∗0. Note that the running cost corresponding to the equivalent

completely observed problem is

r(q,Σ) := c(q) + trace(Π∗Σ) . (6.36)

It thus follows by (6.35) and (6.36) and the fact that Π∗ ∈ M+, that (H2) is satisfied for this

problem. Note also that the RVI, VI are given by

ϕn+1(Σ) = minq∈Q

r(q,Σ) + Tqϕn(Σ)

− ϕn(0) ,

ϕn+1(Σ) = minq∈Q

r(q,Σ) + Tqϕn(Σ)

− ∗ , ϕ0 = ϕ0 ,

respectively, where both algorithms are initialized with the same function ϕ0 : M+0 → R+.

AVERAGE COST OPTIMAL CONTROL: RELATIVE VALUE ITERATIONS 31

Acknowledgements

Most of this work was done during the visits of AA to the Department of Electrical Engineering ofthe Indian Institute Technology Bombay and of VB to the Department of Electrical and ComputerEngineering at the University of Texas at Austin, and the finishing touches were given when bothauthors were at the Institute of Mathematics of the Polish Academy of Sciences (IMPAN) inWarsaw, for a workshop during the 2019 Simons Semester on Stochastic Modeling and Control.The work of AA was supported in part by the National Science Foundation through grant DMS-1715210, and in part the Army Research Office through grant W911NF-17-1-001, while the workof VB was supported by a J. C. Bose Fellowship. VB acknowledges some early discussions withProf. Debasish Chatterjee which spurred some of this work.

References

[1] M. Agarwal, V. S. Borkar, and A. Karandikar, Structural properties of optimal transmission policies over a

randomly varying channel, IEEE Trans. Automat. Control 53 (2008), no. 6, 1476–1491. MR2451236 ↑2, 20[2] A. Arapostathis, V. S. Borkar, and M. K. Ghosh, Ergodic control of diffusion processes, Encyclopedia Math.

Appl., vol. 143, Cambridge University Press, Cambridge, 2012. MR2884272 ↑10, 12[3] A. Arapostathis, V. S. Borkar, and K. S. Kumar, Convergence of the relative value iteration for the ergodic

control problem of nondegenerate diffusions under near-monotone costs, SIAM J. Control Optim. 52 (2014),no. 1, 1–31. MR3148068 ↑1

[4] K. B. Athreya and P. Ney, A new approach to the limit theory of recurrent Markov chains, Trans. Amer. Math.Soc. 245 (1978), 493–501. MR511425 ↑10, 12

[5] D. P. Bertsekas and S. E. Shreve, Stochastic optimal control: The discrete time case, Math. in Science andEngineering, vol. 139, Academic Press, Inc., New York-London, 1978. MR511544 ↑6, 8

[6] V. S. Borkar, A convex analytic approach to Markov decision processes, Probab. Theory Related Fields 78 (1988),no. 4, 583–602. MR950347 ↑5

[7] V. S. Borkar, Topics in controlled Markov chains, Pitman Research Notes in Mathematics Series, vol. 240,Longman Scientific & Technical, Harlow, 1991. MR1110145 ↑1, 9, 21

[8] V. S. Borkar, Convex analytic methods in Markov decision processes (E. A. Feinberg and A. Shwartz, eds.),Internat. Ser. Oper. Res. Management Sci., vol. 40, Kluwer Acad. Publ., Boston, MA, 2002. MR1887208 ↑1, 3,5, 6

[9] J. Carroll, H. Hmedi, and A. Arapostathis, Optimal scheduling of multiple sensors which transmit measurements

over a dynamic lossy network, Proceedings of the 58th IEEE Conference on Decision and Control (Nice, France),2019, pp. 684–689. ↑29

[10] R. Cavazos-Cadena, Value iteration in a class of communicating Markov decision chains with the average cost

criterion, SIAM J. Control Optim. 34 (1996), no. 6, 1848–1873. MR1416491 ↑2[11] R. Cavazos-Cadena, A note on the convergence rate of the value iteration scheme in controlled Markov chains,

Systems Control Lett. 33 (1998), no. 4, 221–230. MR1613070 ↑2[12] D. Chatterjee and J. Lygeros, On stability and performance of stochastic predictive control techniques, IEEE

Trans. Automat. Control 60 (2015), no. 2, 509–514. MR3310179 ↑2[13] R.-R. Chen and S. Meyn, Value iteration and optimization of multiclass queueing networks, Queueing Systems

Theory Appl. 32 (1999), no. 1-3, 65–97. MR1720550 ↑2[14] O. L. V. Costa and F. Dufour, Average control of Markov decision processes with Feller transition probabilities

and general action spaces, J. Math. Anal. Appl. 396 (2012), no. 1, 58–69. MR2956943 ↑1, 2[15] E. Della Vecchia, S. Di Marco, and A. Jean-Marie, Illustrated review of convergence conditions of the value

iteration algorithm and the rolling horizon procedure for average-cost MDPs, Ann. Oper. Res. 199 (2012), 193–214. MR2971812 ↑2

[16] C. Dellacherie and P.-A. Meyer, Probabilities and potential, North-Holland Mathematics Studies, vol. 29, North-Holland Publishing Co., Amsterdam-New York. MR521810 ↑28

[17] E. B. Dynkin and A. A. Yushkevich, Controlled Markov processes, Grundlehren Math. Wiss., vol. 235, Springer-Verlag, Berlin-New York, 1979. MR554083 ↑3, 16

[18] S. N. Ethier and T. G. Kurtz, Markov processes. characterization and convergence, Wiley Series in Probabilityand Mathematical Statistics, John Wiley & Sons, Inc., New York, 1986. MR838085 ↑7, 8

[19] E. A. Feinberg, P. O. Kasyanov, and Y. Liang, Fatou’s lemma for weakly converging measures under the uniform

integrability condition, Theory Probab. Appl. 64 (2019), no. 4, 615–630. MR4030821 ↑1, 2[20] E. A. Feinberg and Y. Liang, On the optimality equation for average cost Markov decision processes and its

validity for inventory control, Ann. Oper. Res. 64 (2017), no. 4, 771–790. ↑1, 2

32 ARI ARAPOSTATHIS AND VIVEK S. BORKAR

[21] E. A. Feinberg, P. O. Kasyanov, and N. V. Zadoianchuk, Average cost Markov decision processes with weakly

continuous transition probabilities, Math. Oper. Res. 37 (2012), no. 4, 591–607. MR2997893 ↑1, 2, 20[22] E. A. Feinberg, P. O. Kasyanov, and N. V. Zadoianchuk, Berge’s theorem for noncompact image sets, J. Math.

Anal. Appl. 397 (2013), no. 1, 255–259. MR2968988 ↑4, 14, 19[23] E. Fernandez-Gaucherand, A. Arapostathis, and S. I. Marcus, Remarks on the existence of solutions to the

average cost optimality equation in Markov decision processes, Systems Control Lett. 15 (1990), no. 5, 425–432.MR1084585 ↑2

[24] E. Fernandez-Gaucherand, A. Arapostathis, and S. I. Marcus, Convex stochastic control problems, Proceedingsof the 31st IEEE Conference on Decision and Control, Tucson, AZ, Dec. 16–18, 1992, pp. 2179–2180. ↑2, 20

[25] R. Z. Has′minskiı, Stochastic stability of differential equations, Sijthoff & Noordhoff, Alphen aan den Rijn—Germantown, Md., The Netherlands, 1980. MR600653 ↑15

[26] O. Hernandez-Lerma and J. B. Lasserre, Error bounds for rolling horizon policies in discrete-time Markov control

processes, IEEE Trans. Automat. Control 35 (1990), no. 10, 1118–1124. MR1073256 ↑2[27] O. Hernandez-Lerma, Average optimality in dynamic programming on Borel spaces—unbounded costs and con-

trols, Systems Control Lett. 17 (1991), no. 3, 237–242. MR1125975 ↑1, 2[28] O. Hernandez-Lerma and J. B. Lasserre, Discrete-time Markov control processes. Basic optimality criteria, Appl.

Math. (N. Y.), vol. 30, Springer-Verlag, New York, 1996. MR1363487 ↑1, 2, 3[29] R. A. Howard, Dynamic programming and Markov processes, The Technology Press of M.I.T., Cambridge, Mass.;

John Wiley & Sons, Inc., New York-London, 1960. MR0118514 ↑1[30] A. Jaskiewicz and A. S. Nowak, On the optimality equation for average cost Markov control processes with Feller

transition probabilities, J. Math. Anal. Appl. 316 (2006), no. 2, 495–509. MR2206685 ↑1, 2, 20[31] A. Mesbah, Stochastic model predictive control: An overview and perspectives for future research, IEEE Control

Systems Magazine 36 (2016), no. 6, 30–44. ↑28[32] S. Meyn and R. L. Tweedie, Markov chains and stochastic stability, 2nd edition, Cambridge University Press,

Cambridge, 2009. MR2509253 ↑10, 13, 22[33] S. P. Meyn, The policy iteration algorithm for average reward Markov decision processes with general state space,

IEEE Trans. Automat. Control 42 (1997), no. 12, 1663–1680. MR1490975 ↑2[34] J. Neveu, Mathematical foundations of the calculus of probability, Holden-Day, Inc., San Francisco, Calif.-London-

Amsterdam, 1965. MR0198505 ↑4[35] E. Nummelin, A splitting technique for Harris recurrent Markov chains, Z. Wahrsch. Verw. Gebiete 43 (1978),

no. 4, 309–318. MR0501353 ↑10, 12[36] E. Nummelin, General irreducible Markov chains and nonnegative operators, Cambridge Tracts in Math., vol. 83,

Cambridge University Press, Cambridge, 1984. MR776608 ↑16[37] M. Schal, Average optimality in dynamic programming with general state space, Math. Oper. Res. 18 (1993),

no. 1, 163–172. MR1250112 ↑1, 2[38] R. Serfozo, Convergence of Lebesgue integrals with varying measures, Sankhya Ser. A 44 (1982), no. 3, 380–402.

MR705462 ↑28[39] R. L. Tweedie, Topological conditions enabling use of Harris methods in discrete and continuous time, Acta Appl.

Math. 34 (1994), no. 1-2, 175–188. MR1273853 ↑28[40] O. Vega-Amaya, The average cost optimality equation: a fixed point approach, Bol. Soc. Mat. Mexicana (3) 9

(2003), no. 1, 185–195. MR1988598 ↑1, 2

[41] O. Vega-Amaya, Solutions of the average cost optimality equation for Markov decision processes with weakly con-

tinuous kernel: the fixed-point approach revisited, J. Math. Anal. Appl. 464 (2018), no. 1, 152–163. MR3794081↑1, 2, 20

[42] D. J. White, Dynamic programming, Markov chains, and the method of successive approximations, J. Math.Anal. Appl. 6 (1963), 373–376. MR0148480 ↑1, 2

[43] W. Wu and A. Arapostathis, Optimal sensor querying: general Markovian and LQG models with controlled

observations, IEEE Trans. Automat. Control 53 (2008), no. 6, 1392–1405. MR2451230 ↑29, 30[44] H. Yu, On linear programming for constrained and unconstrained average-cost Markov decision processes

with countable action spaces and strictly unbounded costs, ArXiv eprints 1905.12095 (2019), available athttps://arxiv.org/abs/1905.12095. ↑9

[45] H. Yu, On the minimum pair approach for average cost Markov decision processes with countable discrete action

spaces and strictly unbounded costs, SIAM J. Control Optim. 58 (2020), no. 2, 660–685. MR4074005 ↑9, 16