Finite-Sample Analysis for SARSA and Q-Learning with Linear …szou3/conference/icml2019rl.pdf ·...

Finite-Sample Analysis for SARSA and Q-Learningwith Linear Function Approximation

Shaofeng Zou 1 Tengyu Xu 2 Yingbin Liang 2

Abstract

Though the convergence of major reinforcementlearning algorithms has been extensively studied,the finite-sample analysis to further characterizethe convergence rate in terms of the sample com-plexity for problems with continuous state spaceis still very limited. Such a type of analysis isespecially challenging for algorithms with dy-namically changing learning policies and undernon-i.i.d. sampled data. In this paper, we presentthe first finite-sample analysis for the SARSA al-gorithm and its minimax variant (for zero-sumMarkov games), with a single sample path andlinear function approximation. To establish ourresults, we develop a novel technique to boundthe gradient bias for dynamically changing learn-ing policies, which can be of independent inter-est. We further provide finite-sample bounds forQ-learning and its minimax variant. Compari-son of our result with the existing finite-samplebound indicates that linear function approxima-tion achieves order-level lower sample complexitythan the nearest neighbor approach.

1. IntroductionThe major success of reinforcement learning (RL) lies in thedevelopments of various algorithms for finding policies thatattain desirable (many times optimal) cumulative rewardsover time (Sutton & Barto, 2018). In particular, model-freeexamples which do not assume knowledge of the underlyingMarkov decision process (MDP) include the time-difference(TD) (Sutton, 1988), SARSA (Rummery & Niranjan, 1994),Q-learning (Watkins & Dayan, 1992), and more recentlydeep Q-network (DQN) (Mnih et al., 2015), and actor-critic(A3C) (Mnih et al., 2016) algorithms. Among them, al-

1Department of Electrical Engineering, University at Buf-falo, the State University of New York, New York, USA2Department of Electrical and Computer Engineering, The OhioState University, Ohio, USA. Correspondence to: Shaofeng Zou<[email protected]>.

gorithms that incorporate parametrized function approxi-mation gain extensive attention due to their efficiency andscalability under continuous state space in practice.

Focusing on the basic TD, SARSA and Q-learning algo-rithms (see Section 1.2 for a summary of their variants) withcontinuous state space and with function approximation, thetheoretical asymptotic convergence has been established forTD algorithms with linear function approximation by (Tsit-siklis & Roy, 1997), Q-learning and SARSA with linearfunction approximation by (Melo et al., 2008), Q-learningwith kernel-based approximation (Ormoneit & Glynn, 2002;Ormoneit & Sen, 2002). Furthermore, the finite-sampleanalysis of the convergence rate in terms of the sample com-plexity has been provided for TD with function approxima-tion in (Dalal et al., 2018a; Lakshminarayanan & Szepesvari,2018), under the assumption that sample observations areindependent and identically distributed (i.i.d.).

Under the non-i.i.d. assumption, (Bhandari et al., 2018)recently provided a finite-sample analysis for TD algo-rithm with linear function approximation, which was furthershown to be valid without modification to Q-learning forhigh-dimensional optimal stopping problems. Furthermore,(Shah & Xie, 2018) proposed a Q-learning algorithm basedon the nearest neighbor approach. In fact, the finite-sampleanalysis for RL algorithms under the non-i.i.d. assumptionis still a largely open direction, and the focus of this paperis on the following three open and fundamental problems.

• Under non-i.i.d. observations, existing studies providedfinite-sample analysis only for TD and Q-learning algo-rithms, where samples are taken under a fixed policy. Theexisting analysis tools are not sufficient to handle the ad-ditional challenges due to dynamically changing sampledistributions arising in algorithms such as SARSA.

• The finite-sample analysis in (Shah & Xie, 2018) for Q-learning with the nearest neighbor approach relies closelyon the state discretization, which can suffer from slowconvergence to attain a high accuracy solution in practice.It is thus of interest to provide finite-sample analysis forQ-learning with linear function approximation, which re-quires different analysis from that in (Shah & Xie, 2018).

• The finite-sample analysis for two-player zero-sum MDPgames has been provided for a deep Q-learning model

Finite-Sample Analysis for SARSA and Q-Learning with Linear Function Approximation

in (Yang et al., 2019) (see a summary of other studies inSection 1.2), but under i.i.d. observations. It is motivatedto provide the finite-sample analysis for minimax SARSAand Q-learning algorithms under non-i.i.d. observations.

1.1. Contributions

Our contributions are summarized as follows.

• We develop the first finite-sample analysis for the on-policy algorithm SARSA with a continuous state spaceand linear function approximation, which is applicable toa single sample path and non-i.i.d. data. To accomplishthis analysis, we propose a novel technique to handleon-policy algorithms with dynamically changing policies,which may be of independent interest. Existing studies(e.g., (Bhandari et al., 2018)) on a fixed policy exploitedthe uniform ergodicity of the Markov chain to decouplethe dependency on the Markovian noise. But the uniformergodicity does not hold generally for the Markov chaininduced by a dynamically changing policy. Our approachconstructs an auxiliary uniformly ergodic Markov chainto approximate the true MDP to facilitate the analysis.

• We further develop the finite-sample analysis for Q-learning with a continuous state space and linear func-tion approximation. In contrast to many existing studieswhich assumed i.i.d. samples, our analysis is applicableto the online case with a single sample path and non-i.i.d. data. Furthermore, our analysis of both SARSAand Q-learning indicates that linear function approxima-tion yields an order-level faster convergence rate than thenearest neighbor approach (Shah & Xie, 2018).

• By leveraging the aforementioned technique we developfor single-agent algorithms, we further provide the firstfinite-sample analysis for the minimax SARSA and Q-learning algorithms for two-player zero-sum Markovgames with a single sample path and non-i.i.d. samples.

1.2. Related Work

Due to a vast amount of literature on theoretical analysisof RL algorithms, we here focus only on highly relevantstudies, which investigated model-free RL algorithms forsolving continuous state-space MDP problems.

Fitted value iteration algorithms: The least-squares tem-poral difference learning (LSTD) algorithms have beenextensively studied in (Bradtke & Barto, 1996; Boyan,2002; Munos & Szepesvari, 2008; Lazaric et al., 2010;Ghavamzadeh et al., 2010; Pires & Szepesvari, 2012;Prashanth et al., 2013; Tagorti & Scherrer, 2015; Tu & Recht,2018) and references therein, which follows the TD type ofupdate with each iteration solving a least square regressionproblem based on a batch data in order to fit the approximatefunction model. A differential LSTD algorithm was recentlyproposed and studied in (Devraj et al., 2018).

Fitted policy iteration algorithms: Approximate (fitted)policy iteration (API) algorithms further extend fitted valueiteration with policy improvement. Several variants of sucha type were studied, which adopt different objective func-tions, including least-squares policy iteration (LSPI) algo-rithms in (Lagoudakis & Parr, 2003; Lazaric et al., 2012;Yang et al., 2019), fitted policy iteration based on Bellmanresidual minimization (BRM) in (Antos et al., 2008; Farah-mand et al., 2010), and classification-based policy iterationalgorithm in (Lazaric et al., 2016).

Our study here focuses on the SARSA and Q-learning al-gorithms under the online case with each iteration updatedbased on one data sample, whereas the above fitted valueand policy algorithms use the full batch data samples at eachiteration for model fitting. Hence, our analysis tools are verydifferent from these two types of algorithms.

Gradient TD algorithms: The off-policy gradient TD al-gorithm with linear function approximation was proposedin (Sutton et al., 2009a;b) based on gradient descent forminimizing the mean-square projected Bellman error. Theconvergence of gradient TD algorithms was established in(Sutton et al., 2009a;b), and the finite-sample analysis wasfurther provided in (Dalal et al., 2018b; Liu et al., 2015;Touati et al., 2018). All these analysis were based on i.i.d.data assumption, whereas our paper here focuses on non-i.i.d. scenarios.

Two-player zero-sum MDP game: The zero-sum Markovgame problem has been studied extensively for discrete statespace models in (Littman, 1994; Bowling, 2001; Conitzer &Sandholm, 2007; Srinivasan et al., 2018; Wei et al., 2017)and references therein, where the convergence and finite-sample analysis have been developed. This problem wasfurther studied with function approximation in (Prasad et al.,2015; Perolat et al., 2018), where only the convergencewas provided, but the finite-sample analysis was not char-acterized. Fitted (i.e., batch) algorithms have also beendesigned and studied for zero-sum Markov game problemsin (Lagoudakis & Parr, 2002; Perolat et al., 2016b;a; 2018;Zhang et al., 2018; Yang et al., 2019). Our study here pro-vides the first finite-analysis for the online SARSA andQ-learning algorithms (differently from the aforementionedfitted algorithms) under the non-i.i.d. data assumption, forwhich the previously developed technical tools are not ap-plicable.

2. PreliminariesIn this section, we introduce the basic MDP problem, andthe linear function approximation.

2.1. Markov Decision Process

Consider a general reinforcement learning setting, wherean agent interacts with a stochastic environment, which is


modeled as a Markov decision process (MDP). Specifically,we consider a MDP that consists of (X ,A,P, r, γ), whereX is a compact continuous state space X ⊂ Rd, and A is afinite action set. We further let Xt ∈ X denote the state attime t, and At ∈ A denote the action at time t. Then, themeasure P defines the action dependent transition kernel forthe underlying Markov chain {Xt}t≥0: P(Xt+1 ∈ U |Xt =x,At = a) =

∫UP(dy|x, a), for any measurable set U ⊆

X . The one-stage reward at time t is given by r(Xt, At),where r : X×A → R is the reward function, and is assumedto be uniformly bounded, i.e., r(x, a) ∈ [0, rmax], for any(x, a) ∈ X ×A. Finally, γ denotes the discount factor.

A stationary Markov policy maps a state x ∈ X to a prob-ability distribution π(·|x) over A, which does not dependon time. For a policy π, the corresponding value functionV π : X → R is defined as the expected total discountedreward obtained by actions executed according to π:

V π (x0) = E

[ ∞∑t=0

γtr(Xt, At)

∣∣∣∣X0 = x0

].

The action-value function Qπ : X ×A → R is defined as

Qπ(x, a) = r(x, a) + γ

∫XP(dy|x, a)V π(y).

The goal is to find an optimal policy that maximizes thevalue function from any initial state. The optimal valuefunction is defined as V ∗ = supπ V

π(x), ∀x ∈ X . Theoptimal action-value function is defined as

Q∗(x, a) = supπQπ(x, a), ∀(x, a) ∈ X ×A.

Based on Q∗, the optimal policy π∗ is the greedy algorithmwith respect to Q∗. It can be verified that Q∗ = Qπ

∗. The

Bellman operator H is defined as

(HQ)(x, a) = r(x, a) + γ

∫X

maxb∈A

Q(y, b)P(dy|x, a).

It is clear that H is contraction in the sup norm defined as‖Q‖sup = sup(x,a)∈X×A |Q(x, a)|, and the optimal action-value function Q∗ is the fixed point of H (Bertsekas, 2012).

2.2. Linear Function Approximation

Let Q = {Qθ : θ ∈ RN} be a family of real-valued func-tions defined on X × A. We consider the problem whereany function in Q is a linear combination of a set of Nfixed linearly independent functions, φi : X ×A → R, i =1, . . . , N . Specifically, for θ ∈ RN ,

Qθ(x, a) =

N∑i=1

θiφi(x, a) = φT (x, a)θ.

We assume that ‖φ(x, a)‖2 ≤ 1 for any (x, a) ∈ X × A,which can be ensured by normalizing the basis functions

{φi}Ni=1. The goal is to find a Qθ with a compact represen-tation in θ to approximate the optimal action-value functionQ∗ with a continuous state space.

3. Finite-Sample Analysis for SARSAAs suggested in (Melo et al., 2008; Tsitsiklis & Roy, 1997;Perkins & Pendrith, 2002), on-policy algorithms may po-tentially yield more reliable convergence performance. Inthis section, we present our main result of the finite-sampleanalysis for an on-policy algorithm SARSA as a variationof the off-policy Q-learning algorithm. We further presentour finite-sample analysis of the Q-learning algorithm as acomparison of performance in Section 4. As will be seen,the sufficient condition for Q-learning to converge is stricter.

3.1. SARSA with Linear Function Approximation

We consider a θ-dependent learning policy, which changeswith time. Specifically, the learning policy πθt is ε-greedywith respect to the Q function φT (x, a)θt. Suppose that{xt, at, rt}t≥0 are sampled trajectories of states, actionsand rewards obtained from the MDP following the timedependent learning policy πθt . Then the projected SARSAalgorithm with linear function approximation updates asfollows:

θt+1 = proj2,R(θt + αtgt(θt)), (1)

where gt(θt) = ∇θQθ(xt, at)∆t = φ(xt, at)∆t, ∆t de-notes the temporal difference at time t given by ∆t =r(xt, at) + φT (xt+1, at+1)θt − φT (xt, at)θt, and

proj2,R(θ) := arg minθ′:‖θ′‖2≤R

‖θ − θ′‖2.

Here, the projection step is to control the norm of the gradi-ent gt(θt), which is a commonly used technique to controlthe gradient bias (Bhandari et al., 2018; Kushner, 2010;Lacoste-Julien et al., 2012; Bubeck et al., 2015; Nemirovskiet al., 2009). The basic idea is that with a decaying step sizeαt and a projected gradient, θt does not change too fast.

The convergence of this algorithm (without a projectionoperation) was established using an O.D.E. argument (Meloet al., 2008). However, the finite-sample analysis of theconvergence still remains unsolved, which is the goal here.

For any θ ∈ RN , the learning policy πθ is assumed to beLipschitz with respect to θ: ∀(x, a) ∈ X ×A,

|πθ1(a|x)− πθ2(a|x)| ≤ C‖θ1 − θ2‖2, (2)

where C > 0 is the Lipschitz constant. Since πθ is an ε-greedy policy with respect to Qθ, it is clear that C decreaseswith a larger ε, and equals zero if ε = 1. We further assumethat for any fixed θ ∈ RN , the Markov chain {Xt}t≥0


induced by the learning policy πθ and the transition kernelP is uniformly ergodic with the invariant measure denotedby Pθ, and satisfies the following assumption.

Assumption 1. There are constants m > 0 and ρ ∈ (0, 1)such that

supx∈X

dTV (P(Xt ∈ ·|X0 = x),Pθ) ≤ mρt,∀t ≥ 0,

where dTV (P,Q) denotes the total-variation distance be-tween the probability measures P and Q.

We note that such an assumption holds for irreducible andaperiodic Markov chains (Meyn & Tweedie, 2012). We de-note by µθ the probability measure induced by the invariantmeasure Pθ and the learning policy πθ.

3.2. Finite-Sample Analysis

We first note that the following facts are useful for ouranalysis. The limit point θ∗ of the algorithm in (1) satisfiesthe following relation (Theorem 2 in (Melo et al., 2008)):

Aθ∗θ∗ + bθ∗ = 0,

where Aθ∗ = Eθ∗ [φ(X,A)(γφT (Y,B)θ∗−φT (X,A)θ∗)],and bθ∗ = Eθ∗ [φ(X,A)r(X,A)]. Here Eθ∗ denotes the ex-pectation whereX follows the invariant probability measurePθ, A is generated by the learning policy πθ∗(A = ·|X), Yis the subsequent state of X following the action A, i.e., Yfollows from the transition kernel P(Y ∈ ·|X,A), and again,B is generated by the learning policy πθ∗(B = ·|Y ). Notethat it has been shown in (Perkins & Precup, 2003; Tsitsiklis& Roy, 1997) that ∀θ ∈ RN , Aθ is negative definite.

Let G = rmax + 2R and λ = G|A|(2 + dlogρ1me+ 1

1−ρ ).Recall in (2) that the learning policy πθ is assumed to beLipschitz with respect to θ with Lipschitz constant C. Wethen make the following assumption as justified in (Meloet al., 2008).

Assumption 2. The Lipschitz constant C is small enoughso that (Aθ∗ + CλI) is negative definite with the largesteigenvalue denoted by −ws < 0.

We then present our main result of the finite-sample boundon the convergence of SARSA.

Theorem 1. Consider the projected SARSA algorithm withlinear function approximation in (1) with ‖θ∗‖2 ≤ R.Under Assumptions 1 and 2, with a decaying step sizeαt = 1

ws(t+1) for t ≥ 0, we have

E‖θT − θ∗‖22 ≤G2(4C|A|Gτ20 + 12τ0ws + 1)(log T + 1)

ws2T

+4G2(τ0ws + 2ρ−1)

ws2T, (3)

where τ0 = min{t ≥ 0 : mρt ≤ αT }.

Remark 1. As T →∞, τ0 ∼ log T , and therefore,

E‖θT − θ∗‖22 .log3 T

T.

From Theorem 1, it is clear that for a smaller C, the algo-rithm converges faster. However, to obtain a smaller C, εshall be increased. On the other hand, a smaller ε meansthat the learning policy and thus the temporal differenceupdate are “ more greedy”, which results in a more accurateapproximation of the optimal Q-function. Hence there isa trade-off between the convergence rate and the accuracythat φT (x, a)θ∗ approximates the optimal Q-function.

In order for Theorem 1 to hold, the projection radius R shallbe chosen such that ‖θ∗‖2 ≤ R. However, θ∗ is unknown inadvance. We next provide an upper bound on ‖θ∗‖2, whichcan be estimated in practice (Bhandari et al., 2018).

Lemma 1. For the projected SARSA algorithm in (1), thelimit point θ∗ satisfies that ‖θ∗‖2 ≤ − rmax

wl, where wl < 0

is the largest eigenvalue of Aθ∗ .

3.3. Outline of Technical Proof

The challenges in analyzing the SARSA algorithm are two-folds: (1) non-i.i.d. samples; and (2) dynamically changinglearning policy. First, as per the updating rule in (1), thereis a strong coupling between the sample path and {θt}t≥0,since the samples are used to compute the gradient gt andthen θt+1, which introduces a strong dependency between{θt}t≥0, and {Xt, At}t≥0, and thus the bias in the gradientgt. Moreover, differently from TD learning and Q-learning,θt is further used (as in the policy πθt) to generate the sub-sequent actions, which make the dependency even stronger.Although the convergence can still be established using theO.D.E. approach (Benveniste et al., 2012; Melo et al., 2008),in order to derive a finite-sample analysis, the stochastic biasin the gradient needs to be explicitly characterized, whichmakes the problem challenging. Second, as θt updates, thetransition kernel for the state-action pair (Xt, At) changeswith time. Analysis in (Bhandari et al., 2018) relies on thefact that the learning policy is fixed so that the Markov pro-cess reaches its stationary distribution quickly. In (Perkins& Precup, 2003), an episodic SARSA algorithm is stud-ied, where within each episode, the learning policy is fixed,only the Q-function of the learning policy is updated, andthe learning policy is then updated only at the end of eachepisode. Therefore, within each episode, the Markov pro-cess can reach its stationary distribution so that the analysiscan be conducted. The SARSA algorithm studied here doesnot possess these nice properties, since the learning policy ischanging at each time step. Thus, to provide a finite-sampleanalysis, we design a new uniformly ergodic Markov chainto approximate the original Markov chain induced by theSARSA algorithm. Using such an approach, the gradient


bias can be explicitly characterized. To illustrate our idea ofthe proof, we provide a sketch, and the detailed proof canbe found in the supplemental materials.

Proof sketch. We sketch the key steps in our proof. We firstintroduce some notations. For any fixed θ ∈ RN , defineg(θ) = Eθ[gt(θ)], where Xt follows the stationary distribu-tion Pθ, and (At, Xt+1, At+1) are subsequent actions andstates generated according to the policy πθ and the transitionkernel P. Here, g(θ) can be interpreted as the noiseless gra-dient at θ. We then define Λt(θ) = 〈θ − θ∗, gt(θ)− g(θ)〉.Thus, Λt(θt) measures the bias caused by using non-i.i.d.samples to estimate the gradient.

Step 1. Error decomposition. The error at each time stepcan be decomposed recursively as follows:

E[‖θt+1 − θ∗‖22]

≤E[‖θt − θ∗‖22] + 2αtE[〈θt − θ∗, g(θt)− g(θ∗)〉]

+ α2tE[‖gt(θt)‖22] + 2αtE[Λt(θt)]. (4)

Step 2. Gradient descent type analysis. The first three termsin (4) mimic the analysis of the gradient descent algorithmwithout noise, because the accurate gradient gt at θt is used.

Due to the projection step in (1), ‖gt(θt)‖2 is upper boundedbyG. It can also be shown that E[〈θt−θ∗, g(θt)−g(θ∗)〉] ≤(θt − θ∗)T (Aθ∗ + CλI)(θt − θ∗). For a small enough C,i.e., πθ is smooth enough with respect to θ, (Aθ∗ +CλI) isnegative definite. Then, we have

E[〈θt − θ∗, g(θt)− g(θ∗)〉] ≤ −wsE[‖θt − θ∗‖22]. (5)

Step 3. Stochastic bias analysis. This step consists of ourmajor technical developments. The last term in (4) corre-sponds to the bias caused by using a single sample path withnon-i.i.d. data and dynamically changing learning policies.For convenience, we rewrite Λt(θt) as Λt(θt, Ot), whereOt = (Xt, At, Xt+1, At+1). This term is very challengingto bound due to the strong dependency between θt and Ot.

We first show that Λt(θ,Ot) is Lipschitz in θ. Due to theprojection step, θt changes slowly with t. Combining thetwo facts, we can show that for any τ > 0,

Λt(θt, Ot) ≤ Λt(θt−τ , Ot) + 6G2t−1∑i=t−τ

αi. (6)

Such a step is intended to decouple the dependency betweenOt and θt by considering Ot and θt−τ . If the Markov chain{(Xt, At)}t≥0 induced by SARSA were uniformly ergodic,and satisfied Assumption 1, then given any θt−τ , Ot wouldreach its stationary distribution quickly for large τ . However,such an argument is not necessarily true, since the learningpolicy πθt changes with time.

Our idea is to construct an auxiliary Markov chain to as-sist our proof. Consider the following new Markov chain.Before time t − τ + 1, the states and actions are gener-ated according to the SARSA algorithm, but after timet − τ + 1, the learning policy is kept fixed as πθt−τ togenerate all the subsequent actions. We then denote byOt = (Xt, At, Xt+1, At+1) the observations of the newMarkov chain at time t and time t+ 1. For this new Markovchain, for large τ , Ot reaches the stationary distributioninduced by πθt−τ and P. And thus, it can be shown that

E[Λt(θt−τ , Ot)] ≤ 4G2mρτ−1. (7)

The next step is to bound the difference between the Markovchain generated by the SARSA algorithm and the auxiliaryMarkov chain that we construct. Since the learning policychanges slowly, due to its Lipschitz property and the decay-ing step size αt, the two Markov chains should not deviatefrom each other too much. It can be shown that

E[Λt(θt−τ , Ot)]− E[Λt(θt−τ , Ot)]

≤ 2C|A|G3τ

wslog

t

t− τ. (8)

Combining (6), (7) and (8) yields an upper bound onE[Λt(θt)].

Step 4. Putting the first three steps together and recursivelyapplying Step 1 complete the proof.

4. Finite-Sample Analysis for Q-Learning4.1. Q-Learning with Linear Function Approximation

Let π be a fixed stationary learning policy. Suppose that{xt, at, rt}t≥0 are sampled trajectories of states, actionsand rewards obtained from the MDP using policy π. Weconsider the projected Q-learning algorithm with the follow-ing update rule:


where αt is the step size at time t, gt(θt) =∇θQθ(xt, at)∆t = φ(xt, at)∆t, ∆t is the temporal dif-ference at time t:

∆t = r(xt, at) + γmaxb∈A

φT (xt+1, b)θt − φT (xt, at)θt.

A same projection step as the one in the SARSA algorithmis used to control the norm of the gradient, i.e., gt(θt), andthus to control the gradient bias. The convergence of thisalgorithm (without a projection operation) was establishedusing an O.D.E. argument (Melo et al., 2008). However,the finite-sample characterization of the convergence stillremains unsolved, which is our interest here.



To explicitly quantify the statistical bias in the iterativeupdate in (9), we assume that the Markov chain {Xt}t≥0induced by the learning policy π is uniformly ergodic withinvariant measure Pπ, which satisfies Assumption 1, andX0 is initialized with Pπ . We then introduce some notations.We denote by µπ the probability measure induced by theinvariant measure Pπ and the learning policy π: µπ(U ×{a}) =

∫Uπ(a|x)Pπ(dx), for any measurable U ⊆ X . We

define the matrix Σπ as

Σπ = Eπ[φ(X,A)φ(X,A)T ]

=

∫x∈X

∑a∈A

φ(x, a)φT (x, a)π(a|x)Pπ(dx).

For any fixed θ ∈ RN , and x ∈ X , define aθx =argmaxa∈A

φT (x, a)θ. We then define the θ-dependent matrix

Σ∗π(θ) = Eπ[φ(X, aθX)φT (X, aθX)]

=

∫Xφ(x, aθx)φT (x, aθx)Pπ(dx).

By their construction, Σπ and Σ∗π(θ) are positive definite.We further make the following assumption.

Assumption 3. For all θ ∈ RN , Σπ � γ2Σ∗π(θ).

We denote the minimum of the smallest eigenvalue of Σπ −γ2Σ∗π(θ) over all θ ∈ RN by ws. The limit point θ∗of thealgorithm in (9) (without a projection operation) satisfiesthe recursive relation

Qθ∗ = projQHQθ∗ , (10)

where projQ is the orthogonal projection onto Q definedwith the inner-product given by 〈f, g〉 =

∫X∑A fg dµπ

(Theorem 1 in (Melo et al., 2008)). We then further char-acterize the finite-sample bound on the performance of theprojected Q-learning in (9).

Theorem 2. Consider the projected Q-learning algorithmin (9) with ‖θ∗‖2 ≤ R. If Assumptions 1 and 3 are satisfiedfor π, with a decaying step size αt = 1

ws(t+1) , we have

E‖θT − θ∗‖22 ≤(9 + 24τ0)G2(log T + 1)

ws2T, (11)

where τ0 = min{t ≥ 0 : mρt ≤ αT }.Remark 2. As T →∞, τ0 ∼ log T , and therefore,


T.

Remark 3. As an interesting comparison, we note that theconvergence rate O( log T

T )1/3 has been characterized for

the nearest neighbor approach for Q-learning with contin-uous state space in (Shah & Xie, 2018). Clearly, Theorem2 implies that linear function approximation yields a muchfaster convergence rate O( log3 T

T ) . On the other hand, thelearning policy and basis functions need to satisfy the con-dition to guarantee the convergence (Melo et al., 2008) dueto the nature of linear function approximation.

To clarify the difference between the proof of SARSA andthat of Q-learning, we note that for Q-learning, the learningpolicy does not change with time, and the induced Markovchain can get close to its stationary distribution. Thus, tocharacterize the stochastic bias, it is not necessary to con-struct an auxiliary Markov chain as Step 3 for SARSA. Onthe other hand, to compute the temporal difference in Q-learning, a greedy action is taken, which does not satisfy theLipschitz condition we pose on the learning policy in theproof for SARSA.

Similar to the SARSA algorithm, we also provide the fol-lowing upper bound on ‖θ∗‖2 for practical consideration.

Lemma 2. For the projected Q-learning algorithm in (9),the limit point θ∗ satisfies ‖θ∗‖2 ≤ 2rmax

ws.

5. Minimax SARSA5.1. Zero-Sum Markov Game

In this subsection, we introduce the two-player zero-sumMarkov game and the corresponding linear function approx-imation.

A two-player zero-sum Markov game is defined as a six-tuple (X ,A1,A2,P, r, γ), where X is a compact continu-ous state space X ⊂ Rd, A1 and A2 are finite action sets ofplayers 1 and 2. We further let Xt ∈ X denote the state attime t, and A1

t ∈ A1, A2t ∈ A2 denote the actions of play-

ers 1 and 2 at time t respectively. The measure P definesthe actions dependent transition kernel for the underlyingMarkov chain {Xt}t≥0:

P(Xt+1 ∈ U |Xt = x,A1t = a1, A

2t = a2)

=

∫U

P(dy|x, a1, a2), (12)

for any measurable set U ⊂ X . The one-stage reward attime t is given by r(Xt, A

1t , A

2t ), where r : X×A1×A2 →

R is the reward function, and is assumed to be uniformlybounded, i.e., r(x, a1, a2) ∈ [0, rmax], for any x ∈ X ,a1 ∈ A1, a2 ∈ A2. Finally, γ denotes the discount factor.

A stationary policy πi, i ∈ {1, 2}, maps a state x ∈ X to aprobability distribution πi(·|x) over Ai, which does not de-pend on time. For policy π = {π1, π2}, the correspondingvalue function V π : X → R is defined as the expected totaldiscounted reward obtained by actions executed according


to π given by: V π (x0) = E[∑∞t=0 γ

tr(Xt, A1t , A

2t )|X0 =

x0]. The action-value function Qπ : X × A1 × A2 →R is then defined as Qπ(x, a1, a2) = r(x, a1, a2) +γ∫X P(dy|x, a1, a2)V π(y).

For the two-player zero-sum game, the goal of player 1 is tomaximize the expected accumulated γ-discounted rewardfrom any initial state, while the goal of player 2 is to mini-mize it. The optimal value function for both player 1 andplayer 2 is then defined in the minimax sense as follows:

V ∗ = minπ2

maxπ1

V π(x), ∀x ∈ X . (13)

The above minimization and maximization are well-definedbecause π1 and π2 lie in a compact probability simplexsince the action sets are finite. For all (x, a1, a2) ∈X × A1 × A2, the optimal action-value function is alsodefined in the following minimax sense: Q∗(x, a1, a2) =minπ2

maxπ1Qπ(x, a1, a2). It can be verified that Q∗ has

the following property (Perolat et al., 2015):

Q∗(x, a1, a2) = Qπ∗(x, a1, a2)

= minπ2

maxπ1

Qπ(x, a1, a2) = maxπ1

minπ2

Qπ(x, a1, a2).

The Bellman operator H for the Markov game is defined as

(HQ)(x, a1, a2) = r(x, a1, a2)

+ γ

∫X

minπ2∈δ2

maxπ1∈δ1

πT1 Q(y)π2P(dy|x, a1, a2), (14)

where Q(y) is the action-value matrix at state y and δidenotes the probability simplex, i.e., entry j in δi denotesthe probability that πi takes the action j, for i = 1, 2.

The linear function approximation here takes a form similarto the single-user case, except that the function dependson two actions. Let Q = {Qθ : θ ∈ RN} be a family ofreal-valued functions defined on X ×A1 ×A2. We assumethat any function in Q is a linear combination of a fixed setof N linearly independent functions, φi : X ×A1 ×A2 →R, i = 1, . . . , N . Specifically, for θ ∈ RN ,

Qθ(x, a1, a2) =

N∑i=1

θiφi(x, a1, a2) = φT (x, a1, a2)θ.

We assume that ‖φ(x, a1, a2)‖2 ≤ 1 for any (x, a1, a2) ∈X×A1×A2, which can be ensured by normalizing the basisfunctions {φi}Ni=1. The goal is to find Qθ to approximatethe optimal action-value functionQ∗ with a continuous statespace. We also express φi(x, a1, a2) in a matrix form Φi(x),with its (j, k)-th entry defined as [Φi(x)]j,k = φi(x, a1 =j, a2 = k), j ∈ A1, k ∈ A2.

5.2. Minimax SARSA with Linear FunctionApproximation

In this subsection, we present our finite-sample analysisfor an on-policy minimax SARSA algorithm which adapts

SARSA to solve the two-player zero-sum Markov game.

We consider a θ-dependent ε-greedy learning policy πθ,which changes with time. For any θ ∈ RN , the optimalminimax policy πθi,x, i ∈ {1, 2} at state x is given by

{πθ2,x, πθ1,x} = argminπ2∈δ2

argmaxπ1∈δ1

πT1 [Φ(x)T θ]π2,

where the action-value matrix [Φ(x)T θ] = ΣNi=1Φi(x)θi.

The ε-greedy learning policy πθ = {π1,θ, π2,θ} balancesthe exploration and exploitation for each player by choosingall actions inA1 andA2 with probability at least ε. Supposethat {xt, a1t , a2t , rt}t≥0 is a sampled trajectory of states, ac-tions and rewards obtained from the MDP using the timedependent learning policy πθt . Then the projected minimaxSARSA algorithm takes the following update rule:


where gt(θt) = φ(xt, a1t , a

2t )(r(xt, a

1t , a

2t ) +

γφT (xt+1, a1t+1, a

2t+1)θt − φT (xt, a

1t , a

2t )θt). Simi-

larly, we introduce a projection step to control the norm ofthe gradient. We assume that for any θ ∈ RN , the learningpolicy πθ is Lipschitz with respect to θ: for any (x, a1, a2)in X ×A1 ×A2,

|πθ1(a1, a2|x)− πθ2(a1, a2|x)| ≤ C‖θ1 − θ2‖2, (16)

where C > 0 is the Lipschitz constant, and πθ(a1, a2|x) =π1,θ(a1|x)π2,θ(a2|x). We further assume that for any fixedθ ∈ RN , the Markov chain {Xt}t≥0 induced by the learningpolicy πθ is uniformly ergodic with invariant measure Pθ,and satisfies Assumption 1.

5.3. Finite-Sample Analysis for Minimax SARSA

Define θ∗ ∈ RN such that Aθ∗θ∗ + bθ∗ = 0, where

Aθ∗ = Eθ∗ [φ(X,A1, A2)(γφT (Y,B1, B2)θ∗

− φT (X,A1, A2)θ∗)]

and bθ∗ = Eθ∗ [φ(X,A1, A2)r(X,A1, A2)]. Here Eθ∗ isdefined similarly as in Section 3.2. Following the stepssimilar to those in the proof of Theorem 5.1 in (De Farias &Van Roy, 2000), we can verify that such a θ∗ exists.

Following the proof in (Perkins & Precup, 2003; Tsitsiklis& Roy, 1997), we can verify that Aθ is negative definite, forany θ ∈ RN . Let G = rmax + 2R and λ = G|A1||A2|(2 +dlogρ

1me+ 1

1−ρ ). Recall in (16) that the learning policy πθis assumed to be Lipschitz with respect to θ with Lipschitzconstant C. We then make the following assumption.

Assumption 4. The Lipschitz constant C is small enoughso that (Aθ∗ + CλI) is negative definite with the largesteigenvalue denoted by −ws < 0.


We have the following finite-sample bound.

Theorem 3. Consider the projected minimax SARSA algo-rithm in (15) with ‖θ∗‖2 ≤ R. Under Assumptions 1 and 4with a decaying step size αt = 1

ws(t+1) for t ≥ 0, we have

E‖θT − θ∗‖22 ≤G2(4C|A|Gτ20 + 12τ0ws + 1)(log T + 1)

ws2T

+4G2(τ0ws + 2ρ−1)

ws2T, (17)

where |A| = |A1||A2| and τ0 = min{t ≥ 0 : mρt ≤ αT }.Remark 4. As T → ∞, τ0 ∼ log T , and then it followsfrom Theorem 3 that E‖θT − θ∗‖22 . log3 T

T .

We provide the following upper bound on ‖θ∗‖2.

Lemma 3. For the minimax SARSA algorithm in (15), thelimit point θ∗ satisfies that ‖θ∗‖2 ≤ − rmax

wl, where wl < 0

is the largest eigenvalue of Aθ∗ .

6. Minimax Q-Learning6.1. Minimax Q-Learning with Linear Function

Approximation

Let π = {π1, π2} be fixed stationary learning policies forplayers 1 and 2. Suppose that {xt, a1t , a2t , rt}t≥0 is a sam-pled trajectory of states, actions and rewards obtained fromthe MDP using policy π. We consider the projected minimaxQ-learning algorithm with the following update rule:


where αt is the step size at time t,

gt(θt) = φ(xt, a1t , a

2t )(r(xt, a

1t , a

2t )

+ γ minπ2∈δ2

maxπ1∈δ1

πT1 [ΦT (xt+1)θt]π2 − φT (xt, a1t , a

2t )θt).

Here, we also use projection to control the norm of thegradient gt(θt). We assume that the Markov chain {Xt}t≥0induced by the learning policy π is uniformly ergodic withinvariant measure Pπ , and X0 is initialized with Pπ .


We denote by µπ the probability measure induced bythe invariant measure Pπ and the learning policy π:µπ(U, a1, a2) =

∫Uπ(a1, a2|x)Pπ(dx), for any measur-

able U ⊂ X , and define the matrix Σπ:

Σπ = Eπ[φ(X,A1, A2)φ(X,A1, A2)T ]

=

∫XΣa1∈A1

Σa2∈A2φ(x, a1, a2)φT (x, a1, a2)dµπ.

For any fixed θ ∈ RN , and x ∈ X , define {πθ2,x, πθ1,x} =

arg minπ2∈δ2 arg maxπ1∈δ1 πT1 [ΦT (x)θ]π2. We then de-

fine the matrix

Σ∗π(θ1,θ2) = E[(πθ1 T1,X Φ(X)πθ22,X) (πθ1 T1,X Φ(X)πθ22,X)T ]

=

∫X

(πθ1 T1,x Φ(x)πθ22,x) (πθ1 T1,x Φ(x)πθ22,x)TPπ(dx).

It can be easily verified that Σπ, Σ∗π(θ1, θ2) are positivedefinite. We further make the following assumption.Assumption 5. For all θ1, θ2 ∈ RN , Σπ � γ2Σ∗π(θ1, θ2).

We denote the minimum of the smallest eigenvalue of Σπ −γ2Σ∗π(θ1, θ2) over all θ1, θ2 ∈ RN by ws. Define θ∗ as thevector that satisfies

θ∗ = Σ−1π Eπ[ΦHQθ∗ ]. (19)

The following theorem establishes the existence of θ∗.Theorem 4. Under Assumption 5, there exists a unique θ∗

that satisfies (19).

We then characterize the finite-sample bound for the pro-jected minimax Q-learning in (18) in the following theorem.Theorem 5. Consider the projected minimax Q-learningalgorithm in (18) with ‖θ∗‖2 ≤ R. If Assumptions 1 and 5are satisfied, then with a decaying step size αt = 1

ws(t+1) ,we have

E‖θT − θ∗‖22 ≤(9 + 24τ0)G2(log T + 1)

ws2T, (20)

where τ0 = min{t ≥ 0 : mρt ≤ αT }.Remark 5. As T →∞, τ0 ∼ log T , and therefore,


T.

We provide the following bound on ‖θ∗‖2.Lemma 4. For the minimax Q-learning algorithm in (18),the limit point θ∗ satisfies ‖θ∗‖2 ≤ 2rmax

ws.

7. ConclusionIn this paper, we presented the first finite-sample analysisfor the SARSA and Q-learning algorithms with continuousstate space and linear function approximation. We obtainedimproved convergence rate bound than the nearest neighborapproach in (Shah & Xie, 2018). Our analysis is applicableto the online case with a single sample path and non-i.i.d.data. In particular, we developed a novel technique to han-dle the stochastic bias of dynamically changing learningpolicies. We also extended our study to the two-player zero-sum Markov games, and developed finite-sample analysisfor the minimax SARSA and Q-learning algorithms. Forfuture work, it is of interest to study the convergence and fur-ther finite-sample analysis for reinforcement learning withgeneral function approximation, e.g., deep neural networks,with non-i.i.d. data.


ReferencesAntos, A., Szepesvari, C., and Munos, R. Learning near-

optimal policies with Bellman-residual minimizationbased fitted policy iteration and a single sample path.Machine Learning, 71(1):89–129, 2008.

Benveniste, A., Metivier, M., and Priouret, P. AdaptiveAlgorithms and Stochastic Approximations, volume 22.Springer Science & Business Media, 2012.

Bertsekas, D. P. Dynamic Programming and Optimal Con-trol, volume 2. Athena Scientific, 3rd edition, 2012.

Bhandari, J., Russo, D., and Singal, R. A finite time anal-ysis of temporal difference learning with linear functionapproximation. arXiv preprint arXiv:1806.02450, 2018.

Bowling, M. Rational and convergent learning in stochas-tic games. In Proc. International Joint Conference onArtificial intelligence (IJCAI), 2001.

Boyan, J. A. Technical update: Least-squares temporal dif-ference learning. Machine Learning, 49:233–246, 2002.

Bradtke, S. J. and Barto, A. G. Linear least-squares algo-rithms for temporal difference learning. Machine Learn-ing, 22:33–57, 1996.

Bubeck, S. et al. Convex optimization: Algorithms and com-plexity. Foundations and Trends R© in Machine Learning,8(3-4):231–357, 2015.

Conitzer, V. and Sandholm, T. AWESOME: A generalmultiagent learning algorithm that converges in self-playand learns a best response against stationary opponents.Machine Learning, 67:23–43, 2007.

Dalal, G., Szrnyi, B., Thoppe, G., and Mannor, S. Fi-nite sample analyses for TD(0) with function approxima-tion. In Proc. AAAI Conference on Artificial Intelligence(AAAI), 2018a.

Dalal, G., Thoppe, G., Szrnyi, B., and Mannor, S. Finitesample analysis of two-timescale stochastic approxima-tion with applications to reinforcement learning. In Proc.Conference On Learning Theory (COLT), 2018b.

De Farias, D. P. and Van Roy, B. On the existence offixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory andApplications, 105(3):589–608, 2000.

Devraj, A. M., Kontoyiannis, I., and Meyn, S. P. Differentialtemporal difference learning. ArXiv: 1812.11137, Dec.2018.

Farahmand, A.-M., Szepesvari, C., and Munos, R. Errorpropagation for approximate policy and value iteration.In Proc. Advances in Neural Information Processing Sys-tems (NIPS), 2010.

Ghavamzadeh, M., Lazaric, A., Maillard, O., and Munos,R. LSTD with random projections. In Proc. Advances inNeural Information Processing Systems (NIPS), 2010.

Kushner, H. Stochastic approximation: a survey. WileyInterdisciplinary Reviews: Computational Statistics, 2(1):87–96, 2010.

Lacoste-Julien, S., Schmidt, M., and Bach, F. A simplerapproach to obtaining an o(1/t) convergence rate for theprojected stochastic subgradient method. arXiv preprintarXiv:1212.2002, 2012.

Lagoudakis, M. G. and Parr, R. Value function approxima-tion in zero-sum Markov games. In Proc. Uncertainty inArtificial Intelligence (UAI), 2002.

Lagoudakis, M. G. and Parr, R. Least-squares policy iter-ation. Journal of Machine Learning Research, 4:1107–1149, 2003.

Lakshminarayanan, C. and Szepesvari, C. Linear stochasticapproximation: How far does constant step-size and iter-ate averaging go? In Proc. International Conference onArtificial Intelligence and Statistics (AISTATS), 2018.

Lazaric, A., Ghavamzadeh, M., and Munos, R. Finite-sample analysis of lstd. In Proc. International Conferenceon Machine Learning (ICML), 2010.

Lazaric, A., Ghavamzadeh, M., and Munos, R. Finite-sample analysis of least-squares policy iteration. Journalof Machine Learning Research, 13:3041–3074, 2012.

Lazaric, A., Ghavamzadeh, M., and Munos, R. Analysis ofclassification-based policy iteration algorithms. Journalof Machine Learning Research, 17:583–612, 2016.

Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Proc. International Con-ference on Machine Learning (ICML), 1994.

Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., andPetrik, M. Finite-sample analysis of proximal gradientTD algorithms. In Proc. Conference on Uncertainty inArtificial Intelligence (UAI), 2015.

Melo, F. S., Meyn, S. P., and Ribeiro, M. I. An analysisof reinforcement learning with function approximation.In Proc. International Conference on Machine Learning(ICML), pp. 664–671. ACM, 2008.

Meyn, S. P. and Tweedie, R. L. Markov Chains and Stochas-tic Stability. Springer Science & Business Media, 2012.


Mitrophanov, A. Y. Sensitivity and convergence of uni-formly ergodic markov chains. Journal of Applied Proba-bility, 42(4):1003–1014, 2005.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,Fidjeland, A. K., and Ostrovski, G. Human-level con-trol through deep reinforcement learning. Nature, 518:529–533, 2015.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. InProc. International Conference on Machine Learning(ICML), 2016.

Munos, R. and Szepesvari, C. Finite-time bounds for fittedvalue iteration. Journal of Machine Learning Research,9:815–857, May 2008.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro-bust stochastic approximation approach to stochastic pro-gramming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.

Ormoneit, D. and Glynn, P. Kernel-based reinforcementlearning in average-cost problems. IEEE Trans. Auto-matic Control, 47(10):1624–1636, 2002.

Ormoneit, D. and Sen, S. Kernel-based reinforcement learn-ing. Mach. Learning, 49(2-3):161–178, 2002.

Perkins, T. J. and Pendrith, M. D. On the existence of fixedpoints for Q-learning and Sarsa in partially observabledomains. In Proc. International Conference on MachineLearning (ICML), pp. 490–497, 2002.

Perkins, T. J. and Precup, D. A convergent form of ap-proximate policy iteration. In Proc. Advances in NeuralInformation Processing Systems (NIPS), pp. 1627–1634,2003.

Perolat, J., Scherrer, B., Piot, B., and Pietquin, O. Approx-imate dynamic programming for two-player zero-summarkov games. In Proc. International Conference onMachine Learning (ICML), 2015.

Perolat, J., Piot, B., B., S., and Pietquin, O. On the use ofnon-stationary strategies for solving two-player zero-sumMarkov games. In Proc. International Conference onArtificial Intelligence and Statistics (AISTATS), 2016a.

Perolat, J., Piot, B., Geist, M., Scherrer, B., and Pietquin, O.Softened approximate policy iteration for Markov games.In Proc. International Conference on Machine Learning(ICML), 2016b.

Perolat, J., Piot, B., and Pietquin, O. Actor-critic fictitiousplay in simultaneous move multistage games. In Proc.International Conference on Artificial Intelligence andStatistics (AISTATS), 2018.

Pires, B. A. and Szepesvari, C. Statistical linear estimationwith penalized estimators: An application to reinforce-ment learning. In Proc. International Conference onMachine Learning (ICML), 2012.

Prasad, H. L., L.A., P., and Bhatnagar, S. Two-timescalealgorithms for learning Nash equilibria in general-sumstochastic games. In Proc. International Conference onAutonomous Agents and Multiagent Systems (AAMAS),2015.

Prashanth, L., Korda, N., and Munos, R. Fast LSTD us-ing stochastic approximation: Finite time analysis andapplication to traffic control. In Proc. Joint EuropeanConference on Machine Learning and Knowledge Dis-covery in Databases, 2013.

Rummery, G. A. and Niranjan, M. Online Q-learning usingconnectionist systems. Technical Report, CambridgeUniversity Engineering Department, September 1994.

Shah, D. and Xie, Q. Q-learning with nearest neighbors.In Proc. Advances in Neural Information Processing Sys-tems (NeurIPS), 2018.

Srinivasan, S., Lanctot, M., Zambaldi, V., Perolat, J., Tuyls,K., Munos, R., and Bowling, M. Actor-critic policy opti-mization in partially observable multiagent environments.In Proc. Advances in Neural Information Processing Sys-tems (NeurIPS), 2018.

Sutton, R. S. Learning to predict by the methods of temporaldifferences. Machine Learning, 3(1):9–44, 1988.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: AnIntroduction, Second Edition. The MIT Press, Cambridge,Massachusetts, 2018.

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Sil-ver, D., Szepesvari, C., and Wiewiora, E. Fast gradient-descent methods for temporal-difference learning withlinear function approximation. In Proc. InternationalConference on Machine Learning (ICML), 2009a.

Sutton, R. S., Maei, H. R., and Szepesvari, C. A convergento(n) temporal-difference algorithm for off-policy learningwith linear function approximation. In Proc. Advances inNeural Information Processing Systems (NIPS), 2009b.

Tagorti, M. and Scherrer, B. On the rate of convergenceand error bounds for LSTD (λ). In Proc. InternationalConference on Machine Learning (ICML), 2015.


Touati, A., Bacon, P.-L., Precup, D., and Vincent, P. Con-vergent TREE BACKUP and RETRACE with functionapproximation. In Proc. International Conference onMachine Learning (ICML), 2018.

Tsitsiklis, J. N. and Roy, B. V. An analysis of temporal-difference learning with function approximation. IEEETransactions on Automatic Control, 42(5):674–690, May1997.

Tu, S. and Recht, B. Least-squares temporal differencelearning for the linear quadratic regulator. In Proc. In-ternational Conference on Machine Learning (ICML),2018.

Watkins, C. J. C. H. and Dayan, P. Q-learning. Mach.Learning, 8(3-4):279–292, 1992.

Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. Online reinforcementlearning in stochastic games. In Proc. Advances in NeuralInformation Processing Systems (NIPS), 2017.

Yang, Z., Xie, Y., and Wang, Z. A theoretical analysis ofdeep Q-learning. ArXiv: 1901.00137, January 2019.

Zhang, K., Yang, Z., Liu, H., Zhang, T., and Basar, T. Finite-sample analyses for fully decentralized multi-agent rein-forcement learning. arXiv:1812.02783, 2018.


Supplementary Materials

A. Technical Proofs for SARSA with Linear Function ApproximationA.1. Useful Lemmas for Proof of Theorem 1

For the SARSA algorithm, define for any θ ∈ RN ,

g(θ) = Eθ[φ(X,A)

(r(X,A) + γφT (Y,B) θ − φT (X,A)θ

)]. (21)

It can be easily verified that g(θ∗) = 0. We then define Λt(θ) = 〈θ − θ∗, gt(θ)− g(θ)〉.Lemma 5. For any θ ∈ RN such that ‖θ‖2 ≤ R, ‖gt(θ)‖2 ≤ G.

Proof. By the definition of gt(θ), we obtain

‖gt(θ)‖2 =∥∥φ(xt, at)(r(xt, at) + γφT (xt+1, at+1)θ − φT (xt, at)θ)

∥∥2

≤ |(r(xt, at) + γφT (xt+1, at+1)θ − φT (xt, at)θ|≤ rmax + (1 + γ) ‖θ‖2≤ G, (22)

where the first two inequalities are due to the assumption that ‖φ(x, a)‖2 ≤ 1.

The following lemma is useful to deal with the time-varying learning policy.Lemma 6. For any θ1 and θ2 in RN ,

dTV (Pθ1 ,Pθ2) ≤ |A|C(dlogρm

−1e+1

1− ρ

)‖θ1 − θ2‖2 , (23)

and

dTV (µθ1 , µθ2) ≤ |A|C(

1 + dlogρm−1e+

1

1− ρ

)‖θ1 − θ2‖2 . (24)

Proof. For θi, i = 1, 2, define the transition kernels respectively as follows:

Ki(x, dy) =∑a∈A

P(dy|x, a)πθi(a|x). (25)

Following from Theorem 3.1 in (Mitrophanov, 2005), we obtain

dTV (Pθ1 ,Pθ2) ≤(dlogρm

−1e+1

1− ρ

)‖K1 −K2‖, (26)

where ‖ · ‖ is the operator norm: ‖A‖ := sup‖q‖TV =1 ‖qA‖TV , and ‖ · ‖TV denotes the total-variation norm. Then, we have

‖K1 −K2‖ = sup‖q‖TV =1

∥∥∥∥∫Xq(dx)(K1 −K2)(x, ·)

∥∥∥∥TV

= sup‖q‖TV =1

∫X

∣∣∣∣∫Xq(dx)(K1 −K2)(x, dy)

∣∣∣∣≤ sup‖q‖TV =1

∫X

∣∣∣∣∫X|q(dx)||(K1 −K2)(x, dy)|

∣∣∣∣= sup‖q‖TV =1

∫X

∫X|q(dx)|

∣∣∣∣∣∑a∈A

P(dy|x, a)(πθ1(a|x)− πθ2(a|x))

∣∣∣∣∣≤ sup‖q‖TV =1

∫X

∫X|q(dx)|

∑a∈A

P(dy|x, a) |πθ1(a|x)− πθ2(a|x)|

≤ |A|C ‖θ1 − θ2‖2 . (27)


By definition, µθi(dx, a) = Pθi(dx)πθi(a|x), for i = 1, 2. Therefore, the second result follows after a few steps of simplecomputations.

Lemma 7. For any θ ∈ RN such that ‖θ‖2 ≤ R,

〈θ − θ∗, g(θ)− g(θ∗)〉 ≤ −ws ‖θ − θ∗‖22. (28)

Proof. Let θ = θ − θ∗. Denote by dψθ = µθ(dx, a)P(dy|x, a)πθ(b|y). By the definition of g, we have

〈θ − θ∗, g(θ)− g(θ∗)〉

=

∫x∈X

∑a∈A

θTφ(x, a)r(x, a)(µθ(dx, a)− µθ∗(dx, a))

+

∫x∈X

∑a∈A

∫y∈X

∑b∈A

θTφ(x, a)(γφT (y, b)− φT (x, a)) (θdψθ − θ∗dψθ∗)

=

∫x∈X

∑a∈A

θTφ(x, a)r(x, a)(µθ(dx, a)− µθ∗(dx, a))

+

∫x∈X

∑a∈A

∫y∈X

∑b∈A

θTφ(x, a)(γφT (y, b)− φT (x, a))θ(dψθ − dψθ∗)

+

∫x∈X

∑a∈A

∫y∈X

∑b∈A

θTφ(x, a)(γφT (y, b)− φT (x, a))θdψθ∗ . (29)

The first term in (29) can be bounded as follows:∫x∈X

∑a∈A

θTφ(X,A)r(X,A)(µθ(dx, a)− µθ∗(dx, a))

≤ ‖θ‖2rmax‖µθ − µθ∗‖TV

≤ ‖θ‖22rmax|A|C(

1 + dlogρm−1e+

1

1− ρ

)≤ λ1C‖θ‖22, (30)

where the second inequality follows from Lemma 6, and λ1 = rmax|A|(

2 + dlogρm−1e+ 1

1−ρ

).

The second term in (29) can be bounded as follows:∫x∈X

∑a∈A

∫y∈X

∑b∈A

θTφ(x, a)(γφT (y, b)− φT (x, a))θ (dψθ − dψθ∗)

≤ ‖θ‖2(1 + γ) ‖θ‖2 ‖ψθ − ψθ∗‖TV

≤ ‖θ‖22(2 + γ)R |A|C(

2 + dlogρm−1e+

1

1− ρ

)= λ2C‖θ‖22, (31)

where the second inequality follows from Lemma 6, and λ2 = (1 + γ)R |A|(

2 + dlogρm−1e+ 1

1−ρ

).

Let Aθ∗ = Eθ∗ [φ(X,A)(γφT (Y,B)− φT (X,A))], which is negative definite (Perkins & Precup, 2003; Tsitsiklis & Roy,1997). The third term in (29) is equal to θTAθ∗ θ.

Hence,

〈θ − θ∗, g(θ)− g(θ∗)〉 ≤ θT (Aθ∗ + C(λ1 + λ2)I)θ ≤ −ws ‖θ − θ∗‖22, (32)

where I is the identity matrix, and −ws is the largest eigenvalue of Aθ∗ + C(λ1 + λ2)I .


Lemma 8. For all t ≥ 0, Λt(θt) ≤ 2G2.

Proof. The result follows from Lemma 5.

Lemma 9. |Λt(θ1)−Λt(θ2)| ≤ 6G ‖θ1 − θ2‖2 .

Proof. It is clear that

|Λt(θ1)−Λt(θ2)|= |〈θ1 − θ∗, gt(θ1)− g(θ1)〉 − 〈θ2 − θ∗, gt(θ2)− g(θ2)〉|= |〈θ1 − θ∗, gt(θ1)− g(θ1)− (gt(θ2)− g(θ2))〉+ 〈θ1 − θ∗ − (θ2 − θ∗), gt(θ2)− g(θ2)〉|≤ 2R ‖gt(θ1)− g(θ1)− (gt(θ2)− g(θ2))‖2 + 2G ‖θ1 − θ2‖2 . (33)

The first term in (33) can be further upper bounded as follows,

‖gt(θ1)− g(θ1)− (gt(θ2)− g(θ2))‖2 ≤ ‖gt(θ1)− gt(θ2)‖2 + ‖g(θ1)− g(θ2))‖2 . (34)

For the first term in (34),

‖gt(θ1)− gt(θ2)‖2 =∥∥φ(xt, at)

(γ(φT (xt+1, at+1)θ1 − φT (xt+1, at+1)θ2)− φT (xt, at)(θ1 − θ2)

)∥∥2

≤∣∣γφT (xt+1, at+1)(θ1 − θ2)

∣∣+∣∣φT (xt, at)(θ1 − θ2)

∣∣≤ (1 + γ) ‖θ1 − θ2‖2 . (35)

For the second term in (34), it follows similarly as (35) that ‖g(θ1)− g(θ2))‖2 ≤ (1 + γ) ‖θ1 − θ2‖2 . And this completesthe proof.

We next prove the major technical lemma for proving Theorem 1.

Lemma 10. For any τ > 0, and t > τ , E[Λt(θt)] ≤ 2C|A|G3τws

log tt−τ + 4G2mρτ−1 + 6G2 log t

t−τ .

Proof. Step 1. For any i ≥ 0, by the update rule in (1), it is clear that

‖θi+1 − θi‖2 =∥∥proj2,R(θi + αigi(θi))− proj2,R(θi)

∥∥2

≤ ‖αigi(θi)‖2≤ αiG. (36)

Therefore, for any τ ≥ 0, ‖θt − θt−τ‖2 ≤∑t−1i=t−τ ‖θi+1 − θi‖2 ≤ G

∑t−1i=t−τ αi, which, together with the Lipschitz

continuous property of Λt(θ) in Lemma 9, implies that |Λt(θt)−Λt(θt−τ )| ≤ 6G2∑t−1i=t−τ αi, and thus

Λt(θt) ≤ Λt(θt−τ ) + 6G2t−1∑i=t−τ

αi. (37)

Step 2. Conditioning on θt−τ and Xt−τ+1, we construct the following Markov chain:

Xt−τ+1 → Xt−τ+2 → Xt−τ+3 → · · · → Xt → Xt+1, (38)

where for any measurable set B ⊆ X , and for k = t− τ + 2, . . . , t+ 1,

P(Xk ∈ B|θt−τ , Xk−1) =∑a∈A

πθt−τ (Ak−1 = a|Xk−1)P(Xk ∈ B|Ak−1 = a, Xk−1), (39)

where Xt−τ+1 = Xt−τ+1. In short, conditioning on θt−τ and Xt−τ+1, this new auxiliary Markov chain is generated byrepeatedly applying the same learning policy πθt−τ .


From the construction of the Markov chain in (38), and the assumption that for any θ, the Markov chain induced by πθ isuniformly ergodic, and satisfies Assumption 1, it follows that conditioning on (θt−τ , Xt−τ+1), for any k ≥ t− τ + 1,

‖P(Xk ∈ ·|θt−τ , Xt−τ+1)− Pθt−τ ‖TV ≤ mρk−(t−τ+1). (40)

It can be shown that

E[Λt(θt−τ , Ot)|θt−τ , Xt−τ+1]− E[Λt(θt−τ , O′t)|θt−τ , Xt−τ+1]

≤ 2G2(mρτ−1 +mρτ ) ≤ 4G2mρτ−1. (41)

Since for any fixed θ, E[Λt(θ)] = 0, thus E[Λt(θt−τ , O′t)|θt−τ , Xt−τ+1] = 0, where O′t are independently generated by

Pθt−τ and the policy πθt−τ . It then follows that

E[Λt(θt−τ , Ot)] ≤ 2G2(mρτ−1 +mρτ ) ≤ 4G2mρτ−1. (42)

Step 3. For convenience, rewrite Λt(θt−τ ) = Λt(θt−τ , Ot), where Ot = (Xt, At, Xt+1, At+1). Then conditioning onθt−τ and Xt−τ+1, we have

E[Λt(θt−τ , Ot)|θt−τ , Xt−τ+1]− E[Λt(θt−τ , Ot)|θt−τ , Xt−τ+1]

≤ 2G2‖P(Ot ∈ ·|θt−τ , Xt−τ+1)− P(Ot ∈ ·|θt−τ , Xt−τ+1)‖TV (43)

To develop an upper bound on the total-variation norm above, recall that for the Markov chain generated by the SARSAalgorithm, conditioning on (θt−τ , Xt−τ+1),

Xt−τ+1 → Xt−τ+2 → Xt−τ+3 → · · · → Xt → Xt+1, (44)

where for any measurable set B ⊆ X , for k = t− τ + 2, . . . , t+ 1,

P(Xk ∈ B|θt−τ , Xk−1) =∑a∈A

πθk−2(Ak−1 = a|Xk−1)P(Xk ∈ B|Ak−1 = a,Xk−1). (45)

By the definitions of the two Markov chains, with a slight abuse of notation, it follows that

P(Ot|θt−τ , Xt−τ+1) = P(Xt|θt−τ , Xt−τ+1)πθt−1(At|Xt)P(Xt+1|Xt, At)πθt(At+1|Xt+1), (46)

P(Ot|θt−τ , Xt−τ+1) = P(Xt|θt−τ , Xt−τ+1)πθt−τ (At|Xt)P(Xt+1|Xt, At)πθt−τ (At+1|Xt+1), (47)

where we omit the expectation over θt and θt−1 conditioning on (θt−τ , Xt−τ+1) in notation, which does not affect theresults. Following a few steps of simple computations, we obtain

‖P(Ot ∈ ·|θt−τ , Xt−τ+1)− P(Ot ∈ ·|θt−τ , Xt−τ+1)‖TV≤ ‖P(Xt ∈ ·|θt−τ , Xt−τ+1)− P(Xt ∈ ·|θt−τ , Xt−τ+1)‖TV

+ ‖πθt−1(At = ·|Xt)− πθt−τ (At = ·|Xt)‖TV + ‖πθt(At+1 = ·|Xt+1)− πθt−τ (At+1 = ·|Xt+1)‖TV . (48)

Recalling that ‖θt − θt−τ‖2 ≤∑t−1i=t−τ ‖θi+1 − θi‖2 ≤ G

∑t−1i=t−τ αi, and by the Lipschitz condition in (2), it follows that

‖πθt−1(At = ·|Xt)− πθt−τ (At = ·|Xt)‖TV ≤ C|A| ‖θt−1 − θt−τ‖2 ≤ C|A|G

t−2∑i=t−τ

αi, (49)

and similarly

‖πθt(At+1 = ·|Xt+1)− πθt−τ (At+1 = ·|Xt+1)‖TV ≤ C|A|Gt−1∑i=t−τ

αi. (50)


where the inequality is due to the fact that orthogonal projections onto a convex set are non-expansive, and the last step isdue to the fact that g(θ∗) = 0. Applying Lemmas 7 and 10 (with τ = τ0) implies that

E[‖θt+1 − θ∗‖22] ≤ (1− αtws)E[‖θt − θ∗‖22 + α2tG

2 + 2αtΛt(θt)], (58)

which further implies that

ws(t+ 1)E[‖θt+1 − θ∗‖22] ≤ twsE[‖θt − θ∗‖22 + αtG2 + 2Λt(θt)]. (59)

Applying (59) recursively yields that

wsTE[‖θT − θ∗‖22] ≤T−1∑t=0

(αtG2 + 2E[Λt(θt)])

≤ (log T + 1)G2

ws+

τ0∑t=0

2E[Λt(θt)] +

T−1∑t=τ0+1

2E[Λt(θt)]

≤ (log T + 1)G2

ws+ 4G2τ0 +

4(C|A|G3τ20 + 3G2τ0ws)(log T + 1)

ws+

8G2

ρws

=G2(4C|A|Gτ20 + 12τ0ws + 1)(log T + 1)

ws+

4G2(τ0ws + 2ρ−1)

ws, (60)

which completes the proof.

A.3. Proof of Lemma 1

Consider Aθ = Eθ[φ(X,A)(γφT (Y,B)θ − φT (X,A)θ)], and bθ = Eθ[φ(X,A)r(X,A)] as defined in Section 3.2 withrespect to θ. It has been shown that for any θ ∈ RN , Aθ is negative definite (Perkins & Precup, 2003; Tsitsiklis & Roy, 1997).Denote by wl the largest eigenvalue of Aθ∗ . Recall that the limit point θ∗ satisfies the following relationship −Aθ∗θ∗ = bθ∗

(Theorem 2 (Melo et al., 2008)). It then follows that −(θ∗)TAθ∗θ∗ = (θ∗)T bθ∗ , which implies − wl ‖θ∗‖22 ≤ ‖θ∗‖2 rmax.

Thus, ‖θ∗‖2 ≤ −rmax

wl, which completes the proof.

B. Technical Proofs for Q-Learning with Linear Function ApproximationB.1. Useful Lemmas for Proof of Theorem 2

We first introduce some notations. For any θ ∈ RN , we define

g(θ) = Eπ[φ(X,A)

(r(X,A) + γmax

b∈AφT (Y, b) θ − φT (X,A)θ

)]. (61)

It can be easily verified that g(θ∗) = 0. We then define Λt(θ) = 〈θ − θ∗, gt(θ)− g(θ)〉.Lemma 11. For any θ ∈ RN such that ‖θ‖2 ≤ R, ‖gt(θ)‖2 ≤ G.

Proof. The proof is similar to the one for Lemma 5, and is hence omitted.

Lemma 12. For any θ ∈ RN , 〈θ − θ∗, g(θ)− g(θ∗)〉 ≤ −ws‖θ−θ∗‖22

2 .

Proof. For simplicity of notation, denote θ = θ − θ∗. By the definition of g(θ), it follows that

〈θ, g(θ)− g(θ∗)〉 = θT g(θ)− g(θ∗)

= γEπ[φT (X,A)θ

(maxb∈A

φT (Y, b)θ −maxb∈A

φT (Y, b)θ∗)]− θTΣπ θ. (62)

Following the steps similar to those in the proof of Theorem 1 in (Melo et al., 2008), the first term can be upper bounded asfollows

γEπ[φT (X,A)θ

(maxb∈A

φT (Y, b)θ −maxb∈A

φT (Y, b)θ∗)]≤ γ

√θTΣπ θmax

{θTΣ∗π(θ)θ, θTΣ∗π(θ∗)θ

}. (63)


This implies that

〈θ, g(θ)− g(θ∗)〉 ≤√θTΣπ θ

(γ

√max


})− θTΣπ θ

=

√θTΣπ θ

(γ

√max


}−√θTΣπ θ

)

≤ −ws‖θ‖

2

2

2, (64)

where the last step is due to Assumption 3.

Lemma 13. For any ‖θ‖2 ≤ R and t ≥ 0, Λt(θ) ≤ 2G2.

Proof. The result follows directly from Lemma 11.

The following lemma shows that Λt(θ) is Lipschitz continuous with respect to θ.

Lemma 14. For any θ1 and θ2 in RN , |Λt(θ1)−Λt(θ2)| ≤ 6G ‖θ1 − θ2‖2.

Proof. It is clear that

|Λt(θ1)−Λt(θ2)|= |〈θ1 − θ∗, gt(θ1)− g(θ1)〉 − 〈θ2 − θ∗, gt(θ2)− g(θ2)〉|= |〈θ1 − θ∗, gt(θ1)− g(θ1)− (gt(θ2)− g(θ2))〉+ 〈θ1 − θ∗ − (θ2 − θ∗), gt(θ2)− g(θ2)〉|≤ 2R ‖gt(θ1)− g(θ1)− (gt(θ2)− g(θ2))‖2 + 2G ‖θ1 − θ2‖2 . (65)

The first term in (65) can be further upper bounded as follows:

‖gt(θ1)− g(θ1)− (gt(θ2)− g(θ2))‖2 ≤ ‖gt(θ1)− gt(θ2)‖2 + ‖g(θ1)− g(θ2))‖2 . (66)

The first term in (66) can be further bounded as follows:

‖gt(θ1)− gt(θ2)‖2 =

∥∥∥∥φ(xt, at)

(γ(max

b∈AφT (xt+1, b)θ1 −max

b∈AφT (xt+1, b)θ2)− φT (xt, at)(θ1 − θ2)

)∥∥∥∥2

≤∣∣∣∣γ(max

b∈AφT (xt+1, b)θ1 −max

b∈AφT (xt+1, b)θ2)

∣∣∣∣+∣∣φT (xt, at)(θ1 − θ2)

∣∣≤ (1 + γ) ‖θ1 − θ2‖2 , (67)

where the last inequality is due to the following argument. Consider

maxb∈A

φT (xt+1, b)θ1 −maxb∈A

φT (xt+1, b)θ2 = φT (xt+1, aθ1xt+1

)θ1 − φT (xt+1, aθ2xt+1

)θ2, (68)

which can be lower bounded by φT (xt+1, aθ2xt+1

)(θ1 − θ2), and can be upper bounded by φT (xt+1, aθ1xt+1

)(θ1 − θ2). Thenit follows that

|maxb∈A

φT (xt+1, b)θ1 −maxb∈A

φT (xt+1, b)θ2| ≤ max{|φT (xt+1, a

θ2xt+1

)(θ1 − θ2)|, |φT (xt+1, aθ1xt+1

)(θ1 − θ2)|}

≤ ‖θ1 − θ2‖2 . (69)

For the second term in (66), it follows similarly that ‖g(θ1)− g(θ2))‖2 ≤ (1+γ) ‖θ1 − θ2‖2 . This completes the proof.

Lemma 15. For t ≤ τ0, E[Λt(θt)] ≤ 6G2(log t+1)ws

; and for t > τ0, E[Λt(θt)] ≤ (4 + 6τ0)G2αt−τ0 .


Proof. Following the steps similar to those from (36) to (37), and applying the Lipschitz continuous property of Λt(θ) inLemma 14, it follows that

Λt(θt) ≤ Λt(θt−τ ) + 6G2t−1∑i=t−τ

αi. (70)

The next step is to provide an upper bound for Eπ[Λt(θt−τ )]. It is clear that for any fixed θ, Eπ[Λt(θ)] = 0, sinceEπ[gt(θ)] = g(θ) by definition. Let Ot = (Xt, At, Xt+1), and rewrite Λt(θ) as Λt(θ,Ot). Further define an independentO′t = (X ′t, A

′t, X

′t+1) that has the same marginal distribution as Ot. It is clear that E[Λt(θt−τ , O

′t)] = 0, and X ′t ∼ Pπ.

Note that the following Markov chain holds

θt−τ → Xt−τ → Xt → Ot. (71)

Consider E[Λt(θt−τ , Ot)] = E[E[Λt(θt−τ , Ot)|Xt−τ ]], where we first take the expectation with respect to Ot conditionedon Xt−τ , and then take the expectation with respect to Xt−τ . Due to the Markov chain (71), conditioned on Xt−τ , Ot andθt−τ are independent. Further following from Assumption 1, it can be shown that

dTV (P(Xt ∈ ·|Xt−τ ),Pπ) ≤ mρτ . (72)

Then ∣∣E[E[Λt(θt−τ , Ot)|Xt−τ ]]− E[E[Λt(θt−τ , O′t)|Xt−τ ]]

∣∣≤ 4G2dTV (P(Xt ∈ ·|Xt−τ ),Pπ)

≤ 4G2mρτ , (73)

which implies that

E[Λt(θt−τ , Ot)] ≤ 4G2mρτ . (74)

Recall that τ0 = min{t ≥ 0 : mρt ≤ αT }. For t ≤ τ0, let τ = t. Then, since αt = 1ws(t+1) , it follows that

E[Λt(θt)] ≤ E[Λt(θ0)] + 6G2t−1∑i=0

αi ≤6G2(log t+ 1)

ws. (75)

For t > τ0, let τ = τ0. It then follows that

E[Λt(θt)] ≤ E[Λt(θt−τ0)] + 6G2t−1∑

i=t−τ0

αi

≤ 4G2mρτ0 + 6G2τ0αt−τ0

≤ (4 + 6τ0)G2αt−τ0 . (76)

B.2. Proof of Theorem 2

We bound the error term as follows:

E[ ‖θt+1 − θ∗‖22]

= E[∥∥proj2,R(θt + αtgt(θt))− proj2,R(θ∗)

∥∥22]

≤ E[‖θt + αtgt(θt)− θ∗‖22]

= E[‖θt − θ∗‖22 + α2t ‖gt(θt)‖

22 + 2αt〈θt − θ∗, gt(θt)〉]

= E[‖θt − θ∗‖22 + α2t ‖gt(θt)‖

22 + 2αt〈θt − θ∗, g(θt)− g(θ∗)〉+ 2αtΛt(θt)], (77)


where the inequality is due to the fact that orthogonal projections onto a convex set are non-expansive, and the last step isdue to the fact that g(θ∗) = 0. Following from (77) and the lemmas in Section B.1, we obtain

E[‖θt+1 − θ∗‖22] ≤ (1− αtws)E[‖θt − θ∗‖22] + α2tG

2 + 2αtE[Λt(θt)], (78)

which implies that

0 ≤ wstE[‖θt − θ∗‖22]− ws(t+ 1)E[‖θt+1 − θ∗‖22] +G2

ws(t+ 1)+ 2E[Λt(θt)]. (79)

Taking the summation of (79) over t = 0, 1, ..., T − 1, we obtain

wsTE[‖θT − θ∗‖22] ≤ G2(log T + 1)

ws+

τ0∑t=0

2E[Λt(θt)] +

T−1∑t=τ0+1

2E[Λt(θt)]

≤ G2(log T + 1)

ws+

12τ0G2(log T + 1)

ws+

(8 + 12τ0)G2(log T + 1)

ws

=(9 + 24τ0)G2(log T + 1)

ws. (80)

This completes the proof for Theorem 2.

B.3. Proof of Lemma 2

Following from (10), θ∗ satisfies the following equation

θ∗ = Σ−1π Eπ[φ(X,A)(r(X,A) + γmax

b∈AφT (Y, b)θ∗)

], (81)

where Eπ denotes the expectation when (X,A) ∼ µπ , and Y ∼ P(·|X,A). It then follows that

θ∗TΣπθ∗ = θ∗TEπ

[φ(X,A)(r(X,A) + γmax

b∈AφT (Y, b)θ∗)

], (82)

which implies that

θ∗TΣπθ∗ − γθ∗TEπ[φ(X,A) max

b∈AφT (Y, b)θ∗]

= θ∗TEπ[φ(X,A)r(X,A)]

≤ ‖θ∗‖2rmax, (83)

where the last step follows from the assumptions that ‖φ(x, a)‖2 ≤ 1 and r(x, a) ≤ rmax, ∀(x, a) ∈ X ×A. By Holder’sinequality,

θ∗TEπ[φ(X,A) maxb∈A

φT (Y, b)θ∗]

≤

√Eπ[(θ∗Tφ(X,A))2

]Eπ[(maxb∈A

φT (Y, b)θ∗)2]

=

√(θ∗TΣπθ∗)(θ∗

TΣ∗π(θ∗)θ∗). (84)

Combing (83) and (84) yields that√θ∗TΣπθ∗

(√θ∗TΣπθ∗ −

√θ∗T γ2Σ∗π(θ∗)θ∗

)≤ ‖θ∗‖2rmax. (85)

By Assumption 3, it then follows that√θ∗TΣπθ∗ >

√θ∗T γ2Σ∗π(θ∗)θ∗, and√

θ∗TΣπθ∗√θ∗TΣπθ∗ +

√θ∗T γ2Σ∗π(θ∗)θ∗

>1

2, (86)


which together with (85) yields that

‖θ∗‖2 ≤2rmax

ws. (87)

C. Technical Proofs for Minimax SARSA with Linear Function ApproximationC.1. Proof of Theorem 3

The proof follows in a similar fashion to that for Theorem 1 with adaptation to the zero-sum problem and the minimaxSARSA algorithm. For the minimax SARSA algorithm, define for any θ ∈ RN ,

g(θ) = Eθ[φ(X,A1, A2)

(r(X,A1, A2) + γφT (Y,B1, B2) θ − φT (X,A1, A2)θ

)]. (88)

Define θ∗ such that g(θ∗) = 0. Following steps similar to those in the proof of Theorem 5.1 in (De Farias & Van Roy, 2000),we can verify that θ∗ exists. We then define Λt(θ) = 〈θ − θ∗, gt(θ)− g(θ)〉.

We then obtain the following properties for minimax SARSA by following the steps similar to those for proving Lemma5-10.

(a) For any θ ∈ RN such that ‖θ‖2 ≤ R, ‖gt(θ)‖2 ≤ G.

(b) For any θ1 and θ2 in RN ,

dTV (Pθ1 ,Pθ2) ≤ |A1||A2|C(dlogρm

−1e+1

1− ρ

)‖θ1 − θ2‖2 , and (89)

dTV (µθ1 , µθ2) ≤ |A1||A2|C(

1 + dlogρm−1e+

1

1− ρ

)‖θ1 − θ2‖2 . (90)

(c) For any θ ∈ RN such that ‖θ‖2 ≤ R, 〈θ − θ∗, g(θ)− g(θ∗)〉 ≤ −ws ‖θ − θ∗‖22.

(d) For all t ≥ 0, Λt(θt) ≤ 2G2.

(e) |Λt(θ1)−Λt(θ2)| ≤ 6G ‖θ1 − θ2‖2 .

(f) For any τ > 0 and t > τ , E[Λt(θt)] ≤ 2C|A1||A2|G3τws

log tt−τ + 4G2mρτ−1 + 6G2 log t

t−τ .

With the above properties, we can obtain the following bound on the error by following the steps similar to those in theproof of Theorem 1.

wsTE[‖θT − θ∗‖22] ≤ (log T + 1)G2

ws+ 4G2τ0 +

(4C|A1||A2|G3τ20 + 12G2τ0ws)(log T + 1)

ws+

8G2

ρws, (91)


C.2. Proof of Lemma 3

Consider Aθ = Eθ[φ(X,A1, A2)(γφT (Y,B1, B2)θ − φT (X,A1, A2)θ)], and bθ = Eθ[φ(X,A1, A2)r(X,A1, A2)] asdefined in Section 5.3 with respect to θ. Denote by wl the largest eigenvalue of Aθ∗ . Following steps similar to those in theproof of Lemma 1 we have

θ∗ ≤ −rmax

wl, (92)



D. Technical Proofs for Minimax Q-Learning with Linear Function ApproximationD.1. Proof of Theorem 4

We define the operator T (θ) = Σ−1π Eπ[φHQθ], and Σπ-norm of vector x ∈ RN as ‖x‖Σπ =√xTΣπx. Choose any

θ1, θ2 ∈ RN . Then, we have

T (θ1) = Σ−1π Eπ[φ(X,A1, A2)

(r(X,A1, A2) + γπθ1 T1,Y [ΦT (Y ) θ1]πθ12,Y − φ

T (X,A1, A2)θ1

)], (93)

T (θ2) = Σ−1π Eπ[φ(X,A1, A2)

(r(X,A1, A2) + γπθ2 T1,Y [ΦT (Y ) θ2]πθ22,Y − φ

T (X,A1, A2)θ2

)]. (94)

Then it can be shown that

(T (θ1)− T (θ2))TΣπ(T (θ1)− T (θ2))

= γEπ[φ(X,A1, A2)(T (θ1)− T (θ2))

(πθ1 T1,Y [ΦT (Y )θ1]πθ12,Y − π

θ2 T1,Y [ΦT (Y )θ2]πθ22,Y

)]. (95)

Defining the sets S+ = {(x, a1, a2) : φT (x, a1, a2)(T (θ1)− T (θ2)) > 0} and S− = X ×A \ S+, i.e., the complement ofS+. For fixed T (θ1)− T (θ2) and any y ∈ X , a1 ∈ A1 and a2 ∈ A2, the following holds:

1{(x,a1,a2)∈S+}φT (x, a1, a2)θ(πθ1 T1,y [ΦT (y)θ1]πθ12,y − π

θ2 T1,y [ΦT (y)θ2]πππθ22,y)

= 1{(x,a1,a2)∈S+}φT (x, a1, a2)θ[(πθ1 T1,y [ΦT (y)θ1]πθ12,y − π

θ1 T1,y [ΦT (y)θ1]πθ22,y)

+ (πθ1 T1,y [ΦT (y)θ1]πθ22,y − πθ1 T1,y [ΦT (y)θ2]πθ22,y) + (πθ1 T1,y [ΦT (y)θ2]πθ22,y − π

θ2 T1,y [ΦT (y)θ2]πθ22,y)]

≤ 1{(x,a1,a2)∈S+}φT (x, a1, a2)θ(πθ1 T1,y [ΦT (y)θ1]πθ22,y − π


= 1{(x,a1,a2)∈S+}φT (x, a1, a2)θ(πθ1 T1,y [Φ(y)T θ]πθ22,y), (96)

and

1{(x,a1,a2)∈S−}φT (x, a1, a2)θ(πθ1 T1,y [ΦT (y)θ1]πθ12,y − π


= 1{(x,a1,a2)∈S−}φT (x, a1, a2)θ[(πθ1 T1,y [ΦT (y)θ1]πθ12,y − π


+ (πθ2 T1,y [ΦT (y)θ1]πθ12,y − πθ2 T1,y [ΦT (y)θ2]πθ12,y) + (πθ2 T1,y [ΦT (y)θ2]πθ12,y − π

θ2 T1,y [ΦT (y)θ2]πθ22,y)]

≤ 1{(x,a1,a2)∈S−}φT (x, a1, a2)θ(πθ2 T1,y [ΦT (y)θ1]πθ12,y − π


= 1{(x,a1,a2)∈S−}φT (x, a1, a2)θ(πθ2 T1,y [Φ(y)T θ]πθ12,y). (97)

Then (95) can be bounded as follows,

‖T (θ1)− T (θ2)‖2Σπ ≤ γEπ[φT (X,A1, A2)(T (θ1)− T (θ2))

(πθ1 T1,Y [ΦT (Y )(θ1 − θ2)]πθ22,Y )1S+

]+ γEπ

[φT (X,A1, A2)(T (θ1)− T (θ2))

(πθ2 T1,Y [ΦT (Y )(θ1 − θ2)]πθ12,Y )1S−

]. (98)

Following the steps similar to those in the proof of Theorem 1 in (Melo et al., 2008), we obtain

‖T (θ1)− T (θ2)‖2Σπ ≤ γ√‖T (θ1)− T (θ2)‖2Σπ max

{‖θ1 − θ2‖2Σ∗π(θ1,θ2), ‖θ1 − θ2‖

2Σ∗π(θ2,θ1)

}. (99)

By Assumption 5, we have

max{‖θ1 − θ2‖2Σ∗π(θ1,θ2), ‖θ1 − θ2‖

2Σ∗π(θ2,θ1)

}<

1

γ2‖θ1 − θ2‖2Σπ . (100)

Define

η = γ

√√√√ supθ1,θ2∈RN

‖θ1 − θ2‖2Σ∗π(θ1,θ2)‖θ1 − θ2‖2Σπ

, (101)

which is less than 1 by Assumption 5. Then, we have ‖T (θ1)−T (θ2)‖Σπ ≤ η‖θ1− θ2‖Σπ . Since η < 1, T is a contractionoperator in the Σπ norm. Hence there exists a unique fixed point θ∗ such that T (θ∗) = θ∗.


D.2. Proof of Theorem 5

For any θ ∈ RN , we define

g(θ) = Eπ[φ(X,A1, A2)

(r(X,A1, A2) + γ min

π2∈δ2maxπ1∈δ1

πT1 [ΦT (Y ) θ]π2 − φT (X,A1, A2)θ

)]. (102)

It is shown in Theorem 4 that there exist θ∗ ∈ RN such that g(θ∗) = 0. We then define Λt(θ) = 〈θ − θ∗, gt(θ)− g(θ)〉.We can obtain the following properties for minimax Q-Learning by following the steps similar to those for proving Lemma11 and Lemmas 13-15.

(a) For any θ ∈ RN such that ‖θ‖2 ≤ R, ‖gt(θ)‖2 ≤ G.

(b) For any ‖θ‖2 ≤ R and t ≥ 0, Λt(θ) ≤ 2G2.

(c) For any θ1 and θ2 in RN , |Λt(θ1)−Λt(θ2)| ≤ 6G ‖θ1 − θ2‖2.

(d) For t ≤ τ0, E[Λt(θt)] ≤ 6G2(log t+1)ws

; and for t > τ0, E[Λt(θt)] ≤ (4 + 6τ0)G2αt−τ0 .

We further prove the following lemma.

Lemma 16. For any θ ∈ RN , 〈θ − θ∗, g(θ)− g(θ∗)〉 ≤ −ws‖θ−θ∗‖22

2 .

Proof. For simplicity of notation, denote θ = θ − θ∗. By the definition of g(θ), it follows that

〈θ, g(θ)− g(θ∗)〉 = θT (g(θ)− g(θ∗))

= γEπ[φT (X,A1, A2)θ

(minπ2∈δ2

maxπ1∈δ1

πT1 [ΦT (Y )θ]π2 − minπ2∈δ2

maxπ1∈δ1

πT1 [ΦT (Y )θ∗]π2)]− θTΣπ θ.

(103)

Then following steps are similar to those in the proofs of Theorem 4 and Theorem 1 in (Melo et al., 2008). The first termabove can be upper bounded as follows

γEπ[φT (X,A1, A2)θ

(minπ2∈δ2

maxπ1∈δ1

πT1 [ΦT (Y )θ]π2 − minπ2∈δ2

maxπ1∈δ1

πT1 [ΦT (Y )θ∗]π2)]

≤ γ√θTΣπ θmax

{θTΣ∗π(θ, θ∗)θ, θTΣ∗π(θ∗, θ)θ

}. (104)

This implies that

〈θ, g(θ)− g(θ∗)〉 ≤ γ√θTΣπ θ

√max


}− θTΣπ θ

=

√θTΣπ θ

(γ

√max


}−√θTΣπ θ

)

≤ −ws‖θ‖

2

2

2, (105)

where the last step is due to the Assumption 5.

With the above properties, Theorem 5 follows by the steps similar to those in the proof of Theorem 2.


D.3. Proof of Lemma 4

Following from Theorem 4, θ∗ satisfies the following equation:

θ∗ = Σ−1π Eπ[φ(X,A1, A2)

(r(X,A1, A2) + γπθ

∗ T1,Y [ΦT (Y ) θ∗]πθ

∗

2,Y

)]. (106)

We denote the action-value matrix with parameter θ∗ at state Y by [ΦT (Y ) θ∗]. It then follows that

θ∗TΣπθ∗ = θ∗TEπ

[φ(X,A1, A2)

(r(X,A1, A2) + γπθ

∗ T1,Y [ΦT (Y ) θ∗]πθ

∗

2,Y

)], (107)

which implies that

θ∗TΣπθ∗ − γθ∗TEπ

[φ(X,A1, A2)

(πθ∗ T

1,Y [ΦT (Y ) θ∗]πθ∗

2,Y

)]= θ∗TEπ[φ(X,A1, A2)r(X,A1, A2)]

≤ ‖θ∗‖2rmax, (108)

where the last step follows from the assumptions that ‖φ(x, a1, a2)‖2 ≤ 1 and r(x, a1, a2) ≤ rmax,∀(x, a1, a2) ∈X ×A1 ×A2. By Holder’s inequality,

θ∗TEπ[φ(X,A1, A2)(πθ∗ T

1,Y [ΦT (Y ) θ∗]πθ∗

2,Y

)]

≤√Eπ[(θ∗Tφ(X,A1, A2))2

]Eπ[(πθ

∗ T1,Y [ΦT (Y ) θ∗]πθ

∗2,Y )2

]=

√(θ∗TΣπθ∗)(θ∗

TΣ∗π(θ∗, θ∗)θ∗). (109)

Combing (108) and (109) implies that√θ∗TΣπθ∗

(√θ∗TΣπθ∗ −

√θ∗T γ2Σ∗π(θ∗, θ∗)θ∗

)≤ ‖θ∗‖2rmax. (110)

By Assumption 3, it follows that√θ∗TΣπθ∗ >

√θ∗T γ2Σ∗π(θ∗, θ∗)θ∗, and√θ∗TΣπθ∗√

θ∗TΣπθ∗ +√θ∗T γ2Σ∗π(θ∗, θ∗)θ∗

>1

2, (111)

which together with (110) implies that

‖θ∗‖2 ≤2rmax

ws. (112)

Finite-Sample Analysis for SARSA and Q-Learning with Linear …szou3/conference/icml2019rl.pdf ·...

Documents

Transcript of Finite-Sample Analysis for SARSA and Q-Learning with Linear …szou3/conference/icml2019rl.pdf ·...