Opportunistic Fair Scheduling in Wireless Networks: An Approximate Dynamic Programming Approach

Mobile Netw Appl (2010) 15:710–728DOI 10.1007/s11036-009-0198-x

Opportunistic Fair Scheduling in Wireless Networks:An Approximate Dynamic Programming Approach

Zhi Zhang · Sudhir Moola · Edwin K. P. Chong

Published online: 4 August 2009© Springer Science + Business Media, LLC 2009

Abstract We consider the problem of temporal fairscheduling of queued data transmissions in wirelessheterogeneous networks. We deal with both thethroughput maximization problem and the delay mini-mization problem. Taking fairness constraints and thedata arrival queues into consideration, we formulatethe transmission scheduling problem as a Markov deci-sion process (MDP) with fairness constraints. We studytwo categories of fairness constraints, namely temporalfairness and utilitarian fairness. We consider two crite-ria: infinite horizon expected total discounted rewardand expected average reward. Applying the dynamicprogramming approach, we derive and prove explicitoptimality equations for the above constrained MDPs,and give corresponding optimal fair scheduling poli-cies based on those equations. A practical stochastic-approximation-type algorithm is applied to calculatethe control parameters online in the policies. Further-more, we develop a novel approximation method—temporal fair rollout—to achieve a tractable computa-tion. Numerical results show that the proposed scheme

This research was supported in part by NSF under grantECCS-0700559. Parts of an early version of this paper waspresented at the IEEE Conference on Decision and Control2008 [1].

Z. Zhang · S. Moola · E. K. P. Chong (B)Department of Electrical and Computer Engineering,Colorado State University, Ft. Collins,CO 80523-1373, USAe-mail: [email protected]

Z. Zhange-mail: [email protected]

S. Moolae-mail: [email protected]

achieves significant performance improvement for boththroughput maximization and delay minimization prob-lems compared with other existing schemes.

Keywords approximate dynamic programming ·fairness · Markov decision process · resourceallocation · scheduling · stochastic approximation ·wireless networks

1 Introduction

Next generation wireless networks, which support high-speed packet data while providing heterogeneous QoSguarantees, require flexible and efficient radio resourcescheduling schemes. One of the fundamental charac-teristics of wireless networks is the time-varying andlocation-dependent channel conditions due to multipathfading. Efficient exploitation of such channel variationhas attracted significant research interest in the pastdecade [2–4]. A good survey of wireless schedulingtechniques can be found in [5].

From an information-theoretic viewpoint, Knoppand Humblet showed that the system capacity is maxi-mized by exploiting inherent multiuser diversity gain inthe wireless channel [2]. The basic idea is to schedule asingle user with the best instantaneous channel condi-tion to transmit at any one time. Technology based onthis idea has already been implemented in the current3G systems: High Data Rate (HDR) [4] and high-speeddownlink packet access (HSDPA) [6].

Good scheduling schemes in wireless networksshould opportunistically seek to exploit the time-varying channel conditions to improve spectrumefficiency thereby achieving multiuser diversity gain.

Mobile Netw Appl (2010) 15:710–728 711

In this context, it is also important to consider thetradeoff between wireless resource efficiency and levelof satisfaction among individual users (fairness).

Fairness criteria are critical to the scheduling prob-lem in wireless networks. For example, allowing onlyusers close to the base station to transmit at hightransmission rate may result in very high throughput,but sacrifice the transmission of other users. In [7–9],Liu et al. developed a unified opportunistic schedulingframework for multimedia communication in a cellularsystem, while providing three long-term QoS/fairnessguarantees—temporal fairness, utilitarian fairness, andminimum-performance guarantees.

Practical radio channels are commonly modeled asmultipath Rayleigh fading channels, which are corre-lated random processes. However, much of the priorwork on scheduling, including [7–9], is based on therelatively simple memoryless channel models. Finite-State Markov Channel (FSMC) models have beenfound to be accurate in modeling such channels withmemory [10]. When channel memory is taken into con-sideration, the existing work on memoryless channelsdoes not apply directly. Also, much of the previouswork focused on “elastic” traffic [11], and assumed thatthe system has infinite backlogged data queues, which isnot always an appropriate assumption. This assumptionmakes it impossible to consider the data arrival queuesand further evaluate the system delay performance.

The widely studied Markov decision processes(MDPs) and the associated dynamic programmingmethodology provide us with a general frameworkfor posing and analyzing problems of sequential de-cision making under uncertainty [12–14]. ConstrainedMarkov decision processes have been studied mostlyvia linear programming and Lagrangian methods[15–19]. However, the existing dynamic programmingapproach does not directly treat long-term fairnessconstraints—we show later how such constraints rep-resent the users’ fairness guarantees.

In [19], the authors considered a Markov decisionproblem to maximize the long-run average reward sub-ject to multiple long-run average cost constraints. Alinear program produces the optimal policy with limitedrandomization. In [20], the authors considered con-trolled Markov models with total discounted expectedlosses. They used a dynamic programming approach tofind optimal admissible strategies, where admissibilitymeans meeting a set of given constraint inequalities.The solution in their approach is a function of a proba-bility distribution and the admissible expected loss, andrandomization is allowed. In [21], the authors derivedthe dynamic programming equations for discounted

cost minimization problems subject to a single identi-cally structured constraint. The equations allowed themto characterize the contraction-type structure of op-timal policies, but the authors did not provide exactsolution schemes and explicit computational tools fortheir approximation.

On the other hand, because of the notorious “curseof dimensionality,” even exact solution schemes, suchas value iteration and policy iteration, often cannot beapplied directly in practice to solve MDP problems.This has motivated a broad class of approximationmethods that involve more tractable computation, butyield suboptimal policies, which we refer to as ap-proximate dynamic programming (ADP) methods [22].For example, Bertsekas and Castanon proposed an ap-proximate approach called rollout for deterministic andstochastic problems that are usually computationallyintractable [23, 24].

In this paper, we consider the opportunistic fairscheduling problem for the uplink of a single-cell TimeDivision Multiplexing (TDM) system. We provide anovel formulation of the scheduling problem with theMarkov channel model as an MDP with explicit fairnessconstraints. We deal with both the throughput max-imization problem and the delay minimization prob-lem. We consider two criteria: infinite horizon expectedtotal discounted reward and expected average reward.In either case, we characterize the corresponding op-timal MDP-based fair scheduling scheme. We focus ontwo categories of fairness constraints, namely temporalfairness and utilitarian fairness. Owing to the partic-ular characteristics of the constraints, we are able toderive and prove explicit dynamic programming equa-tions for MDPs with fairness constraints. Based onthese optimality equations, we obtain the exact corre-sponding optimal scheduling policies. A practical sto-chastic approximation algorithm is applied to calculatethe control parameters online in the policies. Further-more, based on the rollout algorithm, we develop anovel approximation method—temporal fair rollout—to achieve a tractable computation.

Our proposed scheme can easily be extended todifferent objective functions and other fairness mea-sures. Although we only focus on uplink scheduling, thescheme is equally applicable to the downlink case.

Our work addresses heterogeneity of networks inthree dimensions. First, there is heterogeneity in thechannel conditions, owing to factors such as path loss,shadowing, and fading. Second is heterogeneity in theutility of the channel, which depends on factors suchas the heterogeneity of the end-user devices and theircapability (e.g., transmission power, battery capacity,

712 Mobile Netw Appl (2010) 15:710–728

signal-processing hardware, and application software).Third, there is heterogeneity in the end-user QoS andfairness requirements.

The rest of the paper is organized as follows. InSection 2, we describe our system model and MDP for-mulation. In Section 3, we derive the dynamic program-ming equation and the optimal scheduling policy for thetemporal fair constrained problem with both expectedtotal discounted reward and expected average rewardcriteria. In Section 4, we derive the dynamic program-ming equations and the optimal scheduling policiesfor the utilitarian fair constrained problem with bothexpected total discounted reward and expected averagereward criteria. In Section 5, we propose an efficientapproximation algorithm—temporal fair rollout. Wediscuss stochastic approximation method for parameterestimation in Section 6. In Section 7, we present andanalyze the channel model and the simulation results.Finally, concluding remarks are given in Section 8.

2 System model and problem formulation

2.1 System model

Figure 1 depicts an uplink data queueing model fora single-cell TDM system. We assume that there is abase station receiving data from K mobile users. Ascheduler, located at the base station, decides at thestart of each scheduling interval which (single) user toserve. We call a decision rule for scheduling which userto transmit at each interval a scheduling policy.

A1(t)

A2(t)

A3(t)

AK(t)

X1(t)

X2(t)

X3(t)

XK(t)

S1(t)

S2(t)

S3(t)

SK(t) Base Station

Mobile Users

Fig. 1 Uplink queueing model of a single-cell TDM system

The wireless channel for each user differs dependingon the location, the surrounding environment, and themobility. Here we assume that the base station knowsthe channel state information (CSI) of all users per-fectly. Each user has its own packet queue for trans-mission with unlimited queue capacity. We assume thatpackets arrive in each queue randomly (according tosome distribution) and independently. The length of ascheduling interval (time slot) is fixed, and the channeldoes not vary significantly during a time slot. We alsoassume that all users have the same fixed packet size.

In practice, before each scheduling interval, all usersneed to report their current CSI to the base station. So,the perfect CSI assumption here potentially involvessignificant feedback signaling cost [25]. This issue hasmotivated the recent research interest in opportunisticscheduling with partial or reduced feedback [26–28].

Let t = 0, 1, . . . be the index of time slots, and k =1, . . . , K the index of users. For user k at time slott, we use Xk(t), Sk(t), and Ak(t) to denote the queuelength, the channel state, and the exogenous packetarrivals respectively (all in terms of number of packets).The channel state here is measured by the maximumnumber of packets each user can transmit to the basestation at each time slot.

Let πt be the user scheduled at time slot t givena scheduling policy π . Using this notation, the queuelength evolution is given by, for all k ∈ 1, . . . , K,

Xk(t + 1) = Xk(t) + Ak(t) − min (Xk(t), Sk(t)) 1{πt=k},

where 1{·} is the indicator function.

2.2 MDP problem formulation

A discrete-time, finite-state Markov decision process[18] is specified by a tuple (S, A, P(·|·, ·), r(·, ·)). Thestate space S and the action space A are finite sets. Attime slot t, if the system is in state s ∈ S and actiona ∈ A is chosen, then the following happens:

1) a reward r(s, a) is earned immediately;2) the process moves to state s′ ∈ S with transi-

tion probability P(s′|s, a), where P(s′|s, a) ≥ 0 and∑

s′ P(s′|s, a) = 1 for all s and a.

The goal is to determine a policy, a decision rule foraction selection at each time, to optimize a given perfor-mance criterion. This optimization involves balancingbetween immediate reward and future rewards: a highreward now may lead the process into a bad situationlater.

Mobile Netw Appl (2010) 15:710–728 713

We formulate our scheduling problem as a Markovdecision process as follows:

• State: The state space S is the set of all vectors s ∈R

2K of the form

s = (x1, x2, . . . , xK, s1, s2, . . . , sK),

where xk and sk are the queue length and thechannel state of user k during a generic time slot.The state of the system at time slot t is

Xt = (X1(t), X2(t), . . . , XK(t), S1(t), S2(t), . . . , SK(t)).

• Action: The action at each time is to choose oneof K users for transmission; thus an action herecorresponds to a user. The action space A is thus

A = {1, 2, . . . , K}.• Transition probability function: Since a state con-

sists of all queue lengths and channel state, the tran-sition probability function is determined by thequeue length evolution and the dynamics of thechannels.

• Reward: We consider the following two problems:the throughput maximization problem and the de-lay minimization problem.The throughput maximization problem involvesmaximizing the system throughput with the tem-poral fairness constraint (described below). Thethroughput in a time slot is defined as the actualnumber of packets transmitted between a user andthe base station in the time slot. The correspondingreward function is given by

r(Xt, πt) =K∑

k=1

1{πt=k} min (Xk(t), Sk(t)) . (1)

Note that the throughput for user k is the minimumof the queue length Xk(t) and the available channeltransmission packets Sk(t) because at most Xk(t)packets can be transmitted at time slot t.The delay minimization problem is to minimize thesum of the user queue lengths with the temporalfairness constraint. The corresponding reward func-tion is given by

r(Xt, πt) = −K∑

k=1

Xk(t). (2)

(The negative sign accounts for minimization.)Each of these reward functions leads to an over-all objective function to be maximized, definedroughly as the long-term cumulative reward; theseare defined precisely in the next two sections.

• Policy: In this paper, the space of policies underconsideration is restricted to stationary policies. Astationary policy is a mapping π : S → A from thestate space S to the action space A; i.e., the station-ary policy π selects action π(s) when the process isin state s. Let � be the set of all stationary policies.

A natural fairness criterion is to give each user a cer-tain long-term fraction of time, because time is the basicresource shared among users. This is called the tempo-ral fairness [8], which is closely related to generalizedprocessor sharing (GPS) in wireline networks [29]. Analternative fairness criterion, called utilitarian fairness,would ensure that all users get at least a certain fractionof the overall system performance.

Based on the MDP model described above, our goalcan be formally stated as: find a policy π that maximizesthe specified objective function Jπ while satisfying thecorresponding fairness constraints.

In this paper, we consider infinite-horizon models.In the following sections, we will discuss the abovescheduling problems with two types of objective func-tions: the expected total discounted reward and theexpected average reward criteria.

3 Temporal fairness scheduling

3.1 Expected total discounted reward criterion

Discounting arises naturally in applications in which weaccount for the time value of the rewards, such as ineconomic problems. The discount factor α measuresthe present value of one unit of currency received inthe future. The meaning of α < 1 is that future rewardsmatter to us less than the same reward received at thepresent time [12, 13].

In this subsection, we study the infinite horizon ex-pected total discounted reward MDP problem with theexpected discounted temporal fairness constraints. Wederive and prove an explicit dynamic programmingequation for the constrained MDP, and give an optimalscheduling policy based on that equation.

3.1.1 Problem formulation

For any policy π , we define the expected discountedreward objective function as

Jπ (s) = limT→∞

Eπ

[T−1∑

t=0

αtr(Xt, πt)

∣∣∣∣∣

X0 = s

]

, s ∈ S,

714 Mobile Netw Appl (2010) 15:710–728

where Eπ represents expectation given that a policy π isemployed, α is the discount factor with 0 < α < 1, andX0 is the initial state. Since r(Xt, πt) is the immediatereward received at time t, it follows that Jπ (s) rep-resents the expected total discounted reward receivedwhen the policy π is employed and the initial state is s.A policy π∗ is said to be α-optimal if

Jπ∗(s) = maxπ∈�

Jπ (s), ∀s ∈ S. (3)

Hence, a policy is α-optimal if its expected discountedreward is maximal for every initial state.

The expected discounted temporal fairness con-straint is

limT→∞

Eπ

[T−1∑

t=0

αt1{πt=a}

∣∣∣∣∣

X0 = s

]

≥ C(a), ∀a ∈ A, (4)

where C(a) denotes the minimum discounted time-fraction in which action (user) a should be chosen, with0 ≤ C(a) ≤ 1 and

∑a∈A C(a) ≤ 1.

Therefore, our goal can be stated as: find an α-optimal policy π∗ subject to the expected discountedtemporal fairness constraint.

3.1.2 Optimal scheduling policy

Theoretically, the above constrained optimizationproblem can be solved directly by linear program-ming or Lagrangian methods [16–18]. Practically, thosemethods are computationally formidable even forproblems with moderate state spaces. Moreover, theycannot be used if the state-transition distribution isnot available explicitly. Dynamic programming can beused to solve such problems online iteratively. Here wewill derive and prove an explicit dynamic programmingequation for the above constrained MDP. Then wecharacterize an optimal solution to find Eq. 3 subjectto Eq. 4.

Given a function u : A → R, for any policy π , wedefine

Vπ (s) = limT→∞

Eπ

[T−1∑

t=0

αt[r(Xt, πt) + u(πt)]∣∣∣∣∣

X0 = s

]

,

s ∈ S,

and let

Vα(s) = supπ∈�

Vπ (s), s ∈ S.

Lemma 1 Given a function u : A → R, Vα satisfies theoptimality equation

Vα(s) = maxa∈A

{

r(s, a) + u(a)

+ α∑

s′∈S

P(s′|s, a)Vα(s′)

}

, s ∈ S. (5)

Proof See Appendix A. �

Now let B(S) be the Banach space of real-valuedbounded functions on the state space S. Note that sincerewards are bounded, Vπ ∈ B(S) for any policy π . Forany stationary policy π , we define the mapping

Tπ : B(S) → B(S)

in the following manner:

(Tπv)(s)=r(s, π(s)) + u(π(s)) + α∑

s′∈S

P(s′|s, π(s))v(s′).

We can interpret Tπv at s as representing the expectedweighted reward if we use policy π but terminate itafter one period and receive a final reward αv(s′) whenthe final state is s′.

The following lemma and theorem characterize theoptimal policy for our temporal fair constrained MDPand the corresponding optimal discounted reward.

Lemma 2 Let u : A → R satisfy u(a) ≥ 0 for all a ∈ A.Let π∗ be a stationary policy that when the process is instate s, selects an action maximizing the right-hand sideof Eq. 5:

π∗(s)=argmaxa∈A

{

r(s, a) + u(a) + α∑

s′∈S


}

.

(6)

Then

Vπ∗(s) = Vα(s), ∀s ∈ S.

Proof See Appendix B. �

We now show that, under certain assumptions, thepolicy π∗ in Lemma 2 is an α-optimal policy for thediscounted temporal fair constrained MDP.

Theorem 1 Suppose there exists a function u : A → R

such that:

1) ∀a ∈ A, u(a) ≥ 0;

Mobile Netw Appl (2010) 15:710–728 715

2) ∀a ∈ A, limT→∞ Eπ∗[∑T−1

t=0 αt1{π∗t =a}

∣∣X0 = s

]≥

C(a);3) ∀a ∈ A, if limT→∞ Eπ∗

[∑T−1t=0 αt1{π∗

t =a}∣∣X0 = s

]>

C(a), then u(a) = 0.

Then π∗ defined in Eq. 6 is an α-optimal policy asdefined by Eq. 3 subject to Eq. 4. The correspondingoptimal discounted reward is

Jπ∗(s) = Vπ∗(s) −∑

a∈A

u(a)C(a), ∀s ∈ S. (7)

Proof See Appendix C. �

Lemma 2 and Theorem 1 provide an optimalscheduling policy for the discounted temporal fair con-strained MDP. The α-optimal scheduling policy π∗ isgiven by Eq. 6, and the corresponding optimal dis-counted reward is given by Eq. 7.

We can think of the parameter u(a) in Theorem 1as an “offset” or “threshold” for each user (action)to satisfy the fairness constraint, analogous to the re-sult of [8]. Under this constraint, the scheduling policyschedules the “relatively best” user to transmit. It isstraightforward to see that by setting u(a) = 0 for all a ∈A, the optimal policy reduces to an optimal policy fora standard (unconstrained) MDP. However, that policycould be unfair to certain users. If u(a) > 0, then usera is an “unfortunate” user, i.e., the channel conditionit experiences is relatively poor. Hence, it has to takeadvantage of other users (e.g., users with u(a) = 0) tosatisfy its fairness requirement. But to maximize theoverall system performance, we can only give the “un-fortunate” users their minimum resource requirements,hence condition 3 for u(a).

3.2 Expected average reward criterion

In the previous subsection, we posed the temporalfairness scheduling problem as an expected discountedreward MDP with constraints. In this subsection, weconsider optimization problems with average rewardcriteria. Such problems are common in economic, com-puter, and communication systems. Some examples areinventory control problems and computer communi-cation networks, where decisions are made based onthroughput rate or average time a job or packet remainsin the system [12].

In this subsection, we study the problem as an infinitehorizon average reward MDP with expected averagetemporal fairness constraints. Analogous to the resultsin the last subsection, we derive and prove an explicitdynamic programming equation for the constrained

MDP, and give an optimal scheduling policy based onthat equation.


For any policy π , we define the expected average re-ward objective function as


Eπ

[1T

T−1∑

t=0

r(Xt, πt)

∣∣∣∣∣

X0 = s

]

, s ∈ S,

(8)

where Eπ represents conditional expectation given thatthe policy π is employed. Since r(Xt, πt) is the imme-diate reward received at time t, it follows that Jπ (s)represents the expected average reward received perstage when the policy π is employed and the initial stateis s. If the limit in Eq. 8 does not exist, then we agree touse limsup instead of lim. We say that the policy π∗ isaverage-reward-optimal if

Jπ∗(s) = maxπ∈�

Jπ (s), ∀s ∈ S. (9)

The expected average temporal fairness constraint isdefined as

limT→∞

Eπ

[1T

T−1∑

t=0

1{πt=a}

∣∣∣∣∣

X0 = s

]

≥ C(a), ∀a ∈ A,

(10)

where C(a) denotes the minimum relative frequencyat which action a should be taken, with C(a) ≥ 0 and∑

a∈A C(a) ≤ 1.Therefore, our goal can be stated as: find an average-

reward-optimal policy π∗ subject to the expected averagetemporal fairness constraint.


Here we derive and prove an explicit dynamic pro-gramming equation for the above constrained MDP,and give an optimal scheduling policy based on thatequation.

Theorem 2 Suppose the system is unichain.1 Supposewe have a bounded function h : S → R, a function u :A → R, a constant g, and a stationary policy π∗ suchthat for s ∈ S,

1) ∀a ∈ A, u(a) ≥ 0;

1An MDP is unichain if the transition matrix correspondingto every deterministic stationary policy consists of one singlerecurrent class plus a possibly empty set of transient state [12].

716 Mobile Netw Appl (2010) 15:710–728

2) ∀a ∈ A, limT→∞ Eπ∗[

1T

∑T−1t=0 1{π∗

t =a}∣∣X0 = s

]≥

C(a);3) ∀a ∈ A, if limT→∞ Eπ∗

[1T

∑T−1t=0 1{π∗

t =a}∣∣X0 = s

]>

C(a), then u(a) = 0;4)

g + h(s) = maxa∈A

{

r(s, a) + u(a) +∑

s′∈S

P(s′|s, a)h(s′)

}

;

(11)

5) π∗ is a policy which, for each s, prescribes an actionwhich maximizes the right-hand side of Eq. 11:

π∗(s) = argmaxa∈A

{

r(s, a) + u(a) +∑

s′∈S

P(s′|s, a)h(s′)

}

.

(12)

Then π∗ is an average-reward-optimal policy as definedby Eq. 9 subject to Eq. 10. The corresponding optimalaverage reward is

Jπ∗(s) = g −∑

a∈A

u(a)C(a), ∀s ∈ S. (13)

Proof See Appendix D. �

The average-reward-optimal policy π∗ is given byEq. 12, and the corresponding optimal average rewardis given by Eq. 13.

Analogous to u(a) in the last subsection, u(a) inEq. 11 can be considered as an “offset” for each user tosatisfy the average temporal fairness constraint. If werelax the fairness constraint by letting C(a) = 0 for alla ∈ A, the optimal policy would reduce to an optimalpolicy for a standard (unconstrained) average rewardMDP, as expected.

4 Utilitarian fairness scheduling

In the previous section, we studied the schedulingproblem with temporal fairness constraints. In wire-line networks, when a certain amount of resourceis assigned to a user, it is equivalent to grantingthe user a certain amount of throughput. However,the situation is different in wireless networks, wherethe performance value and the amount of resource arenot directly related. Therefore, a potential problem inwireless network is that the temporal fairness schemehas no way of explicitly ensuring that each user receivesa certain guaranteed fair amount of utility (e.g., datarate). Hence, in this section we will describe an alterna-

tive fair scheduling problem that would ensure that allusers get at least a certain fraction of the overall systemperformance, called utilitarian fairness scheduling.

We consider both the infinite horizon expected totaldiscounted and average reward criteria here. In eithercase, we characterize the corresponding optimal MDP-based fair scheduling scheme.

4.1 Expected total discounted reward criterion


As defined in Section 3.1, for any policy π , the expecteddiscounted reward objective function is


Eπ

[T−1∑

t=0

αtr(Xt, πt)

∣∣∣∣∣

X0 = s

]

, s ∈ S.

The expected discounted utilitarian fairness con-straint is

limT→∞

Eπ

[T−1∑

t=0

αtr(Xt, πt)1{πt=a}

∣∣∣∣∣

X0 = s

]

≥ D(a)Jπ (s),

∀a ∈ A, (14)

where D(a) denotes the minimum discounted frac-tion of overall system performance in which action(user) a should be chosen, with 0 ≤ D(a) ≤ 1 and

∑a∈A

D(a) ≤ 1.Therefore, our goal can be stated as: find an α-

optimal policy π∗ subject to the expected discountedutilitarian fairness constraint.


Given a function ω : A → R, for any policy π , we define

Uπ (s) = limT→∞

Eπ

[T−1∑

t=0

αt[(κ+ω(πt))r(Xt, πt)]∣∣∣∣∣

X0 =s

]

,

s ∈ S,

where κ = 1 − ∑πt∈A D(πt)ω(πt) and let

Uα(s) = supπ∈�

Uπ (s), s ∈ S.

Lemma 3 Given a function ω : A → R, Uα satisfies theoptimality equation

Uα(s) = maxa∈A

{

(κ+ω(a))r(s, a)

+ α∑

s′∈S

P(s′|s, a)Uα(s′)}

, s ∈ S. (15)

Mobile Netw Appl (2010) 15:710–728 717

Proof The proof is similar to that of Lemma 1 inSection 3.1. The details are omitted for the sake ofspace. �

The following lemma and theorem characterize theoptimal policy for our utilitarian fair constrained MDPand the corresponding optimal discounted reward.

Lemma 4 Let ω : A → R satisfy ω(a) ≥ 0 for all a ∈ A.Let π∗ be a stationary policy that when the process is instate s, selects an action maximizing the right-hand sideof Eq. 15:


{

(κ + ω(a))r(s, a)

+ α∑

s′∈S

P(s′|s, a)Uα(s′)

}

. (16)

Then

Uπ∗(s) = Uα(s), ∀s ∈ S.

Proof The proof is similar to that of Lemma 2 inSection 3.1. �

We now show that, under certain assumptions, thepolicy π∗ in Lemma 4 is an α-optimal policy for thediscounted utilitarian fair constrained MDP.

Theorem 3 Suppose there exists a function ω : A → R

such that :

1) ∀a ∈ A, ω(a) ≥ 0;

2) ∀a∈ A, limT→∞ Eπ∗

[∑T−1

t=0 αtr(Xt, π∗t )1{π∗

t =a}∣∣∣∣

X0 = s]

≥ D(a)Jπ∗(s);

3) ∀a∈ A, if limT→∞ Eπ∗

[∑T−1

t=0 αtr(Xt, π∗t )1{π∗

t =a}∣∣∣∣

X0 =s]

> D(a)Jπ∗(s), then ω(a) = 0.

Then π∗ defined in Eq. 16 is an α-optimal policy asdefined by Eq. 3 subject to Eq. 14. The correspondingoptimal discounted reward is

Jπ∗(s) = Uπ∗(s), ∀s ∈ S. (17)

Proof See Appendix E. �

4.2 Expected average reward criterion


As defined in Section 3.2, for any policy π , the expectedaverage reward objective function is


Eπ

[1T

T−1∑

t=0

r(Xt, πt)

∣∣∣∣∣

X0 = s

]

, s ∈ S.

The expected average utilitarian fairness constraintis

limT→∞

Eπ

[1T

T−1∑

t=0

r(Xt, πt)1{πt=a}

∣∣∣∣∣

X0 = s

]

≥ D(a)Jπ (s),

∀a ∈ A, (18)

where D(a) denotes the minimum fraction of overallsystem performance in which action (user) a should bechosen, with 0 ≤ D(a) ≤ 1 and

∑a∈A D(a) ≤ 1.

Therefore, our goal can be stated as: find an average-reward-optimal policy π∗ subject to the expected averageutilitarian fairness constraint.


The following theorem characterize the optimal policyfor our average utilitarian fair constrained MDP.

Theorem 4 Suppose the system is unichain. Suppose wehave a bounded function h : S → R, a function ω : A →R, a constant g, and a stationary policy π∗ such that fors ∈ S,

1) ∀a ∈ A, ω(a) ≥ 0;

2) ∀a ∈ A, limT→∞ Eπ∗

[1T

∑T−1t=0 r(Xt, π

∗t )1{π∗

t =a}∣∣∣∣∀a

X0 = s]

≥ D(a)Jπ∗(s);

3) ∀a ∈ A, if limT→∞ Eπ∗

[1T

∑T−1t=0 r(Xt, π

∗t )1{π∗

t =a}∣∣∣∣

X0 = s]

> D(a)Jπ∗(s), then ω(a) = 0;

4)

g+h(s) = maxa∈A

{

(κ + ω(a))r(s, a)+∑

s′∈S

P(s′|s, a)h(s′)

}

,

(19)

where κ = 1 − ∑a∈A D(a)ω(a);

718 Mobile Netw Appl (2010) 15:710–728

5) π∗ is a policy which, for each s, prescribes an actionwhich maximizes the right-hand side of Eq. 19:


{

(κ + ω(a))r(s, a)+∑

s′∈S

P(s′|s, a)h(s′)

}

.

(20)

Then π∗ is an average-reward-optimal policy as definedby Eq. 9 subject to Eq. 18. The corresponding optimalaverage reward is

Jπ∗(s) = g, ∀s ∈ S. (21)

Proof See Appendix F. �

5 Temporal fair rollout algorithm

5.1 Rollout algorithm

In the previous sections, we derived optimal policiesfor the expected total discounted reward and the ex-pected average reward criteria MDP problems with thecorresponding temporal fairness and utilitarian fairnessconstraints. Note that the optimal policies may be ob-tained in principle by maximizing the right-hand side ofEqs. 5, 11, 15, or 19, but this requires the calculation ofthe optimal value function in the right-hand side, whichfor many problems is overwhelming.

The rollout algorithm yields a one-step lookaheadpolicy, with the optimal value function approximated bythe value function of a known base policy π . The basepolicy π is typically heuristic and suboptimal, whichis calculated either analytically or by simulation. Thepolicy thus obtained is called the rollout policy basedon π .

The salient feature of the rollout algorithm is itsreward-improvement property: the rollout policy is noworse than the performance of the base policy. In manycases, the rollout policy is substantially better than thebase policy. The rollout algorithm can also be viewedas the policy improvement step of the policy iterationmethod, which is a primary method for solving infinitehorizon MDP problems [13].

5.2 Temporal fair rollout algorithm

We will extend the idea of rollout to our temporal fairconstrained MDPs to propose an efficient approxima-tion method—temporal fair rollout. (A similar treat-ment applies to the utilitarian fairness case, but isomitted here for the sake of brevity.) We use theexpected total discounted reward MDP as an example

here (a similar approach applies to the expected aver-age reward case).

Suppose that there exists a function u : A → R satis-fying the conditions in Theorem 1. Then we have

Vπ∗(s)=maxa

{r(s, a)+u(a)+αE

[Vπ∗(s′)|s, a

]}, ∀s ∈ S,

where E[·|s, a] is the conditional expectation given thestate s and action a. Moreover, an optimal policy isgiven by


{r(s, a)+u(a)+αE

[Vπ∗(s′)|s, a

]}, s ∈ S.

Applying Theorem 1, we can rewrite the optimalpolicy as:


{

r(s, a) + u(a) + αE[Jπ∗(s′)|s, a

]

+ α∑

a∈A

u(a)C(a)

}

, s ∈ S.

Removing the last constant term, we get


{r(s, a) + u(a) + αE

[Jπ∗(s′)|s, a

]},

s ∈ S.

Instead of calculating the optimal value functiondirectly, we approximate it with the value function ofa base policy that also satisfies the discounted temporalfairness requirements. Let πb be a base policy and Jπb

the value function of the policy. Then the temporal fairrollout policy is

π tfr(s) = argmaxa∈A

{r(s, a) + u(a) + αE

[Jπb(s′)

∣∣s, a

]},

s ∈ S. (22)

The expected value of the base policy is obtained byMonte Carlo simulation. The selection of the base pol-icy is problem specific. In our experiments, we use thetemporal fair opportunistic scheduling policy of [8] asthe base policy. We will show by simulation that thetemporal fair rollout policy in fact performs better thanthe base policy.

6 Stochastic approximation for parameter estimation

The temporal fair rollout policy Eq. 22 described inthe previous section involves some control parametersu(a) that need to be estimated. Figure 2 shows a blockdiagram of a general iterative procedure to estimatethese control parameters online. We use a practicalstochastic approximation technique, similar to the onein [8], to estimate such parameters.

Mobile Netw Appl (2010) 15:710–728 719

Fig. 2 Block diagram of thescheduling policy with onlineparameter estimation

Measure Channeland Queue

Length

EstimateExpected Value

Function

ApplyScheduling

Policy

Update ControlParamenter

We first briefly explain the idea of the stochasticapproximation algorithm used here. Suppose we wishto find a zero root of an unknown continuous functionf (·). If we can evaluate the value of f (x) at any x, thenwe can use the iterative algorithm

xt+1 = xt − β t f (xt),

which will converge to a point x∗ such that f (x∗) = 0 aslong as the step size β t is appropriately chosen. Supposethat we cannot have the exact value of f (xt) at xt;instead, we only have a noisy observation gt of f (xt),i.e., gt = f (xt) + et where et is the observation error(noise) and E[et] = 0. Then the iterative stochastic ap-proximation algorithm

xt+1 = xt − β tgt,

converges to x∗ with probability 1 under appropriateconditions on β t and f . We refer readers to [30, 31] fora systematic and rigorous study of stochastic approxi-mation algorithms.

We now use a stochastic approximation algorithm toestimate �u (the vector of u(a)). Note that we can write�u as a root of the equation f (�u) = 0, where the kthcomponent of f (�u) is given by

fk(�u) = limT→∞

Eπ tfr

[T∑

t=0

αt1{π tfrt =k}

∣∣∣∣∣

X0 = s

]

− C(k),

∀k ∈ A.

Next, we use a stochastic approximation algorithm togenerate a sequence of iterates �u1, �u2, · · · that representestimates of �u. Each �ut defines a policy given by

π tfr,t(s) = argmaxa∈A

{r(s, a) + ut(a) + αE

[Jπb(s′)

∣∣s, a

]},

∀s ∈ S.

To construct the stochastic approximation algorithm,we need an estimate gt of f (�ut). Although we cannot

obtain f (�ut) directly, we have a noisy observation of itscomponents:

gtk = αt1{π tfr,t(s)=k} − C(k), ∀k ∈ A.

Hence, we can get a stochastic approximation algorithmof the form

ut+1(k) = ut(k) − β t(αt1{π tfr,t(s)=k} − C(k)),

where the step size β t is appropriately chosen; forexample, β t = 1/t. The initial estimate �u1 can be setto �0 or some value based on the history information.The computation burden above is O(K) per time slot,where K is the number of users, which suggests thatthe algorithm is easy to implement online. Simulationsshow that with the above stochastic approximation al-gorithm, �ut converges to �u relatively quickly.

7 Numerical results

In this section, we present numerical results to illustratethe performance of the proposed temporal fair rolloutalgorithm. We first describe our simulation setup of acellular system, as well as the channel model. We thenshow the simulation results for each scheduling policyusing the model.

7.1 Simulation setup

We consider the uplink of a single-cell system with 10mobile users and 1 base station in our simulation. Thepreset temporal fairness requirements for users 1–10are 1/11, 1/11, 1/13, 1/13, 1/13, 1/13, 1/11, 1/11, 1/13,1/13. Note that the temporal fairness requirements arenonuniform and the summation of these is less than1. This gives the system the freedom to assign theremaining fraction of the resource to some “better”users to further improve the system performance.

We assume that the packet arrivals at each queueare independent Poisson processes. For simplicity, weassume that we know the maximum arrival rate for

720 Mobile Netw Appl (2010) 15:710–728

each user. We denote the arrival rate and the maximumarrival rate for user k by λk and λmax

k respectively. Wedefine the normalized-arrival-rate for user k as λk/λ

maxk .

We divide the 10 users into five groups based on theirheterogeneous arrival rates and mean channel condi-tions. Specifically, users 1 and 2 have low arrival ratesand low mean channel conditions. Users 3 and 4 havehigh arrival rates and high mean channel conditions.Users 5 and 6 have high arrival rates and moderatemean channel conditions. Users 7 and 8 have low arrivalrates and high mean channel conditions. Finally, users9 and 10 have moderate arrival rates and moderatechannel conditions. With this range of heterogeneoususer environments, we can study how the arrival ratesand channel conditions affect the users’ performancesunder different scheduling schemes.

For the purpose of comparison, we evaluate sixrelated scheduling policies including temporal fairrollout:

1) Round-robin: A well-known non-opportunisticscheduling policy that schedules users in a prede-termined order. At time slot t, the user (t mod K +1) is chosen.

2) Greedy: The natural greedy policy always selectsthe user with maximum possible throughput totransmit at any time. At time slot t, we choose useraccording to the following index policy:

argmaxa∈A

{min(Xa(t), Sa(t))}.

3) Rollout: At time slot t, the user chosen is

argmaxa∈A

{r(Xt, a) + αE

[Jπb(Xt+1)

∣∣Xt, a

]},

where the base policy πb is the above greedypolicy.

4) Opportunistic scheduling-1: In [8], Liu et al. pro-posed an opportunistic scheduling scheme withtemporal fairness constraints for memoryless chan-nels. The policy is

argmaxa∈A

{Sa(t) + v1(a)},

where v1(a) is estimated online via stochastic ap-proximation.

5) Opportunistic scheduling-2: This policy is a varia-tion of the above opportunistic scheduling-1 policywith the consideration of the queue lengths. Thepolicy is

argmaxa∈A

{min(Xa(t), Sa(t)) + v2(a)} ,

where v2(a) is also estimated online via stochasticapproximation.

6) Temporal fair rollout: We select the above oppor-tunistic scheduling-2 policy as our base policy πb. Itnot only satisfies the discounted temporal fairnessconstraints, but also takes the queue lengths intoaccount. At time slot t, the user chosen is

argmaxa∈A

{r(Xt, a) + u(a) + αE

[Jπb(Xt+1)

∣∣Xt, a

]}.

The primary motivation of this paper is to improvewireless resource efficiency by exploiting time-varyingchannel conditions while also satisfying certain QoSconstraints among users. However, it turns out thatpolicies (1)–(3) above violate the temporal fairnessconstraints (see Figs. 4 and 6), which means they areinfeasible for our problem. The reason we include themin our comparison is that either they are very simpleand widely used, or they can serve as a performancebenchmark/bound.

In our evaluation, our focus is not on the effect ofthe discount factor (which was introduced primarily foranalytical tractability). Therefore, in our simulation, wetreat α as a number very close to 1, and replace allnormalized discounted sums by finite-horizon averages.For example, in the throughput maximization problem,we calculate the throughput as a time average (withoutdiscounting). Similarly, in the delay minimization prob-lem, the delay is the calculated as the time average ofthe queue length. The constraints are also calculated astime averages.

7.2 Channel model

The digital cellular radio transmission environmentusually consists of a large number of scatterers thatresult in multiple propagation paths. Associated witheach path is a propagation delay and an attenuationfactor depending on the obstacles in the path thatreflect electromagnetic waves. Multipath fading resultsin a correlated random process, i.e., a random processwith memory. This kind of channel is known as themultipath Rayleigh fading channel.

Finite-State Markov Channel (FSMC) models havebeen found to be accurate in modeling such channelswith memory [10]. The base station uses the pilot chan-nels to estimate the SNRs at the receiver. The SNRis used as a measure of channel condition here. Thestudy of the FSMC emerges from a two-state Markovchannel known as the Gilbert-Elliott channel [32, 33].However in some cases, modeling a radio communi-cation channel as a two-state Gilbert-Elliott channel isnot adequate when the channel quality varies dramat-ically. We need more than two states to capture the

Mobile Netw Appl (2010) 15:710–728 721

Fig. 3 State transition forRayleigh fading channelmodel

Pn,n

Pn,n+1

Pn,n+2

Pn,n-2

Pn,n-1

Pn-1,n

Pn-1,n+1

Pn+2,n+1

Pn -1,n-1

Pn-1,n-2

Pn-2,n

Pn+2,n

Pn-2,n-1

Pn-2,n-2

Pn+1,n+1

Pn+1,n+2

Pn+2,n+2

Pn+1,n

Pn+1,n-1

n-2 n-1 n n+1 n+2

channel quality and take advantage of rate adaptationtechniques used in cellular networks.

In our simulation, we use an 8-state Markov chan-nel model described in [34] to capture the channelconditions. Figure 3 shows the state transition forthe Rayleigh fading channel model. We partition therange of possible SNR values into eight equal inter-vals where each interval corresponds to a state in theMarkov chain. We denote the set of states by N ={0, 1, 2, 3, 4, 5, 6, 7}, where state 0 corresponds to anSNR range of 0 db to 5 db, state 1 to 5 db to 10 db,and so on. The time interval between channel mea-surements for this model is 1 ms, also called the timegranularity of the model. For convenience, we assumethat the length of time slot is also 1 ms, so that thegranularity of the channel model and the schedulingintervals are consistent. The channel state transitionprobabilities are given in the Table 1.

7.3 Simulation results

Figures 4 and 5 show the performance of the six policies(described in Section 7.1) for the throughput maxi-mization problem where we use Eq. 1 as the reward

Table 1 Channel state transition probabilities [34]

n' n – 2 n – 1 n n + 1 n + 2n

0 – – 0.677567 0.319746 0.0026871 – 0.109712 0.676353 0.212175 0.0017602 0.000739 0.137678 0.671957 0.188237 0.0013893 0.000885 0.149863 0.670962 0.176897 0.0013934 0.001099 0.157779 0.671564 0.168380 05 0.001102 0.164497 0.670785 0.162652 0.0009646 0.001252 0.169662 0.670743 0.158343 –7 0.000248 0.041396 0.958356 – –

function. Figure 4 indicates the long-term time fractionallocations of all 10 users under the various schedulingpolicies for the problem. We plot the 95% confidenceintervals for each user. For each user, the rightmostbar shows the minimum time fraction requirement. Theremaining six bars represent the time fraction allocatedto this user in the six policies evaluated here. We seethat only the opportunistic scheduling-1, opportunisticscheduling-2, and temporal fair rollout policies satisfythe minimum temporal fairness requirements for allusers. Therefore, these three policies are feasible solu-tions for our constrained problem.

Figure 5 evaluates the scheduling policies by exam-ining the impact of the arrival traffic on the averagesystem throughput (packets/time slot). We take thenormalized-arrival-rate for each user to be the samefor every simulation, varying from 0.1 to 1.0. We alsoplot the 95% confidence intervals for each step. FromFig. 5, we see that our temporal fair rollout policyoutperforms (higher means better) all the other policiesexcept the rollout policy with the greedy base policy.This is not surprising since the latter policy achievesthe best overall performance at the cost of unfairnessamong the users (and is thus not a feasible solution tothe problem).

Figures 6 and 7 show the performance of the sixpolicies for the delay minimization problem where weuse Eq. 2 as the reward function. Similar to Fig. 4, Fig. 6indicates the time-fraction allocations of 10 users in thevarious scheduling policies for the problem. Also wecan see that only the opportunistic scheduling-1, op-portunistic scheduling-2, and our temporal fair rolloutpolicies are feasible solutions.

Figure 7 evaluates the scheduling policies by examin-ing the impact of the arrival traffic on the average sys-tem queue length (packets/time slot). It is evident thatthe average system queue length increases significantlywith increasing arrival traffic. Similar to Fig. 5, we also

722 Mobile Netw Appl (2010) 15:710–728

Fig. 4 Time fractionallocation for throughputmaximization problem

1 2 3 4 5 6 7 8 9 100

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

User Index

Tem

pora

l Fai

rnes

s

Round−robinOpportunistic scheduling−1Opportunistic scheduling−2Temporal fair rolloutGreedyRolloutTemporal fairness requirement

Fig. 5 Average systemthroughput vs. normalizedarrival rate

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

50

100

150

Normalized Arrival Rate

Ave

rage

Sys

tem

Thr

ough

put

Round−robinOpportunistic scheduling−1Opportunistic scheduling−2Temporal fair rolloutGreedyRollout

Fig. 6 Time fractionallocation for delayminimization problem

1 2 3 4 5 6 7 8 9 100

0.05

0.10

0.15

0.20

0.25

User Index

Tem

pora

l Fai

rnes

s

Round−robinOpportunistic scheduling−1Opportunistic scheduling−2Temporal fair rolloutGreedyRolloutTemporal fairness requirement

Mobile Netw Appl (2010) 15:710–728 723

Fig. 7 Average system queuelength vs. normalized arrivalrate

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

500

1000

1500

2000

2500

3000

3500

4000

Normalized Arrival Rate

Ave

rage

Sys

tem

Que

ue L

engt

h

Round−robinOpportunistic scheduling−1Opportunistic scheduling−2Temporal fair rolloutGreedyRollout

see that our temporal fair rollout policy outperforms(lower means better) all the other policies except therollout policy with the greedy base policy (which, again,is not a feasible solution).

In summary, the simulation results show that ourtemporal fair rollout policy performs significantly betterthan other policies, including the two opportunisticscheduling policies that also satisfy the temporal fair-ness requirements, especially for the delay minimiza-tion case.

8 Conclusions

In this paper, we formulated the opportunistic fairscheduling problem as an MDP with explicit fairnessconstraints. We derived the dynamic programming op-timality equations for MDPs with temporal fairness andutilitarian fairness constraints with two criteria: infinitehorizon expected discounted reward and expected av-erage reward. Based on the optimality equations, weobtained the corresponding optimal scheduling policiesfor the two criteria. We applied the methods on twocommon scheduling objectives: throughput maximiza-tion and delay minimization problems. Our approachcan naturally be extended to fit different objective func-tions and many other fairness measures. To computethe optimal policies efficiently, we developed a prac-tically viable approximation algorithm called tempo-ral fair rollout. Simulations showed that the algorithmachieve significant performance gains over the other ex-

isting opportunistic and non-opportunistic schedulingschemes.

Appendix

A Proof of Lemma 1

Proof Let π be an arbitrary policy, and suppose thatπ chooses action a at time slot 0 with probability Pa,a ∈ A. Then,

Vπ (s) =∑

a∈A

Pa

[

r(s, a) + u(a) +∑

s′∈S

P(s′|s, a)Wπ (s′)

]

,

where Wπ (s′) represents the expected discountedweighted reward with the weight u(πt) incurred fromtime slot 1 onwards, given that π is employed and thestate a time 1 is s′. However, it follows that

Wπ (s′) ≤ αVα(s′)

and hence that

Vπ (s) ≤∑

a∈A

Pa

{

r(s, a) + u(a) + α∑

s′∈S


}

≤∑

a∈A

Pa maxa∈A

{

r(s, a)+u(a)+α∑

s′∈S


}

= maxa∈A

{

r(s, a) + u(a) + α∑

s′∈S


}

.

(23)

724 Mobile Netw Appl (2010) 15:710–728

Since π is arbitrary, Eq. 23 implies that

Vα(s) ≤ maxa∈A

{

r(s, a) + u(a) + α∑

s′∈S


}

.

(24)

To go the other way, let a0 be such that

r(s, a0) + u(a0) + α∑

s′∈S

P(s′|s, a0)Vα(s′)

= maxa∈A

{

r(s, a) + u(a) + α∑

s′∈S


}

(25)

and let π be the policy that chooses a0 at time 0; and,if the next state is s′, views the process as originatingin state s′; and follows a policy πs′ , which is such thatVπs′ (s

′) ≥ Vα(s′) − ε, s′ ∈ S. Hence,

Vπ (s) = r(s, a0) + u(a0) + α∑

s′∈S

P(s′|s, a0)Vπs′ (s′)

≥ r(s, a0) + u(a0) + α∑

s′∈S

P(s′|s, a0)Vα(s′) − αε

which, since Vα(s) ≥ Vπ (s), implies that

Vα(s) ≥ r(s, a0) + u(a0) + α∑

s′∈S

P(s′|s, a0)Vα(s′) − αε.

Hence, from Eq. 25, we have

Vα(s)≥maxa∈A

{

r(s, a)+u(a)+α∑

s′∈S


}

−αε.

(26)

Since πs′ could be arbitrary, then ε is arbitrary, fromEqs. 24 and 26, we have

Vα(s) = maxa∈A

{

r(s, a) + u(a). + α∑

s′∈S


}

,

s ∈ S.

�

B Proof of Lemma 2

Proof By applying the mapping Tπ∗ to Vα , we obtain

(Tπ∗ Vα)(s)

= r(s, π∗(s)) + u(π∗(s)) + α∑

s′∈S

P(s′|s, π∗(s))Vα(s′)

=maxa∈A

{

r(s, a)+u(a)+α∑

s′∈S

P(s′|s, π∗(s))Vα(s′)

}

=Vα(s),

where the last equation follows from Lemma 1. Hence,by induction we have,

Tnπ∗ Vα = Vα, ∀n.

Letting n → ∞ and using Banach fixed-point theoremyields the result,

Vπ∗(s) = Vα(s), ∀s ∈ S.

�

C Proof of Theorem 1

Proof Let π be a policy satisfying the expected dis-counted temporal fairness constraint. And supposethere exists u : A → R satisfying conditions 1–3. Then,

Jπ (s)

= limT→∞

Eπ

[T−1∑

t=0

αtr(Xt, πt)

∣∣∣∣∣

X0 = s

]

≤ limT→∞

Eπ

[T−1∑

t=0

αtr(Xt, πt)

∣∣∣∣∣

X0 = s

]

+∑

a∈A

u(a)

(

limT→∞

Eπ

[T−1∑

t=0

αt1{πt=a}

∣∣∣∣∣

X0 = s

]

−C(a)

)

= limT→∞

Eπ

[T−1∑

t=0

αtr(Xt, πt)

∣∣∣∣∣

X0 = s

]

+ limT→∞

Eπ

[T−1∑

t=0

αtu(πt)

∣∣∣∣∣

X0 = s

]

−∑

a∈A

u(a)C(a)

= limT→∞

Eπ

[T−1∑

t=0

αt[r(Xt, πt) + u(πt)]∣∣∣∣∣

X0 = s

]

−∑

a∈A

u(a)C(a)

= Vπ (s) −∑

a∈A

u(a)C(a).

Mobile Netw Appl (2010) 15:710–728 725

Since Vπ (s) ≤ Vα(s) = Vπ∗(s) from Lemma 2, we have

Jπ (s) ≤ Vπ∗(s) −∑

a∈A

u(a)C(a) (27)

= limT→∞

Eπ∗

[T−1∑

t=0

αt[r(Xt, π∗t ) + u(π∗

t )]∣∣∣∣∣

X0 = s

]

−∑

a∈A

u(a)C(a)

= limT→∞

Eπ∗

[T−1∑

t=0

αtr(Xt, π∗t )

∣∣∣∣∣

X0 = s

]

+∑

a∈A

u(a)

(

limT→∞

Eπ∗

[T−1∑

t=0

αt1{π∗t =a}

∣∣∣∣∣

X0 =s

]

−C(a)

)

= limT→∞

Eπ∗

[T−1∑

t=0

αtr(Xt, π∗t )

∣∣∣∣∣

X0 = s

]

= Jπ∗(s), (28)

where the second part of Eq. 28 equals zero because ofcondition 3 on u. From Eq. 27, we get the correspondingoptimal discounted reward as

Jπ∗(s) = Vπ∗(s) −∑

a∈A

u(a)C(a), ∀s ∈ S.

�

D Proof of Theorem 2

Proof Let π be a policy satisfying the expected averagetemporal fairness constraint; and let Ht = (X0, π0, . . .,Xt−1, πt−1, Xt, πt) denote the history of the process upto time t. First, we have

Eπ

{T∑

t=1

[h(Xt) − Eπ (h(Xt)|Ht−1)]}

= 0,

since

Eπ

{T∑

t=1

[h(Xt) − Eπ (h(Xt)|Ht−1)]}

=T∑

t=1

Eπ [h(Xt) − Eπ (h(Xt)|Ht−1)]

=T∑

t=1

{Eπ [h(Xt)] − Eπ [Eπ (h(Xt)|Ht−1)]}

=T∑

t=1

{Eπ [h(Xt)] − Eπ [h(Xt)]} = 0.

Also,

Eπ [h(Xt)|Ht−1] =∑

s′∈S

h(s′)P(s′|Xt−1, πt−1)

= r(Xt−1, πt−1) + u(πt−1)

+∑

s′∈S

h(s′)P(s′|Xt−1, πt−1)

−r(Xt−1, πt−1) − u(πt−1)

≤ maxa∈A

{

r(Xt−1, a) + u(a)

+∑

s′∈S

P(s′|Xt−1, a)h(s′)}

−r(Xt−1, πt−1) − u(πt−1)

= g+h(Xt−1)−r(Xt−1, πt−1)−u(πt−1)

with equality for π∗, since π∗ is defined to take themaximizing action. Hence,

0 ≥ Eπ

{ T∑

t=1

[h(Xt) − g − h(Xt−1) + r(Xt−1, πt−1)

+ u(πt−1)]}

⇔ g ≥ Eπ

h(XT)

T− Eπ

h(X0)

T+ Eπ

1T

T∑

t=1

r(Xt−1, πt−1)

+Eπ

1T

T∑

t=1

u(πt−1).

726 Mobile Netw Appl (2010) 15:710–728

Letting T → ∞ and using the fact that h is bounded, wehave that

g ≥ Jπ (X0) + limT→∞

Eπ

1T

T−1∑

t=0

u(πt) ⇔ g −∑

a∈A

u(a)C(a)

≥ Jπ (X0) + limT→∞

Eπ

1T

T−1∑

t=0

u(πt) −∑

a∈A

u(a)C(a)

= Jπ (s) + limT→∞

Eπ

⎡

⎣ 1T

T−1∑

t=0

∑

a∈A

u(a)1{πt=a}

∣∣∣∣∣

X0 = s

⎤

⎦

−∑

a∈A

u(a)C(a)

= Jπ (s) +∑

a∈A

u(a)

⎛

⎝ limT→∞

Eπ

⎡

⎣ 1T

T−1∑

t=0

1{πt=a}

∣∣∣∣∣

X0 = s

⎤

⎦

−C(a)

⎞

⎠. (29)

Since we know that u ≥ 0, and that the policy π satisfiesthe temporal fairness constraints, the second part ofEq. 29 is greater than or equal to zero. We get

g −∑

a∈A

u(a)C(a) ≥ Jπ (s).

With policy π∗, we have

g−∑

a∈A

u(a)C(a)= Jπ∗(s)

+∑

a∈A

u(a)

⎛

⎝ limT→∞

Eπ∗

[1T

T−1∑

t=0

1{π∗t =a}

]

−C(a)

⎞

⎠

= Jπ∗(s), (30)

where the second part of Eq. 30 equals to zero becauseof condition 3 on u(a). Hence, the desired result isproven. �

E Proof of Theorem 3

Proof Let π be a policy satisfying the expecteddiscounted utilitarian fairness constraint. And sup-

pose there exists ω : A → R satisfying conditions1–3. Then,

Jπ (s)≤ Jπ (s)+∑

a∈A

ω(a)

(

limT→∞

Eπ

[ T−1∑

t=0

αtr(Xt, πt)1{πt=a}

∣∣∣∣∣

×X0 =s]

−D(a)Jπ (s))

= Jπ (s) + limT→∞

Eπ

[T−1∑

t=0

αtω(πt)r(Xt, πt)

∣∣∣∣∣

X0 = s

]

−∑

a∈A

ω(a)D(a)Jπ (s)

= limT→∞

Eπ

[T−1∑

t=0

αt[(κ + ω(πt))r(Xt, πt)]∣∣∣∣∣

X0 = s

]

= Uπ (s),

where κ =1−∑πt∈A D(πt)ω(πt). Since Uπ (s)≤Uα(s) =

Uπ∗(s) from Lemma 4, we have

Jπ (s) ≤ Uπ∗(s) (31)

= limT→∞

Eπ∗

[T−1∑

t=0

αtr(Xt, π∗t )

∣∣∣∣∣

X0 = s

]

+∑

a∈A

ω(a)

⎛

⎝ limT→∞

Eπ∗

⎡

⎣T−1∑

t=0

αtr(Xt, π∗t )1{π∗

t =a}

∣∣∣∣∣

X0 =s

⎤

⎦

−D(a)Jπ∗(s)

⎞

⎠

= Jπ∗(s), (32)

where the second part of Eq. 32 equals zero because ofcondition 3 on ω. From Eq. 31, we get the correspond-ing optimal discounted reward is

Jπ∗(s) = Uπ∗(s), ∀s ∈ S.

�

F Proof of Theorem 4

Proof Let π be a policy satisfying the expected averageutilitarian fairness constraint; and let Ht = (X0, π0, . . .,

Mobile Netw Appl (2010) 15:710–728 727

Xt−1, πt−1, Xt, πt) denote the history of the process upto time t. First, we have

Eπ

{T∑

t=1

[h(Xt) − Eπ (h(Xt)|Ht−1)]}

= 0.

Also,

Eπ [h(Xt)|Ht−1] =∑

s′∈S

h(s′)P(s′|Xt−1, πt−1)

= (κ + ω(πt−1))r(Xt−1, πt−1)

+∑

s′∈S

h(s′)P(s′|Xt−1, πt−1)

− (κ + ω(πt−1))r(Xt−1, πt−1)

≤ maxa∈A

{

(κ + ω(a))r(Xt−1, a)

+∑

s′∈S

P(s′|Xt−1, a)h(s′)}

− (κ + ω(πt−1))r(Xt−1, πt−1)

= g + h(Xt−1) − (κ + ω(πt−1))

× r(Xt−1, πt−1)

with equality for π∗, since π∗ is defined to take themaximizing action. Hence,

0 ≥ Eπ

{ T∑

t=1

[h(Xt) − g − h(Xt−1)

+ (κ + ω(πt−1))r(Xt−1, πt−1)]}

⇔ g ≥ Eπ

h(XT)

T− Eπ

h(X0)

T

+ Eπ

1T

T∑

t=1

(κ + ω(πt−1))r(Xt−1, πt−1)

⇔ g ≥ Eπ

h(XT)

T− Eπ

h(X0)

T+ Eπ

1T

T∑

t=1

r(Xt−1, πt−1)

+ Eπ

1T

T∑

t=1

(

ω(πt−1−∑

a∈A

D(a)ω(a)

)

× r(Xt−1, πt−1).

Letting T → ∞ and using the fact that h is bounded,we have that

g≥ Jπ (X0)

+ limT→∞

Eπ

1T

T−1∑

t=0

(

ω(πt−1−∑

a∈A

D(a)ω(a)

)

r(Xt−1, πt−1)

⇔g≥ Jπ (X0)

+∑

a∈A

ω(a)

(

limT→∞

Eπ

[1T

T−1∑

t=0

r(Xt, πt)1{πt=a}

∣∣∣∣∣X0 =s

]

−D(a)Jπ (s))

. (33)

Since we know that ω ≥ 0, and that the policy π satisfiesthe utilitarian fairness constraints, the second part ofEq. 33 is greater than or equal to zero. We get

g ≥ Jπ (s).

With policy π∗, we have

g = Jπ∗(s)

+∑

a∈A

u(a)

(

limT→∞

Eπ∗

[1T

T−1∑

t=0

r(Xt, π∗t )1{π∗

t =a}

∣∣∣∣∣X0 =s

]

− D(a)Jπ∗(s))

.

= Jπ∗(s), (34)

where the second part of Eq. 34 equals to zero becauseof condition 3 on ω(a). Hence, the desired result isproven. �

References

1. Zhang Z, Moola S, Chong EKP (2008) Approximate stochas-tic dynamic programming for opportunistic fair scheduling inwireless networks. In: Proc. 47th IEEE conference on deci-sion and control, Cancun, 9–11 December 2008, pp 1404–1409

2. Knopp R, Humblet P (1995) Information capacity andpower control in single cell multiuser communications. In:Proc. IEEE int. conference on communications 1995, vol 1,pp 331–335

3. Andrews M, Kumaran K, Ramanan K, Stolyar A, WhitingP, Vijayakumar R (2001) Providing quality of service over ashared wireless link. IEEE Commun Mag 39(2):150–153

4. Bender P, Black P, Grob M, Padovani R, Sindhushyana N,Viterbi A (2000) Cdma/hdr: a bandwidth-efficient high-speedwireless data service for nomadic users. IEEE Commun Mag38(7):70–77

728 Mobile Netw Appl (2010) 15:710–728

5. Andrews M (2005) A survey of scheduling theory in wirelessdata networks. In: Proc. 2005 IMA summer workshop onwireless communications

6. Parkvall S, Dahlman E, Frenger P, Beming P, Persson M(2001) The high speed packet data evolution of wcdma.In: Proc. IEEE VTC 2001, vol 3, pp 2287–2291

7. Liu X, Chong EKP, Shroff NB (2001) Opportunistic trans-mission scheduling with resource-sharing constraints in wire-less networks. IEEE J Sel Areas Commun 19(10):2053–2064

8. Liu X, Chong EKP, Shroff NB (2003) A framework for op-portunistic scheduling in wireless networks. Comput Netw41(4):451–474

9. Liu X, Chong EKP, Shroff NB (2004) Opportunistic schedul-ing: an illustration of cross-layer design. Telecommun Rev14(6):947–959

10. Wang HS, Moayeri N (1995) Finite-state markov channel—auseful model for radio communication channels. IEEE TransVeh Technol 43:163–171

11. Kelly F (1997) Charging and rate control for elastic traffic.Eur Trans Telecommun 8:33–37

12. Puterman ML (1994) Markov decision processes. Wiley,New York

13. Bertsekas DP (2001) Dynamic programming and optimalcontrol, 2nd ed. Athena, Belmont

14. Ross SM (1970) Applied probability models with optimiza-tion applications. Dover, New York

15. Derman C (1970) Finite sate Markovian decision processes.Academic, New York

16. Altman E (1998) Constrained Markov decision processes.Chapman and Hall/CRC, London

17. Piunovskiy AB (1997) Optimal control of random sequencesin problems with constraints. Kluwer, Dordrecht

18. Feinberg EA, Shwartz A (eds) (2002) Handbook of Markovdecision processes: methods and applications. Kluwer, Boston

19. Ross KW (1989) Randomized and past-dependent policiesfor Markov decision processes with multiple constraints.Oper Res 37(3):474–477

20. Piunovskiy AB, Mao X (2000) Constrained markovian de-cision processes: the dynamic programming approach. OperRes Lett 27:119–126

21. Chen RC, Blankenship GL (2004) Dynamic programmingequations for discounted constrained stochastic control.IEEE Trans Automat Contr 49(5):699–709

22. Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic program-ming. Athena, Belmont

23. Bertsekas DP, Tsitsiklis JN, Wu C (1997) Rollout algo-rithms for combinatorial optimization. J Heuristics 3:245–262

24. Bertsekas DP, Castanon DA (1999) Rollout algorithms forstochastic scheduling problems. J Heuristics 5(1):89–108

25. Gesbert D, Slim-Alouini M (2004) How much feedback ismulti-user diversity really worth? In: Proc. IEEE interna-tional conference on communications, pp 234–238

26. Floren F, Edfors O, Molin BA (2003) The effect of feed-back quantization on the throughput of a multiuser diversityscheme. In: Proc. IEEE global telecommunication confer-ence (GLOBECOM’03), pp 497–501

27. Svedman P, Wilson S, Cimini LJ Jr., Ottersten B (2004)A simplified opportunistic feedback and scheduling schemefor ofdm. In: Proc. IEEE vehicular technology conference2004

28. Al-Harthi Y, Tewfik A, Alouini MS (2007) Multiuser diver-sity with quantized feedback. IEEE Trans Wirel Commun6(1):330–337

29. Parekh AK, Gallager RG (1993) A generalized proces-sor sharing approach to flow control in integrated servicesnetworks: the single-node case. IEEE/ACM Trans Netw1(3):344–357

30. Kushner HJ, Yin GG (2003) Stochastic approximation andrecursive algorithms and applications, 2nd ed. Springer, NewYork

31. Chen HF (2002) Stochastic approximation and its applica-tions. Kluwer, Dordrecht

32. Gilbert E (1960) Capacity of a burst-noise channel. Bell SystTechnol J 39:1253–1265

33. Elliott EO (1963) Estimates of error rates for codes on burst-noise channels. Bell Syst Technol 42:1977–1997

34. Swarts F, Ferreira HC (1993) Markov characterization ofchannels with soft decision outputs. IEEE Trans Commun41:678–682

Opportunistic Fair Scheduling in Wireless Networks: An Approximate Dynamic Programming Approach

Documents

Transcript of Opportunistic Fair Scheduling in Wireless Networks: An Approximate Dynamic Programming Approach