TUTORIAL ON STOCHASTIC GAMES I - Lake Como …TUTORIAL ON STOCHASTIC GAMES Anna Ja!kiewicz Wroc"aw...

TUTORIAL ON STOCHASTIC GAMES

Anna Jaśkiewicz

Wrocław University of Science and Technology

Department of Pure and Applied Mathematics

e-mail: [email protected]

PLAN:

I Theory:I The existence result.I Comments on the extensions.I Algorithms.I Limiting average payoff game.

I Applications: Bequest games.

MODEL OF A STOCHASTIC GAME

I I = {1, . . . , n} - the set of players; i ∈ I ;

I X = {1, . . . ,N} - a state space;

I Ai - the finite action set of player i ; (or Ai(x))

I A := A1 × · · · × An - the set of action profiles;

I ri : X × A 7→ R - the reward function of a player i ;

I q(x ′|x , a) - the probability that the next state is x ′ giventhe current state x and the action profile a ∈ A;

I β ∈ (0, 1) - a discount factor.

EVOLUTION OF A STOCHASTIC GAME

...x1 x2 xkq i x2 ,a(2)( )q i x1,a(1)( ) q i xk ,a(k )( )

a(1) a(2) = a(k )= (a12 ,a22 ,...,an2 )

r1(x1,a(1) )!rn (x1,a(1) )

r1(x2 ,a(2) )!rn (x2 ,a(2) )

r1(xk ,a(k ) )!rn (xk ,a(k ) )

= (a1k ,a2k ,...,ank )

STRATEGIES

A strategy for player i is a sequence:

πi = (πi1, πi2, . . .),

whereπik(·|hk) ∈ Pr(Ai), k ∈ N

andhk = (x1, a(1), x2, a(2), . . . , xk)

is the history of the process up to the k-th state (h1 = x1),

a(m) = (a1m, a2m, . . . , anm)

is the profile of actions at the m-th stage of the game.

STATIONARY STRATEGIES

A stationary strategy for player i is a sequence:

f ∞i = (fi , fi , . . .),

wherefi : X 7→ Pr(Ai).

Identify:f ∞i ←→ fi .

We shall write:fi(·|x), fi(x)(·),

f = (f1, . . . , fn), π = (π1, . . . , πn).

The set of stationary strategies for player i is Fi .

EXPECTED DISCOUNTED REWARD

Let x ∈ X be a initial state and π be a strategy profile chosen

by players. Then, r(k)i (x , π) is the expected reward function of

player i at the k-th stage of the game. Define the expecteddiscounted reward in the infinite time horizon as follows:

Ji(x , π) =∑∞

k=1 βk−1r

(k)i (x , π).

It is well-defined, since

Ji(x , π) ≤ R

1− β, max

i ,x ,a|ri(x , a)| ≤ R .

β - the probability of continuation of the game or β = 11+ρ

,

where ρ is the interest rate.

AN EXAMPLE OF 2-PLAYER STOCHASTIC GAME I

6,23,-3

2,1 1,3

State x=1

(1,0)

(1,0)(1,0)

13 , 23( )

State x=2

0,0(0,1)

Stationary strategies: f1 = (1 / 2,1 / 2),1( ), f2 = (0,1),1( )r1(x, f1, f2 ) = r1(x,a1,a2 ) f1(a1

a2!A2"

a1!A1" | x) f2 (a2 | x) =

7 / 2, x = 10, x = 2

#$%

&%

r2 (x, f1, f2 ) = r2 (x,a1,a2 ) f1(a1a2!A2"

a1!A1" | x) f2 (a2 | x) =

5 / 2, x = 10, x = 2

#$%

&%

q(y | x, f1, f2 ) = q(y | x,a1,a2 ) f1(a1a2!A2"

a1!A1" | x) f2 (a2 | x)

AN EXAMPLE OF 2-PLAYER STOCHASTIC GAME II

6,23,-3

2,1 1,3

State x=1

(1,0)

(1,0)(1,0)

13 , 23( )

State x=2

0,0

(0,1)

f1 = (1 / 2,1 / 2),1( ), f2 = (0,1),1( )

J1(1, f1, f2 ) = 72 1+ ! 2

3 + !2 22

32 + ...( ) = 7/21! 23!

J2 (1, f1, f2 ) = 52 1+ ! 2

3 + !2 22

32 + ...( ) = 5/21! 23!

Formula:

Qf1 f2=

23

13

0 1

!

"#

$

%&

r1(k ) (x, f1, f2 ) = r1 y, f1, f2( )q(k!1) y | x, f1, f2( )

y"X#

Ji ( f1, f2 ) = ! k!1Qf1 f2(k!1)ri ( f1, f2 ) =

k=1

"

# I ! !Qf1 f2( )!1 ri f1, f2( )

NASH EQUILIBRIUM IN A DISCOUNTEDSTOCHASTIC GAME

A profile π∗ = (π∗1, . . . , π∗n) is a Nash Equilibrium, if

Ji(x , π∗) ≥ Ji(x , πi , π

∗−i)

for all x ∈ X , πi and i ∈ I .

Recall that for the vector y ∈ Rn we define

y−i = (y1, . . . , yi−1, yi+1, . . . , yn)

(zi , y−i) = (y1, . . . , yi−1, zi , yi+1, . . . , yn).

A Stationary Nash Equilibrium is a Nash equilibrium thatbelongs to the class of strategy profiles F1 × · · · × Fn.

EXISTENCE OF A NASH EQUILIBRIUM

Every discounted stochastic game possesses aStationary Nash Equilibrium, i.e., there existsf ∗ = (f ∗1 , . . . , f

∗n ) ∈ F := F1 × · · · × Fn such that

Ji(x , f∗) ≥ Ji(x , πi , f

∗−i)

for all x ∈ X , πi and i ∈ I .

PROOF I

I Note that F is a compact convex set in some Euclidean space.

I Observe that Ji (x , ·) is continuous on F , i.e., if

(f k1 , fk

2 , . . . , fkn )→ (f1, f2, . . . , fn) as k →∞,

(i.e., f ki (x)→ fi (x) for all x ∈ X , i ∈ I )

then Ji (x , fk

1 , . . . , fkn )→ Ji (x , f1, . . . , fn). This follows from

the formula Ji (f1, . . . , fn) = (I − βQf1...fn)−1ri (f1, . . . , fn).

I CLAIM: A policy f ∗ = (f ∗1 , . . . , f∗n ) is a Nash equilibrium if

Ji (·, f ∗) satisfies the optimality equation

Ji (x , f∗) = max

µ∈Pr(Ai )[ri (x , µ, f

∗−i ) + β

∑y∈X

Ji (y , f∗)q(y |x , µ, f ∗−i )]

= ri (x , f∗) + β

∑y∈X

Ji (y , f∗)q(y |x , f ∗)

for all x ∈ X and i ∈ I .

PROOF II

I Note that for every f ∈ F there exists gi ∈ Fi such that

maxµ∈Pr(Ai )

[ri (x , µ, f−i ) + β∑y∈X

Ji (y , f )q(y |x , µ, f−i )]

= ri (x , gi , f−i ) + β∑y∈X

Ji (y , f )q(y |x , gi , f−i )

I Let ϕi (f ) be the set of all gi ∈ Fi that satisfy the aboveequality for all x ∈ X .

I Observe that ϕi (f ) 6= ∅ and ϕi (f ) is convex!

ri (x , λgi1+(1−λ)gi2, f−i ) = λri (x , gi1, f−i )+(1−λ)ri (x , gi2, f−i ), λ ∈ [0, 1].

The same for q.

I Define the mapping

Φ(f ) = ϕ1(f )× · · · × ϕn(f ).

PROOF: upper semicontinuity of f 7→ Φ(f ).

Fix two sequences (f k) and (gk) of stationary strategies such that

(1) f k → f ∈ F gk → g ∈ F ,

(2) gk ∈ Φ(f k).

Then, letting k →∞ in

maxµ∈Pr(Ai )

[ri (x , µ, fk−i ) + β

∑y∈X

Ji (y , fk)q(y |x , µ, f k−i )]

= ri (x , gki , f

k−i ) + β

∑y∈X

Ji (y , fk)q(y |x , gk

i , fk−i ),

we get (use the continuity of Ji (x , ·)!)

maxµ∈Pr(Ai )

[ri (x , µ, f−i ) + β∑y∈X

Ji (y , f )q(y |x , µ, f−i )]

= ri (x , gi , f−i ) + β∑y∈X

Ji (y , f )q(y |x , gi , f−i )

for all i ∈ I and x ∈ X . Hence, g ∈ Φ(f ).

PROOF: the Kakutani fixed point theorem (1941)

Let S ⊂ Rm be a compact and convex set and let φ bean upper semicontinuous correspondence from S to Ssuch that for every s ∈ S the set φ(s) is non-empty andconvex. Then there exists a point s∗ ∈ S such thats∗ ∈ φ(s∗).

Since Φ is the upper semicontinuous correspondence from thecompact convex set F into itself, it follows that there existsf ∗ ∈ F such that f ∗ ∈ Φ(f ∗).

PROOF: CLAIM REVISITED I

We have

maxµ∈Pr(Ai )

[ri (x , µ, f∗−i ) + β

∑y∈X

Ji (y , f∗)q(y |x , µ, f ∗−i )]

= ri (x , f∗) + β

∑y∈X

Ji (y , f∗)q(y |x , f ∗) = Ji (x .f

∗)

because for any f ∈ F

Ji (x , f ) =∞∑k=1

βk−1r(k)i (x , f ) = ri (x , f ) + β

∞∑k=2

βk−2r(k)i (x , f )

= ri (x , f ) + β∑y∈X

Ji (y , f )q(y |x , f ).

PROOF: CLAIM REVISITED II

For every µ ∈ Pr(Ai ) we have

Ji (x .f∗) ≥ ri (x , µ, f

∗−i ) + β

∑y∈X

Ji (y , f∗)q(y |x , µ, f ∗−i )

Iterating this inequality (T-1) times:

Ji (x , f∗) ≥

T∑k=1

βk−1r(k)i (x , πi , f

∗−i )+βT

∑y∈X

Ji (y , f∗)q(T )(y |x , πi , f ∗−i )

Since |Ji (x , π)| ≤ R1−β and letting T →∞ we get

Ji (x , f∗) ≥ Ji (x , πi , f

∗−i ).

COMMENTS ON THE EXISTENCE OF ASTATIONARY NASH EQUILIBRIUM

I Non-zero sum stochastic game with finite state and actionsets: A.M. Fink (J Sci Hiroshima Univ Ser A-I, 28 (1964)),M. Takahashi (J Sci Hiroshima Univ Ser A-I, 26 (1963));

I These works generalise the paper of L.S. Shapley (ProcNatl Acad Sci 39 (1953)) on zero-sum stochastic gameswith finite state and action sets;

Year 1953 - the beginning of stochastic games

L.S. Shapley, Stochastic Games

COMMENTS II

I Non-zero sum stochastic game with countable state spaceand compact metric action spaces: A. Federgruen (AdvAppl Prob, 10 (1978)):

(1) introduce the topology of point-wise convergence inthe set of stationary strategies;

(2) f 7→ Ji(x , f ) is continuous for every i ∈ I ;

(3) a generalisation of Kakutani’s fixed point theorem dueto I. Glicksberg (1952) and K. Fan (1952).

COMMENTS III

I Non-zero sum stochastic game with uncountable (general)state space and finite action spaces.

The problem of the existence of a stationary Nashequilibrium had been unsolved till 2015!

Y.J. Levy and A. McLennan (Econometrica 83 (2015))gave a counterexample:

1. 8-player stochastic games, state space X = [0, 1], action setsare finite;

2. the transition probability is a convex combination of a Diracmeasure δ1 and the uniform distribution on [x , 1] withcoefficients depending on the current state x and the actionprofile a chosen by the players;

3. payoff functions are complex;4. E. Kohlberg and J.-F. Mertens (Econometrica 54, (1986): On

strategic stability of equilibria).

COMMENTS IV: Why it is so difficult?

I The problem of the existence of a stationary Nashequilibrium is problem of a fixed point:

- compactness of a domain

- continuity of the expected payoffs on the product ofstationary strategy sets

Remedy (?):

1. the weak-star topology in the set of stationary strategies (theBanach-Alaoglu theorem gives compactness of F ), however welose continuity of the expected payoffs (counterexample: R.J.Elliot, N.J. Kalton, L. Marcus (1973))

2. the topology of uniform convergence in the set F ; but then thefamily is compact if it is uniformly bounded andequicontinuous (the Arzela-Ascoli theorem): one may considerthe Lipschitz function with constant 1, but the best responsecan be a Lipschitz function with constant 2....

COMMENTS V: General state space

I P. Barelli and J. Duggan (J Econ Theory 15, (2014))1. transition law q(·|y , a) is absolutely continuous w.r.t. some

non-atomic measure µ;2. the equilibrium strategies are independent of the stage of the

game, they may depend on the current state, the previousstate and the action profile chosen by the players in theprevious state;

I A. Jaskiewicz, A.S. Nowak (Math Oper Res 41 (2016))

1. q(·|x , a) =∑l

k=1 gk(x , a)qk(·|x) such that∑l

k=1 gk(x , a) = 1,gk is a Caratheodory function;

2. qk(·|x) - absolutely continuous w.r.t. measure µ;3. the equilibrium strategies are independent of the stage of the

game, they may depend on the current state, the previousstate;

4. The counterexample of Levy/McLennan belongs to the gamesconsidered here (SAMPE).

I Proofs: Look for a fixed point in the set of Nash equilibrium payoffs!I J.-F. Mertens and T. Parthasarathy (1991, 2003): subgame perfect

equilibrium (the entire history of the game).

ALGORITHMS: ZERO-SUM STOCHASTIC GAMES I

I Consider the game, in which r := r1 = −r2, player 1 -max, player 2 - min; X , A1 and A2 are finite;

I Define the game Γv (x) with the payoff matrix:

Pv (x) =

[r(x , a1, a2) + β

∑y∈X

v(y)q(y |xa1, a2)

]a1,a2

Tv(x) = val [Pv (x)]

I By Shapley (1953) the value f-n w is the unique solution

to the equation w=Tw and if f ∗1 (x) and f ∗2 (x) optimalstrategies in the game Γw (x), then f ∗1 and f ∗2 are optimalin the stochastic game, i.e.,

J(x , π1, f∗

2 ) ≤ J(x , f ∗1 , f∗

2 ) = w(x) ≤ J(x , f ∗1 , π2)

for every x ∈ X , π1 and π2.

ALGORITHMS: ZERO-SUM STOCHASTIC GAMES II

I w is the fixed point of the contractive operator T ;

I value iteration: T (k)0→ w but slowly!I A.J. Hoffman and R.M. Karp (1966): modification of VI

that uses information on optimal strategies in the k-thstage of the game:

1. find g1(x) - optimal strategy for player 2 in Γ0(x)2. find w1(x) = supπ1

J(x , π1, g1)

3. find g2(x) - optimal strategy for player 2 in Γw1 (x)4. find w2(x) = supπ1

J(x , π1, g2)

5. find g3(x) - optimal strategy for player 2 in Γw2 (x)6. etc.7. Claim: wk → w

ALGORITHMS: ZERO-SUM STOCHASTIC GAMESIII

I Pollatschek and Avi-Itzhak algorithm (1969) - proved tobe convergent under stringent assumptions;

I J. Filar and B. Tolwinski (1991) applied the modifiedNewton’s method and improved this algorithm:w is a unique solution to the Shapley equation ≡ theunconstrained optimisation problem

minv∈RN

∑x∈X

(Tv(x)− v(x))2

It has a unique global minimum (v = (v(1), . . . , v(N)));

I 1-player game - MDP: LP solves the problem!

I In zero-sum games this is not the case!

ALGORITHMS: COUNTEREXAMPLE FOR LP

State x=1

(1,0)

(1,0)

State x=2

(0,1)1

3(0,1)

(0,1)

00

0w(2) = 0

w(1) = val1+ 1

2 w(1) 0 + 12 w(2)

0 + 12 w(2) 3+ 1

2 w(1)

!

"#

$

%&

w(1) = 13 !4 + 2 13( )

! = 12

w(1) = val r(x,a1,a2 )+ ! w(y)q(y |1,a1,a2 )y!X"#

$%

&

'(a1,a2

Shapley eq.

ALGORITHMS: LP FOR ZERO-SUM STOCHASTICGAMES

LP for Single-Controller Stochastic Game: player 2 controlsthe transition probability, i.e., q(y |x , a1, a2) = q(y |x , a2)

(LP) maxv∈RN

∑x∈X

v(x)

subject to

r(x , f1, a2) + β∑y∈X

v(y)q(y |x , f1, a2) ≥ v(x), a2 ∈ A2, x ∈ X

v(x) ≥ 0

f1 ∈ F1

The problem (LP) has a solution (v ∗, f ∗1 ) such thatv ∗ = w and f ∗1 is an optimal strategy in the SCSG.To find an optimal strategy f ∗2 use the Shapley equation w = Tw .

ALGORITHMS FOR NON-ZERO-SUM STOCHASTICGAMES I

Linear Complementarity Problem (LCP):

- given a square matrix M (m ×m)

- given a column vector Q ∈ Rm

- find two vectors Z = [z1, . . . , zm]T ∈ Rm andW = [w1, . . . ,wm]T ∈ Rm such that

MZ + Q = W , wj , zj ≥ 0, wjzj = 0

for all j = 1, . . . ,m.

C.E. Lemke (1965) proposed a finite step pivoting algorithm tosolve LCP for a large class of matrices M and vectors Q.

ALGORITHMS FOR NON-ZERO-SUM STOCHASTICGAMES II

Finding a NE in any bimatrix game (C,D) ≡ solving LCP with

M =

[DT OO C

]Q = [−1, . . . ,−1]T

C.E. Lemke and J.T. Howson (1964): a finite step algorithmfor this LCP;

If Z ∗ = [Z ∗1 , Z∗2 ] is a solution to LCP, then the normalisation

of Z ∗i is a an equilibrium strategy for player i .

ALGORITHMS: LCP FOR NON-ZERO-SUMSTOCHASTIC GAMES I

LPC for Single-Controller Stochastic Game:

- player 2 controls the transition probability, i.e.,

q(y |x , a1, a2) = q(y |x , a2)

- {f 11 , . . . , f

m11 }, {f 1

2 , . . . , fm2

2 } - the sets of pure stationarystrategies of the players

Consider the bimatrix game:

C↔ ck,l = r1(x , f k1 (x), f l2 (x))

D↔ dk,l =∑x∈X

J2(x , f k1 , fl

2 )

ALGORITHMS: LCP FOR NON-ZERO-SUMSTOCHASTIC GAMES II

Let p = (p1, . . . , pm1) q = (q1, . . . , qm2) be a NE inthe bimatrix game (C,D). Then, the stationarystrategies

f ∗1 (x) =∑m1

k=1 pkδf k1 (x)

f ∗2 (x) =∑m2

l=1 qlδf l2 (x)

form a NE in the discounted stochastic game.A.S. Nowak, T.E.S. Raghavan (1993), improvements due to S.R. Mohan, S.K.

Neogy, T. Parthasarathy (1997, 2001)

RECENT DEVELOPMENTS ON ALGORITHMS

I J.J.P. Herings and R.J.A.P. Peters (2004, 2010):homotopy methods in computing Nash equilibria

Idea is based on the application of the Browder fixed point theorem

(1960).

I S. Govindan and R. Wilson (2003): the algorithmcombines the global Newton method and a homotopymethod for finding fixed points of some continuousmapping.

MODEL OF A ZERO-SUM STOCHASTIC GAME

I I = {1, . . . , 2} - the set of players; i ∈ I ;

I A := A1 × A2 - the set of action profiles;

I r : X × A 7→ R - the reward function of a player 1 (theloss function of player 2) ;

I Vβ - the value of the normalised β-discounted game, i.e.,for every x ∈ X and strategies πi , i = 1, 2,

Jβ(x , π1, f∗

2 ) ≤ Jβ(x , f ∗1 , f∗

2 ) = Vβ(x) ≤ Jβ(x , f ∗1 , π2),

(f ∗1 , f∗

2 ) ∈ F1 × F2 − optimal stationary strategies

Jβ(x , π1, π2) = (1− β)∞∑k=1

βk−1r (k)(x , π1, π2)

SHAPLEY’S THEOREM

An auxiliary matrix game Γv (x):[(1− β)r(x , a1, a2) + β

∑y∈X

v(y)q(y |x , a1, a2)

]a1,a2

The discounted zero-sum stochastic game possessesthe value Vβ that is unique solution

Vβ(x) = val

[(1− β)r(x , a1, a2) + β

∑y∈X

Vβ(y)q(y |x , a1, a2)

]for all x ∈ X . Moreover, if (f ∗1 (x), f ∗2 (x)) is an optimalstrategy pair in the game ΓVβ

(x) for every x ∈ X , then(f ∗1 , f

∗2 ) is an optimal pair of strategies in the stochastic

zero-sum game.

The map Tβ : RN 7→ RN is contractive:

(Tβv)(x) := val [(1− β)r(x , a1, a2) + β∑y∈X

v(y)q(y |x , a1, a2)]

THE LIMITING AVERAGE PAYOFF CRITERION

The worst scenario for player 1

V (x , , π1, π2) = lim infT→∞1T

∑Tk=1 r

(k)(x , π1, π2)

I MDP model (1-player game): LAP is considerably moredifficult to analyse;

I There exists a stationary optimal strategy and can befound by a suitably constructed linear program.

Questions:

I Does the value for the limiting average game exist?

I Do players possess optimal (stationary) stationarystrategies?

THE BIG MATCH (GILLETTE (1957))

State x=1 State x=2*

100(0,1,0)

1

10(0,1,0)

State x=3*

(0,0,1)

(0,0,1)(1,0,0)(1,0,0)

Every day player 2 chooses a number 1 or 2 and player 1 tries to predict 2”s choice, winning

a point if he is correct. This continues as long as player 1 predicts 1. But if he ever predicts 2,

all future choices for both players are required to be the same as that day’s choices: if player

1 is correct on that day, he wins a point every day thereafter; if he is wrong on that day, he

wins zero every day thereafter. The payoff to player 1 is

1

1

2

2

*absorbing states

! k ! 0,1{ }V (1,!1,! 2 ) = lim infn!"

"1 + ...+" n

n

STATIONARY STRATEGIES IN THE BIG MATCH

Gillette (1957): stationary strategies

maxf1∈F1

minf2∈F2

V (1, f1, f2) = 0 < minf2∈F2

maxf1∈F1

V (1, f1, f2) =1

2

Consider the following stationary strategies:

f p1 = ((p, 1− p), 1, 1)

f q2 = ((q, 1− q), 1, 1)

THE LOWER VALUE


10(0,1,0)

0State x=3*

(0,0,1)

CASE 1. p=1 - player 1 never chooses a row causing absorption; but against the strategy

1 2

*absorbing states

p(p,1! p,0) (p,0,1! p)

f1p (1) = (p,1! p) f2

q (1) = (q,1! q)

f20 (1) = (0,1) player 1 will earn 0 always and hence V 1, f1

1, f20( ) = 0

CASE 2. 0<=p<1 - againts f21(1) = (1,0) player 1 ultimately will be absorbed in state 2

with probability 1. In view of the nature of LAP V 1, f1p , f2

1( ) = 01! p + p 1! p( )+ p2 1! p( )+ ...= 1

CONCLUSION: minf2!F2

V 1, f1p , f2( ) = 0

THE UPPER VALUE


10(0,1,0)

State x=3*

(0,0,1)

CASE 0. q=1/2 - irrespective of what player 1 does in state 1

1

2

*absorbing states

f1p (1) = (p,1! p) f2

q (1) = (q,1! q)

CASE 1. p<1 - absorption in state 2 [3] will occur with probability q [1-q]

CONCLUSION:

(1,0,0)

(0,q,1! q)

q

1! q

V 1, f1p , f2

12( ) = 12

q(1! p)+ pq 1! p( )+ p2q 1! p( )+ ...= qCASE 2. p=1 - state 1 will repeat itself infinitely often. V 1, f1

p , f2q( ) = q if p = 1

1! q if p <1

"#$

%$

minf2!F2

maxf1!F1

V 1, f1, f2( ) = 12

q1

1

12

12

maxf1!F1

V 1, f1, f2( )

SOLUTION OF THE BIG MATCH

I The Big Match does not have a limiting average value instationary strategies;

I The Big Match does not have a limiting average value inMarkov strategies (dependence on stage and current state);

I The Big Match was solved by Blackwell and Ferguson (1968):

The limiting average value of the Big Match equals 12.

Note: The behavioural strategies are indispensable inachieving limiting average ε-optimality. Player 1 canguarantee a LAP as close to 1

2as he likes by carefully

taking into account the opponent’s behaviour (his pastactions in the process of choosing his own actions). Thereis no way that player 1 can guarantee 1

2.

PROOF - COMMENTS

I Blackwell and Ferguson (1968) provided two differentconstructions of ε-optimal strategy for player 1. One ofthem relies on using a sequence of optimal stationarystrategies in discounting games with discount factortending to 1. The idea is to switch from one discountedoptimal strategy to another on the basis of some statisticsdefined on the past plays.

I Their results were generalised by Kohlberg (1974) forabsorbing zero-sum stochastic games, i.e., stochasticgames in which all states but one are absorbing.

THE RESULT OF MERTENS AND NEYMAN (1981)

The limiting average value V ∗ exists for every finitezero-sum stochastic game.

Moreover, ∀ε > 0 ∃ (πε1, πε2), n0 ∈ N, β0 ∈ (0, 1) such that

supπ1

Jn(x , π1, πε2)− ε ≤ V ∗(x) ≤ inf

π2

Jn(x , πε1, π2) + ε

for all n ≥ n0, x ∈ X

supπ1

Jβ(x , π1, πε2)− ε ≤ V ∗(x) ≤ inf

π2

Jβ(x , πε1, π2) + ε

for all β ∈ (β0, 1), x ∈ X .

Jn(x , π1, π2) = 1n

∑nk=1 r

(k)(x , π1, π2)

(πε1 , π

ε2) are nearly optimal in sufficiently long finite games and in all discounted

games with discount factor sufficiently close to 1.

THE BIG MATCH

V ∗(x) = limn→∞ Vn(x) = limβ→1 Vβ(x)

The value for n-stage games: Vn(1) = 12

for all n.

The value for β-discounted games: Vβ(1) = 12

for all β.

The unique optimal strategy for player 2 is (12, 1

2) for all n

stage games and all β-discounted games.

For player 1 the β-discounted optimal strategy is ( 12−β ,

1−β2−β );

For player 1 optimal Markov strategy in the n-stage game is toplay ( 1+m

2+m, 1

2+m) at stage n −m for m = 1, 2, . . . , n − 1.

PROOF: COMMENTS

The proof contains clever use of the Blackwell and Fergusonapproach and the Bewley and Kohlberg result (the field of realPuiseux series turns out to be extremely useful in studying theasymptotic behaviour of Vβ as β ↗ 1 and Vn as n→∞).

The proof is complex and makes use of algebraic tools and Tarski’s principle

from logic.

Bewley and Kohlberg (1976):

There exist β∗ ∈ (0, 1), M ∈ N and the real numbers ck(x),k = 0, 1, . . . such that for all β ∈ (β∗, 1)

Vβ(x) =∑∞

k=0 ck(x)(1− β)k/M .

There exist M ∈ N and the real numbers dk(x), k = 0, 1, . . .such that for all sufficiently large n

|Vn(x)−∑∞

k=0 dk(x)n−k/M | = O(ln n/n).

LAP IN NON-ZERO SUM GAMES

I The limiting average equilibrium payoffs can neither beapproached asymptotically from the set of β-discountedequilibrium payoffs nor from the set of n-stage equilibriumpayoffs (see The Paris Match due to S. Sorin).

I The Puiseux series expansions for the normalisedβ-discounted payoffs in finite non-zero-sum games wereprovided by Mertens (1982).

MODEL OF A BEQUEST GAME

I {1, 2, . . .} - the set of short-lived generations;I generation t lives, acts and dies in period t;I there is a single good S = [0, 1] used both for

consumption and as productive capital (the space ofendowments);

I st ∈ [0, 1] - the endowment of generation t;I generation t saves yt ∈ [0, st ] and consumes at :

st︸︷︷︸endowment

= at︸︷︷︸consumption

+ yt︸︷︷︸investment

I the next generation’s inheritance or capital: st+1 = f (yt),where f is an increasing continuous production functionwith f (0) = 0;

I ut := u(at , at+1) - generation t’s utility (generation caresabout itself and descendant);

EVOLUTION OF A BEQUEST GAME

at at+1 at+m

st st+1 st+m

yt = st ! at yt+1 = st+1 ! at+1 yt+m = st+m ! at+m

… f yt+1( ) … f yt+m( )f yt( ) = st+1

u(at ,at+1) each generation derives its utility from its own consumption and

the follower’s consumption

STRATEGIES

I Φ - the set of functions ϕ : S 7→ S such that ϕ(s) ∈ [0, s];

I strategy for generation t is a function ct ∈ Φ;

I if ct = c for all t for some c ∈ Φ, then generationsemploy a stationary strategy;

I c ∈ Φ - a consumption strategy =⇒ i(s) = s − c(s) - aninvestment/saving strategy.

SOLUTION

E. Phelps and R. Pollack (Rev. Econ. Stud. (1968)): a gamebetween generations!

Definition: A strategy c∗ is a stationary Markovperfect equilibrium (SMPE) if

c∗(s) ∈ arg maxa∈[0,s]

u(a, c∗(f (s − a))).

The best reply of the current generation is c∗ if thenext one uses c∗.

EXAMPLE

I u(at , at+1) = ln at + β ln at+1, where β ∈ (0, 1);

I f (y) = y ξ, where ξ ∈ (0, 1);I we look for a stationary strategy c(·) that is differentiable;I solve the problem

maxa∈[0,s]

(ln a + β ln c((s − a)ξ))

I FOC:1

a=βξ(s − a)ξ−1

c((s − a)ξ)c ′((s − a)ξ)

I this suggests a linear strategy c(s) := As, A ∈ (0, 1);

1

a=βξ(s − a)ξ−1A

A(s − a)ξ⇒ a =

1

1 + βξs

I hence, c∗(s) = 11+βξ

s is a SMPE.

EXISTENCE OF SMPE: BEQUEST GAME

I Suppose that the generation inherits s and choosesstrategy c(s) = a that maximises

u(a, c(f (s − a))),

where c(·) is a strategy of the following generation;I c(·) - continuous ⇒ the maximum on [0, s] exists;I One can assign to a continuous function c a function c ;I The fixed point of such a map is a SMPE;I PROBLEM!

- c need not be continuous

- C [0, 1] is not compact, e.g., (xn);I Remedy: the Lipschitz cont. f-ns (stringent ass.)I compactness of the domain: the Arzela-Ascoli thm, the

Banach-Alaoglu thm.

EXAMPLE: FISH WAR GAME I

I Each generation extracts a renewable common-propertyresource, e.g., fish;

I f (y) = y ξ - the production function;

I The utility of generation t:

ut(at , at+1, at+2, . . .) := ln at+α(β ln at+1 + β2 ln at+2 + . . .

)I Generation t cares about the consumption levels of all

future generations:

α− the altruism coefficient (< 1)

EXAMPLE: FISH WAR GAME II

I Game: between the current generation and allfuture generations;

I Look for a SMPE in the set of linear functions, i.e.,c(s) = As, where 0 < A < 1;

I Assume that all future generations use c(s) = As :

sτ = f (sτ−1 − aτ−1) = f (sτ−1 − c(sτ−1))

= (sτ−1 − Asτ−1)ξ = (sτ−1)ξ(1− A)ξ

I Define the discounted payoff when all future generationst + 1, t + 2, . . . employ strategy c :

J(c)(st+1) =∞∑τ=1

βτ−1 ln c(st+τ ) =∞∑τ=1

βτ−1 lnAst+τ

J(c)(st+1) =st+1

1− ξβ+

lnA

1− β+

ξβ ln(1− A)

(1− β)(1− ξβ)

EXAMPLE: FISH WAR GAME III

I Payoff for the current generation, if it consumes a

P(st , c)(a) = ln a + αβJ(c)(st+1)

I Problem of the current generation:

maxa∈[0,s]

P(s, c)(a) = ln a +αβξ

1− ξβln(s − a)

+αβ lnA

1− β+

αξβ2 ln(1− A)

(1− β)(1− ξβ)

I FOC:1

a=

αβξ

(1− ξβ)(s − a)

I SMPE:

c∗(s) =1− ξβ

1− ξβ + αξβs

MODEL OF A STOCHASTIC BEQUEST GAME

I {1, 2, . . .} - the set of short-lived generations;

I S = [0, 1] - the space of endowments:

st︸︷︷︸endowment

= at︸︷︷︸consumption

+ yt︸︷︷︸investment

I the next generation’s inheritance or capital:

st+1 ∼ q(·|yt)

I u(at) - generation t’s utility

I v(at+1) -the satisfaction of consumption of its childrenfrom the point of view of generation t.

STOCHASTIC BEQUEST GAME: DYNAMICS

...

at at+1 at+m

q i yt( ) q i yt+1( ) q i yt+m( )st st+1 st+m


...

P(a,c)(s) = u(a)+ ! v(c(s '))q(ds ' | s ! a)S"

st ! q i | yt( )

STRATEGIES

I I - the set of non-decreasing lower semicontinuousfunctions i : S 7→ S such that i(s) ∈ [0, s];every i ∈ I is continuous from the left and has acountable number of discontinuity points;

I Define:

F := {c ∈ Φ : c(s) = s − i(s), i ∈ I};

every c ∈ F is upper semicontinuous and continuous fromthe left;

I D. Bernheim and D. Ray (1987),R. Sundaram (1989)

WHY THE CLASS F? (I)

I X - the vector space of real-valued functions of boundedvariation on S that are continuous from the left;

I (ηm) in X converges weakly to η ∈ X , iflimm→∞ ηm(s) = η(s) for every continuity point of η;

I Endow I ⊂ X with the topology of weak convergence;

I Consider the dual of C (S) (regular signed measures ofbounded variation) and equipped it with the weak-startopology;

I i ←→ µ, where µ ∈ C ∗(S) such that µ(S) ≤ 1(denote this set of measures by M);

I M is compact (the Banach-Alaoglu thm) ⇒ I - compact⇒ F - compact (c(s) = s − i(s)).

WHY THE CLASS F? (II)

The Helly Theorem

Let (ηm) be a sequence of functions of bounded variation on Ssuch that ηm(0) ≤ constant. Then, there exists a convergentsubsequence to some function η of bounded variation on S .

Note: If ηm =: im is non-decreasing, then η =: i is alsonon-decreasing. We may accept that ηm, η are continuousfrom the left.

ASSUMPTIONS

(A1) u ∈ C (S) - strictly concave, increasing;v ∈ C (S) is increasing;

(A2) q is weakly continuous on S , i.e.,

q(·|ym)⇒ q(·|y0) if ym → y0∫S

w(s)q(ds|ym)→∫S

w(s)q(ds|y0) for w ∈ C (S)

q({0}|0) = 1;

(A3) Zs := {y ∈ S : q({s}|y) > 0} is countable;

(A4) q is stochastically increasing, i.e., if z 7→ Q(z |y) is cdf ofq(·|y), then for all y1 < y2 and z ∈ S

Q(z |y1) ≥ Q(z |y2).

EXAMPLE - (A3)

Zs := {y ∈ S : q({s}|y) > 0} is countable

I N ⊂ N,I {fn}n∈N - the set of continuous increasing production

functions (fn(0) = 0),

I {αn}n∈N - the set of positive numbers such that∑n∈N αn = 1,

I Defineq(·|y) =

∑n∈N

αnδfn(y)(·), y ∈ S .

I Zs := {y ∈ S : fn(y) = s, n ∈ N} is countable, since|Zs | ≤ |N |.

EXAMPLE - (A3)

q(·|y) =∑n∈N

αnδfn(y)(·), y ∈ S

I (A2): w ∈ C (S) and ym → y0∫Sw(s)q(ds|ym) =

∑n∈N

αnw(fn(ym))→

∑n∈N

αnw(fn(y0)) =

∫Sw(s)q(ds|y0)

q({0}|0) =∑n∈N

αn = 1 since fn(0) = 0

I (A4) is satisfied: fn ↗I Other examples: convex combinations of above transition

probabilities and non-atomic transitions.

MAIN RESULT

Assume (A1)-(A4).Then there exists a SMPE c∗ ∈ F , i.e.,

maxa∈[0,s] P(a, c∗)(s) = P(c∗(s), c∗)(s)

where

P(a, c∗)(s) = u(a) + β

∫S

v(c∗(s ′))q(ds ′|s − a).

PROOF - THE MAIN IDEA

I c ∈ F - the strategy used by the descendant

I c0(s) = s− i0(s) - the best reply of the current generation to c

I CLAIM: c0 ∈ F or equivalently i0 ∈ I (non-decreasing andcontinuous from the left)

I Define the mapping L : F 7→ F by

Lc(s) := c0(s).

I L is continuous when F is given the topology of weakconvergence, i.e.,

IF cn → c weakly in F , c0,n(s) = Lcn(s) and c0,n → c0 weaklyin F , THEN c0(s) = Lc(s).

I By the Schauder-Tychonoff fixed point theorem there existsc∗ ∈ F such that c∗(s) = Lc∗(s) for s ∈ S ⇒c∗ is a SMPE.

PROOF

I Assumption (A4) ⇒ Skorohod Representation Thm:

limm→∞∫Sv(c(s ′))q(ds ′|ym) =

∫Sv(c(s ′))q(ds ′|y).

I Assumption (A3) helps in controlling the atoms of q.

EXTENSION TO A MODEL WITH MOREDESCENDANTS?

I Two descendants; utilities

u(a) =4√

8√a, v1(a) = 0.8u(a), v2(a) = 0.64u(a).

I Transition law is deterministic: f (y) =√y .

y − investment of t→f (y)− endowment of t + 1

→c(f (y))→f (y)− c(f (y))− investment of t + 1

→f (f (y)− c(f (y)))− endowment of t + 2

→c(f (f (y)− c(f (y))))

R(y , c)(s) (:= P(a, c)(s)) = u(s − y) + 0.8u(c(f (y)))

+0.64u(c(f (f (y)− c(f (y))))).

EXTENSION TO A MODEL WITH MOREDESCENDANTS?-CONT.

R(y , c)(s) = u(s − y) + 0.8u(c(f (y))) + 0.64u(c(f (f (y)− c(f (y))))).

I The successors employ the following strategy:

c(s) =

{s s ∈ [0, 0.5]s/2 s ∈ (0.5, 1].

Then,

R(y , c)(1) =

{4√

8√

1− y + 0.8 4√

8y , y ∈ [0, 0.25]4√

8√

1− y + 0.8 4√

2y + 0.64 8√y , y ∈ (0.25, 1].

Note thatarg max

y∈A(1)R(y , c)(1) = ∅.

PLOT OF R(y , c)(1)The result cannot be easily extended!

0.2 0.4 0.6 0.8 1.0

1.8

2.0

2.2

2.4

2.6

POSSIBLE EXTENSION TO A MODEL WITH MOREDESCENDANTS

Assume that generation t’s utility equals

Ut(ht) := u(at) + w(at+1, at+2, . . .)

ht = (at , st+1, at+1, . . .) - the feasible future history fromperiod t onwardsu, w ∈ C (S)

STOCHASTIC BEQUEST GAMES: DYNAMICS

...

generation

at at+1 at+msuccessor m ! th1! st successor

q i yt( ) q i yt+1( ) q i yt+m( )st st+1 st+m


t ! th

w(at+1,at+2 ,...,at+m, ...)+u(at )

A MODEL WITH INFINITELY MANYDESCENDANTS

I The expected utility of generation t :

Wt(c)(st) := E cstUt(h

t) = u(c(st)) +E cst [w(at+1, at+2, . . .)]

I Define

J(c)(st+1) = E cst+1

[w(at+1, at+2, . . .)]

I Then, the expected utility of generation t :

Wt(c)(st) = u(c(st)) +

∫S

J(c)(st+1)q(dst+1|st).

A MODEL WITH INFINITELY MANYDESCENDANTS

I Assume that successors use c . Then

P(a, c)(s) := u(a) +

∫S

J(c)(s ′)q(ds ′|s − a)

I SMPE is a function c∗ ∈ F such that

supa∈[0,s]

P(a, c∗)(s) = P(c∗(s), c∗)(s)

ASSUMPTIONS

(A1) u - strictly concave, increasing, continuous,w - continuous;

(A2) q is weakly continuous on S , i.e.,

q(·|ym)⇒ q(·|y0) if ym → y0;

q(·|y) is non-atomic for each y ∈ S \ {0} andq({0}|0) = 1.

SOME NOTABLE EXAMPLES

The dynamics evolves according to the equation:st+1 = f (yt , zt), where (zt) - i.i.d. random shocks having anon-atomic probability distribution π;

q(B |y) =

∫S

1B(f (y , z))π(dz);

(C1) f (yt , zt) = zt + f0(yt), where f0 : S 7→ S - continuous,increasing; supp(π) ⊂ [0, 1− f (1)];

(C2) f (yt , zt) = ztf0(yt); supp(π) ⊂ [0, 1/f (1)];

(C3) q(B |y) =∑l

j=1 gj(y)µj(B), where gj ≥ 0 and continuousand µj are non-atomic.

MAIN RESULT I

Assume (A1)-(A2).

Then there exists a SMPE c∗ ∈ F .

Why does it work: an investment strategy i ∈ I (hence c ∈ F ) has a countable

number of discontinuity points; since q is non-atomic, then these points do not

count and do not spoil the continuity of the operator L (proof is based on the

Schauder-Tychonoff thm).

TRANSITION PROBABILITIES WITH ATOMS I

I Consider the model as in the fish war game:

Ut(ht) = u(at) + αβ

(∞∑τ=1

βτ−1u(at+τ )

)I The expected utility for generation t:

W (c)(st) = E cstUt(h

t) = u(c(st))+αβE cst

(∞∑τ=1

βτ−1u(at+τ )

)I Put

J(c)(st+1) = E cst+1

(∞∑τ=1

βτ−1u(at+τ )

)I Define

P(a, c)(s) = u(a) + αβ

∫S

J(c)(s ′)q(ds ′|s − a).

TRANSITION PROBABILITIES WITH ATOMS II

I q(·|s − a) = q(·|y) =∑l

j=1 gj(y)µj(·), where

I gj are continuous and∑l

j=1 gj(y) = 1;I ∃νj such that µj is absolutely continuous with respect to νj for

each j ;

I g2, . . . , gl are non-decreasing and concave;

I µ1 is (first-order) stochastically dominated by µj for allj ≥ 2.

MAIN RESULT II

There exists a SMPE c∗ in the class ofLipschitz functions with constant one.

LITERATURE

1. A. Alj, A. Haurie (IEEE Trans Autom Control (1983)) -denumerable state of endowments

2. C. Harris, D. Laibson (Econometrica (2001)) - all descendants

3. A. Amir (Econ Theory (1996)), L. Balbus, K. Reffet, L.Wozny (J Math Econ (2012)) - narrow class of transitions, anSMPE is shown to exist in the class of Lipschitz continuousfunctions

4. L. Balbus, A. Jaskiewicz, A.S. Nowak (Games Econ Behav(2015), J Optim Theory Appl (2015))

5. Purely deterministic cases:W. Leininger (Rev Econ Studies (1986)) - “levelling” techD. Bernheim, D. Ray (Report (1983)) - “filling” tech L. Balbus, A. Jaskiewicz, A.S. Nowak (J Math Analysis Appl,(2015)) - unbounded utilities;

CREDITS

1. J. Filar, K. Vrieze, Competitive Markov Decision Processes,Springer, 1996

2. A. Jaskiewicz, A.S. Nowak, Zero-sum stochastic games,Handbook of Dynamic Games, Vol I, Birkhauser, 2017

3. A. Jaskiewicz, A.S. Nowak, Non-zero-sum stochastic games,Handbook of Dynamic Games, Vol I, Birkhauser, 2017

4. A. Neyman, S. Sorin (Eds), Stochastic games, NATO Series,2003

TUTORIAL ON STOCHASTIC GAMES I - Lake Como …TUTORIAL ON STOCHASTIC GAMES Anna Ja!kiewicz Wroc"aw...

Documents

Transcript of TUTORIAL ON STOCHASTIC GAMES I - Lake Como …TUTORIAL ON STOCHASTIC GAMES Anna Ja!kiewicz Wroc"aw...