Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov...

13
Stochastic Control and Games Thor Stead 29 April 2020

Transcript of Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov...

Page 1: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

Stochastic Control and GamesThor Stead

29 April 2020

Page 2: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

What is Stochastic Control?● Optimizing a set of decisions in a random process evolving in uncertainty

How is it studied?● Typically discrete-time, Markov or non-Markov decision process assuming a

reward at each step or at the end

Applications:● Games of chance, air traffic control, reinforcement learning

Page 3: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

Markov Decision Process (MDP)● Define as agent G occupies some state s, and can take some set of actions A=

(a1,a2,...,an) at that state s.

○ P(s’ | sn,an) = P(s’ | sn, an, … , s1, a1) memoryless

M. L. Puterman,Markov Decision Processes, Discrete Stochastic Dynamic Programming, John

Wiley and Sons, Inc., 2005.

Page 4: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

Policies and Feedback Control● To denote a strategy in our MDP, we introduce π, our policy

○ Set of actions we will take at each possible state s

● Feedback control:

○ Actions dependent on output of

current state

○ The value of a certain state is

dependent upon what states can be

reached from it, and their

respective rewards

Action State transition

“Value” of current state → converges

https://en.wikipedia.org/wiki/Control_theory#Open-loop_and_clos

ed-loop_(feedback)_control

Page 5: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

Optimizing our Policy:● 𝜋 = policy = (s

t

, a

t

) ∀ t ∈ [0,n]

● Define final value W(s,𝜋) = sum of all discounted rewards in expectation

○ what we want to maximize

● Need to figure out set of optimal actions

Distinction between W(i,j) and r(i,j):

r(i,j) refers to reward func. at a single point

W(i,j) refers to the recursive formula:

W(i,j) = r(i,j) + ∑ [ P(i’,j’)*W(i’,j’) ]

and can thus be interpreted as a recursive sum

of rewards along the path from (i,j) outwards

t

Page 6: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

the Bellman Equation● We can solve the previous equation for A if we obtain W(i,j) for all points

(vectors) within our sample space (here, we use R

2

for simplicity)

● This equation known as the Bellman equation after Richard Bellman (1957),

central to dynamic programming

Bellman also formulated a solution for W,

as we will see...

Page 7: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

Solving for W

where 𝚽 refers to the max over all actions a. Bellman’s value iteration algorithm:

RHS of the Bellman Equation

A. T. Mehryar Mohri, Afshin

Rostamizadeh,Foundations of

Machine Learning, The MIT

Press, 2018

Page 8: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

Why this works:Because the Bellman

equation is a contraction

mapping.

||𝚽(x)-𝚽(y)|| ≤ 𝛂||x-y||,

for some 0 ≤ 𝛂 < 1

● Banach fixed point theorem: on any contraction mapping, sequence x

n

= 𝚽(x

n-1

)

will eventually converge to fixed point 𝚽(x) = x, ∀ x

○ Heart of value iteration in vector space

An adapted proof.

https://people.eecs.berkeley.edu/~ananth/223Spr07/ee223_spr07_lec19.pdf

Page 9: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

Phases of this Project:1: Card Game Proposal

Initially proposed

application to card game

of Bluff (BS)

In fact a non-Markovian

process so utilizes a

different framework, and

many hurdles with coding

the AI unrelated to math

2: One-player Optimization

Optimized a path for a

particle moving on a grid

evolving in uncertainty

● reward function

based on grid (x,y)

3: Competing AIs

Had two AIs play ‘tag’ on

a torus-shaped grid, with

varying degrees of

uncertainty

● Reward function based on

player-player distance

Page 10: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

the Optimal Path1. Define a reward function based

on grid location

2. Start a particle at a random (x,y)

on the grid, adding in a

stochasticity

a. For example, P(goes intended

direction) = 0.7

3. Iteratively solve Bellman eq. To

get the value W of each location

4. Simulate

5. Repeat (3) until (x,y) = W(x*,y*)

Yellow =

high

value

Purple =

low

value

Page 11: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

Playing Tag?1. Define reward function based

on other player’s location

2. Start 2 particles at random

(x,y) on the grid, adding in a

stochasticity

3. Solve Bellman eq. To get the

value W of each location

4. Simulate player 1 turn

5. Solve Bellman eq. using

player 1 location to get W

6. Simulate player 2 turn

7. In this example, #turns = 20 each

Page 12: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

Going Forward● Computing & solving the Bellman equation for an MDP is a fundamental tool of

reinforcement machine learning and optimality under randomness

● Want to extend the principles here to include and solve:

○ other Markovian games

○ non-Markovian games

○ Games with varying win conditions– many ways to attain max. Reward

Page 13: Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov decision process assuming a reward at each step or at the end ... M. L. Puterman,Markov

Thank you!Credit to Patrick Flynn for helping to educate, develop, and revise versions of this work throughout the semester.