Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov...
Transcript of Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov...
Stochastic Control and GamesThor Stead
29 April 2020
What is Stochastic Control?● Optimizing a set of decisions in a random process evolving in uncertainty
How is it studied?● Typically discrete-time, Markov or non-Markov decision process assuming a
reward at each step or at the end
Applications:● Games of chance, air traffic control, reinforcement learning
Markov Decision Process (MDP)● Define as agent G occupies some state s, and can take some set of actions A=
(a1,a2,...,an) at that state s.
○ P(s’ | sn,an) = P(s’ | sn, an, … , s1, a1) memoryless
M. L. Puterman,Markov Decision Processes, Discrete Stochastic Dynamic Programming, John
Wiley and Sons, Inc., 2005.
Policies and Feedback Control● To denote a strategy in our MDP, we introduce π, our policy
○ Set of actions we will take at each possible state s
● Feedback control:
○ Actions dependent on output of
current state
○ The value of a certain state is
dependent upon what states can be
reached from it, and their
respective rewards
Action State transition
“Value” of current state → converges
https://en.wikipedia.org/wiki/Control_theory#Open-loop_and_clos
ed-loop_(feedback)_control
Optimizing our Policy:● 𝜋 = policy = (s
t
, a
t
) ∀ t ∈ [0,n]
● Define final value W(s,𝜋) = sum of all discounted rewards in expectation
○ what we want to maximize
● Need to figure out set of optimal actions
Distinction between W(i,j) and r(i,j):
r(i,j) refers to reward func. at a single point
W(i,j) refers to the recursive formula:
W(i,j) = r(i,j) + ∑ [ P(i’,j’)*W(i’,j’) ]
and can thus be interpreted as a recursive sum
of rewards along the path from (i,j) outwards
t
the Bellman Equation● We can solve the previous equation for A if we obtain W(i,j) for all points
(vectors) within our sample space (here, we use R
2
for simplicity)
● This equation known as the Bellman equation after Richard Bellman (1957),
central to dynamic programming
Bellman also formulated a solution for W,
as we will see...
Solving for W
where 𝚽 refers to the max over all actions a. Bellman’s value iteration algorithm:
RHS of the Bellman Equation
A. T. Mehryar Mohri, Afshin
Rostamizadeh,Foundations of
Machine Learning, The MIT
Press, 2018
Why this works:Because the Bellman
equation is a contraction
mapping.
||𝚽(x)-𝚽(y)|| ≤ 𝛂||x-y||,
for some 0 ≤ 𝛂 < 1
● Banach fixed point theorem: on any contraction mapping, sequence x
n
= 𝚽(x
n-1
)
will eventually converge to fixed point 𝚽(x) = x, ∀ x
○ Heart of value iteration in vector space
An adapted proof.
https://people.eecs.berkeley.edu/~ananth/223Spr07/ee223_spr07_lec19.pdf
Phases of this Project:1: Card Game Proposal
Initially proposed
application to card game
of Bluff (BS)
In fact a non-Markovian
process so utilizes a
different framework, and
many hurdles with coding
the AI unrelated to math
2: One-player Optimization
Optimized a path for a
particle moving on a grid
evolving in uncertainty
● reward function
based on grid (x,y)
3: Competing AIs
Had two AIs play ‘tag’ on
a torus-shaped grid, with
varying degrees of
uncertainty
● Reward function based on
player-player distance
the Optimal Path1. Define a reward function based
on grid location
2. Start a particle at a random (x,y)
on the grid, adding in a
stochasticity
a. For example, P(goes intended
direction) = 0.7
3. Iteratively solve Bellman eq. To
get the value W of each location
4. Simulate
5. Repeat (3) until (x,y) = W(x*,y*)
Yellow =
high
value
Purple =
low
value
Playing Tag?1. Define reward function based
on other player’s location
2. Start 2 particles at random
(x,y) on the grid, adding in a
stochasticity
3. Solve Bellman eq. To get the
value W of each location
4. Simulate player 1 turn
5. Solve Bellman eq. using
player 1 location to get W
6. Simulate player 2 turn
7. In this example, #turns = 20 each
Going Forward● Computing & solving the Bellman equation for an MDP is a fundamental tool of
reinforcement machine learning and optimality under randomness
● Want to extend the principles here to include and solve:
○ other Markovian games
○ non-Markovian games
○ Games with varying win conditions– many ways to attain max. Reward
Thank you!Credit to Patrick Flynn for helping to educate, develop, and revise versions of this work throughout the semester.