Reinforcement Learning for ABM · Both direct and indirect methods have advantages and...

Presented

by Thomas Asikis, [email protected]

Agent-Based Modeling and Social System SimulationFall Semester 2019

1

Reinforcement Learning for ABM

§ An agent learns to become billionaire…

§ How to design agents that learn to optimize!§ Relevant code is found in:

https://github.com/asikist-ethz/reinforcement_learning

2

In Today’s Course

https://github.com/asikist-ethz/reinforcement_learning

3

MDP Example: Investing Agent

Action

State

decision

transition

probability, reward

* SUTTON, R. S. . B. (2018). REINFORCEMENT LEARNING : an introduction. MIT PRESS.Ch.3 Finite Markov Decision Processes, p.52

Target State

Low

High

Save

Invest

Medium

Save

Invest

Invest

𝑝 = 0.3, 𝑟 = −1

𝑝 = 0.6, 𝑟 = −1

𝑝 = 0.8, 𝑟 = 0

𝑝 = 0.2, 𝑟 = 1

𝑝 = 0.4, 𝑟 = 1

𝑝 = 0.6, 𝑟 = −1

𝑝 = 0.9, 𝑟 = 0

𝑝 = 0.1, 𝑟 = 1

𝑝 = 1, 𝑟 = 1

Unlimited 𝑝 = 0.1, 𝑟 = 100

Save𝑝 = 0.001, 𝑟 = 100

𝑝 = 0.999, 𝑟 = 0

Reinforcement Learning Agents§ Main elements: an agent and an environment.§ ”A goal directed agent in an uncertain environment”.§ Learn a behavior that achieves a goal by interacting with

environment:⤷ Behavior: choice of actions.

§ Maximization of a reward (or minimization of cost).§ Multi Agent simulations can be done via Independent

Reinforcement Learning, and then extended by using communication

4

Intro

Machine Learning

SupervisedLearning from labeled set - External supervisor

External supervisorCertainty: for given input always the same correct output

UnsupervisedFinding structure/patterns in data.

No goal

Descriptive

Reinforcement

Goal Driven agentsLearn to optimize: Exploration vs ExploitationUncertainty: same input does not always result to same outputSequential: The current model output affects future received inputs

Delayed feedback, as the evaluation of the current output may be included in a later sample.

5

Reinforcement Learning and Machine Learning

* SUTTON, R. S. . B. (2018). REINFORCEMENT LEARNING : an introduction. MIT PRESS.Ch.1Introduction, p.2

162 Chapter 8: Planning and Learning with Tabular Methods

in the near future. If decision making and model learning are both computation-intensiveprocesses, then the available computational resources may need to be divided betweenthem. To begin exploring these issues, in this section we present Dyna-Q, a simplearchitecture integrating the major functions needed in an online planning agent. Eachfunction appears in Dyna-Q in a simple, almost trivial, form. In subsequent sections weelaborate some of the alternate ways of achieving each function and the trade-o↵s betweenthem. For now, we seek merely to illustrate the ideas and stimulate your intuition.

Within a planning agent, there are at least two roles for real experience: it can beused to improve the model (to make it more accurately match the real environment)and it can be used to directly improve the value function and policy using the kinds of

planning

value/policy

experiencemodel

modellearning

acting

directRL

reinforcement learning methods we have discussedin previous chapters. The former we call model-learning , and the latter we call direct reinforcementlearning (direct RL). The possible relationshipsbetween experience, model, values, and policy aresummarized in the diagram to the right. Each ar-row shows a relationship of influence and presumedimprovement. Note how experience can improvevalue functions and policies either directly or in-directly via the model. It is the latter, which issometimes called indirect reinforcement learning,that is involved in planning.

Both direct and indirect methods have advantages and disadvantages. Indirect methodsoften make fuller use of a limited amount of experience and thus achieve a better policywith fewer environmental interactions. On the other hand, direct methods are muchsimpler and are not a↵ected by biases in the design of the model. Some have arguedthat indirect methods are always superior to direct ones, while others have argued thatdirect methods are responsible for most human and animal learning. Related debatesin psychology and artificial intelligence concern the relative importance of cognition asopposed to trial-and-error learning, and of deliberative planning as opposed to reactivedecision making (see Chapter 14 for discussion of some of these issues from the perspectiveof psychology). Our view is that the contrast between the alternatives in all these debateshas been exaggerated, that more insight can be gained by recognizing the similaritiesbetween these two sides than by opposing them. For example, in this book we haveemphasized the deep similarities between dynamic programming and temporal-di↵erencemethods, even though one was designed for planning and the other for model-free learning.

Dyna-Q includes all of the processes shown in the diagram above—planning, acting,model-learning, and direct RL—all occurring continually. The planning method is therandom-sample one-step tabular Q-planning method on page 161. The direct RL methodis one-step tabular Q-learning. The model-learning method is also table-based and assumesthe environment is deterministic. After each transition St, At ! Rt+1, St+1, the modelrecords in its table entry for St, At the prediction that Rt+1, St+1 will deterministicallyfollow. Thus, if the model is queried with a state–action pair that has been experiencedbefore, it simply returns the last-observed next state and next reward as its prediction.

6

Reinforcement Learning and Planning

* SUTTON, R. S. . B. (2018). REINFORCEMENT LEARNING : an introduction. MIT PRESS.Ch.8 Planning and Learning with Tabular Methods, p.162-163

7

Markov Decision Process

Agent: The learner and decision makerEnvironment: The agent interacts with the environment,

which comprises everything outside the agent.Time 𝑡: Discrete timesteps (Discrete time).State 𝑠: The agent’s perception about the environment.Action 𝑎: An action the agent takes based on observing 𝑆.Reward 𝑟: Consequence of action.

Agent

Environment

𝑠4 𝑟4

𝑠456

𝑟456

𝑎4

𝑡 ← 𝑡 + 1

𝑠6𝑠9...𝒔𝑻

Terminal or GoalState


8

MDP Example: Investing Agent

Action

State

decision

transition

probability, reward


Target State

Low

High

Save

Invest

Medium

Save

Invest

Invest

𝑝 = 0.3, 𝑟 = −1

𝑝 = 0.6, 𝑟 = −1

𝑝 = 0.8, 𝑟 = 0

𝑝 = 0.2, 𝑟 = 1

𝑝 = 0.4, 𝑟 = 1

𝑝 = 0.6, 𝑟 = −1

𝑝 = 0.9, 𝑟 = 0

𝑝 = 0.1, 𝑟 = 1

𝑝 = 1, 𝑟 = 1

Unlimited 𝑝 = 0.1, 𝑟 = 100

Save𝑝 = 0.001, 𝑟 = 100

𝑝 = 0.999, 𝑟 = 0

𝒔 𝒂 𝒔′ 𝒓(𝒔, 𝒂, 𝒔@)

Low Invest Low 0

Low Save Medium 1

Medium Save Medium 0

Medium Invest High 0

High Invest High 0

High Invest Low 0

9

MDP Example ~ Accumulating Experience

§ Optimization Objective: min 𝑂(𝑎, 𝑠) ∨ max 𝑂(𝑎, 𝑠 ).§ Time 𝑡: Usually discrete, (non-)uniform time intervals.§ State 𝑠4 ∈ 𝕊 ⊂ ℝM×O×⋯

§ Actions 𝑎4, ∈ 𝔸 ⊂ ℝM×O×⋯

§ Reward: 𝑟4 = ℎ(𝑂 𝑎, 𝑠 ) , usually noise estimate of 𝑂(𝑎, 𝑠)§ Environment → State transition: 𝑠456 = 𝑔(𝑠4, 𝑎4)

10

Usual Setting

Usually it is used to model the environment, and the true function is unknown.§ Deterministic§ Stochastic

⤷ Same action-state may lead to different states. This is often the case for a single agent in a single agent environment.

§ Affected by:§ Current action and state§ and past actions and states

11

State transition 𝑔(𝑠4, 𝑎4)

The strategy to select an action: 𝜋 𝑎4 𝑠4𝜋: 𝕊 → 𝔸

§ Usually deterministic. Probabilistic mostly to explore new actions.

12

Policy

𝒔 𝒂

Low Save

Medium Save

Medium Invest

High Buy

§ Skirner, Behavioral Theory§ Positive is encouragement§ Negative is punishment§ Reward ≠ Objective:

e.g. difference of objective values in consecutive timesteps: rY = 𝑂4 − 𝑂4Z6

13

Reward

*Skinner, Burrhus F. "Operant behavior." American psychologist18.8 (1963): 503.

§ Reward in the long term§ Greedy selection of current maximum reward, or …§ Non-optimal selection now, long term optimization later§ Episode: interval of timesteps until goal or failure (𝑡 = 𝑇)§ Usually a discount factor is used (discounted reward):

𝑅4 = ]^_`

𝛾^𝑟45^ , 0 < 𝛾 < 1

§ The return is the cumulative reward at the end of an episode*:

𝐺4 = ]^_`

d

𝛾^𝑟45^

14

Cumulative Reward & Return

§ Estimate how good is to for an agent to be in a state➥ in terms of acquiring high future rewards

§ True state value function 𝑣 𝑠§ Each policy has its own true state value function:

𝑣f 𝑠 = 𝔼f(𝐺4|𝑆4 = 𝑠)§ Usually 𝑣f 𝑠 is unknown, so we need to approximate it:

𝑉f 𝑠 → 𝑣f 𝑠§ For fixed policy, the Bellman equation can be derived:

𝑣f 𝑠 =]j

𝜋 𝑎 𝑠 ]k,lm

𝑝(𝑠@, 𝑟|𝑎, 𝑠) 𝑟 + 𝛾𝑣f(𝑠@) , ∀𝑠 ∈ 𝕊

15

State Value Function

§ Estimate how good is to for an agent to be in a state➥ in terms of acquiring high future rewards

§ True state value function 𝑞 𝑠§ Each policy has its own true state value function:

𝑞f 𝑎|𝑠 = 𝔼f(𝐺4|𝑆4 = 𝑠, 𝐴4 = 𝑎)§ Usually 𝑞f 𝑎|𝑠 is unknown, so we need to approximate it:

𝑄f 𝑎|𝑠 → 𝑞f 𝑠§ For fixed policy, the Bellman equation can be derived:

𝑞f 𝑎|𝑠 =]k,lm

𝑝(𝑠@, 𝑟|𝑎, 𝑠) 𝑟 + 𝛾]jm𝜋 𝑎@ 𝑠@ 𝑄f(𝑎|𝑠))

16

Action-State Value Function

§ Estimating a true value function given a policy (Policy Evaluation).

§ Several parts of the Bellman equation can be estimated (if unknown) to predict the actual value value function.

§ Different methods, e.g. Monte-Carlo, Dynamic Programming, Temporal Differences have different approaches to prediction.

§ There are methods without value estimation for policy optimization: genetic algorithms, simulated annealing.

17

The Prediction Problem: Estimating values

Optimal policy: The policy that maximizes return by selecting optimal actions:

𝜋∗ 𝑎4 𝑠4, 𝑜4The optimal policy has the optimal value functions:

𝑣∗, 𝑞∗

To optimize the policy, many iterations are repeated over all states updating the policy, until the policy is stable:i.e. for all states, it yields the same action for the same state in consecutive iterations.

18

The Control Problem: Finding an Optimal PolicyReinforcement Learning

Solving both problems simultaneously:

19

Generalized Policy Iteration

𝑉𝜋

evaluation

𝜋 →greedy(V)

𝑉 → 𝑉f

improvement

* SUTTON, R. S. . B. (2018). REINFORCEMENT LEARNING : an introduction. MIT PRESS.Ch.4 Dynamic Programming, p.86

𝑣 = 𝑣u

𝜋 =greedy(V)

𝑣∗, 𝜋∗

𝑣, 𝜋… 𝑣∗𝜋∗

§ The dynamics of the system are known, e.g. transition probabilities and rewards in MDP.

§ We can then derive a policy iteration algorithm that solves the prediction and control problems efficiently.

§ Polynomial time with states and actions, less than direct search which takes exponential time.

§ Asynchronous dynamic programming can be used to avoid systematic sweeps of the whole state space. Still, all states need to be visited at least once.

§ Next slides contain pseudocode relevant to the prediction and control problem in DP!

20

Dynamic Programming

* SUTTON, R. S. . B. (2018). REINFORCEMENT LEARNING : an introduction. MIT PRESS.Ch.4 Dynamic Programming, p.80

Policy Evaluation Algorithm§ Input: 𝜋(𝑠), policy to evaluate§ Parameters: 𝜃 > 0, small threshold for accuracy of estimation§ Variable: 𝑉 𝑠 ∀ 𝑠 ∈ 𝑆: a dictionary/map such that:

𝑉 𝑠 ~𝒰 ∀ 𝑠 ∈ 𝑆5 and 𝑉 𝑠4zkO{Mj| = 0

Do:Δ ← 0For each 𝑠 ∈ 𝑆:𝓋 ← 𝑉 𝑠𝓋 ← ∑j 𝜋 𝑎 𝑠 ∑k,lm 𝑝 𝑠@, 𝑟 𝑠, 𝑎 [𝑟 + 𝛾V(𝑠@)]

Δ ← max Δ, 𝓋 − 𝑉(𝑠)While Δ < 𝜃

Return 𝑉 𝑠

21

Policy Evaluation

Complementary Notation:𝒰: A uniform distribution𝑠4zkO{Mj| : The terminal state𝑆5: The set of all states except terminal

* SUTTON, R. S. . B. (2018). REINFORCEMENT LEARNING : An introduction. MIT PRESS.Ch.4 Dynamic Programming, p.75

22

Policy Improvement

Policy Improvement Algorithm§ Input: 𝑉 𝑠 , a value function for states§ Variables: 𝜋(𝑠) a randomly generated policy

policy-stable ← true

For each 𝑠 ∈ 𝑆:𝑎�|� ← 𝜋(𝑠)𝜋 𝑠 ← argmax

j∑k,lm 𝑝 𝑠@, 𝑟 𝑠, 𝑎 [𝑟 + 𝛾V(𝑠@)]

policy-stable ← 𝑎�|� = 𝜋 𝑠

Return 𝝅(𝒔), policy-stable


23

Policy Iteration

Policy Iteration Algorithm§ Variables : 𝑉 𝑠 , any value function for states

𝜋(𝑠), any policy to decide actionspolicy-stable ← true

Do:𝑉 𝑠 ← Policy Evaluation Algorithm(𝜋(𝑠))𝝅(𝒔), policy-stable ← Policy Improvement Algorithm(𝑉(𝑠))

While not policy-stable

Return 𝝅(𝒔),𝑽(𝒔)


24

Value Iteration

Value Iteration Algorithm§ Variables: 𝑉 𝑠 , any value function for states

𝜋(𝑠), any policy to decide actionsDo:Δ ← 0For each 𝑠 ∈ 𝑆:

𝓋 ← 𝑉 𝑠𝓋 ← max


Δ ← max Δ, 𝓋 − 𝑉(𝑠)While Δ < 𝜃

For each 𝑠 ∈ 𝑆:𝜋 𝑠 ← argmax


Return 𝑉 𝑠 , 𝜋(𝑠) * SUTTON, R. S. . B. (2018). REINFORCEMENT LEARNING : An introduction. MIT PRESS.Ch.4 Dynamic Programming, p.83

§ The dynamics of the system are unknown.§ We can sample action-state pairs and estimate their returns for entire.§ Monte Carlo methods will estimate values asymptotically.§ Assumptions:

§ exploring starts (starting from all possible states)§ Infinite samples of episodes

§ Assumptions can be lifted via:§ 𝜖-soft policies (always non-zero probability for taking an action, exploring via

“random” actions)§ Importance sampling (sampling episodes that contain more information about value

estimation and policy optimization)

§ Next slides contain pseudocode relevant to the prediction and control problem in MC!

25

Monte Carlo Control

Play Episode AlgorithmInput: 𝜋(𝑠), policy to evaluate

environment

experiences ← list𝑠 ← environment.initial_state()While 𝒔 ≠ 𝒔𝒕𝒆𝒓𝒎𝒊𝒏𝒂𝒍 :𝑎 ← 𝜋(𝑠)𝑠′, 𝑟 ← environment.apply(𝑎)

experiences.add(𝑠, 𝑎, 𝑠@, 𝑟)

Return experiences

26

Sample experiences

27

On Policy Iteration First Visit Monte Carlo* SUTTON, R. S. . B. (2018). REINFORCEMENT LEARNING : An introduction. MIT PRESS.Ch.4 Dynamic Programming, p.101Policy First Visit Monte Carlo Algorithm

§ Inputs: 𝜖-soft 𝜋(𝑠), environment§ Parameters: 𝛾§ Variables: 𝑉 𝑠 , total_episodes ,

Q 𝑠, 𝑎 , the expected return after taking and action 𝑎 on state 𝑠state_returns: {a, s, list}, Structure for all observed returns for each pair (𝑎, 𝑠)

For episode in total_episodes:experiences ← Play Episode Algorithm(𝝅(𝒔), environment)𝐺 ← 0, 𝑇 ← 𝑒𝑥𝑝𝑒𝑟𝑖𝑛𝑐𝑒𝑠. 𝑙𝑒𝑛𝑔𝑡ℎFor 𝑡 in [𝑇, 𝑇 − 1,… , 0] :𝒔𝒕, 𝒂𝒕, 𝒔𝒕5𝟏 , 𝒓𝒕 ← experiences.get(𝑡)𝐺 ← 𝛾𝐺 + 𝑟4

If (𝒂𝒕, 𝒔𝒕) not in experiences.get(𝑡 − 1, 𝑡 − 2,… , 0):state_returns.get((𝒂𝒕, 𝒔𝒕)).append(𝐺)Q 𝑠, 𝑎 ← mean(state_returns.get((𝒂𝒕, 𝒔𝒕)))𝑎∗ ← argmax

j𝑄(𝑠4, 𝑎)

For 𝑎 in 𝒜4: #𝒜4 ← environment.all_possible_actions(𝑠4)

𝜋 𝑎 𝑠4 = �1 − 𝜖 + �

|𝒜�|, 𝑎 = 𝑎∗

�|𝒜�|

, 𝑎 ≠ 𝑎∗

Return 𝜋 𝑎 𝑠4

§ Reinforcement Learning can be used for uncertain and dynamic environments

§ Exploitation vs Exploration § Control Problem: optimal policy ~ policy iteration§ Prediction Problem: optimal estimation ~ policy evaluation§ Dynamic Programming: System dynamics are known, guaranteed

convergence & optimality§ Monte Carlo: Sample whole episodes, asymptotically optimal

28

Summary

?29

Questions

§ Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning : an introduction. MIT Press. Retrieved from https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/viewInteresting Chapters:

§ 1 Introduction: 1.1 Reinforcement Learning, 1.2 Examples, 1.3 Elements of Reinforcement Learning

§ 3 Finite Markov Processes: 3.1 The Agent-Environment Interface, 3.2 Goals and Rewards, 3.3 Returns and Episodes

§ 4 Dynamic Programming: All

§ 5 Monte Carlo Methods: All

§ 6 Temporal-Difference Learning: 6.1 TD Prediction, 6.2 Advantages of TD Prediction Methods, 6.4 Sarsa: On-policy TD Control, 6.5 Q-learning Off-policy TD Control

§ 8 Planning and Learning with Tabular Methods: 8.1 Models and Planning, 8.2 Dyna: Integrated Planning, Acting, and Learning, 8.13 Summary of Part I: Dimensions

§ 9 On-policy Prediction with Approximation: 9.1 Value-function Approximation, 9.2 The Prediction Objective 𝑉𝐸 , 9.3 Stochastic-gradient and Semi-gradient Methods

§ 13 Policy Gradient Methods: 13.1 Policy Approximation and its Advantages, 13.2 The Policy Gradient Theorem, 13.5 Actor-Critic Methods

§ 14 Psychology: 14.1 Prediction and Control

§ 15 Neuroscience: 15.4 Dopamine, 15.7 Neural Actor-Critic, 15.9 Hedonistic Neurons, 15.10 Collective Reinforcement Learning

§ 17 Frontiers: 17.3 Observations and State

§ Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676), 354–359. https://doi.org/10.1038/nature24270

§ David Silver’s slides:http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

30

Some References

https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view

https://doi.org/10.1038/nature24270

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

Extra Slides

Just in case

31

Symbol Explanation

𝑖, 𝑗 Agent indices

𝑂(𝑥) An objective function that operates on input 𝑥𝑡 A timestep

𝑎4 An action taken by agent at time 𝑡𝑠4 The agent state of agent at time 𝑡𝑟4 The reward received by an agent at time 𝑡

𝑔(𝑠4𝑎4) The state transition that happens from time 𝑡 to 𝑡 + 1 given agent state and selected action

𝑉(𝑠4) The value function, that provides the agent estimates about how optimal is its state at time 𝑡

𝑄(𝑠4, 𝑎4) The action-value function, that provides the agent estimates about how optimal is its state 𝑠4 and the action it selected 𝑎4 at time 𝑡

32

Notation Table IAppendix

Symbol Explanation

𝑣(𝑠4)𝑞(𝑠4, 𝑎4)

The true state and action-state value functions that provided the actual value of how optimal the state 𝑠4 and selected action 𝑎4 are for agent at time 𝑡. Usually, they are not known.

𝜋(𝑎4|𝑠4) The policy that selects the action 𝑎4 given the state 𝑠4 for the agent at time 𝑡.

𝛾 The discount factor, usually 0 ≤ 𝛾 ≤ 1, which discounts future rewards and values

𝑅4 The cumulative reward from time 𝑡 and on, for agent.

𝐺4 The return, which is the cumulative reward from time 𝑡 until the end of an episode (e.g. when a goal is met or failed for an agent).

𝜋∗(𝑎4|𝑠4) The optimal policy that maximizes cumulative reward and return.

𝑜4 The environmental observation of an agent at time 𝑡. Usually modelled along with state.

33

Notation Table IIAppendix

Reinforcement Learning for ABM · Both direct and indirect methods have advantages and...

Documents

Transcript of Reinforcement Learning for ABM · Both direct and indirect methods have advantages and...