Advanced Prediction Models

81
Advanced Prediction Models

Transcript of Advanced Prediction Models

Page 1: Advanced Prediction Models

Advanced Prediction Models

Page 2: Advanced Prediction Models

Today’s Outline

• Complex Decisions• Reinforcement Learning Basics

• Markov Decision Process• (State Action) Value Function

• Q Learning Algorithm

2

Page 3: Advanced Prediction Models

Complex Decisions

3

Page 4: Advanced Prediction Models

Complex Decisions Making is Everywhere

Optimal Control/Engineering Machine Learning/AI

Neuroscience/Psychology Economics/Operations Research

RL

Control• Fly drones• Autonomous driving

Operations• Retain customers, UX• Inventory management

Logistics• Schedule transportation• Resource allocation

Games• Chess, Go, Atari

Page 5: Advanced Prediction Models

Complex Decisions Making is Everywhere

Credit: Sebastien Bubeck

Control• Fly drones• Autonomous driving

Operations• Retain customers, UX• Inventory management

Logistics• Schedule transportation• Resource allocation

Games• Chess, Go, Atari

Page 6: Advanced Prediction Models

Playing Atari Using RL (2013)

1Figure: Defazio Graepel, Atari Learning Environment

Page 7: Advanced Prediction Models

1Reference: DeepMind, March 2016

AlphaGo Conquers Go (2016)

Page 8: Advanced Prediction Models

Need for Reinforcement Learning

81Reference: https://medium.com/@awjuliani/simple-reinforcement-learning-with-tensorflow-part-1-5-contextual-bandits-bff01d1aad9c

Non-exogenous change of states/contexts

Page 9: Advanced Prediction Models

RL Overview

• Reinforcement Learning (RL) addresses a version of the problem of sequential decision making

• Ingredients:• There is an environment• Within which, an agent takes actions• This action influences the future• Agent gets a (potentially delayed) feedback signal

• How to select actions to maximize total reward?• RL provides several sound answers to this question

Page 10: Advanced Prediction Models

The Environment

• Sees Agent’s action 𝐴" and generates an observation 𝑆"$% and a reward 𝑅"$%

• Subscript 𝑡 indexes time. Current observation 𝑆" is called state

• Assume the future (at times 𝑡 + 1, 𝑡 + 2,… .) is independent of the past (… , 𝑡 − 2, 𝑡 − 1) given the present (𝑡): this is called the Markov assumption

• Assume everything relevant is observed

𝑃 𝑆"$% 𝑆" = 𝑃(𝑆"$%|𝑆%, 𝑆4, … 𝑆")

Page 11: Advanced Prediction Models

The Agent

• Agent observes 𝑅"$%, 𝑆"$% and these are not i.i.d. across time

• Agent’s objective is to maximize expected total future reward E[𝑅"$% + 𝛾𝑅"$4 + ⋯ ]

• Agent’s actions affect what it sees in the future (𝑆"$%)

• Maybe better to trade off current reward 𝑅"$% to gain more rewards in the future

𝛾: Discount Factor

Page 12: Advanced Prediction Models

The Reward

1Reference: David Silver, 2015

Page 13: Advanced Prediction Models

The Goal

1Reference: David Silver, 2015

Page 14: Advanced Prediction Models

The Interactions

• Pictorially

Agent Environment

𝑆", 𝑅"

𝐴"

Agent Environment

𝑆"$%, 𝑅"$%

𝐴"$%

Agent Environment

𝑆"$4, 𝑅"$4

𝐴"$4

Page 15: Advanced Prediction Models

The Interactions

• Pictorially

Agent Environment

𝑆", 𝑅"

𝐴"

Agent Environment

𝑆"$%, 𝑅"$%

𝐴"$%

Agent Environment

𝑆"$4, 𝑅"$4

𝐴"$4

Page 16: Advanced Prediction Models

The Interactions

• Pictorially

Agent Environment

𝑆", 𝑅"

𝐴"

Agent Environment

𝑆"$%, 𝑅"$%

𝐴"$%

Agent Environment

𝑆"$4, 𝑅"$4

𝐴"$4

Page 17: Advanced Prediction Models

RL versus other Machine Learning Settings

171Reference: Joelle Pineau, DLSS 2016

Page 18: Advanced Prediction Models

Components of an RL Agent

181Reference: David Silver, 2015

Page 19: Advanced Prediction Models

Components of RL: Policy

191Reference: David Silver, 2015

Page 20: Advanced Prediction Models

Components of RL: Policy

201Reference: David Silver, 2015

Page 21: Advanced Prediction Models

Components of RL: Policy

211Reference: David Silver, 2015

Page 22: Advanced Prediction Models

Components of RL: Value Function

221Reference: David Silver, 2015

Page 23: Advanced Prediction Models

Components of RL: Value Function

231Reference: David Silver, 2015

Page 24: Advanced Prediction Models

Components of RL: Model

241Reference: David Silver, 2015

Page 25: Advanced Prediction Models

Components of RL: Model

251Reference: David Silver, 2015

Page 26: Advanced Prediction Models

Questions?

26

Page 27: Advanced Prediction Models

Today’s Outline

• Complex Decisions• Reinforcement Learning Basics

• Markov Decision Process• (State Action) Value Function

• Q Learning Algorithm

27

Page 28: Advanced Prediction Models

Components of RL: MDP Framework

28

• We will now revisit these components formally• Policy 𝜋(𝑎|𝑠)• Value function 𝑣>(𝑠)• Model 𝒫@@A

B and ℛ@B

• In the framework of Markov Decision Processes

• And then we will address the question of optimizingfor the best 𝜋 in realistic environments

Page 29: Advanced Prediction Models

Towards a Markov Decision Process

• MDPs are a useful way to describe the RL problem

• MDPs can be understood via the following progression• Start with a Markov Chain• State transitions happen autonomously

• Add Rewards• Becomes a Markov Reward Process

• Add Actions that influences state transitions• Becomes a Markov Decision Process

1Reference: David Silver, 2015

Page 30: Advanced Prediction Models

Markov Chain/Process

1Reference: David Silver, 2015

Page 31: Advanced Prediction Models

Example Markov Chain

1Reference: David Silver, 2015

Page 32: Advanced Prediction Models

Example Markov Chain

1Reference: David Silver, 2015

Page 33: Advanced Prediction Models

Example Markov Chain

1Reference: David Silver, 2015

Page 34: Advanced Prediction Models

Markov Chain with Rewards

1Reference: David Silver, 2015

Page 35: Advanced Prediction Models

Markov Chain with Rewards

1Reference: David Silver, 2015

Page 36: Advanced Prediction Models

Markov Chain with Rewards

1Reference: David Silver, 2015

Page 37: Advanced Prediction Models

Example Markov Reward Process

1Reference: David Silver, 2015

Page 38: Advanced Prediction Models

Recursions in Markov Reward Process

1Reference: David Silver, 2015

Page 39: Advanced Prediction Models

Recursions in Markov Reward Process

1Reference: David Silver, 2015

Page 40: Advanced Prediction Models

1Reference: David Silver, 2015

Markov Decision Process

Page 41: Advanced Prediction Models

Example Markov Decision Process

1Reference: David Silver, 2015

Page 42: Advanced Prediction Models

Markov Decision Process: Policy

• Now that we have introduced actions, we can discuss policies again

• Recall

1Reference: David Silver, 2015

Page 43: Advanced Prediction Models

MDP is an MRP for a Fixed Policy

1Reference: David Silver, 2015

Page 44: Advanced Prediction Models

MDP is an MRP for a Fixed Policy

1Reference: David Silver, 2015

Page 45: Advanced Prediction Models

Markov Decision Process: Value Function• We can also talk about the value function(s)

1Reference: David Silver, 2015

Page 46: Advanced Prediction Models

Markov Decision Process: Value Function• We can also talk about the value function(s)

1Reference: David Silver, 2015

Page 47: Advanced Prediction Models

Recursions in MDP

1Reference: David Silver, 2015

*Also called the Bellman Expectation Equations

Page 48: Advanced Prediction Models

Recursions in MDP

1Reference: David Silver, 2015

*Also called the Bellman Expectation Equations

Page 49: Advanced Prediction Models

Markov Decision Process: Objective

1Reference: David Silver, 2015

Page 50: Advanced Prediction Models

Markov Decision Process: Objective

1Reference: David Silver, 2015

*Also called the Bellman Optimality Equation

Page 51: Advanced Prediction Models

Markov Decision Process: Optimal Policy

1Reference: David Silver, 2015

Page 52: Advanced Prediction Models

Questions?

52

Page 53: Advanced Prediction Models

Today’s Outline

• Complex Decisions• Reinforcement Learning Basics

• Markov Decision Process• (State Action) Value Function

• Q Learning Algorithm

53

Page 54: Advanced Prediction Models

Finding the Best Policy

• Need to be able to do two things ideally• Prediction: • For a given policy, evaluate how good it is• Compute 𝑞>(𝑠, 𝑎)

• Control:• And make an improvement from 𝜋

• We will focus on the Q Learning algorithm• It does prediction and control ‘simultaneously’

1Reference: David Silver, 2015

Page 55: Advanced Prediction Models

The Q Learning Algorithm

• Initialize 𝑄, which is a table of size #states×#actions

• Start at state 𝑠%• For 𝑡 = 1,2,3, … .• Take 𝐴" chosen uniformly at random with probability 𝜖• Take argmaxB∈O 𝑄(𝑆", 𝑎) with probability 1 − 𝜖• Update Q:

• 𝑄 𝑆", 𝐴" = 𝑄 𝑆", 𝐴" + 𝛼"(𝑅"$% + 𝛾maxB∈O𝑄 𝑆"$%, 𝑎 − 𝑄(𝑆", 𝐴"))

• Parameter 𝜖 is the exploration parameter

• Parameter 𝛼" is the learning rate

• Under appropriate assumptions1, lim"→T

𝑄 = 𝑄∗

Temporal difference error

1Reference: Christopher J. C. H. Watkins and Peter Dayan, 1992

Explore

Exploit

Page 56: Advanced Prediction Models

The Q Learning Algorithm

• Initialize 𝑄, which is a table of size #states×#actions

• Start at state 𝑠%• For 𝑡 = 1,2,3, … .• Take 𝐴" chosen uniformly at random with probability 𝜖• Take argmaxB∈O 𝑄(𝑆", 𝑎) with probability 1 − 𝜖• Update Q:

• 𝑄 𝑆", 𝐴" = 𝑄 𝑆", 𝐴" + 𝛼"(𝑅"$% + 𝛾maxB∈O𝑄 𝑆"$%, 𝑎 − 𝑄(𝑆", 𝐴"))

• Parameter 𝜖 is the exploration parameter

• Parameter 𝛼" is the learning rate

• Under appropriate assumptions1, lim"→T

𝑄 = 𝑄∗

Temporal difference error

1Reference: Christopher J. C. H. Watkins and Peter Dayan, 1992

Explore

Exploit

Page 57: Advanced Prediction Models

The Q Learning Algorithm

• Initialize 𝑄, which is a table of size #states×#actions

• Start at state 𝑠%• For 𝑡 = 1,2,3, … .• Take 𝐴" chosen uniformly at random with probability 𝜖• Take argmaxB∈O 𝑄(𝑆", 𝑎) with probability 1 − 𝜖• Update Q:

• 𝑄 𝑆", 𝐴" = 𝑄 𝑆", 𝐴" + 𝛼"(𝑅"$% + 𝛾maxB∈O𝑄 𝑆"$%, 𝑎 − 𝑄(𝑆", 𝐴"))

• Parameter 𝜖 is the exploration parameter

• Parameter 𝛼" is the learning rate

• Under appropriate assumptions1, lim"→T

𝑄 = 𝑄∗

Temporal difference error

1Reference: Christopher J. C. H. Watkins and Peter Dayan, 1992

Explore

Exploit

Page 58: Advanced Prediction Models

The Q Learning Algorithm

• Initialize 𝑄, which is a table of size #states×#actions

• Start at state 𝑠%• For 𝑡 = 1,2,3, … .• Take 𝐴" chosen uniformly at random with probability 𝜖• Take argmaxB∈O 𝑄(𝑆", 𝑎) with probability 1 − 𝜖• Update Q:

• 𝑄 𝑆", 𝐴" = 𝑄 𝑆", 𝐴" + 𝛼"(𝑅"$% + 𝛾maxB∈O𝑄 𝑆"$%, 𝑎 − 𝑄(𝑆", 𝐴"))

• Parameter 𝜖 is the exploration parameter

• Parameter 𝛼" is the learning rate

• Under appropriate assumptions1, lim"→T

𝑄 = 𝑄∗

Temporal difference error

1Reference: Christopher J. C. H. Watkins and Peter Dayan, 1992

Explore

Exploit

Page 59: Advanced Prediction Models

The Q Learning Algorithm

• Initialize 𝑄, which is a table of size #states×#actions

• Start at state 𝑠%• For 𝑡 = 1,2,3, … .• Take 𝐴" chosen uniformly at random with probability 𝜖• Take argmaxB∈O 𝑄(𝑆", 𝑎) with probability 1 − 𝜖• Update Q:

• 𝑄 𝑆", 𝐴" = 𝑄 𝑆", 𝐴" + 𝛼"(𝑅"$% + 𝛾maxB∈O𝑄 𝑆"$%, 𝑎 − 𝑄(𝑆", 𝐴"))

• Parameter 𝜖 is the exploration parameter

• Parameter 𝛼" is the learning rate

• Under appropriate assumptions1, lim"→T

𝑄 = 𝑄∗

Temporal difference error

1Reference: Christopher J. C. H. Watkins and Peter Dayan, 1992

Explore

Exploit

Page 60: Advanced Prediction Models

The Q Learning Algorithm

• Initialize 𝑄, which is a table of size #states×#actions

• Start at state 𝑠%• For 𝑡 = 1,2,3, … .• Take 𝐴" chosen uniformly at random with probability 𝜖• Take argmaxB∈O 𝑄(𝑆", 𝑎) with probability 1 − 𝜖• Update Q:

• 𝑄 𝑆", 𝐴" = 𝑄 𝑆", 𝐴" + 𝛼"(𝑅"$% + 𝛾maxB∈O𝑄 𝑆"$%, 𝑎 − 𝑄(𝑆", 𝐴"))

• Parameter 𝜖 is the exploration parameter

• Parameter 𝛼" is the learning rate

• Under appropriate assumptions1, lim"→T

𝑄 = 𝑄∗

Temporal difference error

1Reference: Christopher J. C. H. Watkins and Peter Dayan, 1992

Explore

Exploit

Page 61: Advanced Prediction Models

The Q Learning Algorithm: Notation

• Bellman Optimality Equation gives rise to the Q-Value Iteration algorithm

• Making this algorithm incremental, sampled and adding 𝜖-greedy exploration gives Q Learning Algorithm

611Reference: David Silver, 2015

Page 62: Advanced Prediction Models

Q-Value Iteration• If we know the model• Turn the Bellman Optimality Equation into an iterative update• Replace the equality with the ‘arrow’ sign below

1Reference: David Silver, 2015

Page 63: Advanced Prediction Models

Q-Value Iteration

1Reference: David Silver, 2015

Page 64: Advanced Prediction Models

Q-Value Iteration

1Reference: David Silver, 2015

Page 65: Advanced Prediction Models

The Q Learning Algorithm• If we do not know the model• Do sampling to get an incremental iterative update• Choose next actions to ensure exploration

1Reference: David Silver, 2015

Page 66: Advanced Prediction Models

The Q Learning Algorithm• If we do not know the model• Do sampling to get an incremental iterative update• Choose next actions to ensure exploration

1Reference: David Silver, 2015

Page 67: Advanced Prediction Models

The Q Learning Algorithm• If we do not know the model• Do sampling to get an incremental iterative update• Choose next actions to ensure exploration

1Reference: David Silver, 2015

Page 68: Advanced Prediction Models

The Q Learning Algorithm• If we do not know the model• Do sampling to get an incremental iterative update• Choose next actions to ensure exploration

1Reference: David Silver, 2015

Page 69: Advanced Prediction Models

Questions?

69

Page 70: Advanced Prediction Models

Summary• RL is a great framework to make agents intelligent• Specify goals and provide feedback

• Many challenges still remain: exciting opportunity to contribute towards next generation of artificially intelligent and autonomous agents.

• In the next lecture, we will see that deep learning function approximation based RL agents show promise in large complex tasks: representations matter!• Applications such as • Self-driving cars• Intelligent virtual agents

70

Page 71: Advanced Prediction Models

Appendix

71

Page 72: Advanced Prediction Models

Sample Exam Questions

• What is the difference between a Markov Chain and a Markov Reward Process?

• What is the difference between a Markov Chain and a Markov Decision Process?

• Why is exploration needed in the reinforcement learning setting?

• What does the optimal state-action value function signify?

• What are the two objects (distributions) of an RL model?

• What is the difference between supervised learning and reinforcement learning?

72

Page 73: Advanced Prediction Models

Additional Resources• An Introduction to Reinforcement Learning by Richard Sutton and

Andrew Barto• http://incompleteideas.net/sutton/book/the-book.html

• Course on Reinforcement Learning by David Silver at UCL (includes video lectures)• http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

• Research Papers• Deep RL collection: https://github.com/junhyukoh/deep-

reinforcement-learning-papers• [MKSRVBGRFOPBSAKKWLH2015] Mnih et al. Human-level

control through deep reinforcement learning. Nature, 518:529–533, 2015.

• [SHMGSDSAPLDGNKSLLKGH2016] Silver et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529: 484–489, 2016.

73

Page 74: Advanced Prediction Models

Cons of RL

• Reinforcement Learning requires experiencing the environment many many times

• This is because it is a trial and error based approach

• Impractical for many complex tasks• Unless one has access to simulators where an RL agent

can practice a billon times

74

Page 75: Advanced Prediction Models

RL versus other Machine Learning Settings

75

• There is a notion of exploration and exploitation, similar to Multi-armed bandits and Contextual bandits

• Key difference: actions influence future contexts

1Reference: David Silver, 2015

Page 76: Advanced Prediction Models

RL versus other Sequential Decision Making Settings

761Reference: David Silver, 2015

Page 77: Advanced Prediction Models

Types of RL Agents

771Reference: David Silver, 2015

• There are many ways to design them, so we roughly categorize then as below:

Page 78: Advanced Prediction Models

Relating the Two Value Functions I

1Reference: David Silver, 2015

Page 79: Advanced Prediction Models

Relating the Two Value Functions II

1Reference: David Silver, 2015

Page 80: Advanced Prediction Models

Recursion in MDP: Value Function Version

1Reference: David Silver, 2015

Page 81: Advanced Prediction Models

Relating Policy and Value Function

1Reference: David Silver, 2015