Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10,...
Transcript of Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10,...
![Page 1: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/1.jpg)
Monte Carlo Approaches to Reinforcement Learning
Robert Platt (w/ Marcus Gualtieri’s edits) Northeastern University
![Page 2: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/2.jpg)
Model Free Reinforcement Learning
Agent World
Joystick command
Observe screen pixels
Reward = game score
Goal: learn a value function through trial-and-error experience
![Page 3: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/3.jpg)
Model Free Reinforcement Learning
Agent World
Joystick command
Observe screen pixels
Reward = game score
Goal: learn a value function through trial-and-error experience
Recall: Value of state when acting according to policy
![Page 4: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/4.jpg)
Model Free Reinforcement Learning
Agent World
Joystick command
Observe screen pixels
Reward = game score
Goal: learn a value function through trial-and-error experience
Recall: Value of state when acting according to policy
How?
![Page 5: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/5.jpg)
Model Free Reinforcement Learning
Agent World
Joystick command
Observe screen pixels
Reward = game score
Goal: learn a value function through trial-and-error experience
Recall: Value of state when acting according to policy
How?
Simplest solution: average all outcomes from previous experiences in a given state
– this is called a Monte Carlo method
![Page 6: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/6.jpg)
Running Example: Blackjack
State: sum of cards in agent’s hand + dealer’s showing card + does agent have usable ace?
Actions: hit, stick
Objective: Have agent’s card sum be greater than the dealer’s without exceeding 21
Reward: +1 for winning, 0 for a draw, -1 for losing
Discounting:
Dealer policy: draw until sum at least 17
![Page 7: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/7.jpg)
Running Example: Blackjack
Blackjack “Basic Strategy” is a set of rules for play so as to maximize return– well known in the gambling community– how might an RL agent learn the Basic Strategy?
![Page 8: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/8.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
![Page 9: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/9.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
19, 10, no
![Page 10: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/10.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
19, 10, no
Agent sum, dealer’s card, ace?
![Page 11: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/11.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
19, 10, no HIT
Agent sum, dealer’s card, ace?
![Page 12: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/12.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
19, 10, no HIT 22, 10, no -1
Agent sum, dealer’s card, ace?
![Page 13: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/13.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
19, 10, no HIT 22, 10, no -1
Agent sum, dealer’s card, ace? Bust!(reward = -1)
![Page 14: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/14.jpg)
Monte Carlo Policy Evaluation: Example
Upon episode termination, make the following value function updates:
State Action Next State Reward
19, 10, no HIT 22, 10, no -1
![Page 15: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/15.jpg)
Monte Carlo Policy Evaluation: Example
Next episode...
![Page 16: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/16.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
13, 10, no
![Page 17: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/17.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
13, 10, no HIT 16, 10, no 0
![Page 18: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/18.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
13, 10, no HIT 16, 10, no 013, 10, no
![Page 19: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/19.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
13, 10, no HIT 16, 10, no 013, 10, no HIT 19, 10, no 0
![Page 20: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/20.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
13, 10, no HIT 16, 10, no 013, 10, no HIT 19, 10, no 019, 10, no
![Page 21: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/21.jpg)
Monte Carlo Policy Evaluation: Example
Dealer card:
Agent’s hand:
State Action Next State Reward
13, 10, no HIT 16, 10, no 013, 10, no HIT 19, 10, no 019, 10, no HIT 21, 22, no 1
![Page 22: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/22.jpg)
Monte Carlo Policy Evaluation: Example
State Action Next State Reward
13, 10, no HIT 16, 10, no 016, 10, no HIT 19, 10, no 019, 10, no HIT 21, 22, no 1
Upon episode termination, make the following value function updates:
![Page 23: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/23.jpg)
Monte Carlo Policy Evaluation: Example
Value function learned for “hit everything except for 20 and 21” policy.
![Page 24: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/24.jpg)
Monte Carlo Policy Evaluation
Given a policy, , estimate the value function, , for all states,
![Page 25: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/25.jpg)
Monte Carlo Policy Evaluation
Given a policy, , estimate the value function, , for all states,
Monte Carlo Policy Evaluation (first visit):
![Page 26: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/26.jpg)
Monte Carlo Policy Evaluation
All states:
To get an accurate estimate of the value function, every state has to be visited many times.
Rollouts
![Page 27: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/27.jpg)
Think-pair-share: frozenlake env
01230 SFFF1 FHFH2 FFFH3 HFFG
States: grid world coordinates
Actions: L, R, U, D
Reward: 0 except at G
![Page 28: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/28.jpg)
Think-pair-share: frozenlake env
01230 SFFF1 FHFH2 FFFH3 HFFG
States: grid world coordinates
Actions: L, R, U, D
Reward: 0 except at G where r=1
Given: three episodes as shown
Calculate: values of states on top row as calculated by MC
![Page 29: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/29.jpg)
Monte Carlo Control
So far, we’re only talking about policy evaluation
… but RL requires us to find a policy, not just evaluate it… How?
Key idea: evaluate/improve policy iteratively...
Estimate via rollouts
![Page 30: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/30.jpg)
Monte Carlo Control
Monte Carlo, Exploring Starts
![Page 31: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/31.jpg)
Monte Carlo Control
Monte Carlo, Exploring Starts
Exploring starts:– each episode starts with a random
action taken from a random state
![Page 32: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/32.jpg)
Monte Carlo Control
Monte Carlo, Exploring Starts
![Page 33: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/33.jpg)
Monte Carlo Control
Monte Carlo, Exploring Starts Notice there is only one step of policy evaluation– that’s okay.– each evaluation iter moves value fn toward its optimal value. Good enough to improve policy.
![Page 34: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/34.jpg)
Monte Carlo Control
![Page 35: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/35.jpg)
Monte Carlo Control
What the MC agent learned The official “basic strategy”
![Page 36: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/36.jpg)
Monte Carlo Control: Convergence
![Page 37: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/37.jpg)
Monte Carlo Control: Convergence
If
then i.e. is better than
![Page 38: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/38.jpg)
Policy Improvement Theorem: Proof (Sketch)
![Page 39: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/39.jpg)
E-Greedy Exploration
Monte Carlo, Exploring Starts:Without exploring starts, we are notguaranteed to explore the state/actionspace– why is this a problem?– what happens if we never experience
certain transitions?
![Page 40: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/40.jpg)
E-Greedy Exploration
Monte Carlo, Exploring Starts:
Without exploring starts, we are notguaranteed to explore the state/actionspace– why is this a problem?– what happens if we never experience
certain transitions?
Can we accomplish this withoutexploring starts?
![Page 41: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/41.jpg)
E-Greedy Exploration
Monte Carlo, Exploring Starts:Without exploring starts, we are notguaranteed to explore the state/actionspace– why is this a problem?– what happens if we never experience
certain transitions?
Can we accomplish this withoutexploring starts?
Yes: create a stochastic (e-greedy) policy
![Page 42: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/42.jpg)
E-Greedy Exploration
Greedy policy:
E-Greedy policy:
![Page 43: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/43.jpg)
E-Greedy Exploration
Greedy policy:
E-Greedy policy:
Action drawn uniformly from
![Page 44: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/44.jpg)
E-Greedy Exploration
Greedy policy:
E-Greedy policy:
Guarantees every state/action willbe visited infinitely often
– Notice that this is a stochastic policy (not deterministic).– This is an example of an soft policy– soft policy: all actions in all states have non-zero probability
![Page 45: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/45.jpg)
E-Greedy Exploration
Monte Carlo, ε-greedy exploration:
E-greedy exploration
![Page 46: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/46.jpg)
Off-Policy Methods
● On-policy methods evaluate or improve the policy that is used to make decisions.
● Off-policy methods evaluate or improve a policy different from that used to generate the data.
● The target policy is the policy (π) we wish to evaluate/improve.● The behavior policy is the policy (b) used to generate experiences.
● Coverage:
![Page 47: Monte Carlo Approaches to Reinforcement Learning13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value](https://reader034.fdocuments.in/reader034/viewer/2022050214/5f60330f5da64947ba01a615/html5/thumbnails/47.jpg)
MC Summary
MC methods estimate value function by doing rollouts
Can estimate either the state value function, , or the action value function,
MC Control alternates between policy evaluation and policy improvement
E-greedy exploration explores all possible actions while preferring greedy actions
Off-policy methods update a policy other than the one used to generate experience