Lecture 12: Fast Reinforcement Learning Part II...
Transcript of Lecture 12: Fast Reinforcement Learning Part II...
Lecture 12: Fast Reinforcement Learning Part II 2
Emma Brunskill
CS234 Reinforcement Learning.
Winter 2018
2With many slides from or derived from David Silver, Worked Examples NewEmma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 3 Winter 2018 1 / 70
Class Structure
Last time: Fast Learning, Exploration/Exploitation Part 1
This Time: Fast Learning Part II
Next time: Batch RL
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 4 Winter 2018 2 / 70
Table of Contents
1 Metrics for evaluating RL algorithms
2 Principles for RL Exploration
3 Probability Matching
4 Information State Search
5 MDPs
6 Principles for RL Exploration
7 Metrics for evaluating RL algorithms
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 5 Winter 2018 3 / 70
Performance Criteria of RL Algorithms
Empirical performance
Convergence (to something ...)
Asymptotic convergence to optimal policy
Finite sample guarantees: probably approximately correct
Regret (with respect to optimal decisions)
Optimal decisions given information have available
PAC uniform
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 6 Winter 2018 4 / 70
Table of Contents
1 Metrics for evaluating RL algorithms
2 Principles for RL Exploration
3 Probability Matching
4 Information State Search
5 MDPs
6 Principles for RL Exploration
7 Metrics for evaluating RL algorithms
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 7 Winter 2018 5 / 70
Principles
Naive Exploration (last time)
Optimistic Initialization (last time)
Optimism in the Face of Uncertainty (last time + this time)
Probability Matching (last time + this time)
Information State Search (this time)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 8 Winter 2018 6 / 70
Multiarmed Bandits
Multi-armed bandit is a tuple of (A,R)
A : known set of m actions
Ra(r) = P[r | a] is an unknown probability distribution over rewards
At each step t the agent selects an action at ∈ AThe environment generates a reward rt ∼ Rat
Goal: Maximize cumulative reward∑t
τ=1 rτ
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 9 Winter 2018 7 / 70
Regret
Action-value is the mean reward for action a
Q(a) = E[r | a]
Optimal value V ∗
V ∗ = Q(a∗) = maxa∈A
Q(a)
Regret is the opportunity loss for one step
lt = E[V ∗ − Q(at)]
Total Regret is the total opportunity loss
Lt = E[t∑
τ=1
V ∗ − Q(aτ )]
Maximize cumulative reward ⇐⇒ minimize total regret
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 10 Winter 2018 8 / 70
Optimism Under Uncertainty: Upper Confidence Bounds
Estimate an upper confidence Ut(a) for each action value, such thatQ(a) ≤ Qt(a) + Ut(a) with high probability
This depends on the number of times N(a) has been selected
Small Nt(a)→ large Ut(a) (estimate value is uncertain)Large Nt(a)→ small Ut(a) (estimate value is accurate)
Select action maximizing Upper Confidence Bound (UCB)
at = arg max a ∈ AQt(a) + Ut(a)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 11 Winter 2018 9 / 70
UCB1
This leads to the UCB1 algorithm
at = arg maxa∈A
Q(a) +
√2 log t
Nt(a)
Theorem: The UCB algorithm achieves logarithmic asymptotic totalregret
limt→∞
Lt ≤ 8 log t∑
a|∆a>0
∆a
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 12 Winter 2018 10 / 70
Toy Example: Ways to Treat Broken Toes13
Consider deciding how to best treat patients with broken toes
Imagine have 3 possible options: (1) surgery (2) buddy taping thebroken toe with another toe, (3) do nothing
Outcome measure is binary variable: whether the toe has healed (+1)or not healed (0) after 6 weeks, as assessed by x-ray
13Note:This is a made up example. This is not the actual expected efficacies of thevarious treatment options for a broken toe
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 14 Winter 2018 11 / 70
Toy Example: Ways to Treat Broken Toes15
Consider deciding how to best treat patients with broken toes
Imagine have 3 common options: (1) surgery (2) surgical boot (3)buddy taping the broken toe with another toe
Outcome measure is binary variable: whether the toe has healed (+1)or not (0) after 6 weeks, as assessed by x-ray
Model as a multi-armed bandit with 3 arms, where each arm is aBernoulli variable with an unknown parameter θi
Check your understanding: what does a pull of an arm / taking anaction correspond to? Why is it reasonable to model this as amulti-armed bandit instead of a Markov decision process?
15Note:This is a made up example. This is not the actual expected efficacies of thevarious treatment options for a broken toe
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 16 Winter 2018 12 / 70
Toy Example: Ways to Treat Broken Toes17
Imagine true (unknown) parameters for each arm (action) are
surgery: Q(a1) = θ1 = .95buddy taping: Q(a2) = θ2 = .9doing nothing: Q(a3) = θ3 = .1
17Note:This is a made up example. This is not the actual expected efficacies of thevarious treatment options for a broken toe
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 18 Winter 2018 13 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling19
True (unknown) parameters for each arm (action) are
surgery: Q(a1) = θ1 = .95buddy taping: Q(a2) = θ2 = .9doing nothing: Q(a3) = θ3 = .1
Optimism under uncertainty, UCB1 (Auer, Cesa-Bianchi, Fischer2002)
1 Sample each arm once
19Note:This is a made up example. This is not the actual expected efficacies of thevarious treatment options for a broken toe
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 20 Winter 2018 14 / 70
Toy Example: Ways to Treat Broken Toes, Optimism21
True (unknown) parameters for each arm (action) are
surgery: Q(a1) = θ1 = .95buddy taping: Q(a2) = θ2 = .9doing nothing: Q(a3) = θ3 = .1
UCB1 (Auer, Cesa-Bianchi, Fischer 2002)1 Sample each arm once
Take action a1 (r ∼Bernoulli(0.95), get +1, Q(a1) = 1Take action a2 (r ∼Bernoulli(0.90), get +1, Q(a2) = 1Take action a3 (r ∼Bernoulli(0.1), get 0, Q(a3) = 0
21Note:This is a made up example. This is not the actual expected efficacies of thevarious treatment options for a broken toe
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 22 Winter 2018 15 / 70
Toy Example: Ways to Treat Broken Toes, Optimism23
True (unknown) parameters for each arm (action) are
surgery: Q(a1) = θ1 = .95buddy taping: Q(a2) = θ2 = .9doing nothing: Q(a3) = θ3 = .1
UCB1 (Auer, Cesa-Bianchi, Fischer 2002)1 Sample each arm once
Take action a1 (r ∼Bernoulli(0.95), get +1, Q(a1) = 1Take action a2 (r ∼Bernoulli(0.90), get +1, Q(a2) = 1Take action a3 (r ∼Bernoulli(0.1), get 0, Q(a3) = 0
2 Set t = 3, Compute upper confidence bound on each action
ucb(a) = Q(a) +
√2lnt
Nt(a)
23Note:This is a made up example. This is not the actual expected efficacies of thevarious treatment options for a broken toe
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 24 Winter 2018 16 / 70
Toy Example: Ways to Treat Broken Toes, Optimism25
True (unknown) parameters for each arm (action) aresurgery: Q(a1) = θ1 = .95buddy taping: Q(a2) = θ2 = .9doing nothing: Q(a3) = θ3 = .1
UCB1 (Auer, Cesa-Bianchi, Fischer 2002)1 Sample each arm once
Take action a1 (r ∼Bernoulli(0.95), get +1, Q(a1) = 1Take action a2 (r ∼Bernoulli(0.90), get +1, Q(a2) = 1Take action a3 (r ∼Bernoulli(0.1), get 0, Q(a3) = 0
2 Set t = 3, Compute upper confidence bound on each action
ucb(a) = Q(a) +
√2lnt
Nt(a)
3 t = 3, Select action at = arg maxa ucb(a),4 Observe reward 15 Compute upper confidence bound on each action
25Note:This is a made up example. This is not the actual expected efficacies of thevarious treatment options for a broken toe
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 26 Winter 2018 17 / 70
Toy Example: Ways to Treat Broken Toes, Optimism27
True (unknown) parameters for each arm (action) aresurgery: Q(a1) = θ1 = .95buddy taping: Q(a2) = θ2 = .9doing nothing: Q(a3) = θ3 = .1
UCB1 (Auer, Cesa-Bianchi, Fischer 2002)1 Sample each arm once
Take action a1 (r ∼Bernoulli(0.95), get +1, Q(a1) = 1Take action a2 (r ∼Bernoulli(0.90), get +1, Q(a2) = 1Take action a3 (r ∼Bernoulli(0.1), get 0, Q(a3) = 0
2 Set t = 3, Compute upper confidence bound on each action
ucb(a) = Q(a) +
√2lnt
Nt(a)
3 t = t + 1, Select action at = arg maxa ucb(a),4 Observe reward 15 Compute upper confidence bound on each action
27Note:This is a made up example. This is not the actual expected efficacies of thevarious treatment options for a broken toe
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 28 Winter 2018 18 / 70
Toy Example: Ways to Treat Broken Toes, Optimism,Assessing Regret
True (unknown) parameters for each arm (action) are
surgery: Q(a1) = θ1 = .95buddy taping: Q(a2) = θ2 = .9doing nothing: Q(a3) = θ3 = .1
UCB1 (Auer, Cesa-Bianchi, Fischer 2002)Action Optimal Action Regret
a1 a1
a2 a1
a3 a1
a1 a1
a2 a1
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 29 Winter 2018 19 / 70
Check Your Understanding
An alternative would be to always select the arm with the highestlower bound
Why can this yield linear regret?
Consider a two arm case for simplicity
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 30 Winter 2018 20 / 70
Table of Contents
1 Metrics for evaluating RL algorithms
2 Principles for RL Exploration
3 Probability Matching
4 Information State Search
5 MDPs
6 Principles for RL Exploration
7 Metrics for evaluating RL algorithms
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 31 Winter 2018 21 / 70
Probability Matching
Assume have a parametric distribution over rewards for each arm
Probability matching selects action a according to probability that ais the optimal action
π(a | ht) = P[Q(a) > Q(a′), ∀a′ 6= a | ht ]
Probability matching is optimistic in the face of uncertainty
Uncertain actions have higher probability of being max
Can be difficult to compute analytically from posterior
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 32 Winter 2018 22 / 70
Thompson sampling implements probability matching
Thompson sampling:
π(a | ht) = P[Q(a) > Q(a′), ∀a′ 6= a | ht ]
= ER|ht
[1(a = arg max
a∈AQ(a))
]
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 33 Winter 2018 23 / 70
Thompson sampling implements probability matching
Thompson sampling:
π(a | ht) = P[Q(a) > Q(a′), ∀a′ 6= a | ht ]
= ER|ht
[1(a = arg max
a∈AQ(a))
]Use Bayes law to compute posterior distribution p[R | ht ]Sample a reward distribution R from posterior
Compute action-value function Q(a) = E[Ra]
Select action maximizing value on sample, at = arg maxa∈AQ(a)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 34 Winter 2018 24 / 70
Thompson sampling implements probability matching
Thompson sampling achieves Lai and Robbins lower bound
Last checked: bounds for optimism are tighter than for Thomsponsampling
But empirically Thompson sampling can be extremely effective
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 35 Winter 2018 25 / 70
Thompson Sampling for News Article Recommendation(Chapelle and Li, 2010)
Contextual bandit: input context which impacts reward of each arm,context sampled iid each step
Arms = articles
Reward = click (+1) on article (Q(a)=click through rate)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 36 Winter 2018 26 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling
True (unknown) Bernoulli parameters for each arm/action
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Thompson sampling:
Place a prior over each arm’s parameter. Here choose Beta(1,1)(Uniform)
1 Sample a Bernoulli parameter given current prior over each armBeta(1,1), Beta(1,1), Beta(1,1):
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 37 Winter 2018 27 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling38
True (unknown) Bernoulli parameters for each arm/action
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Thompson sampling:
Place a prior over each arm’s parameter. Here choose Beta(1,1)1 Sample a Bernoulli parameter given current prior over each arm
Beta(1,1), Beta(1,1), Beta(1,1): 0.3 0.5 0.62 Select a = arg maxa∈A Q(a) = arg maxainA θ(a) =
38Note:This is a made up example. This is not the actual expected efficacies of thevarious treatment options for a broken toe
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 39 Winter 2018 28 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling
True (unknown) Bernoulli parameters for each arm/action
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Thompson sampling:
Place a prior over each arm’s parameter. Here choose Beta(1,1)1 Per arm, sample a Bernoulli θ given prior: 0.3 0.5 0.62 Select at = arg maxa∈A Q(a) = arg maxainA θ(a) = 33 Observe the patient outcome’s outcome: 04 Update the posterior over the Q(at) = Q(a3) value for the arm pulled
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 40 Winter 2018 29 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling
True (unknown) Bernoulli parameters for each arm/action
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Thompson sampling:
Place a prior over each arm’s parameter. Here choose Beta(1,1)1 Sample a Bernoulli parameter given current prior over each arm
Beta(1,1), Beta(1,1), Beta(1,1): 0.3 0.5 0.62 Select at = arg maxa∈A Q(a) = arg maxainA θ(a) = 33 Observe the patient outcome’s outcome: 04 Update the posterior over the Q(at) = Q(a1) value for the arm pulled
Beta(c1, c2) is the conjugate distribution for BernoulliIf observe 1, c1 + 1 else if observe 0 c2 + 1
5 New posterior over Q value for arm pulled is:6 New posterior p(Q(a3)) = p(θ(a3) = Beta(1, 2)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 41 Winter 2018 30 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling
True (unknown) Bernoulli parameters for each arm/action
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Thompson sampling:
Place a prior over each arm’s parameter. Here choose Beta(1,1)1 Sample a Bernoulli parameter given current prior over each arm
Beta(1,1), Beta(1,1), Beta(1,1): 0.3 0.5 0.62 Select at = arg maxa∈A Q(a) = arg maxainA θ(a) = 13 Observe the patient outcome’s outcome: 04 New posterior p(Q(a1)) = p(θ(a1) = Beta(1, 2)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 42 Winter 2018 31 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling
True (unknown) Bernoulli parameters for each arm/action
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Thompson sampling:
Place a prior over each arm’s parameter. Here choose Beta(1,1)1 Sample a Bernoulli parameter given current prior over each arm
Beta(1,1), Beta(1,1), Beta(1,2): 0.7, 0.5, 0.3
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 43 Winter 2018 32 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling
True (unknown) Bernoulli parameters for each arm/action
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Thompson sampling:
Place a prior over each arm’s parameter. Here choose Beta(1,1)1 Sample a Bernoulli parameter given current prior over each arm
Beta(1,1), Beta(1,1), Beta(1,2): 0.7, 0.5, 0.32 Select at = arg maxa∈A Q(a) = arg maxainA θ(a) = 13 Observe the patient outcome’s outcome: 14 New posterior p(Q(a1)) = p(θ(a1) = Beta(2, 1)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 44 Winter 2018 33 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling
True (unknown) Bernoulli parameters for each arm/action
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Thompson sampling:Place a prior over each arm’s parameter. Here choose Beta(1,1)
1 Sample a Bernoulli parameter given current prior over each armBeta(2,1), Beta(1,1), Beta(1,2): 0.71, 0.65, 0.1
2 Select at = arg maxa∈A Q(a) = arg maxainA θ(a) = 13 Observe the patient outcome’s outcome: 14 New posterior p(Q(a1)) = p(θ(a1) = Beta(3, 1)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 45 Winter 2018 34 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling
True (unknown) Bernoulli parameters for each arm/actionSurgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Thompson sampling:Place a prior over each arm’s parameter. Here choose Beta(1,1)
1 Sample a Bernoulli parameter given current prior over each armBeta(2,1), Beta(1,1), Beta(1,2): 0.75, 0.45, 0.4
2 Select at = arg maxa∈A Q(a) = arg maxainA θ(a) = 13 Observe the patient outcome’s outcome: 14 New posterior p(Q(a1)) = p(θ(a1) = Beta(4, 1)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 46 Winter 2018 35 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling vs Optimism
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
How does the sequence of arm pulls compare in this example so far?Optimism TS Optimal Regret Optimism Regret TS
a1 a3
a2 a1
a3 a1
a1 a1
a2 a1
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 47 Winter 2018 36 / 70
Toy Example: Ways to Treat Broken Toes, ThompsonSampling vs Optimism
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Incurred regret?Optimism TS Optimal Regret Optimism Regret TS
a1 a3 a1 0 0
a2 a1 a1 0.05
a3 a1 a1 0.85
a1 a1 a1 0
a2 a1 a1 0.05
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 48 Winter 2018 37 / 70
Alternate Metric: Probably Approximately Correct
Theoretical regret bounds specify how regret grows with T
Could be making lots of little mistakes or infrequent large ones
May care about bounding the number of non-small errors
More formally, probably approximately correct (PAC) results statethat the algorithm will choose an action a whose value is ε-optimal(Q(a) ≥ Q(a∗)− ε) with probability at least 1− δ on all but apolynomial number of steps
Polynomial in the problem parameters (# actions, ε, δ, etc)
Exist PAC algorithms based on optimism or Thompson sampling
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 49 Winter 2018 38 / 70
Toy Example: Probably Approximately Correct and Regret
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Let ε = 0.05.
O = Optimism, TS = Thompson Sampling: W/in ε =I (Q(at) ≥ Q(a∗)− ε)
O TS Optimal O Regret O W/in ε TS Regret TS W/in εa1 a3 a1 0 0.85a2 a1 a1 0.05 0a3 a1 a1 0.85 0a1 a1 a1 0 0a2 a1 a1 0.05 0
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 50 Winter 2018 39 / 70
Toy Example: Probably Approximately Correct and Regret
Surgery: θ1 = .95 / Taping: θ2 = .9 / Nothing: θ3 = .1
Let ε = 0.05.
O = Optimism, TS = Thompson Sampling: W/in ε =I (Q(at) ≥ Q(a∗)− ε)
O TS Optimal O Regret O W/in ε TS Regret TS W/in εa1 a3 a1 0 Y 0.85 Na2 a1 a1 0.05 Y 0 Ya3 a1 a1 0.85 N 0 Ya1 a1 a1 0 Y 0 Ya2 a1 a1 0.05 Y 0 Y
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 51 Winter 2018 40 / 70
Table of Contents
1 Metrics for evaluating RL algorithms
2 Principles for RL Exploration
3 Probability Matching
4 Information State Search
5 MDPs
6 Principles for RL Exploration
7 Metrics for evaluating RL algorithms
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 52 Winter 2018 41 / 70
Relevant Background: Value of Information
Exploration is useful because it gains information
Can we quantify the value of information (VOI)?
How much reward a decision-maker would be prepared to pay in orderto have that information, prior to making a decisionLong-term reward after getting information - immediate reward
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 53 Winter 2018 42 / 70
Relevant Background: Value of Information Example
Consider bandit where only get to make a single decision
Oil company considering buying rights to drill in 1 of 5 locations
1 of locations contains $10 million worth of oil, others 0
Cost of buying rights to drill is $2 million
Seismologist says for a fee will survey one of 5 locations and reportback definitively whether that location does or does not contain oil
What should one consider paying seismologist?
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 54 Winter 2018 43 / 70
Relevant Background: Value of Information Example
1 of locations contains $10 million worth of oil, others 0
Cost of buying rights to drill is $2 million
Seismologist says for a fee will survey one of 5 locations and reportback definitively whether that location does or does not contain oil
Value of information: expected profit if ask seismologist minusexpected profit if don’t askExpected profit if don’t ask:
Guess at random
=1
5(10− 2) +
4
5(0− 2) = 0 (1)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 55 Winter 2018 44 / 70
Relevant Background: Value of Information Example
1 of locations contains $10 million worth of oil, others 0
Cost of buying rights to drill is $2 million
Seismologist says for a fee will survey one of 5 locations and reportback definitively whether that location does or does not contain oil
Value of information: expected profit if ask seismologist minusexpected profit if don’t askExpected profit if don’t ask:
Guess at random
=1
5(10− 2) +
4
5(0− 2) = 0 (2)
Expected profit if ask:If one surveyed has oil, expected profit is: 10− 2 = 8If one surveyed doesn’t have oil, expected profit: (guess at randomfrom other locations) 1
4 (10− 2)− 34 (−2) = 0.5
Weigh by probability will survey location with oil: = 15 8 + 4
5 0.5 = 2
VOI: 2− 0 = 2
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 56 Winter 2018 45 / 70
Relevant Background: Value of Information
Back to making a sequence of decisions under uncertainty
Information gain is higher in uncertain situations
But need to consider value of that information
Would it change our decisions?Expected utility benefit
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 57 Winter 2018 46 / 70
Information State Space
So far viewed bandits as a simple fully observable Markov decisionprocess (where actions don’t impact next state)
Beautiful idea: frame bandits as a partially observable Markovdecision process where the hidden state is the mean reward of eacharm
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 58 Winter 2018 47 / 70
Information State Space
So far viewed bandits as a simple fully observable Markov decisionprocess (where actions don’t impact next state)
Beautiful idea: frame bandits as a partially observable Markovdecision process where the hidden state is the mean reward of eacharm
(Hidden) State is static
Actions are same as before, pulling an arm
Observations: Sample from reward model given hidden state
POMDP planning = Optimal Bandit learning
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 59 Winter 2018 48 / 70
Information State Space
POMDP belief state / information state s is posterior over hiddenparameters (e.g. mean reward of each arm)
s is a statistic of the history, s = f (ht)
Each action a causes a transition to a new information state s ′ (byadding information), with probability Pa
s,s′
Equivalent to a POMDP
Or a MDP M = (S,A, P,R, γ) in augmented information statespace
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 60 Winter 2018 49 / 70
Bernoulli Bandits
Consider a Bernoulli bandit such that Ra = B(µa)
e.g. Win or lose a game with probability µa
Want to find which arm has the highest µaThe information state is s = (α, β)
αa counts the pulls of arm a where the reward was 0βa counts the pulls of arm a where the reward was 1
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 61 Winter 2018 50 / 70
Solving Information State Space Bandits
We now have an infinite MDP over information states
This MDP can be solved by reinforcement learning
Model-free reinforcement learning (e.g. Q-learning)
Bayesian model-based RL (e.g. Gittins indices)
This approach is known as Bayes-adaptive RL: Finds Bayes-optimalexploration/exploitation trade-off with respect to prior distribution
In other words, selects actions that maximize expected reward giveninformation have so far
Check your understanding: Can an algorithm that optimally solves aninformation state bandit have a non-zero regret? Why or why not?
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 62 Winter 2018 51 / 70
Bayes-Adaptive Bernoulli Bandits
Start with Beta(αa, βa) priorover reward function Ra
Each time a is selected,update posterior for Ra
Beta(αa + 1, βa) if r = 0Beta(αa, βa + 1) if r = 1
This defines transitionfunction P for theBayes-adaptive MDP
Information state (α, β)corresponds to reward modelBeta(α, β)
Each state transitioncorresponds to a Bayesianmodel update
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 63 Winter 2018 52 / 70
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 64 Winter 2018 53 / 70
Gittins Indices for Bernoulli Bandits
Bayes-adaptive MDP can be solved by dynamic programming
The solution is known as the Gittins index
Exact solution to Bayes-adaptiev MDP is typically intractable;information state space is too large
Recent idea: apply simulation-based search (Guez et al. 2012, 2013)
Forward search in information state spaceUsing simulations from current information state
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 65 Winter 2018 54 / 70
Table of Contents
1 Metrics for evaluating RL algorithms
2 Principles for RL Exploration
3 Probability Matching
4 Information State Search
5 MDPs
6 Principles for RL Exploration
7 Metrics for evaluating RL algorithms
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 66 Winter 2018 55 / 70
Principles for Strategic Exploration
The sample principles for exploration/exploitation apply to MDPs
Naive ExplorationOptimistic InitializationOptimism in the Face of UncertaintyProbability MatchingInformation State Search
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 67 Winter 2018 56 / 70
Optimistic Initialization: Model-Free RL
Initialize action-value function Q(s,a) to rmax1−γ
Run favorite model-free RL algorithm
Monte-Carlo controlSarsaQ-learningetc.
Encourages systematic exploration of states and actions
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 68 Winter 2018 57 / 70
Optimistic Initialization: Model-Based RL
Construct an optimistic model of the MDP
Initialize transitions to go to terminal state with rmax reward
Solve optimistic MDP by favorite planning algorithm
Encourages systematic exploration of states and actions
e.g. RMax algorithm (Brafman and Tennenholtz)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 69 Winter 2018 58 / 70
UCB: Model-Based RL
Maximize UCB on action-value function Qπ(s, a)
at = arg maxa∈A
Q(st , a) + U(st , a)
Estimate uncertainty in policy evaluation (easy)Ignores uncertainty from policy improvement
Maximize UCB on optimal action-value function Q∗(s, a)
at = arg maxa∈A
Q(st , a) + U1(st , a) + U2(st , a)
Estimate uncertainty in policy evaluation (easy)plus uncertainty from policy improvement (hard)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 70 Winter 2018 59 / 70
Bayesian Model-Based RL
Maintain posterior distribution over MDP models
Estimate both transition and rewards, p[P,R | ht ], whereht = (s1, a1, r1, . . . , st) is the history
Use posterior to guide exploration
Upper confidence bounds (Bayesian UCB)Probability matching (Thompson sampling)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 71 Winter 2018 60 / 70
Thompson Sampling: Model-Based RL
Thompson sampling implements probability matching
π(s, a | ht) = P[Q(s, a) > Q(s, a′),∀a′ 6= a | ht ]
= EP,R|ht
[1(a = arg max
a∈AQ(s, a))
]Use Bayes law to compute posterior distribution p[P,R | ht ]Sample an MDP P,R from posterior
Solve MDP using favorite planning algorithm to get Q∗(s, a)
Select optimal action for sample MDP, at = arg maxa∈AQ∗(st , a)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 72 Winter 2018 61 / 70
Information State Search in MDPs
MDPs can be augmented to include information state
Now the augmented state is (s, s)
where s is original state within MDPand s is a statistic of the history (accumulated information)
Each action a causes a transition
to a new state s ′ with probability Pas,s′
to a new information state s ′
Defines MDP M in augmented information state space
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 73 Winter 2018 62 / 70
Bayes Adaptive MDP
Posterior distribution over MDP model is an information state
st = P[P,R | ht ]
Augmented MDP over (s, s) is called Bayes-adaptive MDP
Solve this MDP to find optimal exploration/exploitation trade-off(with respect to prior)
However, Bayes-adaptive MDP is typically enormous
Simulation-based search has proven effective (Guez et al, 2012, 2013)
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 74 Winter 2018 63 / 70
Table of Contents
1 Metrics for evaluating RL algorithms
2 Principles for RL Exploration
3 Probability Matching
4 Information State Search
5 MDPs
6 Principles for RL Exploration
7 Metrics for evaluating RL algorithms
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 75 Winter 2018 64 / 70
Principles
Naive Exploration
Add noise to greedy policy (e.g. ε-greedy)
Optimistic Initialization
Assume the best until proven otherwise
Optimism in the Face of Uncertainty
Prefer actions with uncertain values
Probability Matching
Select actions according to probability they are best
Information State Search
Lookahead search incorporating value of information
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 76 Winter 2018 65 / 70
Generalization and Strategic Exploration
Active area of ongoing research: combine generalization & strategicexploration
Many approaches are grounded by principles outlined here
Some examples:
Optimism under uncertainty: Bellemare et al. NIPS 2016; Ostrovski etal. ICML 2017; Tang et al. NIPS 2017Probability matching: Osband et al. NIPS 2016; Mandel et al. IJCAI2016
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 77 Winter 2018 66 / 70
Table of Contents
1 Metrics for evaluating RL algorithms
2 Principles for RL Exploration
3 Probability Matching
4 Information State Search
5 MDPs
6 Principles for RL Exploration
7 Metrics for evaluating RL algorithms
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 78 Winter 2018 67 / 70
Performance Criteria of RL Algorithms
Empirical performance
Convergence (to something ...)
Asymptotic convergence to optimal policy
Finite sample guarantees: probably approximately correct
Regret (with respect to optimal decisions)
Optimal decisions given information have available
PAC uniform (Dann, Tor, Brunskill NIPS 2017): stronger criteria,directly provides both PAC and regret bounds
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 79 Winter 2018 68 / 70
Summary: What You Are Expected to Know
Define the tension of exploration and exploitation in RL and why thisdoes not arise in supervised or unsupervised learning
Be able to define and compare different criteria for ”good”performance (empirical, convergence, asymptotic, regret, PAC)
Be able to map algorithms discussed in detail in class to theperformance criteria they satisfy
Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Fast Reinforcement Learning Part II 80 Winter 2018 69 / 70