Reinforcement Learning Evaluative Feedback and Bandit Problems
-
Upload
carlos-barlow -
Category
Documents
-
view
39 -
download
2
description
Transcript of Reinforcement Learning Evaluative Feedback and Bandit Problems
![Page 1: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/1.jpg)
Reinforcement Learning
Evaluative Feedback and Bandit Problems
Subramanian RamamoorthySchool of Informatics
20 January 2012
![Page 2: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/2.jpg)
Recap: What is Reinforcement Learning?
• An approach to Artificial Intelligence• Learning from interaction• Goal-oriented learning• Learning about, from, and while interacting with an external
environment
• Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal
• Can be thought of as a stochastic optimization over time
220/01/2012
![Page 3: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/3.jpg)
Recap: The Setup for RL
Agent is:
• Temporally situated
• Continual learning and planning
• Objective is to affect the environment – actions and states
• Environment is uncertain, stochastic
Environment
actionstate
rewardAgent
320/01/2012
![Page 4: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/4.jpg)
Recap: Key Features of RL
• Learner is not told which actions to take• Trial-and-Error search• Possibility of delayed reward– Sacrifice short-term gains for greater long-term gains
• The need to explore and exploit
• Consider the whole problem of a goal-directed agent interacting with an uncertain environment
420/01/2012
![Page 5: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/5.jpg)
Multi-arm Bandits (MAB)
• N possible actions• You can play for some period
of time and you want to maximize reward (expected utility)
Which is the best arm/ machine?
DEMO
520/01/2012
![Page 6: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/6.jpg)
Real-Life Version
• Choose the best content to display to the next visitor of your commercial website
• Content options = slot machines• Reward = user's response (e.g., click on an ad)
• Also, clinical trials: arm = treatment, reward = patient cured• Simplifying assumption: no context (no visitor proles). In
practice, we want to solve contextual bandit problems but that is for later discussion.
620/01/2012
![Page 7: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/7.jpg)
What is the Choice?
720/01/2012
![Page 8: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/8.jpg)
n-Armed Bandit Problem
• Choose repeatedly from one of n actions; each choice is called a play
• After each play at , you get a reward rt , where
These are unknown action valuesDistribution of depends only on rt at
Objective is to maximize the reward in the long term, e.g., over 1000 plays
To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them
820/01/2012
![Page 9: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/9.jpg)
Exploration/Exploitation Dilemma
• Suppose you form estimates
• The greedy action at time t is at*
• You can’t exploit all the time; you can’t explore all the time• You can never stop exploring; but you could reduce
exploring.
Qt(a) Q*(a) action value estimates
at* argmax
aQt(a)
at at* exploitation
at at* exploration
920/01/2012
Why?
![Page 10: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/10.jpg)
Action-Value Methods
• Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action a had been chosen ka times, producing rewards r1 , r2 , …, rka
, then
“sample average”
limk a
Qt(a) Q*(a)
1020/01/2012
![Page 11: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/11.jpg)
Remark
• The simple greedy action selection strategy:
• Why might this above be insufficient?• You are estimating, online, from a few samples. How will this
behave?
DEMO
1120/01/2012
![Page 12: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/12.jpg)
-Greedy Action Selection
• Greedy action selection:
• -Greedy:
at at* arg max
aQt(a)
at* with probability 1
random action with probability {at
. . . the simplest way to balance exploration and exploitation
1220/01/2012
![Page 13: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/13.jpg)
Worked Example: 10-Armed Testbed
• n = 10 possible actions
• Each is chosen randomly from a normal distrib.:
• Each is also normal:
• 1000 plays, repeat the whole thing 2000 times and average the results
rt
Q*(a) )1,0(N
)1),(( *taQN
1320/01/2012
![Page 14: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/14.jpg)
-Greedy Methods on the 10-Armed Testbed
1420/01/2012
![Page 15: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/15.jpg)
Softmax Action Selection
• Softmax action selection methods grade action probabilities by estimated values.
• The most common softmax uses a Gibbs, or Boltzmann, distribution:
re' temperatunalcomputatio' a is where
,
yprobabilit with play on action Choose
1
)(
)(
n
b
bQ
aQ
t
t
e
e
ta
1520/01/2012
![Page 16: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/16.jpg)
Incremental Implementation
Qk
r1 r2 rk
k
Sample average estimation method:
How to do this incrementally (without storing all the rewards)?
We could keep a running sum and count, or, equivalently:
Qk1 Qk 1
k 1rk1 Qk
The average of the first k rewards is(dropping the dependence on a ):
NewEstimate = OldEstimate + StepSize [Target – OldEstimate]
1620/01/2012
![Page 17: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/17.jpg)
Tracking a Nonstationary Problem
Choosing to be a sample average is appropriate in astationary problem, i.e., when none of the change over time,
But not in a nonstationary problem.
Qk
Q*(a)
Better in the nonstationary case is:
Qk1 Qk rk1 Qk for constant , 0 1
(1 ) kQ0 (1 i1
k
)k iri
exponential, recency-weighted average
1720/01/2012
![Page 18: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/18.jpg)
Optimistic Initial Values
• All methods so far depend on , i.e., they are biased• Encourage exploration: initialize the action values optimistically,
Q0 (a)
i.e., on the 10-armed testbed, use Q0 (a) 5 for all a
1820/01/2012
![Page 19: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/19.jpg)
Beyond Counting…
20/01/2012 19
![Page 20: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/20.jpg)
An Interpretation of MAB Type Problems
20/01/2012 20
Related to‘rewards’
![Page 21: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/21.jpg)
MAB is a Special Case of Online Learning
20/01/2012 21
![Page 22: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/22.jpg)
How to Evaluate Online Alg.: Regret
• After you have played for T rounds, you experience a regret:= [Reward sum of optimal strategy] – [Sum of actual collected rewards]
• If the average regret per round goes to zero with probability 1, asymptotically, we say the strategy has no-regret property
~ guaranteed to converge to an optimal strategy • -greedy is sub-optimal (so has some regret). Why?
20/01/2012 22
kk
T
ti
T
tt trETrT
t
max
)(ˆ
*
1
*
1
*
Randomness in
draw of rewards &player’s strategy
![Page 23: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/23.jpg)
Interval Estimation
• Attribute to each arm an “optimistic initial estimate” within a certain confidence interval
• Greedily choose arm with highest optimistic mean (upper bound on confidence interval)
• Infrequently observed arm will have over-valued reward mean, leading to exploration
• Frequent usage pushes optimistic estimate to true values
20/01/2012 23
![Page 24: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/24.jpg)
Interval Estimation Procedure
• Associate to each arm 100(1-)% reward mean upper band
• Assume, e.g., rewards are normally distributed• Arm is observed n times to yield empirical mean & std dev• -upper bound:
• If is carefully controlled, could be made zero-regret strategy– In general, we don’t know
20/01/2012 24
dxx
tc
cn
u
t
2exp
2
1)(
)1(ˆ
ˆ
2
1
Cum. Distribution Function
![Page 25: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/25.jpg)
Variant: UCB Strategy
• Again, based on notion of an upper confidence bound but more generally applicable
• Algorithm:– Play each arm once– At time t > K, play arm it maximizing
20/01/2012 25
far so playedbeen has j timesofnumber :
ln2)(
,
,
tj
tjj
T
T
ttr
![Page 26: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/26.jpg)
UCB Strategy
20/01/2012 26
![Page 27: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/27.jpg)
Reminder: Chernoff-Hoeffding Bound
20/01/2012 27
![Page 28: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/28.jpg)
UCB Strategy – Behaviour
20/01/2012 28
We will not try to prove the following result but I quote the final result to tell you why UCB may be a desirable strategy – regret is bounded.
K = number of arms
![Page 29: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/29.jpg)
Variation on SoftMax:
• It is possible to drive regret down by annealing • Exp3 : Exponential weight alg. for exploration and exploitation• Probability of choosing arm k at time t is
20/01/2012 29
n
b
bQ
aQ
t
t
e
e
1
)(
)(
)log(Regret
at pulled is arm if
)(
)(
)(exp)(
)1(
)(
)()1()(
1
KKTO
otherwise
tj
tw
KtP
trtw
tw
Ktw
twtP
j
j
jj
j
k
jj
kk
is a user definedopen parameter
![Page 30: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/30.jpg)
The Gittins Index
• Each arm delivers reward with a probability• This probability may change through time but only when arm
is pulled• Goal is to maximize discounted rewards – future is discounted
by an exponential discount factor • The structure of the problem is such that, all you need to do is
compute an “index” for each arm and play the one with the highest index
• Index is of the form:
20/01/2012 30
T
t
t
T
t
it
Ti
tR
0
0
0
)(
sup
![Page 31: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/31.jpg)
Gittins Index – Intuition
• Proving optimality isn’t within our scope; but based on,• Stopping time: the point where you should ‘terminate’ bandit
• Nice Property: Gittins index for any given bandit is independent of expected outcome of all other bandits– Once you have a good arm, keep playing until there is a better one– If you add/remove machines, computation doesn’t really change
• BUT: – hard to compute, even when you know distributions– Exploration issues; arms isn’t updated unless used (restless bandits?)
20/01/2012 31
![Page 32: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/32.jpg)
Numerous Applications!
20/01/2012 32
![Page 33: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/33.jpg)
Extending the MAB Model
• In this lecture, we are in a single casino and the only decision is to pull from a set of n arms– except perhaps in the very last slides, exactly one state!
Next,• What if there is more than one state?• So, in this state space, what is the effect of the distribution of
payout changing based on how you pull arms? • What happens if you only obtain a net reward corresponding
to a long sequence of arm pulls (at the end)?
3320/01/2012
![Page 34: Reinforcement Learning Evaluative Feedback and Bandit Problems](https://reader036.fdocuments.in/reader036/viewer/2022062422/56812fc9550346895d95491d/html5/thumbnails/34.jpg)
Acknowledgements
• Many slides are adapted from web resources associated with Sutton and Barto’s Reinforcement Learning book
3420/01/2012