Wu
Introduction to multi-armed banditsExploration-Exploitation Dilemma
Cathy Wu6.246 Reinforcement Learning: Foundations and Methods
Apr 6, 2021
Wu
References2
1. Alessandro Lazaric. INRIA Lille. Reinforcement Learning. 2017, Lecture 6.
2. Aleksandrs Slivkins. Introduction to Multi-Armed Bandits. 2019. Chapters 1-3, 8.
Wu
Outline3
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
Wu
Recall: Q-Learning4
PropositionIf the learning rate satisfies the Robbins-Monro conditions in all states !, # ∈ %×'
()*+
,-. !, # = ∞ (
)*+
,-.1 !, # < ∞
And all state-action pairs are tried infinitely often, then for all !, # ∈%×'
34 !, # 5.7. 4∗ !, #Remark: “infinitely often” requires a steady exploration policy.
Wu
Learning the Optimal Policy5
for ! = 1,… , & do1. Set ' = 02. Set initial state )*3. while ()+ not terminal)
1) Take action ,+ = argmax2 34 )+, ,2) Observe next state )+56 and reward 7+3) Compute the temporal difference
8+ = 7+ + :max2;34 )+56, ,< − 34()+, ,+) (Q−learning)
4) Update the Q-function34 )+, ,+ = 34 )+, ,+ + E )+, ,+ 8+
5) Set ' = ' + 1endwhile
endforNo Convergence
according to a suitable exploration policy
Wu
Learning the Optimal Policy6
for ! = 1,… , & do1. Set ' = 02. Set initial state )*3. while ()+ not terminal)
1) Take action ,+~.(0)2) Observe next state )+23 and reward 4+3) Compute the temporal difference
5+ = 4+ + 7max;<
=> )+23, ,? − =>()+, ,+) (Q−learning)
4) Update the Q-function=> )+, ,+ = => )+, ,+ + H )+, ,+ 5+
5) Set ' = ' + 1
endwhileendforBad Convergence
Wu
From RL to Multi-armed Bandit7
for ! = 1,… , & do1. Set ' = 02. Set initial state )*3. while ()+ not terminal)
1) Take action ,+2) Observe next state )+-. and reward /+endwhile
endfor
Wu
From RL to Multi-armed Bandit8
The problem§ Set of ! actions§ Reward distribution " # with $ # = & ' #
(bounded in [0,1] for convenience)The protocol
for - = 1, … , / do1. Take action #02. Observe reward '0~" #0
endforThe objective§ Maximize sum of reward & ∑0345 '0
Wu
9
§ A RS can recommend different genres of movies (e.g. action, adventure, romance, animation)
§ Users arrive at random and no information about the user is available
§ The RS picks a genre to recommend to the user but not the specific movies
§ The feedback is whether the user watched a movie of the recommended genre or not
§ Objective: Design a RS that maximizes the movies watched in the recommended genre
A Simple Recommendation System
Wu
10
RS as a Multi-armed Bandit
for ! = 1,… , & do1. User arrives2. Recommend genre '(3. Reward
)( = *10user watches movie of genre '(
otherwiseendfor
Wu
11
RS as a Multi-armed Bandit
The model§ !(#) is a Bernoulli§ % # = ' ( # is the probability a random user watches a movie of
genre #§ Assumption: ()~! #) is a realization of the Bernoulli of a genre #The objective§ Maximize sum of reward ' ∑),-. ()
Wu
12
Other Examples
§ Packet routing§ Clinical trials§ Web advertising§ Health advice§ Education§ : Computer games ó§ Resource mining§ …
Wu
The Regret13
!" = max'( )
*+,
"-*(/) − ( )
*+,
"-*(/*)
The expectation summarizes any possible source of randomness (either in -or in the algorithm)
Relation to RL: Can think of this as 2 trajectories (of length 1).
Measures not only the final error, but all mistakes made over 2 “iterations.”
Wu
The Regret14
§ Number of times action !has been selected after "rounds
#$ ! =&'()
$* !' = !
§ Gap Δ ! = , !∗ − ,(!)
§ Regret
1$ = max56 &
'()
$7'(!) − 6 &
'()
$7'(!')
1$ = max5", ! − 6 &
'()
$7'(!')
1$ = max5", ! −&
56 #$ ! , !
1$ = ", !∗ −&56 #$ ! , !
1$ = &585∗
6 #$ ! , !∗ − , !
1$ = &585∗
6 #$ ! Δ(!)
Wu
The Regret15
!" = $%&%∗
( )" * Δ(*)
Ø We only need to study the expected number of times suboptimal actions are selected
Ø Worst case possible: !" = .(/)Ø A good algorithm has !" = 0(/), i.e. 12" → 0
Wu
The Exploration-Exploitation Dilemma16
Problem 1: The environment does not reveal the reward of the actions not selected by the learnerØ The learner should gain information by repeatedly selecting all
actions
Problem 2: Whenever the learner selects a bad action, it suffers some regretØ The learner should reduce the regret by repeatedly selecting the
best action
Challenge: The learner should solve two opposite problems!
⟹ exploration
⟹ exploitation
the explora?on-exploitation dilemma!
Wu
Outline17
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
Wu
time Ttime K
Explore phase Exploit phase
Explore-then-Commit
Co-designed w/ Pulkit Agrawal
18
Wu
Explore-then-Commit: Algorithm19
Explore phasefor ! = 1, … , & do
1. Take action '(~*(,) (or round robin)2. Observe reward .(~/ '(
endforCompute statistics for each action '
1̂2 ' = 132(')
4567
2.58 '5 = '
Exploit phasefor ! = & + 1,… , : do
1. Take action ;'∗ = argmaxB ;12 '2. Observe reward .(~/ ;'∗
endfor
Wu
Explore-then-Commit: Regret20
TheoremIf explore-then-commit is run with parameter ! for " steps, then it suffers a regret:
#$ ≤ &'('∗
!* Δ , + 2 " − ! − 1 exp −!Δ , 4/2
§ Difficult to tune: ! should be adjusted depending on " and Δ ,§ Worst case w.r.t. Δ , : #$ = 8 "
9: (for ! = "
9:)
§ Recall: worst possible: #$ = 8 "
Wu
21
§ Regret decomposition!" =$
%&'
(
) * +∗ − * +% + $%&(/'
"
) * +∗ − * 0+∗
§ During explore phase
$%&'
(
) * +∗ − * +% =1
2$345∗
Δ(+)
§ During exploit phase
$%&(/'
"
) * +∗ − * 0+∗ = 9 − 1 − 1 $345∗
ℙ 0+∗ = + Δ +
= 9 − 1 − 1 $345∗
ℙ ∀+=: @̂( + ≥ @̂( += Δ +
≤ 9 − 1 − 1 $345∗
ℙ @̂( + ≥ @̂( +∗ Δ +
Explore-then-Commit: Regret Analysis
Recall:
@̂( + =1
H((+)$I&'
(
JIK +I = +
0+∗ = argmax3@̂( +
Wu
22
Explore-then-Commit: Regret Analysis
Let !" ∈ $, & be an independent r.v. with common mean ' = )!". Then:
ℙ +!, − ' > / ≤ 2 exp − 5,6789: 7 ∀< > 0
where +!, = >,∑"@>
, !" .
Proposition (Hoeffding Inequality)
Wu
24
Explore-then-Commit: Regret Analysis
§ Probability of errorℙ #̂$ % ≥ #̂$ %∗= ℙ #̂$ % − # % ≥ #̂$ %∗ − # %∗ + Δ %≤ ℙ #̂$ % − # % ≥ Δ %
2 + ℙ # %∗ − #̂$ %∗ ≥ Δ %2
§ Apply Hoeffding bound for random variables ./ ∈ 0,1ℙ #̂$ % ≥ #̂$ %∗ ≤ 2 exp −7Δ % 8/2
Wu
Outline25
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
Wu
!-greedy: Algorithm26
for " = 1,… , ' do1. Take action
() = *+ ,
argmax24̂) (
with probability !) (explore)with probability 1 − !) (exploit)
2. Observe reward B)~D ()3. Update statistics for action ()
E) () = E)FG () + 1
4̂) () =1
E) ()IJKG
)
BJL (J = ()
endfor
Wu
!-greedy: Regret27
Theorem
If !-greedy is run with parameter !" = $%&' ( log $ ,/., then for each
round t it suffers a regret:/" ≤ 12 $3/.
§ Same asymptotic regret, now holds for all rounds t§ Can do better, but optimal ! depends on knowledge of Δ (difficult to
tune)§ Keep selecting very bad arms with some probability§ Sharply separates exploration and exploitation
Wu
Non-adaptive exploration
Adaptive exploration
vs
Explore-then-commit: Explore + exploit separately
E.g., Exploit + Explore (based on exploitation)
Types of exploration strategies
!-greedy: Exploit + explore (agnostic to exploitation)
29
Wu
Outline30
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
Wu
Soft-max (aka Exp3): Algorithm31
for ! = 1,… , & do1. Take action
'(~exp .̂( '
/∑12 exp
.̂( '3/
2. Observe reward 4(~5 '(3. Update statistics for action '(
6( '( = 6(78 '( + 1.̂( '( = 1
6( '(:;<8
(4;= '; = '(
endfor
§ More probability to better actions (arms)
§ Temperature /: large for exploration, small for exploitation
§ Difficult to tune
Wu
32
Example of Regret Performance
Wu
Outline33
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
Wu
Problem-Independent Lower Bound34
TheoremConsider the family of multi-armed bandit problems with ! Bernoulli arms. For any algorithm and fixed ", there exists a Bernoulli MAB problem instance such that:
#$ = Ω !"§ At any finite time ", the regret may be as large as Ω "
Wu
Problem-Dependent Lower Bound35
TheoremConsider the family of multi-armed bandit problems with ! Bernoulli arms and an algorithm that satisfies " #$ % = ' () for any * > 0, any action %, and any Bernoulli MAB problem. Then for any Bernoulli MAB problem with gaps Δ % > 0, ∀% ≠ %∗, any algorithm suffers regret:
lim inf$→8
9$
log (= <
=>=∗
Δ %
?@ A % , A %∗
Where ?@ B, C = B logD
E+ 1 − B log
IJD
IJE
§ No algorithm can achieve a regret smaller than Ω(log () (asymptotically)
§ The ratio N =
OP Q,Q∗measures the difficulty of the problem
§ Algorithms such as R-greedy with the right tuning are optimal!
Wu
Outline36
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
Wu
Recipe for Effective Exp-Exp37
1. Computation of estimates2. Evaluation of uncertainty3. Mechanism to combine estimates and uncertainty4. Select the best action (according to its combined value)
Wu
Optimism in Face of Uncertainty38
“Whenever the value of an action is uncertain, consider its largest plausible value, and then select the best action.”
Missing ingredient: uncertainty of our estimates.
Wu
Measuring Uncertainty39
Proposition (Chernoff-Hoeffding Inequality)Let !" ∈ [%, '] be ) independent r.v. with mean * = ,!". Then:
ℙ 1)/012
3!0 − * > ' − %
log 2:2) ≤ :
Wu
Recipe of UCB41
1. Computation of estimates
!"# $ = 1'# $
()*+
#,)- $) = $
2. Evaluation of uncertainty
!"# $ − " $ ≤log 242'# $
3. Mechanism to combine estimates and uncertainty
5# $ = !"# $ + 7log 24#2'# $
4. Select the best action (according to its combined value)$# = argmax
<5# $
Wu
time T
Exploitation
Exploration bonus for rare actions
(optimism)
(mean reward of action i)
...
<latexit sha1_base64="fsbH58ehoMNbB5sd13XhncgY6/0=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoicpePFY0X5AG8pku2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nirImjUWsOgFqJrhkTcONYJ1EMYwCwdrB+Hbmt5+Y0jyWj2aSMD/CoeQhp2is9IB9r1+uuFV3DrJKvJxUIEejX/7qDWKaRkwaKlDrrucmxs9QGU4Fm5Z6qWYJ0jEOWddSiRHTfjY/dUrOrDIgYaxsSUPm6u+JDCOtJ1FgOyM0I73szcT/vG5qwms/4zJJDZN0sShMBTExmf1NBlwxasTEEqSK21sJHaFCamw6JRuCt/zyKmldVL1a1b2/rNRv8jiKcAKncA4eXEEd7qABTaAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QPotY2J</latexit>a1<latexit sha1_base64="//EfCgT17QIUgHLOAztMsYf5HII=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoicpePFY0X5AG8pmO2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nimGTxSJWnYBqFFxi03AjsJMopFEgsB2Mb2d++wmV5rF8NJME/YgOJQ85o8ZKD7Rf65crbtWdg6wSLycVyNHol796g5ilEUrDBNW667mJ8TOqDGcCp6VeqjGhbEyH2LVU0gi1n81PnZIzqwxIGCtb0pC5+nsio5HWkyiwnRE1I73szcT/vG5qwms/4zJJDUq2WBSmgpiYzP4mA66QGTGxhDLF7a2EjaiizNh0SjYEb/nlVdKqVb3Lqnt/Uanf5HEU4QRO4Rw8uII63EEDmsBgCM/wCm+OcF6cd+dj0Vpw8plj+APn8wfqOY2K</latexit>a2
<latexit sha1_base64="T8MRKowcOJGOSXgUIPN3A1CdyCo=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKez6QE8S8OIxonlAsoTeyWwyZHZ2mZkVwpJP8OJBEa9+kTf/xkmyB00saCiquunuChLBtXHdb6ewsrq2vlHcLG1t7+zulfcPmjpOFWUNGotYtQPUTHDJGoYbwdqJYhgFgrWC0e3Ubz0xpXksH804YX6EA8lDTtFY6QF7571yxa26M5Bl4uWkAjnqvfJXtx/TNGLSUIFadzw3MX6GynAq2KTUTTVLkI5wwDqWSoyY9rPZqRNyYpU+CWNlSxoyU39PZBhpPY4C2xmhGepFbyr+53VSE177GZdJapik80VhKoiJyfRv0ueKUSPGliBV3N5K6BAVUmPTKdkQvMWXl0nzrOpdVt37i0rtJo+jCEdwDKfgwRXU4A7q0AAKA3iGV3hzhPPivDsf89aCk88cwh84nz/rvY2L</latexit>a3
Initial confidence intervals:
Co-designed w/ Pulkit Agrawal
!"#$ = argmax+ ,-" ! + /log 23"24" !
Upper Confidence Bound (UCB) Algorithm42
Wu
UCB: Algorithm43
for ! = 1,… , & do1. Compute upper-confidence bound
'( ) = +̂( ) + -log 22(23( )
2. Take action )( argmax8 '( )3. Observe reward 9(~; )(4. Update statistics for action )(
3( )( = 3(<= )( + 1+̂( )( = 1
3( )(>?@=
(9?A )? = )(
endfor
Wu
UCB: Regret45
Theorem
Consider a MAB problem with ! Bernoulli arms with gaps Δ($). If UCB is run with & = 1and )* = +
* for , steps, then it suffers regret:
-. = / 0121∗
log ,Δ $
Consider a 2-action MAB problem, then for any fixed ,, in the worst-case (w.r.t Δ) UCB suffers a regret:
-. = / , log ,
§ It (almost) matches the lower bounds§ It does not require any prior knowledge about the MAB, apart from the
range of the r.v.§ The big-O hides a few numerical constants and ,-independent additive
terms
Wu
46
UCB: Proof Sketch§ Disclaimer: This is a slightly suboptimal proof, but it provides an easy path.
§ Define the (high-probability) event [statistics]
ℰ = ∀$, & (̂) $ − ( $ ≤log 2021) $
§ By Chernoff-Hoeffding & union bound: ℙ ℰ ≥ 1 − 560§ If at time &, we select action $, then [algorithm]
7) $ ≥ 7) $∗
(̂) $ +log 2021) $
≥ (̂) $∗ +log 2021) $∗
§ On the event ℰ, we have [math]
( $ + 2log 2021) $
≥ ( $∗
Wu
48
UCB: Proof Sketch§ Assume ! is the last time " is selected, then #$ " = #&'( " + 1 (for + ≥ !), thus:
- " + 2log 222#$ " ≥ - "∗
§ Reordering [math]
#$ " ≤2log 22Δ " 6
under event ℰ and thus with probability 1 − +92§ Moving to the expectation [statistics]
: #$ " = : #$ " |ℰ + : #$ " |ℰ<
: #$ " ≤2log 22Δ " 6 + + +92
§ Trading-off the two terms 2 = ($=, we obtain:
: #$ " ≤ 4log +Δ?6
+ 9
Wu
Tuning the ! Parameter49
Theory§ ! < 1, polynomial regret w.r.t. $§ ! ≥ 1, logarithmic regret w.r.t. $Practice: ! = 0.2 is often the best choice
Recall:
0123 = argmax8:̂1 0 + !
log 2=12>1 0
Wu
Improvements: UCB-V50
Idea: Use empirical Bernstein bounds for more accurate confidence intervals (c.i.)Algorithm:§ Compute the score of each arm !
"# $ = '̂# $ + ) 4log .20# $
§ Select action$# = argmax5 "# $
§ Update the statistics 0# $# , '̂# $#Regret:
78 ≤ : 1Δ log =
and >?#@ $#?@
2 >?#@ $ log .0# $
+ 8 log .30# $
Wu
Improvements: KL-UCB51
Idea: Use even tighter c.i. based on Kullback-Leibler divergence
KL #, % = # log#%+ 1 − # log
1 − #1 − %
Algorithm: Compute the score of each arm - (convex optimization)./ 0 = max % ∈ 0,1 : 7/ 0 89 ;̂/ 0 , % ≤ log = + > log log =
Regret: Pulls to suboptimal arms
? 7@ 0 ≤ 1 + Alog n
KL ; 0 , ; 0∗+ CE log log F +
GH AFI J
Where K ;L, ;∗ ≥ 2ΔLH
Wu
Outline52
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
WuSource: Steve Roberts
Measuring Uncertainty53
§ Assume that !" # are distributed as Bernoulli for all actions # with parameter $ #§ Define a prior $ # ~Beta *+, -+§ After . rewards, compute the posterior for action # as Beta *" # , -" # with:
*" # = *+ +1234
"5 #" = # ⋀ !" = 0 -" # = -+ +1
234
"5 #" = # ⋀ !" = 1
Wu
Recipe of Thompson Sampling*55
1. Computation of estimates
"̂# $ = &# $&# $ + (# $
2. Evaluation of uncertaintyBeta &# $ , (# $
3. Mechanism to combine estimates and uncertainty.# $ ~Beta &# $ , (# $
4. Select the best action (according to its combined value)$# = argmax
4.# $
*aka Posterior Sampling
Wu
Thompson Sampling: Algorithm56
for ! = 1,… , & do1. Compute upper-confidence bound
'( ) ~Beta /( ) , 0( )2. Take action )( argmax5 '( )3. Observe reward 6(~7 )(4. Update statistics for action )(
/( )( = /(89 )( + ; 6( = 00( )( = 0(89 )( + ; 6( = 1
endfor
Wu
Thompson Sampling: Regret58
TheoremConsider a MAB problem with A Bernoulli arms with gaps Δ " . If Thompson Sampling is run for # steps, then it suffers regret:
$% = ' 1 + * ∑,-,∗ / , 012 %34 5 , ,5 ,∗ + 7 5
89 ∀* > 0
§ It matches the lower bound§ It requires defining a prior on the actions
Wu
Outline59
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
Wu
60
§ A RS can recommend specific movies§ Users arrive at random and no information about the user is
available§ The RS picks a movie to the user§ The feedback is whether the user watched the movie or not§ Objective: Design a RS that maximizes the number of movies
watched
A Simple Recommendation System
Wu
61
RS as a Multi-armed Bandit
for ! = 1,… , & do1. User arrives2. Recommend movie '(3. Reward
)( = *10user watches movie '(
otherwiseEndfor
Issue: Too many movies are available to collect enough feedback for each movie separately
Wu
62
RS as a Linear Bandit
The model§ ! " = $ % " is the probability a random user watches movie "§ Each movie " is characterized by some features & " ∈ ℝ) (e.g.
genre, release date, past rating, income, etc)§ Assumption:• The expected value is a linear function ! " = & " *+∗ (with +∗ ∈ ℝ)
unknown) • The rewards are noisy observations %- " = ! " + /- with $ /- = 0
The objective§ Maximize sum of reward $ ∑-234 %-
Wu
Recipe of UCB63
1. Computation of estimates
!"# $ = 1'# $
()*+
#,)- $) = $
2. Evaluation of uncertainty
!"# $ − " $ ≤log 13'# $
3. Mechanism to combine estimates and uncertainty
4# $ = !"# $ + 6log 13#'# $
4. Select the best action (according to its combined value)$# = argmax; 4# $
Issue: '#($) is likely to be 0 for most $. We need more sample efficient estimates.
Wu
The Regret64
!" = max'( )
*+,
"-* . − ( )
*+,
"-* .*
= ( )*+,
"0 .∗ − 0 .*
23∗
Issue: .∗ unlikely to be ever selected if 4 ≪ 6
Wu
Least-Squares Estimate of !∗ 65
§ Least-squares estimate
#!$ = arg min,∈ℝ/
112345
$63 − 8 93 :! ; + = ! ;
§ Closed form solution
>$ =2345
$8 93 8 93 : + =? @$ =2
345
$8 93 63
⟹ #!$ = >$B5@$§ Estimate of value of action 9
D̂$ 9 = 8 9 : #!$
Wu
Measuring Uncertainty66
PropositionLet !", … , !% be any sequence of actions adapted to the filtration ℱ%. If the noise ' is sub-Gaussian of parameter ( and the features are bounded by ) ! * ≤ ,, then for any ! with probability 1 − /:
1̂% ! − 1 ! ≤ 2% ) ! 34%5") !Where:
2% = ( 7 log1 + <,=/ + =
"* >∗ *
§ ) ! @ABC measure the correlation between ) ! and the actions selected so far
§ If ) ! D is an orthogonal basis for ℝ@, this reduces to the MAB problem and ) ! @ABC =
"3A D
Wu
Recipe of LinUCB67
1. Computation of estimates!"# = %#&'(# *̂# + = , + - !"#
2. Evaluation of uncertainty*̂# + − * + ≤ 0# , + -%#&', +
3. Mechanism to combine estimates and uncertainty1# + = *̂# + + 0# , + -%#&', +
4. Select the best action (according to its combined value)+# = argmax
81# +
Wu
LinUCB: Algorithm68
for ! = 1,… , & do1. Compute upper-confidence bound
'( ) = +̂( ) + -( . ) /0(12. )2. Take action )( argmax8 '( )3. Observe reward 9(~. )( /;∗ + =(4. Update statistics for action )(
0(>2 = 0( + . )( . )( /?;(>2 = AA>212 B(>2
endfor
Wu
LinUCB: Regret69
TheoremConsider a linear MAB problem with actions defined in ℝ" and unknown parameter #∗ ∈ ℝ". If LinUCB is run with &' = )
' for * steps, then it suffers a regret:
+, = - . * log *§ It depends on . but not the number of actions 2§ If 2 < ∞, we can improve the bound to
+, = - .* log *2
Wu
70
§ A RS can recommend specific movies§ Users arrive at random and we have information about them§ The RS picks a movie to the user§ The feedback is whether the user watched the movie or not§ Objective: Design a RS that maximizes the number of movies
watched
A Simple Recommendation System
Wu
71
RS as a Multi-armed Bandit
for ! = 1,… , & do1. User arrives '(2. Recommend movie )(3. Reward
*( = +10user watches movie )(
otherwiseEndfor
Issue: Too many users to collect enough feedback for each user separately
Wu
72
RS as a Contextual Linear Bandit
The model§ ! ", $ = & ' ", $ is the probability user " watches movie $§ Each user " and movie $ is characterized by some features ( ", $ ∈ ℝ+ (e.g. name, location, genre, release date, past rating, income, etc)
§ Assumption:• The expected value is a linear function ! ", $ = ( ", $ ,-∗ (with -∗ ∈ ℝ+
unknown) • The rewards are noisy observations '/ ", $ = ! ", $ + 1/ with & 1/ = 0
The objective§ Maximize sum of reward & ∑/456 '/
Wu
The Regret73
!" = $ %&'(
"max,
-& .&, 0 − $ %&'(
"-& .&, 0&
= $ %&'(
"2 .&, 0&∗ − 2 .&, 0&
45∗
Wu
Least-Squares Estimate of !∗ 74
§ Least-squares estimate
#!$ = arg min,∈ℝ/
112345
$63 − 8 93, ;3 <! = + ? ! =
§ Closed form solution
@$ =2345
$8 93, ;3 8 93, ;3 < + ?A B$ =2
345
$8 93, ;3 63
⟹ #!$ = @$D5B$§ Estimate of value of action ;
F̂$ 9, ; = 8 9, ; < #!$
Wu
ContextualLinUCB: Algorithm75
for ! = 1,… , & do1. Observe context '(2. Compute upper-confidence bound
)( '(, * = ,̂( '(, * + .( / '(, * 01(23/ '(, *3. Take action *( = argmax
9)( '(, *
4. Observe reward :(~/ '(, *( 0<∗ + >(5. Update statistics for action *(
1(?3 = 1( + / '(, *( / '(, *( 0@<(?3 = AB?323 C(?3
endfor
Wu
ContextualLinUCB: Regret76
TheoremConsider a contextual linear MAB problem with contexts and actions defined in ℝ" and unknown parameter #∗ ∈ ℝ". If ContextualLinUCBis run with &' = )
' for * steps, then for any arbitrary sequence of contexts +), +-, … , +/, it suffers a regret:
0/ = 1 2 * log *
Wu
Regret and Finite Sample Analysis77
§ How many data points (or experiences) are needed in order to get a good approximation of the optimal policy in a reinforcement learning task with an algorithm !?• Sample complexity " #, %, & , ' : smallest " such that with probability at
least 1 − #,AvgError ! 0 ≤ %
Where AvgError ! 2 is the average error made by the algorithm after 3steps.• Regret:
4256
0Error ! 2
Adapted from Mohammed Amine Bennouna & Moïse Blanchard (MIT)
Wu
Fitted Q-iteration78
Offline FQI quarantee [Munos, 2003; Munos and Szepesvari, 2008, Antos et al., 2008, Agarwal, et al., 2021]
The !"# iterate of offline Fitted Q Iteration guarantees that with probability 1 − &:
'∗ − ')* ≤ , 11 − - .
/ log ℱ&
4 + 2-71 − - 8
§ Remark: estimation and optimization error terms§ Assumptions• Concentrability C, a measure of data coverage (strong assumption). • Realizability: :∗ ∈ ℱ. ℱ is the function class for the Q function.• Bellman completion: for any = ∈ ℱ, ?= ∈ ℱ.
§ Ingredients• Performance difference lemma• Bernstein’s inequality
Top Related