Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson...
Transcript of Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson...
![Page 1: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/1.jpg)
Wu
Introduction to multi-armed banditsExploration-Exploitation Dilemma
Cathy Wu6.246 Reinforcement Learning: Foundations and Methods
Apr 6, 2021
![Page 2: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/2.jpg)
Wu
References2
1. Alessandro Lazaric. INRIA Lille. Reinforcement Learning. 2017, Lecture 6.
2. Aleksandrs Slivkins. Introduction to Multi-Armed Bandits. 2019. Chapters 1-3, 8.
![Page 3: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/3.jpg)
Wu
Outline3
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
![Page 4: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/4.jpg)
Wu
Recall: Q-Learning4
PropositionIf the learning rate satisfies the Robbins-Monro conditions in all states !, # ∈ %×'
()*+
,-. !, # = ∞ (
)*+
,-.1 !, # < ∞
And all state-action pairs are tried infinitely often, then for all !, # ∈%×'
34 !, # 5.7. 4∗ !, #Remark: “infinitely often” requires a steady exploration policy.
![Page 5: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/5.jpg)
Wu
Learning the Optimal Policy5
for ! = 1,… , & do1. Set ' = 02. Set initial state )*3. while ()+ not terminal)
1) Take action ,+ = argmax2 34 )+, ,2) Observe next state )+56 and reward 7+3) Compute the temporal difference
8+ = 7+ + :max2;34 )+56, ,< − 34()+, ,+) (Q−learning)
4) Update the Q-function34 )+, ,+ = 34 )+, ,+ + E )+, ,+ 8+
5) Set ' = ' + 1endwhile
endforNo Convergence
according to a suitable exploration policy
![Page 6: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/6.jpg)
Wu
Learning the Optimal Policy6
for ! = 1,… , & do1. Set ' = 02. Set initial state )*3. while ()+ not terminal)
1) Take action ,+~.(0)2) Observe next state )+23 and reward 4+3) Compute the temporal difference
5+ = 4+ + 7max;<
=> )+23, ,? − =>()+, ,+) (Q−learning)
4) Update the Q-function=> )+, ,+ = => )+, ,+ + H )+, ,+ 5+
5) Set ' = ' + 1
endwhileendforBad Convergence
![Page 7: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/7.jpg)
Wu
From RL to Multi-armed Bandit7
for ! = 1,… , & do1. Set ' = 02. Set initial state )*3. while ()+ not terminal)
1) Take action ,+2) Observe next state )+-. and reward /+endwhile
endfor
![Page 8: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/8.jpg)
Wu
From RL to Multi-armed Bandit8
The problem§ Set of ! actions§ Reward distribution " # with $ # = & ' #
(bounded in [0,1] for convenience)The protocol
for - = 1, … , / do1. Take action #02. Observe reward '0~" #0
endforThe objective§ Maximize sum of reward & ∑0345 '0
![Page 9: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/9.jpg)
Wu
9
§ A RS can recommend different genres of movies (e.g. action, adventure, romance, animation)
§ Users arrive at random and no information about the user is available
§ The RS picks a genre to recommend to the user but not the specific movies
§ The feedback is whether the user watched a movie of the recommended genre or not
§ Objective: Design a RS that maximizes the movies watched in the recommended genre
A Simple Recommendation System
![Page 10: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/10.jpg)
Wu
10
RS as a Multi-armed Bandit
for ! = 1,… , & do1. User arrives2. Recommend genre '(3. Reward
)( = *10user watches movie of genre '(
otherwiseendfor
![Page 11: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/11.jpg)
Wu
11
RS as a Multi-armed Bandit
The model§ !(#) is a Bernoulli§ % # = ' ( # is the probability a random user watches a movie of
genre #§ Assumption: ()~! #) is a realization of the Bernoulli of a genre #The objective§ Maximize sum of reward ' ∑),-. ()
![Page 12: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/12.jpg)
Wu
12
Other Examples
§ Packet routing§ Clinical trials§ Web advertising§ Health advice§ Education§ : Computer games ó§ Resource mining§ …
![Page 13: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/13.jpg)
Wu
The Regret13
!" = max'( )
*+,
"-*(/) − ( )
*+,
"-*(/*)
The expectation summarizes any possible source of randomness (either in -or in the algorithm)
Relation to RL: Can think of this as 2 trajectories (of length 1).
Measures not only the final error, but all mistakes made over 2 “iterations.”
![Page 14: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/14.jpg)
Wu
The Regret14
§ Number of times action !has been selected after "rounds
#$ ! =&'()
$* !' = !
§ Gap Δ ! = , !∗ − ,(!)
§ Regret
1$ = max56 &
'()
$7'(!) − 6 &
'()
$7'(!')
1$ = max5", ! − 6 &
'()
$7'(!')
1$ = max5", ! −&
56 #$ ! , !
1$ = ", !∗ −&56 #$ ! , !
1$ = &585∗
6 #$ ! , !∗ − , !
1$ = &585∗
6 #$ ! Δ(!)
![Page 15: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/15.jpg)
Wu
The Regret15
!" = $%&%∗
( )" * Δ(*)
Ø We only need to study the expected number of times suboptimal actions are selected
Ø Worst case possible: !" = .(/)Ø A good algorithm has !" = 0(/), i.e. 12" → 0
![Page 16: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/16.jpg)
Wu
The Exploration-Exploitation Dilemma16
Problem 1: The environment does not reveal the reward of the actions not selected by the learnerØ The learner should gain information by repeatedly selecting all
actions
Problem 2: Whenever the learner selects a bad action, it suffers some regretØ The learner should reduce the regret by repeatedly selecting the
best action
Challenge: The learner should solve two opposite problems!
⟹ exploration
⟹ exploitation
the explora?on-exploitation dilemma!
![Page 17: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/17.jpg)
Wu
Outline17
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
![Page 18: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/18.jpg)
Wu
time Ttime K
Explore phase Exploit phase
Explore-then-Commit
Co-designed w/ Pulkit Agrawal
18
![Page 19: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/19.jpg)
Wu
Explore-then-Commit: Algorithm19
Explore phasefor ! = 1, … , & do
1. Take action '(~*(,) (or round robin)2. Observe reward .(~/ '(
endforCompute statistics for each action '
1̂2 ' = 132(')
4567
2.58 '5 = '
Exploit phasefor ! = & + 1,… , : do
1. Take action ;'∗ = argmaxB ;12 '2. Observe reward .(~/ ;'∗
endfor
![Page 20: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/20.jpg)
Wu
Explore-then-Commit: Regret20
TheoremIf explore-then-commit is run with parameter ! for " steps, then it suffers a regret:
#$ ≤ &'('∗
!* Δ , + 2 " − ! − 1 exp −!Δ , 4/2
§ Difficult to tune: ! should be adjusted depending on " and Δ ,§ Worst case w.r.t. Δ , : #$ = 8 "
9: (for ! = "
9:)
§ Recall: worst possible: #$ = 8 "
![Page 21: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/21.jpg)
Wu
21
§ Regret decomposition!" =$
%&'
(
) * +∗ − * +% + $%&(/'
"
) * +∗ − * 0+∗
§ During explore phase
$%&'
(
) * +∗ − * +% =1
2$345∗
Δ(+)
§ During exploit phase
$%&(/'
"
) * +∗ − * 0+∗ = 9 − 1 − 1 $345∗
ℙ 0+∗ = + Δ +
= 9 − 1 − 1 $345∗
ℙ ∀+=: @̂( + ≥ @̂( += Δ +
≤ 9 − 1 − 1 $345∗
ℙ @̂( + ≥ @̂( +∗ Δ +
Explore-then-Commit: Regret Analysis
Recall:
@̂( + =1
H((+)$I&'
(
JIK +I = +
0+∗ = argmax3@̂( +
![Page 22: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/22.jpg)
Wu
22
Explore-then-Commit: Regret Analysis
Let !" ∈ $, & be an independent r.v. with common mean ' = )!". Then:
ℙ +!, − ' > / ≤ 2 exp − 5,6789: 7 ∀< > 0
where +!, = >,∑"@>
, !" .
Proposition (Hoeffding Inequality)
![Page 23: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/23.jpg)
Wu
24
Explore-then-Commit: Regret Analysis
§ Probability of errorℙ #̂$ % ≥ #̂$ %∗= ℙ #̂$ % − # % ≥ #̂$ %∗ − # %∗ + Δ %≤ ℙ #̂$ % − # % ≥ Δ %
2 + ℙ # %∗ − #̂$ %∗ ≥ Δ %2
§ Apply Hoeffding bound for random variables ./ ∈ 0,1ℙ #̂$ % ≥ #̂$ %∗ ≤ 2 exp −7Δ % 8/2
![Page 24: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/24.jpg)
Wu
Outline25
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
![Page 25: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/25.jpg)
Wu
!-greedy: Algorithm26
for " = 1,… , ' do1. Take action
() = *+ ,
argmax24̂) (
with probability !) (explore)with probability 1 − !) (exploit)
2. Observe reward B)~D ()3. Update statistics for action ()
E) () = E)FG () + 1
4̂) () =1
E) ()IJKG
)
BJL (J = ()
endfor
![Page 26: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/26.jpg)
Wu
!-greedy: Regret27
Theorem
If !-greedy is run with parameter !" = $%&' ( log $ ,/., then for each
round t it suffers a regret:/" ≤ 12 $3/.
§ Same asymptotic regret, now holds for all rounds t§ Can do better, but optimal ! depends on knowledge of Δ (difficult to
tune)§ Keep selecting very bad arms with some probability§ Sharply separates exploration and exploitation
![Page 27: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/27.jpg)
Wu
Non-adaptive exploration
Adaptive exploration
vs
Explore-then-commit: Explore + exploit separately
E.g., Exploit + Explore (based on exploitation)
Types of exploration strategies
!-greedy: Exploit + explore (agnostic to exploitation)
29
![Page 28: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/28.jpg)
Wu
Outline30
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
![Page 29: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/29.jpg)
Wu
Soft-max (aka Exp3): Algorithm31
for ! = 1,… , & do1. Take action
'(~exp .̂( '
/∑12 exp
.̂( '3/
2. Observe reward 4(~5 '(3. Update statistics for action '(
6( '( = 6(78 '( + 1.̂( '( = 1
6( '(:;<8
(4;= '; = '(
endfor
§ More probability to better actions (arms)
§ Temperature /: large for exploration, small for exploitation
§ Difficult to tune
![Page 30: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/30.jpg)
Wu
32
Example of Regret Performance
![Page 31: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/31.jpg)
Wu
Outline33
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
![Page 32: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/32.jpg)
Wu
Problem-Independent Lower Bound34
TheoremConsider the family of multi-armed bandit problems with ! Bernoulli arms. For any algorithm and fixed ", there exists a Bernoulli MAB problem instance such that:
#$ = Ω !"§ At any finite time ", the regret may be as large as Ω "
![Page 33: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/33.jpg)
Wu
Problem-Dependent Lower Bound35
TheoremConsider the family of multi-armed bandit problems with ! Bernoulli arms and an algorithm that satisfies " #$ % = ' () for any * > 0, any action %, and any Bernoulli MAB problem. Then for any Bernoulli MAB problem with gaps Δ % > 0, ∀% ≠ %∗, any algorithm suffers regret:
lim inf$→8
9$
log (= <
=>=∗
Δ %
?@ A % , A %∗
Where ?@ B, C = B logD
E+ 1 − B log
IJD
IJE
§ No algorithm can achieve a regret smaller than Ω(log () (asymptotically)
§ The ratio N =
OP Q,Q∗measures the difficulty of the problem
§ Algorithms such as R-greedy with the right tuning are optimal!
![Page 34: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/34.jpg)
Wu
Outline36
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
![Page 35: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/35.jpg)
Wu
Recipe for Effective Exp-Exp37
1. Computation of estimates2. Evaluation of uncertainty3. Mechanism to combine estimates and uncertainty4. Select the best action (according to its combined value)
![Page 36: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/36.jpg)
Wu
Optimism in Face of Uncertainty38
“Whenever the value of an action is uncertain, consider its largest plausible value, and then select the best action.”
Missing ingredient: uncertainty of our estimates.
![Page 37: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/37.jpg)
Wu
Measuring Uncertainty39
Proposition (Chernoff-Hoeffding Inequality)Let !" ∈ [%, '] be ) independent r.v. with mean * = ,!". Then:
ℙ 1)/012
3!0 − * > ' − %
log 2:2) ≤ :
![Page 38: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/38.jpg)
Wu
Recipe of UCB41
1. Computation of estimates
!"# $ = 1'# $
()*+
#,)- $) = $
2. Evaluation of uncertainty
!"# $ − " $ ≤log 242'# $
3. Mechanism to combine estimates and uncertainty
5# $ = !"# $ + 7log 24#2'# $
4. Select the best action (according to its combined value)$# = argmax
<5# $
![Page 39: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/39.jpg)
Wu
time T
Exploitation
Exploration bonus for rare actions
(optimism)
(mean reward of action i)
...
<latexit sha1_base64="fsbH58ehoMNbB5sd13XhncgY6/0=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoicpePFY0X5AG8pku2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nirImjUWsOgFqJrhkTcONYJ1EMYwCwdrB+Hbmt5+Y0jyWj2aSMD/CoeQhp2is9IB9r1+uuFV3DrJKvJxUIEejX/7qDWKaRkwaKlDrrucmxs9QGU4Fm5Z6qWYJ0jEOWddSiRHTfjY/dUrOrDIgYaxsSUPm6u+JDCOtJ1FgOyM0I73szcT/vG5qwms/4zJJDZN0sShMBTExmf1NBlwxasTEEqSK21sJHaFCamw6JRuCt/zyKmldVL1a1b2/rNRv8jiKcAKncA4eXEEd7qABTaAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QPotY2J</latexit>a1<latexit sha1_base64="//EfCgT17QIUgHLOAztMsYf5HII=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoicpePFY0X5AG8pmO2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nimGTxSJWnYBqFFxi03AjsJMopFEgsB2Mb2d++wmV5rF8NJME/YgOJQ85o8ZKD7Rf65crbtWdg6wSLycVyNHol796g5ilEUrDBNW667mJ8TOqDGcCp6VeqjGhbEyH2LVU0gi1n81PnZIzqwxIGCtb0pC5+nsio5HWkyiwnRE1I73szcT/vG5qwms/4zJJDUq2WBSmgpiYzP4mA66QGTGxhDLF7a2EjaiizNh0SjYEb/nlVdKqVb3Lqnt/Uanf5HEU4QRO4Rw8uII63EEDmsBgCM/wCm+OcF6cd+dj0Vpw8plj+APn8wfqOY2K</latexit>a2
<latexit sha1_base64="T8MRKowcOJGOSXgUIPN3A1CdyCo=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKez6QE8S8OIxonlAsoTeyWwyZHZ2mZkVwpJP8OJBEa9+kTf/xkmyB00saCiquunuChLBtXHdb6ewsrq2vlHcLG1t7+zulfcPmjpOFWUNGotYtQPUTHDJGoYbwdqJYhgFgrWC0e3Ubz0xpXksH804YX6EA8lDTtFY6QF7571yxa26M5Bl4uWkAjnqvfJXtx/TNGLSUIFadzw3MX6GynAq2KTUTTVLkI5wwDqWSoyY9rPZqRNyYpU+CWNlSxoyU39PZBhpPY4C2xmhGepFbyr+53VSE177GZdJapik80VhKoiJyfRv0ueKUSPGliBV3N5K6BAVUmPTKdkQvMWXl0nzrOpdVt37i0rtJo+jCEdwDKfgwRXU4A7q0AAKA3iGV3hzhPPivDsf89aCk88cwh84nz/rvY2L</latexit>a3
Initial confidence intervals:
Co-designed w/ Pulkit Agrawal
!"#$ = argmax+ ,-" ! + /log 23"24" !
Upper Confidence Bound (UCB) Algorithm42
![Page 40: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/40.jpg)
Wu
UCB: Algorithm43
for ! = 1,… , & do1. Compute upper-confidence bound
'( ) = +̂( ) + -log 22(23( )
2. Take action )( argmax8 '( )3. Observe reward 9(~; )(4. Update statistics for action )(
3( )( = 3(<= )( + 1+̂( )( = 1
3( )(>?@=
(9?A )? = )(
endfor
![Page 41: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/41.jpg)
Wu
UCB: Regret45
Theorem
Consider a MAB problem with ! Bernoulli arms with gaps Δ($). If UCB is run with & = 1and )* = +
* for , steps, then it suffers regret:
-. = / 0121∗
log ,Δ $
Consider a 2-action MAB problem, then for any fixed ,, in the worst-case (w.r.t Δ) UCB suffers a regret:
-. = / , log ,
§ It (almost) matches the lower bounds§ It does not require any prior knowledge about the MAB, apart from the
range of the r.v.§ The big-O hides a few numerical constants and ,-independent additive
terms
![Page 42: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/42.jpg)
Wu
46
UCB: Proof Sketch§ Disclaimer: This is a slightly suboptimal proof, but it provides an easy path.
§ Define the (high-probability) event [statistics]
ℰ = ∀$, & (̂) $ − ( $ ≤log 2021) $
§ By Chernoff-Hoeffding & union bound: ℙ ℰ ≥ 1 − 560§ If at time &, we select action $, then [algorithm]
7) $ ≥ 7) $∗
(̂) $ +log 2021) $
≥ (̂) $∗ +log 2021) $∗
§ On the event ℰ, we have [math]
( $ + 2log 2021) $
≥ ( $∗
![Page 43: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/43.jpg)
Wu
48
UCB: Proof Sketch§ Assume ! is the last time " is selected, then #$ " = #&'( " + 1 (for + ≥ !), thus:
- " + 2log 222#$ " ≥ - "∗
§ Reordering [math]
#$ " ≤2log 22Δ " 6
under event ℰ and thus with probability 1 − +92§ Moving to the expectation [statistics]
: #$ " = : #$ " |ℰ + : #$ " |ℰ<
: #$ " ≤2log 22Δ " 6 + + +92
§ Trading-off the two terms 2 = ($=, we obtain:
: #$ " ≤ 4log +Δ?6
+ 9
![Page 44: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/44.jpg)
Wu
Tuning the ! Parameter49
Theory§ ! < 1, polynomial regret w.r.t. $§ ! ≥ 1, logarithmic regret w.r.t. $Practice: ! = 0.2 is often the best choice
Recall:
0123 = argmax8:̂1 0 + !
log 2=12>1 0
![Page 45: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/45.jpg)
Wu
Improvements: UCB-V50
Idea: Use empirical Bernstein bounds for more accurate confidence intervals (c.i.)Algorithm:§ Compute the score of each arm !
"# $ = '̂# $ + ) 4log .20# $
§ Select action$# = argmax5 "# $
§ Update the statistics 0# $# , '̂# $#Regret:
78 ≤ : 1Δ log =
and >?#@ $#?@
2 >?#@ $ log .0# $
+ 8 log .30# $
![Page 46: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/46.jpg)
Wu
Improvements: KL-UCB51
Idea: Use even tighter c.i. based on Kullback-Leibler divergence
KL #, % = # log#%+ 1 − # log
1 − #1 − %
Algorithm: Compute the score of each arm - (convex optimization)./ 0 = max % ∈ 0,1 : 7/ 0 89 ;̂/ 0 , % ≤ log = + > log log =
Regret: Pulls to suboptimal arms
? 7@ 0 ≤ 1 + Alog n
KL ; 0 , ; 0∗+ CE log log F +
GH AFI J
Where K ;L, ;∗ ≥ 2ΔLH
![Page 47: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/47.jpg)
Wu
Outline52
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
![Page 48: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/48.jpg)
WuSource: Steve Roberts
Measuring Uncertainty53
§ Assume that !" # are distributed as Bernoulli for all actions # with parameter $ #§ Define a prior $ # ~Beta *+, -+§ After . rewards, compute the posterior for action # as Beta *" # , -" # with:
*" # = *+ +1234
"5 #" = # ⋀ !" = 0 -" # = -+ +1
234
"5 #" = # ⋀ !" = 1
![Page 49: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/49.jpg)
Wu
Recipe of Thompson Sampling*55
1. Computation of estimates
"̂# $ = &# $&# $ + (# $
2. Evaluation of uncertaintyBeta &# $ , (# $
3. Mechanism to combine estimates and uncertainty.# $ ~Beta &# $ , (# $
4. Select the best action (according to its combined value)$# = argmax
4.# $
*aka Posterior Sampling
![Page 50: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/50.jpg)
Wu
Thompson Sampling: Algorithm56
for ! = 1,… , & do1. Compute upper-confidence bound
'( ) ~Beta /( ) , 0( )2. Take action )( argmax5 '( )3. Observe reward 6(~7 )(4. Update statistics for action )(
/( )( = /(89 )( + ; 6( = 00( )( = 0(89 )( + ; 6( = 1
endfor
![Page 51: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/51.jpg)
Wu
Thompson Sampling: Regret58
TheoremConsider a MAB problem with A Bernoulli arms with gaps Δ " . If Thompson Sampling is run for # steps, then it suffers regret:
$% = ' 1 + * ∑,-,∗ / , 012 %34 5 , ,5 ,∗ + 7 5
89 ∀* > 0
§ It matches the lower bound§ It requires defining a prior on the actions
![Page 52: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/52.jpg)
Wu
Outline59
1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax
2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling
3. Linear and Contextual Linear Bandit
![Page 53: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/53.jpg)
Wu
60
§ A RS can recommend specific movies§ Users arrive at random and no information about the user is
available§ The RS picks a movie to the user§ The feedback is whether the user watched the movie or not§ Objective: Design a RS that maximizes the number of movies
watched
A Simple Recommendation System
![Page 54: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/54.jpg)
Wu
61
RS as a Multi-armed Bandit
for ! = 1,… , & do1. User arrives2. Recommend movie '(3. Reward
)( = *10user watches movie '(
otherwiseEndfor
Issue: Too many movies are available to collect enough feedback for each movie separately
![Page 55: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/55.jpg)
Wu
62
RS as a Linear Bandit
The model§ ! " = $ % " is the probability a random user watches movie "§ Each movie " is characterized by some features & " ∈ ℝ) (e.g.
genre, release date, past rating, income, etc)§ Assumption:• The expected value is a linear function ! " = & " *+∗ (with +∗ ∈ ℝ)
unknown) • The rewards are noisy observations %- " = ! " + /- with $ /- = 0
The objective§ Maximize sum of reward $ ∑-234 %-
![Page 56: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/56.jpg)
Wu
Recipe of UCB63
1. Computation of estimates
!"# $ = 1'# $
()*+
#,)- $) = $
2. Evaluation of uncertainty
!"# $ − " $ ≤log 13'# $
3. Mechanism to combine estimates and uncertainty
4# $ = !"# $ + 6log 13#'# $
4. Select the best action (according to its combined value)$# = argmax; 4# $
Issue: '#($) is likely to be 0 for most $. We need more sample efficient estimates.
![Page 57: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/57.jpg)
Wu
The Regret64
!" = max'( )
*+,
"-* . − ( )
*+,
"-* .*
= ( )*+,
"0 .∗ − 0 .*
23∗
Issue: .∗ unlikely to be ever selected if 4 ≪ 6
![Page 58: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/58.jpg)
Wu
Least-Squares Estimate of !∗ 65
§ Least-squares estimate
#!$ = arg min,∈ℝ/
112345
$63 − 8 93 :! ; + = ! ;
§ Closed form solution
>$ =2345
$8 93 8 93 : + =? @$ =2
345
$8 93 63
⟹ #!$ = >$B5@$§ Estimate of value of action 9
D̂$ 9 = 8 9 : #!$
![Page 59: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/59.jpg)
Wu
Measuring Uncertainty66
PropositionLet !", … , !% be any sequence of actions adapted to the filtration ℱ%. If the noise ' is sub-Gaussian of parameter ( and the features are bounded by ) ! * ≤ ,, then for any ! with probability 1 − /:
1̂% ! − 1 ! ≤ 2% ) ! 34%5") !Where:
2% = ( 7 log1 + <,=/ + =
"* >∗ *
§ ) ! @ABC measure the correlation between ) ! and the actions selected so far
§ If ) ! D is an orthogonal basis for ℝ@, this reduces to the MAB problem and ) ! @ABC =
"3A D
![Page 60: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/60.jpg)
Wu
Recipe of LinUCB67
1. Computation of estimates!"# = %#&'(# *̂# + = , + - !"#
2. Evaluation of uncertainty*̂# + − * + ≤ 0# , + -%#&', +
3. Mechanism to combine estimates and uncertainty1# + = *̂# + + 0# , + -%#&', +
4. Select the best action (according to its combined value)+# = argmax
81# +
![Page 61: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/61.jpg)
Wu
LinUCB: Algorithm68
for ! = 1,… , & do1. Compute upper-confidence bound
'( ) = +̂( ) + -( . ) /0(12. )2. Take action )( argmax8 '( )3. Observe reward 9(~. )( /;∗ + =(4. Update statistics for action )(
0(>2 = 0( + . )( . )( /?;(>2 = AA>212 B(>2
endfor
![Page 62: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/62.jpg)
Wu
LinUCB: Regret69
TheoremConsider a linear MAB problem with actions defined in ℝ" and unknown parameter #∗ ∈ ℝ". If LinUCB is run with &' = )
' for * steps, then it suffers a regret:
+, = - . * log *§ It depends on . but not the number of actions 2§ If 2 < ∞, we can improve the bound to
+, = - .* log *2
![Page 63: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/63.jpg)
Wu
70
§ A RS can recommend specific movies§ Users arrive at random and we have information about them§ The RS picks a movie to the user§ The feedback is whether the user watched the movie or not§ Objective: Design a RS that maximizes the number of movies
watched
A Simple Recommendation System
![Page 64: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/64.jpg)
Wu
71
RS as a Multi-armed Bandit
for ! = 1,… , & do1. User arrives '(2. Recommend movie )(3. Reward
*( = +10user watches movie )(
otherwiseEndfor
Issue: Too many users to collect enough feedback for each user separately
![Page 65: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/65.jpg)
Wu
72
RS as a Contextual Linear Bandit
The model§ ! ", $ = & ' ", $ is the probability user " watches movie $§ Each user " and movie $ is characterized by some features ( ", $ ∈ ℝ+ (e.g. name, location, genre, release date, past rating, income, etc)
§ Assumption:• The expected value is a linear function ! ", $ = ( ", $ ,-∗ (with -∗ ∈ ℝ+
unknown) • The rewards are noisy observations '/ ", $ = ! ", $ + 1/ with & 1/ = 0
The objective§ Maximize sum of reward & ∑/456 '/
![Page 66: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/66.jpg)
Wu
The Regret73
!" = $ %&'(
"max,
-& .&, 0 − $ %&'(
"-& .&, 0&
= $ %&'(
"2 .&, 0&∗ − 2 .&, 0&
45∗
![Page 67: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/67.jpg)
Wu
Least-Squares Estimate of !∗ 74
§ Least-squares estimate
#!$ = arg min,∈ℝ/
112345
$63 − 8 93, ;3 <! = + ? ! =
§ Closed form solution
@$ =2345
$8 93, ;3 8 93, ;3 < + ?A B$ =2
345
$8 93, ;3 63
⟹ #!$ = @$D5B$§ Estimate of value of action ;
F̂$ 9, ; = 8 9, ; < #!$
![Page 68: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/68.jpg)
Wu
ContextualLinUCB: Algorithm75
for ! = 1,… , & do1. Observe context '(2. Compute upper-confidence bound
)( '(, * = ,̂( '(, * + .( / '(, * 01(23/ '(, *3. Take action *( = argmax
9)( '(, *
4. Observe reward :(~/ '(, *( 0<∗ + >(5. Update statistics for action *(
1(?3 = 1( + / '(, *( / '(, *( 0@<(?3 = AB?323 C(?3
endfor
![Page 69: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/69.jpg)
Wu
ContextualLinUCB: Regret76
TheoremConsider a contextual linear MAB problem with contexts and actions defined in ℝ" and unknown parameter #∗ ∈ ℝ". If ContextualLinUCBis run with &' = )
' for * steps, then for any arbitrary sequence of contexts +), +-, … , +/, it suffers a regret:
0/ = 1 2 * log *
![Page 70: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/70.jpg)
Wu
Regret and Finite Sample Analysis77
§ How many data points (or experiences) are needed in order to get a good approximation of the optimal policy in a reinforcement learning task with an algorithm !?• Sample complexity " #, %, & , ' : smallest " such that with probability at
least 1 − #,AvgError ! 0 ≤ %
Where AvgError ! 2 is the average error made by the algorithm after 3steps.• Regret:
4256
0Error ! 2
Adapted from Mohammed Amine Bennouna & Moïse Blanchard (MIT)
![Page 71: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward](https://reader036.fdocuments.in/reader036/viewer/2022062318/61347d20dfd10f4dd73bc33b/html5/thumbnails/71.jpg)
Wu
Fitted Q-iteration78
Offline FQI quarantee [Munos, 2003; Munos and Szepesvari, 2008, Antos et al., 2008, Agarwal, et al., 2021]
The !"# iterate of offline Fitted Q Iteration guarantees that with probability 1 − &:
'∗ − ')* ≤ , 11 − - .
/ log ℱ&
4 + 2-71 − - 8
§ Remark: estimation and optimization error terms§ Assumptions• Concentrability C, a measure of data coverage (strong assumption). • Realizability: :∗ ∈ ℱ. ℱ is the function class for the Q function.• Bellman completion: for any = ∈ ℱ, ?= ∈ ℱ.
§ Ingredients• Performance difference lemma• Bernstein’s inequality