Download - Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Introduction to multi-armed banditsExploration-Exploitation Dilemma

Cathy Wu6.246 Reinforcement Learning: Foundations and Methods

Apr 6, 2021

Wu

References2

1. Alessandro Lazaric. INRIA Lille. Reinforcement Learning. 2017, Lecture 6.

2. Aleksandrs Slivkins. Introduction to Multi-Armed Bandits. 2019. Chapters 1-3, 8.

Wu

Outline3

1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax

2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling

3. Linear and Contextual Linear Bandit

Wu

Recall: Q-Learning4

PropositionIf the learning rate satisfies the Robbins-Monro conditions in all states !, # ∈ %×'

()*+

,-. !, # = ∞ (

)*+

,-.1 !, # < ∞

And all state-action pairs are tried infinitely often, then for all !, # ∈%×'

34 !, # 5.7. 4∗ !, #Remark: “infinitely often” requires a steady exploration policy.

Wu

Learning the Optimal Policy5

for ! = 1,… , & do1. Set ' = 02. Set initial state )*3. while ()+ not terminal)

1) Take action ,+ = argmax2 34 )+, ,2) Observe next state )+56 and reward 7+3) Compute the temporal difference

8+ = 7+ + :max2;34 )+56, ,< − 34()+, ,+) (Q−learning)

4) Update the Q-function34 )+, ,+ = 34 )+, ,+ + E )+, ,+ 8+

5) Set ' = ' + 1endwhile

endforNo Convergence

according to a suitable exploration policy

Wu

Learning the Optimal Policy6


1) Take action ,+~.(0)2) Observe next state )+23 and reward 4+3) Compute the temporal difference

5+ = 4+ + 7max;<

=> )+23, ,? − =>()+, ,+) (Q−learning)

4) Update the Q-function=> )+, ,+ = => )+, ,+ + H )+, ,+ 5+

5) Set ' = ' + 1

endwhileendforBad Convergence

Wu

From RL to Multi-armed Bandit7


1) Take action ,+2) Observe next state )+-. and reward /+endwhile

endfor

Wu

From RL to Multi-armed Bandit8

The problem§ Set of ! actions§ Reward distribution " # with $ # = & ' #

(bounded in [0,1] for convenience)The protocol

for - = 1, … , / do1. Take action #02. Observe reward '0~" #0

endforThe objective§ Maximize sum of reward & ∑0345 '0

Wu

9

§ A RS can recommend different genres of movies (e.g. action, adventure, romance, animation)

§ Users arrive at random and no information about the user is available

§ The RS picks a genre to recommend to the user but not the specific movies

§ The feedback is whether the user watched a movie of the recommended genre or not

§ Objective: Design a RS that maximizes the movies watched in the recommended genre

A Simple Recommendation System

Wu

10

RS as a Multi-armed Bandit

for ! = 1,… , & do1. User arrives2. Recommend genre '(3. Reward

)( = *10user watches movie of genre '(

otherwiseendfor

Wu

11


The model§ !(#) is a Bernoulli§ % # = ' ( # is the probability a random user watches a movie of

genre #§ Assumption: ()~! #) is a realization of the Bernoulli of a genre #The objective§ Maximize sum of reward ' ∑),-. ()

Wu

12

Other Examples

§ Packet routing§ Clinical trials§ Web advertising§ Health advice§ Education§ : Computer games ó§ Resource mining§ …

Wu

The Regret13

!" = max'( )

*+,

"-*(/) − ( )

*+,

"-*(/*)

The expectation summarizes any possible source of randomness (either in -or in the algorithm)

Relation to RL: Can think of this as 2 trajectories (of length 1).

Measures not only the final error, but all mistakes made over 2 “iterations.”

Wu

The Regret14

§ Number of times action !has been selected after "rounds

#$ ! =&'()

$* !' = !

§ Gap Δ ! = , !∗ − ,(!)

§ Regret

1$ = max56 &

'()

$7'(!) − 6 &

'()

$7'(!')

1$ = max5", ! − 6 &

'()

$7'(!')

1$ = max5", ! −&

56 #$ ! , !

1$ = ", !∗ −&56 #$ ! , !

1$ = &585∗

6 #$ ! , !∗ − , !

1$ = &585∗

6 #$ ! Δ(!)

Wu

The Regret15

!" = $%&%∗

( )" * Δ(*)

Ø We only need to study the expected number of times suboptimal actions are selected

Ø Worst case possible: !" = .(/)Ø A good algorithm has !" = 0(/), i.e. 12" → 0

Wu

The Exploration-Exploitation Dilemma16

Problem 1: The environment does not reveal the reward of the actions not selected by the learnerØ The learner should gain information by repeatedly selecting all

actions

Problem 2: Whenever the learner selects a bad action, it suffers some regretØ The learner should reduce the regret by repeatedly selecting the

best action

Challenge: The learner should solve two opposite problems!

⟹ exploration

⟹ exploitation

the explora?on-exploitation dilemma!

Wu

Outline17




Wu

time Ttime K

Explore phase Exploit phase

Explore-then-Commit

Co-designed w/ Pulkit Agrawal

18

Wu

Explore-then-Commit: Algorithm19

Explore phasefor ! = 1, … , & do

1. Take action '(~*(,) (or round robin)2. Observe reward .(~/ '(

endforCompute statistics for each action '

1̂2 ' = 132(')

4567

2.58 '5 = '

Exploit phasefor ! = & + 1,… , : do

1. Take action ;'∗ = argmaxB ;12 '2. Observe reward .(~/ ;'∗

endfor

Wu

Explore-then-Commit: Regret20

TheoremIf explore-then-commit is run with parameter ! for " steps, then it suffers a regret:

#$ ≤ &'('∗

!* Δ , + 2 " − ! − 1 exp −!Δ , 4/2

§ Difficult to tune: ! should be adjusted depending on " and Δ ,§ Worst case w.r.t. Δ , : #$ = 8 "

9: (for ! = "

9:)

§ Recall: worst possible: #$ = 8 "

Wu

21

§ Regret decomposition!" =$

%&'

(

) * +∗ − * +% + $%&(/'

"

) * +∗ − * 0+∗

§ During explore phase

$%&'

(

) * +∗ − * +% =1

2$345∗

Δ(+)

§ During exploit phase

$%&(/'

"

) * +∗ − * 0+∗ = 9 − 1 − 1 $345∗

ℙ 0+∗ = + Δ +

= 9 − 1 − 1 $345∗

ℙ ∀+=: @̂( + ≥ @̂( += Δ +

≤ 9 − 1 − 1 $345∗

ℙ @̂( + ≥ @̂( +∗ Δ +

Explore-then-Commit: Regret Analysis

Recall:

@̂( + =1

H((+)$I&'

(

JIK +I = +

0+∗ = argmax3@̂( +

Wu

22


Let !" ∈ $, & be an independent r.v. with common mean ' = )!". Then:

ℙ +!, − ' > / ≤ 2 exp − 5,6789: 7 ∀< > 0

where +!, = >,∑"@>

, !" .

Proposition (Hoeffding Inequality)

Wu

24


§ Probability of errorℙ #̂$ % ≥ #̂$ %∗= ℙ #̂$ % − # % ≥ #̂$ %∗ − # %∗ + Δ %≤ ℙ #̂$ % − # % ≥ Δ %

2 + ℙ # %∗ − #̂$ %∗ ≥ Δ %2

§ Apply Hoeffding bound for random variables ./ ∈ 0,1ℙ #̂$ % ≥ #̂$ %∗ ≤ 2 exp −7Δ % 8/2

Wu

Outline25




Wu

!-greedy: Algorithm26

for " = 1,… , ' do1. Take action

() = *+ ,

argmax24̂) (

with probability !) (explore)with probability 1 − !) (exploit)

2. Observe reward B)~D ()3. Update statistics for action ()

E) () = E)FG () + 1

4̂) () =1

E) ()IJKG

)

BJL (J = ()

endfor

Wu

!-greedy: Regret27

Theorem

If !-greedy is run with parameter !" = $%&' ( log $ ,/., then for each

round t it suffers a regret:/" ≤ 12 $3/.

§ Same asymptotic regret, now holds for all rounds t§ Can do better, but optimal ! depends on knowledge of Δ (difficult to

tune)§ Keep selecting very bad arms with some probability§ Sharply separates exploration and exploitation

Wu

Non-adaptive exploration

Adaptive exploration

vs

Explore-then-commit: Explore + exploit separately

E.g., Exploit + Explore (based on exploitation)

Types of exploration strategies

!-greedy: Exploit + explore (agnostic to exploitation)

29

Wu

Outline30




Wu

Soft-max (aka Exp3): Algorithm31

for ! = 1,… , & do1. Take action

'(~exp .̂( '

/∑12 exp

.̂( '3/

2. Observe reward 4(~5 '(3. Update statistics for action '(

6( '( = 6(78 '( + 1.̂( '( = 1

6( '(:;<8

(4;= '; = '(

endfor

§ More probability to better actions (arms)

§ Temperature /: large for exploration, small for exploitation

§ Difficult to tune

Wu

32

Example of Regret Performance

Wu

Outline33




Wu

Problem-Independent Lower Bound34

TheoremConsider the family of multi-armed bandit problems with ! Bernoulli arms. For any algorithm and fixed ", there exists a Bernoulli MAB problem instance such that:

#$ = Ω !"§ At any finite time ", the regret may be as large as Ω "

Wu

Problem-Dependent Lower Bound35

TheoremConsider the family of multi-armed bandit problems with ! Bernoulli arms and an algorithm that satisfies " #$ % = ' () for any * > 0, any action %, and any Bernoulli MAB problem. Then for any Bernoulli MAB problem with gaps Δ % > 0, ∀% ≠ %∗, any algorithm suffers regret:

lim inf$→8

9$

log (= <

=>=∗

Δ %

?@ A % , A %∗

Where ?@ B, C = B logD

E+ 1 − B log

IJD

IJE

§ No algorithm can achieve a regret smaller than Ω(log () (asymptotically)

§ The ratio N =

OP Q,Q∗measures the difficulty of the problem

§ Algorithms such as R-greedy with the right tuning are optimal!

Wu

Outline36




Wu

Recipe for Effective Exp-Exp37

1. Computation of estimates2. Evaluation of uncertainty3. Mechanism to combine estimates and uncertainty4. Select the best action (according to its combined value)

Wu

Optimism in Face of Uncertainty38

“Whenever the value of an action is uncertain, consider its largest plausible value, and then select the best action.”

Missing ingredient: uncertainty of our estimates.

Wu

Measuring Uncertainty39

Proposition (Chernoff-Hoeffding Inequality)Let !" ∈ [%, '] be ) independent r.v. with mean * = ,!". Then:

ℙ 1)/012

3!0 − * > ' − %

log 2:2) ≤ :

Wu

Recipe of UCB41

1. Computation of estimates

!"# $ = 1'# $

()*+

#,)- $) = $

2. Evaluation of uncertainty

!"# $ − " $ ≤log 242'# $

3. Mechanism to combine estimates and uncertainty

5# $ = !"# $ + 7log 24#2'# $

4. Select the best action (according to its combined value)$# = argmax

<5# $

Wu

time T

Exploitation

Exploration bonus for rare actions

(optimism)

(mean reward of action i)

...

<latexit sha1_base64="fsbH58ehoMNbB5sd13XhncgY6/0=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoicpePFY0X5AG8pku2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nirImjUWsOgFqJrhkTcONYJ1EMYwCwdrB+Hbmt5+Y0jyWj2aSMD/CoeQhp2is9IB9r1+uuFV3DrJKvJxUIEejX/7qDWKaRkwaKlDrrucmxs9QGU4Fm5Z6qWYJ0jEOWddSiRHTfjY/dUrOrDIgYaxsSUPm6u+JDCOtJ1FgOyM0I73szcT/vG5qwms/4zJJDZN0sShMBTExmf1NBlwxasTEEqSK21sJHaFCamw6JRuCt/zyKmldVL1a1b2/rNRv8jiKcAKncA4eXEEd7qABTaAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QPotY2J</latexit>a1<latexit sha1_base64="//EfCgT17QIUgHLOAztMsYf5HII=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoicpePFY0X5AG8pmO2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nimGTxSJWnYBqFFxi03AjsJMopFEgsB2Mb2d++wmV5rF8NJME/YgOJQ85o8ZKD7Rf65crbtWdg6wSLycVyNHol796g5ilEUrDBNW667mJ8TOqDGcCp6VeqjGhbEyH2LVU0gi1n81PnZIzqwxIGCtb0pC5+nsio5HWkyiwnRE1I73szcT/vG5qwms/4zJJDUq2WBSmgpiYzP4mA66QGTGxhDLF7a2EjaiizNh0SjYEb/nlVdKqVb3Lqnt/Uanf5HEU4QRO4Rw8uII63EEDmsBgCM/wCm+OcF6cd+dj0Vpw8plj+APn8wfqOY2K</latexit>a2

<latexit sha1_base64="T8MRKowcOJGOSXgUIPN3A1CdyCo=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKez6QE8S8OIxonlAsoTeyWwyZHZ2mZkVwpJP8OJBEa9+kTf/xkmyB00saCiquunuChLBtXHdb6ewsrq2vlHcLG1t7+zulfcPmjpOFWUNGotYtQPUTHDJGoYbwdqJYhgFgrWC0e3Ubz0xpXksH804YX6EA8lDTtFY6QF7571yxa26M5Bl4uWkAjnqvfJXtx/TNGLSUIFadzw3MX6GynAq2KTUTTVLkI5wwDqWSoyY9rPZqRNyYpU+CWNlSxoyU39PZBhpPY4C2xmhGepFbyr+53VSE177GZdJapik80VhKoiJyfRv0ueKUSPGliBV3N5K6BAVUmPTKdkQvMWXl0nzrOpdVt37i0rtJo+jCEdwDKfgwRXU4A7q0AAKA3iGV3hzhPPivDsf89aCk88cwh84nz/rvY2L</latexit>a3

Initial confidence intervals:

Co-designed w/ Pulkit Agrawal

!"#$ = argmax+ ,-" ! + /log 23"24" !

Upper Confidence Bound (UCB) Algorithm42

Wu

UCB: Algorithm43

for ! = 1,… , & do1. Compute upper-confidence bound

'( ) = +̂( ) + -log 22(23( )

2. Take action )( argmax8 '( )3. Observe reward 9(~; )(4. Update statistics for action )(

3( )( = 3(<= )( + 1+̂( )( = 1

3( )(>?@=

(9?A )? = )(

endfor

Wu

UCB: Regret45

Theorem

Consider a MAB problem with ! Bernoulli arms with gaps Δ($). If UCB is run with & = 1and )* = +

* for , steps, then it suffers regret:

-. = / 0121∗

log ,Δ $

Consider a 2-action MAB problem, then for any fixed ,, in the worst-case (w.r.t Δ) UCB suffers a regret:

-. = / , log ,

§ It (almost) matches the lower bounds§ It does not require any prior knowledge about the MAB, apart from the

range of the r.v.§ The big-O hides a few numerical constants and ,-independent additive

terms

Wu

46

UCB: Proof Sketch§ Disclaimer: This is a slightly suboptimal proof, but it provides an easy path.

§ Define the (high-probability) event [statistics]

ℰ = ∀$, & (̂) $ − ( $ ≤log 2021) $

§ By Chernoff-Hoeffding & union bound: ℙ ℰ ≥ 1 − 560§ If at time &, we select action $, then [algorithm]

7) $ ≥ 7) $∗

(̂) $ +log 2021) $

≥ (̂) $∗ +log 2021) $∗

§ On the event ℰ, we have [math]

( $ + 2log 2021) $

≥ ( $∗

Wu

48

UCB: Proof Sketch§ Assume ! is the last time " is selected, then #$ " = #&'( " + 1 (for + ≥ !), thus:

- " + 2log 222#$ " ≥ - "∗

§ Reordering [math]

#$ " ≤2log 22Δ " 6

under event ℰ and thus with probability 1 − +92§ Moving to the expectation [statistics]

: #$ " = : #$ " |ℰ + : #$ " |ℰ<

: #$ " ≤2log 22Δ " 6 + + +92

§ Trading-off the two terms 2 = ($=, we obtain:

: #$ " ≤ 4log +Δ?6

+ 9

Wu

Tuning the ! Parameter49

Theory§ ! < 1, polynomial regret w.r.t. $§ ! ≥ 1, logarithmic regret w.r.t. $Practice: ! = 0.2 is often the best choice

Recall:

0123 = argmax8:̂1 0 + !

log 2=12>1 0

Wu

Improvements: UCB-V50

Idea: Use empirical Bernstein bounds for more accurate confidence intervals (c.i.)Algorithm:§ Compute the score of each arm !

"# $ = '̂# $ + ) 4log .20# $

§ Select action$# = argmax5 "# $

§ Update the statistics 0# $# , '̂# $#Regret:

78 ≤ : 1Δ log =

and >?#@ $#?@

2 >?#@ $ log .0# $

+ 8 log .30# $

Wu

Improvements: KL-UCB51

Idea: Use even tighter c.i. based on Kullback-Leibler divergence

KL #, % = # log#%+ 1 − # log

1 − #1 − %

Algorithm: Compute the score of each arm - (convex optimization)./ 0 = max % ∈ 0,1 : 7/ 0 89 ;̂/ 0 , % ≤ log = + > log log =

Regret: Pulls to suboptimal arms

? 7@ 0 ≤ 1 + Alog n

KL ; 0 , ; 0∗+ CE log log F +

GH AFI J

Where K ;L, ;∗ ≥ 2ΔLH

Wu

Outline52




WuSource: Steve Roberts


§ Assume that !" # are distributed as Bernoulli for all actions # with parameter $ #§ Define a prior $ # ~Beta *+, -+§ After . rewards, compute the posterior for action # as Beta *" # , -" # with:

*" # = *+ +1234

"5 #" = # ⋀ !" = 0 -" # = -+ +1

234

"5 #" = # ⋀ !" = 1

Wu

Recipe of Thompson Sampling*55


"̂# $ = &# $&# $ + (# $

2. Evaluation of uncertaintyBeta &# $ , (# $

3. Mechanism to combine estimates and uncertainty.# $ ~Beta &# $ , (# $

4. Select the best action (according to its combined value)$# = argmax

4.# $

*aka Posterior Sampling

Wu

Thompson Sampling: Algorithm56


'( ) ~Beta /( ) , 0( )2. Take action )( argmax5 '( )3. Observe reward 6(~7 )(4. Update statistics for action )(

/( )( = /(89 )( + ; 6( = 00( )( = 0(89 )( + ; 6( = 1

endfor

Wu

Thompson Sampling: Regret58

TheoremConsider a MAB problem with A Bernoulli arms with gaps Δ " . If Thompson Sampling is run for # steps, then it suffers regret:

$% = ' 1 + * ∑,-,∗ / , 012 %34 5 , ,5 ,∗ + 7 5

89 ∀* > 0

§ It matches the lower bound§ It requires defining a prior on the actions

Wu

Outline59




Wu

60

§ A RS can recommend specific movies§ Users arrive at random and no information about the user is

available§ The RS picks a movie to the user§ The feedback is whether the user watched the movie or not§ Objective: Design a RS that maximizes the number of movies

watched


Wu

61


for ! = 1,… , & do1. User arrives2. Recommend movie '(3. Reward

)( = *10user watches movie '(

otherwiseEndfor

Issue: Too many movies are available to collect enough feedback for each movie separately

Wu

62

RS as a Linear Bandit

The model§ ! " = $ % " is the probability a random user watches movie "§ Each movie " is characterized by some features & " ∈ ℝ) (e.g.

genre, release date, past rating, income, etc)§ Assumption:• The expected value is a linear function ! " = & " *+∗ (with +∗ ∈ ℝ)

unknown) • The rewards are noisy observations %- " = ! " + /- with $ /- = 0

The objective§ Maximize sum of reward $ ∑-234 %-

Wu

Recipe of UCB63


!"# $ = 1'# $

()*+

#,)- $) = $

2. Evaluation of uncertainty

!"# $ − " $ ≤log 13'# $

3. Mechanism to combine estimates and uncertainty

4# $ = !"# $ + 6log 13#'# $

4. Select the best action (according to its combined value)$# = argmax; 4# $

Issue: '#($) is likely to be 0 for most $. We need more sample efficient estimates.

Wu

The Regret64

!" = max'( )

*+,

"-* . − ( )

*+,

"-* .*

= ( )*+,

"0 .∗ − 0 .*

23∗

Issue: .∗ unlikely to be ever selected if 4 ≪ 6

Wu

Least-Squares Estimate of !∗ 65

§ Least-squares estimate

#!$ = arg min,∈ℝ/

112345

$63 − 8 93 :! ; + = ! ;

§ Closed form solution

>$ =2345

$8 93 8 93 : + =? @$ =2

345

$8 93 63

⟹ #!$ = >$B5@$§ Estimate of value of action 9

D̂$ 9 = 8 9 : #!$

Wu


PropositionLet !", … , !% be any sequence of actions adapted to the filtration ℱ%. If the noise ' is sub-Gaussian of parameter ( and the features are bounded by ) ! * ≤ ,, then for any ! with probability 1 − /:

1̂% ! − 1 ! ≤ 2% ) ! 34%5") !Where:

2% = ( 7 log1 + <,=/ + =

"* >∗ *

§ ) ! @ABC measure the correlation between ) ! and the actions selected so far

§ If ) ! D is an orthogonal basis for ℝ@, this reduces to the MAB problem and ) ! @ABC =

"3A D

Wu

Recipe of LinUCB67

1. Computation of estimates!"# = %#&'(# *̂# + = , + - !"#

2. Evaluation of uncertainty*̂# + − * + ≤ 0# , + -%#&', +

3. Mechanism to combine estimates and uncertainty1# + = *̂# + + 0# , + -%#&', +

4. Select the best action (according to its combined value)+# = argmax

81# +

Wu

LinUCB: Algorithm68


'( ) = +̂( ) + -( . ) /0(12. )2. Take action )( argmax8 '( )3. Observe reward 9(~. )( /;∗ + =(4. Update statistics for action )(

0(>2 = 0( + . )( . )( /?;(>2 = AA>212 B(>2

endfor

Wu

LinUCB: Regret69

TheoremConsider a linear MAB problem with actions defined in ℝ" and unknown parameter #∗ ∈ ℝ". If LinUCB is run with &' = )

' for * steps, then it suffers a regret:

+, = - . * log *§ It depends on . but not the number of actions 2§ If 2 < ∞, we can improve the bound to

+, = - .* log *2

Wu

70

§ A RS can recommend specific movies§ Users arrive at random and we have information about them§ The RS picks a movie to the user§ The feedback is whether the user watched the movie or not§ Objective: Design a RS that maximizes the number of movies

watched


Wu

71


for ! = 1,… , & do1. User arrives '(2. Recommend movie )(3. Reward

*( = +10user watches movie )(

otherwiseEndfor

Issue: Too many users to collect enough feedback for each user separately

Wu

72

RS as a Contextual Linear Bandit

The model§ ! ", $ = & ' ", $ is the probability user " watches movie $§ Each user " and movie $ is characterized by some features ( ", $ ∈ ℝ+ (e.g. name, location, genre, release date, past rating, income, etc)

§ Assumption:• The expected value is a linear function ! ", $ = ( ", $ ,-∗ (with -∗ ∈ ℝ+

unknown) • The rewards are noisy observations '/ ", $ = ! ", $ + 1/ with & 1/ = 0

The objective§ Maximize sum of reward & ∑/456 '/

Wu

The Regret73

!" = $ %&'(

"max,

-& .&, 0 − $ %&'(

"-& .&, 0&

= $ %&'(

"2 .&, 0&∗ − 2 .&, 0&

45∗

Wu

Least-Squares Estimate of !∗ 74

§ Least-squares estimate

#!$ = arg min,∈ℝ/

112345

$63 − 8 93, ;3 <! = + ? ! =

§ Closed form solution

@$ =2345

$8 93, ;3 8 93, ;3 < + ?A B$ =2

345

$8 93, ;3 63

⟹ #!$ = @$D5B$§ Estimate of value of action ;

F̂$ 9, ; = 8 9, ; < #!$

Wu

ContextualLinUCB: Algorithm75

for ! = 1,… , & do1. Observe context '(2. Compute upper-confidence bound

)( '(, * = ,̂( '(, * + .( / '(, * 01(23/ '(, *3. Take action *( = argmax

9)( '(, *

4. Observe reward :(~/ '(, *( 0<∗ + >(5. Update statistics for action *(

1(?3 = 1( + / '(, *( / '(, *( 0@<(?3 = AB?323 C(?3

endfor

Wu

ContextualLinUCB: Regret76

TheoremConsider a contextual linear MAB problem with contexts and actions defined in ℝ" and unknown parameter #∗ ∈ ℝ". If ContextualLinUCBis run with &' = )

' for * steps, then for any arbitrary sequence of contexts +), +-, … , +/, it suffers a regret:

0/ = 1 2 * log *

Wu

Regret and Finite Sample Analysis77

§ How many data points (or experiences) are needed in order to get a good approximation of the optimal policy in a reinforcement learning task with an algorithm !?• Sample complexity " #, %, & , ' : smallest " such that with probability at

least 1 − #,AvgError ! 0 ≤ %

Where AvgError ! 2 is the average error made by the algorithm after 3steps.• Regret:

4256

0Error ! 2

Adapted from Mohammed Amine Bennouna & Moïse Blanchard (MIT)

Wu

Fitted Q-iteration78

Offline FQI quarantee [Munos, 2003; Munos and Szepesvari, 2008, Antos et al., 2008, Agarwal, et al., 2021]

The !"# iterate of offline Fitted Q Iteration guarantees that with probability 1 − &:

'∗ − ')* ≤ , 11 − - .

/ log ℱ&

4 + 2-71 − - 8

§ Remark: estimation and optimization error terms§ Assumptions• Concentrability C, a measure of data coverage (strong assumption). • Realizability: :∗ ∈ ℱ. ℱ is the function class for the Q function.• Bellman completion: for any = ∈ ℱ, ?= ∈ ℱ.

§ Ingredients• Performance difference lemma• Bernstein’s inequality