Download - Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Transcript
Page 1: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Introduction to multi-armed banditsExploration-Exploitation Dilemma

Cathy Wu6.246 Reinforcement Learning: Foundations and Methods

Apr 6, 2021

Page 2: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

References2

1. Alessandro Lazaric. INRIA Lille. Reinforcement Learning. 2017, Lecture 6.

2. Aleksandrs Slivkins. Introduction to Multi-Armed Bandits. 2019. Chapters 1-3, 8.

Page 3: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Outline3

1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax

2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling

3. Linear and Contextual Linear Bandit

Page 4: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Recall: Q-Learning4

PropositionIf the learning rate satisfies the Robbins-Monro conditions in all states !, # ∈ %×'

()*+

,-. !, # = ∞ (

)*+

,-.1 !, # < ∞

And all state-action pairs are tried infinitely often, then for all !, # ∈%×'

34 !, # 5.7. 4∗ !, #Remark: “infinitely often” requires a steady exploration policy.

Page 5: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Learning the Optimal Policy5

for ! = 1,… , & do1. Set ' = 02. Set initial state )*3. while ()+ not terminal)

1) Take action ,+ = argmax2 34 )+, ,2) Observe next state )+56 and reward 7+3) Compute the temporal difference

8+ = 7+ + :max2;34 )+56, ,< − 34()+, ,+) (Q−learning)

4) Update the Q-function34 )+, ,+ = 34 )+, ,+ + E )+, ,+ 8+

5) Set ' = ' + 1endwhile

endforNo Convergence

according to a suitable exploration policy

Page 6: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Learning the Optimal Policy6

for ! = 1,… , & do1. Set ' = 02. Set initial state )*3. while ()+ not terminal)

1) Take action ,+~.(0)2) Observe next state )+23 and reward 4+3) Compute the temporal difference

5+ = 4+ + 7max;<

=> )+23, ,? − =>()+, ,+) (Q−learning)

4) Update the Q-function=> )+, ,+ = => )+, ,+ + H )+, ,+ 5+

5) Set ' = ' + 1

endwhileendforBad Convergence

Page 7: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

From RL to Multi-armed Bandit7

for ! = 1,… , & do1. Set ' = 02. Set initial state )*3. while ()+ not terminal)

1) Take action ,+2) Observe next state )+-. and reward /+endwhile

endfor

Page 8: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

From RL to Multi-armed Bandit8

The problem§ Set of ! actions§ Reward distribution " # with $ # = & ' #

(bounded in [0,1] for convenience)The protocol

for - = 1, … , / do1. Take action #02. Observe reward '0~" #0

endforThe objective§ Maximize sum of reward & ∑0345 '0

Page 9: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

9

§ A RS can recommend different genres of movies (e.g. action, adventure, romance, animation)

§ Users arrive at random and no information about the user is available

§ The RS picks a genre to recommend to the user but not the specific movies

§ The feedback is whether the user watched a movie of the recommended genre or not

§ Objective: Design a RS that maximizes the movies watched in the recommended genre

A Simple Recommendation System

Page 10: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

10

RS as a Multi-armed Bandit

for ! = 1,… , & do1. User arrives2. Recommend genre '(3. Reward

)( = *10user watches movie of genre '(

otherwiseendfor

Page 11: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

11

RS as a Multi-armed Bandit

The model§ !(#) is a Bernoulli§ % # = ' ( # is the probability a random user watches a movie of

genre #§ Assumption: ()~! #) is a realization of the Bernoulli of a genre #The objective§ Maximize sum of reward ' ∑),-. ()

Page 12: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

12

Other Examples

§ Packet routing§ Clinical trials§ Web advertising§ Health advice§ Education§ : Computer games ó§ Resource mining§ …

Page 13: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

The Regret13

!" = max'( )

*+,

"-*(/) − ( )

*+,

"-*(/*)

The expectation summarizes any possible source of randomness (either in -or in the algorithm)

Relation to RL: Can think of this as 2 trajectories (of length 1).

Measures not only the final error, but all mistakes made over 2 “iterations.”

Page 14: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

The Regret14

§ Number of times action !has been selected after "rounds

#$ ! =&'()

$* !' = !

§ Gap Δ ! = , !∗ − ,(!)

§ Regret

1$ = max56 &

'()

$7'(!) − 6 &

'()

$7'(!')

1$ = max5", ! − 6 &

'()

$7'(!')

1$ = max5", ! −&

56 #$ ! , !

1$ = ", !∗ −&56 #$ ! , !

1$ = &585∗

6 #$ ! , !∗ − , !

1$ = &585∗

6 #$ ! Δ(!)

Page 15: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

The Regret15

!" = $%&%∗

( )" * Δ(*)

Ø We only need to study the expected number of times suboptimal actions are selected

Ø Worst case possible: !" = .(/)Ø A good algorithm has !" = 0(/), i.e. 12" → 0

Page 16: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

The Exploration-Exploitation Dilemma16

Problem 1: The environment does not reveal the reward of the actions not selected by the learnerØ The learner should gain information by repeatedly selecting all

actions

Problem 2: Whenever the learner selects a bad action, it suffers some regretØ The learner should reduce the regret by repeatedly selecting the

best action

Challenge: The learner should solve two opposite problems!

⟹ exploration

⟹ exploitation

the explora?on-exploitation dilemma!

Page 17: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Outline17

1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax

2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling

3. Linear and Contextual Linear Bandit

Page 18: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

time Ttime K

Explore phase Exploit phase

Explore-then-Commit

Co-designed w/ Pulkit Agrawal

18

Page 19: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Explore-then-Commit: Algorithm19

Explore phasefor ! = 1, … , & do

1. Take action '(~*(,) (or round robin)2. Observe reward .(~/ '(

endforCompute statistics for each action '

1̂2 ' = 132(')

4567

2.58 '5 = '

Exploit phasefor ! = & + 1,… , : do

1. Take action ;'∗ = argmaxB ;12 '2. Observe reward .(~/ ;'∗

endfor

Page 20: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Explore-then-Commit: Regret20

TheoremIf explore-then-commit is run with parameter ! for " steps, then it suffers a regret:

#$ ≤ &'('∗

!* Δ , + 2 " − ! − 1 exp −!Δ , 4/2

§ Difficult to tune: ! should be adjusted depending on " and Δ ,§ Worst case w.r.t. Δ , : #$ = 8 "

9: (for ! = "

9:)

§ Recall: worst possible: #$ = 8 "

Page 21: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

21

§ Regret decomposition!" =$

%&'

(

) * +∗ − * +% + $%&(/'

"

) * +∗ − * 0+∗

§ During explore phase

$%&'

(

) * +∗ − * +% =1

2$345∗

Δ(+)

§ During exploit phase

$%&(/'

"

) * +∗ − * 0+∗ = 9 − 1 − 1 $345∗

ℙ 0+∗ = + Δ +

= 9 − 1 − 1 $345∗

ℙ ∀+=: @̂( + ≥ @̂( += Δ +

≤ 9 − 1 − 1 $345∗

ℙ @̂( + ≥ @̂( +∗ Δ +

Explore-then-Commit: Regret Analysis

Recall:

@̂( + =1

H((+)$I&'

(

JIK +I = +

0+∗ = argmax3@̂( +

Page 22: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

22

Explore-then-Commit: Regret Analysis

Let !" ∈ $, & be an independent r.v. with common mean ' = )!". Then:

ℙ +!, − ' > / ≤ 2 exp − 5,6789: 7 ∀< > 0

where +!, = >,∑"@>

, !" .

Proposition (Hoeffding Inequality)

Page 23: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

24

Explore-then-Commit: Regret Analysis

§ Probability of errorℙ #̂$ % ≥ #̂$ %∗= ℙ #̂$ % − # % ≥ #̂$ %∗ − # %∗ + Δ %≤ ℙ #̂$ % − # % ≥ Δ %

2 + ℙ # %∗ − #̂$ %∗ ≥ Δ %2

§ Apply Hoeffding bound for random variables ./ ∈ 0,1ℙ #̂$ % ≥ #̂$ %∗ ≤ 2 exp −7Δ % 8/2

Page 24: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Outline25

1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax

2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling

3. Linear and Contextual Linear Bandit

Page 25: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

!-greedy: Algorithm26

for " = 1,… , ' do1. Take action

() = *+ ,

argmax24̂) (

with probability !) (explore)with probability 1 − !) (exploit)

2. Observe reward B)~D ()3. Update statistics for action ()

E) () = E)FG () + 1

4̂) () =1

E) ()IJKG

)

BJL (J = ()

endfor

Page 26: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

!-greedy: Regret27

Theorem

If !-greedy is run with parameter !" = $%&' ( log $ ,/., then for each

round t it suffers a regret:/" ≤ 12 $3/.

§ Same asymptotic regret, now holds for all rounds t§ Can do better, but optimal ! depends on knowledge of Δ (difficult to

tune)§ Keep selecting very bad arms with some probability§ Sharply separates exploration and exploitation

Page 27: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Non-adaptive exploration

Adaptive exploration

vs

Explore-then-commit: Explore + exploit separately

E.g., Exploit + Explore (based on exploitation)

Types of exploration strategies

!-greedy: Exploit + explore (agnostic to exploitation)

29

Page 28: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Outline30

1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax

2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling

3. Linear and Contextual Linear Bandit

Page 29: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Soft-max (aka Exp3): Algorithm31

for ! = 1,… , & do1. Take action

'(~exp .̂( '

/∑12 exp

.̂( '3/

2. Observe reward 4(~5 '(3. Update statistics for action '(

6( '( = 6(78 '( + 1.̂( '( = 1

6( '(:;<8

(4;= '; = '(

endfor

§ More probability to better actions (arms)

§ Temperature /: large for exploration, small for exploitation

§ Difficult to tune

Page 30: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

32

Example of Regret Performance

Page 31: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Outline33

1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax

2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling

3. Linear and Contextual Linear Bandit

Page 32: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Problem-Independent Lower Bound34

TheoremConsider the family of multi-armed bandit problems with ! Bernoulli arms. For any algorithm and fixed ", there exists a Bernoulli MAB problem instance such that:

#$ = Ω !"§ At any finite time ", the regret may be as large as Ω "

Page 33: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Problem-Dependent Lower Bound35

TheoremConsider the family of multi-armed bandit problems with ! Bernoulli arms and an algorithm that satisfies " #$ % = ' () for any * > 0, any action %, and any Bernoulli MAB problem. Then for any Bernoulli MAB problem with gaps Δ % > 0, ∀% ≠ %∗, any algorithm suffers regret:

lim inf$→8

9$

log (= <

=>=∗

Δ %

?@ A % , A %∗

Where ?@ B, C = B logD

E+ 1 − B log

IJD

IJE

§ No algorithm can achieve a regret smaller than Ω(log () (asymptotically)

§ The ratio N =

OP Q,Q∗measures the difficulty of the problem

§ Algorithms such as R-greedy with the right tuning are optimal!

Page 34: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Outline36

1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax

2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling

3. Linear and Contextual Linear Bandit

Page 35: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Recipe for Effective Exp-Exp37

1. Computation of estimates2. Evaluation of uncertainty3. Mechanism to combine estimates and uncertainty4. Select the best action (according to its combined value)

Page 36: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Optimism in Face of Uncertainty38

“Whenever the value of an action is uncertain, consider its largest plausible value, and then select the best action.”

Missing ingredient: uncertainty of our estimates.

Page 37: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Measuring Uncertainty39

Proposition (Chernoff-Hoeffding Inequality)Let !" ∈ [%, '] be ) independent r.v. with mean * = ,!". Then:

ℙ 1)/012

3!0 − * > ' − %

log 2:2) ≤ :

Page 38: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Recipe of UCB41

1. Computation of estimates

!"# $ = 1'# $

()*+

#,)- $) = $

2. Evaluation of uncertainty

!"# $ − " $ ≤log 242'# $

3. Mechanism to combine estimates and uncertainty

5# $ = !"# $ + 7log 24#2'# $

4. Select the best action (according to its combined value)$# = argmax

<5# $

Page 39: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

time T

Exploitation

Exploration bonus for rare actions

(optimism)

(mean reward of action i)

...

<latexit sha1_base64="fsbH58ehoMNbB5sd13XhncgY6/0=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0nEoicpePFY0X5AG8pku2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nirImjUWsOgFqJrhkTcONYJ1EMYwCwdrB+Hbmt5+Y0jyWj2aSMD/CoeQhp2is9IB9r1+uuFV3DrJKvJxUIEejX/7qDWKaRkwaKlDrrucmxs9QGU4Fm5Z6qWYJ0jEOWddSiRHTfjY/dUrOrDIgYaxsSUPm6u+JDCOtJ1FgOyM0I73szcT/vG5qwms/4zJJDZN0sShMBTExmf1NBlwxasTEEqSK21sJHaFCamw6JRuCt/zyKmldVL1a1b2/rNRv8jiKcAKncA4eXEEd7qABTaAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QPotY2J</latexit>a1<latexit sha1_base64="//EfCgT17QIUgHLOAztMsYf5HII=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoicpePFY0X5AG8pmO2mXbjZhdyOU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nimGTxSJWnYBqFFxi03AjsJMopFEgsB2Mb2d++wmV5rF8NJME/YgOJQ85o8ZKD7Rf65crbtWdg6wSLycVyNHol796g5ilEUrDBNW667mJ8TOqDGcCp6VeqjGhbEyH2LVU0gi1n81PnZIzqwxIGCtb0pC5+nsio5HWkyiwnRE1I73szcT/vG5qwms/4zJJDUq2WBSmgpiYzP4mA66QGTGxhDLF7a2EjaiizNh0SjYEb/nlVdKqVb3Lqnt/Uanf5HEU4QRO4Rw8uII63EEDmsBgCM/wCm+OcF6cd+dj0Vpw8plj+APn8wfqOY2K</latexit>a2

<latexit sha1_base64="T8MRKowcOJGOSXgUIPN3A1CdyCo=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKez6QE8S8OIxonlAsoTeyWwyZHZ2mZkVwpJP8OJBEa9+kTf/xkmyB00saCiquunuChLBtXHdb6ewsrq2vlHcLG1t7+zulfcPmjpOFWUNGotYtQPUTHDJGoYbwdqJYhgFgrWC0e3Ubz0xpXksH804YX6EA8lDTtFY6QF7571yxa26M5Bl4uWkAjnqvfJXtx/TNGLSUIFadzw3MX6GynAq2KTUTTVLkI5wwDqWSoyY9rPZqRNyYpU+CWNlSxoyU39PZBhpPY4C2xmhGepFbyr+53VSE177GZdJapik80VhKoiJyfRv0ueKUSPGliBV3N5K6BAVUmPTKdkQvMWXl0nzrOpdVt37i0rtJo+jCEdwDKfgwRXU4A7q0AAKA3iGV3hzhPPivDsf89aCk88cwh84nz/rvY2L</latexit>a3

Initial confidence intervals:

Co-designed w/ Pulkit Agrawal

!"#$ = argmax+ ,-" ! + /log 23"24" !

Upper Confidence Bound (UCB) Algorithm42

Page 40: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

UCB: Algorithm43

for ! = 1,… , & do1. Compute upper-confidence bound

'( ) = +̂( ) + -log 22(23( )

2. Take action )( argmax8 '( )3. Observe reward 9(~; )(4. Update statistics for action )(

3( )( = 3(<= )( + 1+̂( )( = 1

3( )(>?@=

(9?A )? = )(

endfor

Page 41: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

UCB: Regret45

Theorem

Consider a MAB problem with ! Bernoulli arms with gaps Δ($). If UCB is run with & = 1and )* = +

* for , steps, then it suffers regret:

-. = / 0121∗

log ,Δ $

Consider a 2-action MAB problem, then for any fixed ,, in the worst-case (w.r.t Δ) UCB suffers a regret:

-. = / , log ,

§ It (almost) matches the lower bounds§ It does not require any prior knowledge about the MAB, apart from the

range of the r.v.§ The big-O hides a few numerical constants and ,-independent additive

terms

Page 42: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

46

UCB: Proof Sketch§ Disclaimer: This is a slightly suboptimal proof, but it provides an easy path.

§ Define the (high-probability) event [statistics]

ℰ = ∀$, & (̂) $ − ( $ ≤log 2021) $

§ By Chernoff-Hoeffding & union bound: ℙ ℰ ≥ 1 − 560§ If at time &, we select action $, then [algorithm]

7) $ ≥ 7) $∗

(̂) $ +log 2021) $

≥ (̂) $∗ +log 2021) $∗

§ On the event ℰ, we have [math]

( $ + 2log 2021) $

≥ ( $∗

Page 43: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

48

UCB: Proof Sketch§ Assume ! is the last time " is selected, then #$ " = #&'( " + 1 (for + ≥ !), thus:

- " + 2log 222#$ " ≥ - "∗

§ Reordering [math]

#$ " ≤2log 22Δ " 6

under event ℰ and thus with probability 1 − +92§ Moving to the expectation [statistics]

: #$ " = : #$ " |ℰ + : #$ " |ℰ<

: #$ " ≤2log 22Δ " 6 + + +92

§ Trading-off the two terms 2 = ($=, we obtain:

: #$ " ≤ 4log +Δ?6

+ 9

Page 44: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Tuning the ! Parameter49

Theory§ ! < 1, polynomial regret w.r.t. $§ ! ≥ 1, logarithmic regret w.r.t. $Practice: ! = 0.2 is often the best choice

Recall:

0123 = argmax8:̂1 0 + !

log 2=12>1 0

Page 45: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Improvements: UCB-V50

Idea: Use empirical Bernstein bounds for more accurate confidence intervals (c.i.)Algorithm:§ Compute the score of each arm !

"# $ = '̂# $ + ) 4log .20# $

§ Select action$# = argmax5 "# $

§ Update the statistics 0# $# , '̂# $#Regret:

78 ≤ : 1Δ log =

and >?#@ $#?@

2 >?#@ $ log .0# $

+ 8 log .30# $

Page 46: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Improvements: KL-UCB51

Idea: Use even tighter c.i. based on Kullback-Leibler divergence

KL #, % = # log#%+ 1 − # log

1 − #1 − %

Algorithm: Compute the score of each arm - (convex optimization)./ 0 = max % ∈ 0,1 : 7/ 0 89 ;̂/ 0 , % ≤ log = + > log log =

Regret: Pulls to suboptimal arms

? 7@ 0 ≤ 1 + Alog n

KL ; 0 , ; 0∗+ CE log log F +

GH AFI J

Where K ;L, ;∗ ≥ 2ΔLH

Page 47: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Outline52

1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax

2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling

3. Linear and Contextual Linear Bandit

Page 48: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

WuSource: Steve Roberts

Measuring Uncertainty53

§ Assume that !" # are distributed as Bernoulli for all actions # with parameter $ #§ Define a prior $ # ~Beta *+, -+§ After . rewards, compute the posterior for action # as Beta *" # , -" # with:

*" # = *+ +1234

"5 #" = # ⋀ !" = 0 -" # = -+ +1

234

"5 #" = # ⋀ !" = 1

Page 49: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Recipe of Thompson Sampling*55

1. Computation of estimates

"̂# $ = &# $&# $ + (# $

2. Evaluation of uncertaintyBeta &# $ , (# $

3. Mechanism to combine estimates and uncertainty.# $ ~Beta &# $ , (# $

4. Select the best action (according to its combined value)$# = argmax

4.# $

*aka Posterior Sampling

Page 50: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Thompson Sampling: Algorithm56

for ! = 1,… , & do1. Compute upper-confidence bound

'( ) ~Beta /( ) , 0( )2. Take action )( argmax5 '( )3. Observe reward 6(~7 )(4. Update statistics for action )(

/( )( = /(89 )( + ; 6( = 00( )( = 0(89 )( + ; 6( = 1

endfor

Page 51: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Thompson Sampling: Regret58

TheoremConsider a MAB problem with A Bernoulli arms with gaps Δ " . If Thompson Sampling is run for # steps, then it suffers regret:

$% = ' 1 + * ∑,-,∗ / , 012 %34 5 , ,5 ,∗ + 7 5

89 ∀* > 0

§ It matches the lower bound§ It requires defining a prior on the actions

Page 52: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Outline59

1. Basic Exploration Strategies• Explore then Commit• !-greedy• Softmax

2. Advanced Strategies• Lower bounds• UCB• Thompson Sampling

3. Linear and Contextual Linear Bandit

Page 53: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

60

§ A RS can recommend specific movies§ Users arrive at random and no information about the user is

available§ The RS picks a movie to the user§ The feedback is whether the user watched the movie or not§ Objective: Design a RS that maximizes the number of movies

watched

A Simple Recommendation System

Page 54: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

61

RS as a Multi-armed Bandit

for ! = 1,… , & do1. User arrives2. Recommend movie '(3. Reward

)( = *10user watches movie '(

otherwiseEndfor

Issue: Too many movies are available to collect enough feedback for each movie separately

Page 55: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

62

RS as a Linear Bandit

The model§ ! " = $ % " is the probability a random user watches movie "§ Each movie " is characterized by some features & " ∈ ℝ) (e.g.

genre, release date, past rating, income, etc)§ Assumption:• The expected value is a linear function ! " = & " *+∗ (with +∗ ∈ ℝ)

unknown) • The rewards are noisy observations %- " = ! " + /- with $ /- = 0

The objective§ Maximize sum of reward $ ∑-234 %-

Page 56: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Recipe of UCB63

1. Computation of estimates

!"# $ = 1'# $

()*+

#,)- $) = $

2. Evaluation of uncertainty

!"# $ − " $ ≤log 13'# $

3. Mechanism to combine estimates and uncertainty

4# $ = !"# $ + 6log 13#'# $

4. Select the best action (according to its combined value)$# = argmax; 4# $

Issue: '#($) is likely to be 0 for most $. We need more sample efficient estimates.

Page 57: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

The Regret64

!" = max'( )

*+,

"-* . − ( )

*+,

"-* .*

= ( )*+,

"0 .∗ − 0 .*

23∗

Issue: .∗ unlikely to be ever selected if 4 ≪ 6

Page 58: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Least-Squares Estimate of !∗ 65

§ Least-squares estimate

#!$ = arg min,∈ℝ/

112345

$63 − 8 93 :! ; + = ! ;

§ Closed form solution

>$ =2345

$8 93 8 93 : + =? @$ =2

345

$8 93 63

⟹ #!$ = >$B5@$§ Estimate of value of action 9

D̂$ 9 = 8 9 : #!$

Page 59: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Measuring Uncertainty66

PropositionLet !", … , !% be any sequence of actions adapted to the filtration ℱ%. If the noise ' is sub-Gaussian of parameter ( and the features are bounded by ) ! * ≤ ,, then for any ! with probability 1 − /:

1̂% ! − 1 ! ≤ 2% ) ! 34%5") !Where:

2% = ( 7 log1 + <,=/ + =

"* >∗ *

§ ) ! @ABC measure the correlation between ) ! and the actions selected so far

§ If ) ! D is an orthogonal basis for ℝ@, this reduces to the MAB problem and ) ! @ABC =

"3A D

Page 60: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Recipe of LinUCB67

1. Computation of estimates!"# = %#&'(# *̂# + = , + - !"#

2. Evaluation of uncertainty*̂# + − * + ≤ 0# , + -%#&', +

3. Mechanism to combine estimates and uncertainty1# + = *̂# + + 0# , + -%#&', +

4. Select the best action (according to its combined value)+# = argmax

81# +

Page 61: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

LinUCB: Algorithm68

for ! = 1,… , & do1. Compute upper-confidence bound

'( ) = +̂( ) + -( . ) /0(12. )2. Take action )( argmax8 '( )3. Observe reward 9(~. )( /;∗ + =(4. Update statistics for action )(

0(>2 = 0( + . )( . )( /?;(>2 = AA>212 B(>2

endfor

Page 62: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

LinUCB: Regret69

TheoremConsider a linear MAB problem with actions defined in ℝ" and unknown parameter #∗ ∈ ℝ". If LinUCB is run with &' = )

' for * steps, then it suffers a regret:

+, = - . * log *§ It depends on . but not the number of actions 2§ If 2 < ∞, we can improve the bound to

+, = - .* log *2

Page 63: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

70

§ A RS can recommend specific movies§ Users arrive at random and we have information about them§ The RS picks a movie to the user§ The feedback is whether the user watched the movie or not§ Objective: Design a RS that maximizes the number of movies

watched

A Simple Recommendation System

Page 64: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

71

RS as a Multi-armed Bandit

for ! = 1,… , & do1. User arrives '(2. Recommend movie )(3. Reward

*( = +10user watches movie )(

otherwiseEndfor

Issue: Too many users to collect enough feedback for each user separately

Page 65: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

72

RS as a Contextual Linear Bandit

The model§ ! ", $ = & ' ", $ is the probability user " watches movie $§ Each user " and movie $ is characterized by some features ( ", $ ∈ ℝ+ (e.g. name, location, genre, release date, past rating, income, etc)

§ Assumption:• The expected value is a linear function ! ", $ = ( ", $ ,-∗ (with -∗ ∈ ℝ+

unknown) • The rewards are noisy observations '/ ", $ = ! ", $ + 1/ with & 1/ = 0

The objective§ Maximize sum of reward & ∑/456 '/

Page 66: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

The Regret73

!" = $ %&'(

"max,

-& .&, 0 − $ %&'(

"-& .&, 0&

= $ %&'(

"2 .&, 0&∗ − 2 .&, 0&

45∗

Page 67: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Least-Squares Estimate of !∗ 74

§ Least-squares estimate

#!$ = arg min,∈ℝ/

112345

$63 − 8 93, ;3 <! = + ? ! =

§ Closed form solution

@$ =2345

$8 93, ;3 8 93, ;3 < + ?A B$ =2

345

$8 93, ;3 63

⟹ #!$ = @$D5B$§ Estimate of value of action ;

F̂$ 9, ; = 8 9, ; < #!$

Page 68: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

ContextualLinUCB: Algorithm75

for ! = 1,… , & do1. Observe context '(2. Compute upper-confidence bound

)( '(, * = ,̂( '(, * + .( / '(, * 01(23/ '(, *3. Take action *( = argmax

9)( '(, *

4. Observe reward :(~/ '(, *( 0<∗ + >(5. Update statistics for action *(

1(?3 = 1( + / '(, *( / '(, *( 0@<(?3 = AB?323 C(?3

endfor

Page 69: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

ContextualLinUCB: Regret76

TheoremConsider a contextual linear MAB problem with contexts and actions defined in ℝ" and unknown parameter #∗ ∈ ℝ". If ContextualLinUCBis run with &' = )

' for * steps, then for any arbitrary sequence of contexts +), +-, … , +/, it suffers a regret:

0/ = 1 2 * log *

Page 70: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Regret and Finite Sample Analysis77

§ How many data points (or experiences) are needed in order to get a good approximation of the optimal policy in a reinforcement learning task with an algorithm !?• Sample complexity " #, %, & , ' : smallest " such that with probability at

least 1 − #,AvgError ! 0 ≤ %

Where AvgError ! 2 is the average error made by the algorithm after 3steps.• Regret:

4256

0Error ! 2

Adapted from Mohammed Amine Bennouna & Moïse Blanchard (MIT)

Page 71: Cathy Wu - MIT · 2021. 4. 6. · Advanced Strategies • Lower bounds • UCB • Thompson Sampling 3. ... §Resource mining ... Problem 1: The environment does not reveal the reward

Wu

Fitted Q-iteration78

Offline FQI quarantee [Munos, 2003; Munos and Szepesvari, 2008, Antos et al., 2008, Agarwal, et al., 2021]

The !"# iterate of offline Fitted Q Iteration guarantees that with probability 1 − &:

'∗ − ')* ≤ , 11 − - .

/ log ℱ&

4 + 2-71 − - 8

§ Remark: estimation and optimization error terms§ Assumptions• Concentrability C, a measure of data coverage (strong assumption). • Realizability: :∗ ∈ ℱ. ℱ is the function class for the Q function.• Bellman completion: for any = ∈ ℱ, ?= ∈ ℱ.

§ Ingredients• Performance difference lemma• Bernstein’s inequality