Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1...

40
Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard , R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and Bandits 1 / 36

Transcript of Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1...

Page 1: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Subsampling, Concentration andMulti-armed bandits

Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A.Baransi, N. Galichet, J. Pineau, A. Durand

Toulouse, November 09, 2015

O-A. Maillard Subsampling and Bandits 1 / 36

Page 2: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

1 Sub-sampling concentration:1.1 Hoeffding-Serfling,1.2 Bernstein-Serfling and1.3 empirical Bernstein-Serfling bounds

2 Sub-sampling for stochastic multi-armed bandits:2.1 "Best empirical sub-sampled arm" strategy,2.2 Illustrative experiments,2.3 Cumulative regret bound and extensions.

O-A. Maillard Subsampling and Bandits 2 / 36

.Roadmap

Page 3: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Sub-samplingconcentration

Introduction

"Concentration inequalities for sampling without replacement",Bardenet and Maillard, Bernoulli, 2015.

Page 4: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

I X = (x1, . . . , xN),a finite population of N real points.x1 x2 x3 x4 x5 . . . xN−2 xN−1 xN

I Sub-sample of size n 6 N from X : X1, . . . ,Xn pickeduniformly randomly without replacement from X .x1 Xn−1 X1 x4 X2 . . . xN−2 Xn xN

Simple problemApproximating the population mean µ = 1

N∑N

i=1 xi .I Concentration for partial sums of X1, . . . ,Xn.I Careful: dependency.

O-A. Maillard Subsampling and Bandits 4 / 36

.Sub-sampling

Page 5: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Lemma (Hoeffding, 1963)Let X = (x1, . . . , xN) be a finite population of N real points,X1, . . . ,Xn denote a random sample without replacement fromX and Y1, . . . ,Yn denote a random sample with replacementfrom X . If f : R→ R is continuous and convex, then

Ef( n∑

i=1Xi

)6 Ef

( n∑i=1

Yi

).

From sampling with to without replacementWe can thus transfer some results for sampling w. replacementto the case of sampling without replacement (via Chernoff).

O-A. Maillard Subsampling and Bandits 5 / 36

.Hoeffding’s reduction lemma

Page 6: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

0 2000 4000 6000 8000 10000n

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babili

ty b

ound

Estimate

Hoeffding

Bernstein

Hoeffding-Serfling

Bernstein-Serfling

(a) Gaussian N (0, 1)

0 2000 4000 6000 8000 10000n

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babili

ty b

ound

Estimate

Hoeffding

Bernstein

Hoeffding-Serfling

Bernstein-Serfling

(b) Log-normal lnN (1, 1)

0 2000 4000 6000 8000 10000n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pro

babili

ty b

ound

Estimate

Hoeffding

Bernstein

Hoeffding-Serfling

Bernstein-Serfling

(c) Bernoulli B(0.1)

0 2000 4000 6000 8000 10000n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pro

babili

ty b

ound

Estimate

Hoeffding

Bernstein

Hoeffding-Serfling

Bernstein-Serfling

(d) Bernoulli B(0.5)O-A. Maillard Subsampling and Bandits 6 / 36

.Comparing bounds on P(n−1∑ni=1 Xi − µ > 10−2), N = 104.

Page 7: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

For 1 6 k 6 N (considering fictitious Xn+1, . . . ,XN)

Zk = 1k

k∑t=1

(Xt − µ) and Z ?k = 1

N − k

k∑t=1

(Xt − µ) . (1)

Lemma (Serfling, 1974)The following forward martingale structure holds for {Z ?

k }k6N :

E[Z ?

k

∣∣∣∣Z ?k−1, . . . ,Z ?

1

]= Z ?

k−1 .

The following reverse martingale structure holds for {Zk}k6N :

E[Zk

∣∣∣∣Zk+1, . . . ,ZN−1

]= Zk+1 .

=⇒ Structured dependency.

O-A. Maillard Subsampling and Bandits 7 / 36

.Serfling’s key observation

Page 8: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Theorem (Serfling, 1974)Let a = min16i6N xi , and b = max16i6N xi . Then ∀λ ∈ R+, itholds

logE exp(λnZn

)6

(b − a)2

8 λ2n(1− n − 1

N

).

Moreover,

logE exp(λ max

16k6nZ ?

k

)6

(b − a)2

8λ2

(N − n)2n(1− n − 1

N

).

O-A. Maillard Subsampling and Bandits 8 / 36

.A useful result

Page 9: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Theorem (Bardenet M. 2015)Let a = min16i6N xi , and b = max16i6N xi . Then ∀λ ∈ R+, italso holds

logE exp(λnZn

)6

(b − a)2

8 λ2(n + 1)(1− n

N

).

Moreover,

logE exp(λ max

16k6nZ ?

k

)6

(b − a)2

8λ2

(N − n)2n(1− n − 1

N

).

logE exp(λ max

n6k6N−1Zk

)6

(b − a)2

8λ2

n2 (n + 1)(1− n

N

).

(Slight) improvement when n > N/2.

O-A. Maillard Subsampling and Bandits 8 / 36

.A useful result

Page 10: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Trivial corollary:

Corollary (Bardenet M., 2015)For all n 6 N, δ ∈ [0, 1], with probability higher than 1− δ, itholds ∑n

t=1(Xt − µ)n 6 (b − a)

√ρn log(1/δ)

2n ,

where we define

ρn =

(1− n−1N ) if n 6 N/2

(1− nN )(1 + 1/n) if n > N/2

. (2)

O-A. Maillard Subsampling and Bandits 9 / 36

.A slightly improved Hoeffding-Serfling inequality

Page 11: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Sub-samplingconcentration

Bernstein-Serfling

Page 12: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Let σ2 = N−1∑Ni=1(xi − µ)2, then

Q?k−1 = 1

N − k + 1

k−1∑i=1

((Xi − µ)2 − σ2

)

Qk+1 = 1k + 1

k+1∑i=1

((Xi − µ)2 − σ2

)

Lemma (Bardenet M., 2015)

E[(Xk − µ)2

∣∣∣∣Z1, . . .Zk−1

]= σ2 − Q?

k−1 ,

where the Zis are defined in (1). Likewise

E[(Xk+1 − µ)2

∣∣∣∣Zk+1, . . .ZN−1

]= σ2 + Qk+1 .

O-A. Maillard Subsampling and Bandits 11 / 36

.Towards Bernstein-Serfling’s inequality

Page 13: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Corollary (Bardenet M., 2015)Let n 6 N and δ ∈ [0, 1]. With probability larger than 1− 2δ,it holds that∑n

t=1(Xt − µ)n 6 σ

√2ρn log(1/δ)

n + κn(b − a) log(1/δ)n ,

where

ρn =

(1− fn−1) if n 6 N/2(1− fn)(1 + 1/n) if n > N/2

κn =

43 +

√fn

gn−1if n 6 N/2

43 +

√gn+1(1− fn) if n > N/2

, (3)

with fn = n/N and gn = N/n − 1.

O-A. Maillard Subsampling and Bandits 12 / 36

.A Bernstein-Serfling inequality

Page 14: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Improvement over BernsteinI Factor ρn can give dramatic improvement.

Proof elementsI Self-bounded property of the variance:

Study of Z = 1(b−a)2

∑ni=1(Xi − µ)2 (cf. Maurer and

Pontil, 2006; via tensorization inequality for the entropy).I Hoeffding reduction’s lemma.

O-A. Maillard Subsampling and Bandits 13 / 36

.A Bernstein-Serling inequality

Page 15: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

σ̂2n = 1

n

n∑i=1

(Xi − µ̂n)2 = 1n2

n∑i ,j=1

(Xi − Xj)2

2 , where µ̂n = 1n

n∑i=1

Xi .

Lemma (Bardenet M., 2015)When sampling without replacement from a finite populationX = (x1, . . . , xN) of size N, with range [a, b] and variance σ2,the empirical variance σ̂2

n using n < N samples satisfies

P(σ > σ̂n + (b − a)

(1 +

√1 + ρn

)√ log(3/δ)2n

)6 δ .

Possible improvementI Conjecture: Replace (1 +

√1 + ρn) with

√4ρn.

I Difficulty: concentration for self-bounded randomvariables when sampling without replacement.

O-A. Maillard Subsampling and Bandits 14 / 36

.Towards an empirical Bersntein-Serfling inequality

Page 16: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Corollary (Bardenet M., 2015)For all δ ∈ [0, 1], with probability larger than 1− 5δ, it holds∑n

t=1(Xt − µ)n 6 σ̂n

√2ρn log(1/δ)

n + κ(b − a) log(1/δ)n ,

where we remind the definition of ρn

ρn =

(1− n−1N ) if n 6 N/2

(1− nN )(1 + 1/n) if n > N/2 ,

and κ = 73 + 3√

2 .

O-A. Maillard Subsampling and Bandits 15 / 36

.An empirical Bernstein-Serfling inequality

Page 17: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

0 20000 40000 60000 80000 100000n

0.00

0.05

0.10

0.15

0.20

Invert

ed b

ound

Hoeffding-Serfling

Bernstein-Serfling

Empirical Bernstein-Serfling

(e) Gaussian N (0, 1)

0 20000 40000 60000 80000 100000n

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Invert

ed b

ound

Hoeffding-Serfling

Bernstein-Serfling

Empirical Bernstein-Serfling

(f) Log-normal lnN (1, 1)

0 20000 40000 60000 80000 100000n

0.000

0.005

0.010

0.015

0.020

0.025

Invert

ed b

ound

Hoeffding-Serfling

Bernstein-Serfling

Empirical Bernstein-Serfling

(g) Bernoulli B(0.1)

0 20000 40000 60000 80000 100000n

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

Invert

ed b

ound

Hoeffding-Serfling

Bernstein-Serfling

Empirical Bernstein-Serfling

(h) Bernoulli B(0.5)O-A. Maillard Subsampling and Bandits 16 / 36

.Serfling-bounds

Page 18: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

What we didI Improved Serfling-Hoeffding bound, new

Bernstein-Serfling and empirical Bernstein-Serflingbounds.

I Improvement over Hoeffding’s reduction due to ρn.

Improvement/Open questionI Tensorization inequality for the entropy in the case of

sampling without replacement?I Would lead to: (1 +

√1 + ρn) replaced with

√4ρn.

O-A. Maillard Subsampling and Bandits 17 / 36

.Sub-sampling recap

Page 19: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Sub-sampling BanditsIntroduction

"Sub-sampling for multi-armed bandits", Baransi, Maillard,Mannor ECML, 2014.

Page 20: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

SettingSet of choices A. Each a ∈ A is associated with an unknownprobability distribution νa ∈ D with mean µa. At each roundt = 1 . . .T the playerI first picks an arm At ∈ A based on past observations.I then receives (and sees) a stochastic payoff Xt ∼ νAt .

Goal and performanceMinimize the regret at round T :

RTdef= E

[Tµ? −

T∑t=1

Xt

]=∑a∈A

(µ? − µa)E[Nπ

T ,a

]where µ? = max{µa ; a ∈ A}, a?∈argmax{µa ; a ∈ A}

NπT ,a =

T∑t=1

I{At = a} .

O-A. Maillard Subsampling and Bandits 19 / 36

.Stochastic Multi-armed bandit setting

Page 21: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Theorem (Burnetas and Katehakis, 1996)For any strategy π that is consistent (for any bandit,sub-optimal arm a, β > 0 it holds E

[Nπ

T ,a

]= o(T β)), and

D ⊂ P([0, 1])

lim infT→∞

RT

logT >∑

a:∆a>0

(µ? − µa)Kinf(νa, µ?)

,

where Kinf(νa, µ?) def= inf{KL(νa||ν), ν ∈ D has mean > µ∗} .

O-A. Maillard Subsampling and Bandits 20 / 36

.Lower performance bound

Page 22: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Class of optimal algorithmsI Confidence bound: e.g. KL-UCB (Lai-Robbins, 1985)I Bayesian: e.g. Thompson Sampling (Thompson, 1933)I Sub-sampling?

Provably optimal finite-time regret for some DI Discrete or exponential families of dimension 1.

They need to know D in order to be optimalI A different algorithm for each D: TS or KL-UCB for

Bernoulli, for Poisson, for Exponential, etc.

O-A. Maillard Subsampling and Bandits 21 / 36

.Optimality

Page 23: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

I 10 Bernoulli(0.1, 3{0.05}, 3{0.02}, 3{0.01})BESA kl-UCB kl-UCB+ TS Others

Regret 74.4 121.2 72.8 83.4 100-400Beat BESA - 1.6% 35.4% 3.1%Run Rime 13.9X 2.8X 3.1X X

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

regr

et

BESA

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

KLUCB

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

KLUCB+

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

Thompson

Others: UCB, Moss, UCB-Tunes, DMED, UCB-V.(Credit: Akram Baransi)

O-A. Maillard Subsampling and Bandits 22 / 36

.Puzzling experiments (T = 20, 000, 50, 000 replicates)

Page 24: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

I Exponential(15 ,

14 ,

13 ,

12 , 1)

BESA KL-UCB-exp UCB-tuned FTL 10 OthersRegret 53.3 65.7 97.6 306.5 60-110,120+

Beat BESA - 5.7% 4.3% -Run Rime 6X 2.8X X -

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

regr

et

BESA

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

BESAT

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

KLUCBexp

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

UCBtuned

Others: UCB, Moss, kl-UCB,UCB-V.(Credit: Akram Baransi)

O-A. Maillard Subsampling and Bandits 23 / 36

.Puzzling experiments (T = 20, 000, 50, 000 replicates)

Page 25: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

I Poisson({12 + i

3}i=1,...,6)

BESA KL-UCB-Poisson kl-UCB FTL 10Regret 19.4 25.1 150.6 144.6

Beat BESA - 4.1% 0.7% -Run Rime 3.5X 1.2X X -

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

regr

et

BESA

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

BESAT

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

KLUCBpoisson

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

KLUCB

(Credit: Akram Baransi)

O-A. Maillard Subsampling and Bandits 24 / 36

.Puzzling experiments (T = 20, 000, 50, 000 replicates)

Page 26: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

I Bernoulli all half but one 0.51.BESA KL-UCB KL-UCB+ TS

Regret 156.7 170.8 165.3 165.1Beat BESA - 41.4% 41.6% 40.8%Run Rime 19.6X 2.8X 3X X

0 0.5 1 1.5 2

x 104

0

50

100

150

200

250

300

time

regr

et

BESA

0 0.5 1 1.5 2

x 104

0

50

100

150

200

250

300

time

KLUCB

0 0.5 1 1.5 2

x 104

0

50

100

150

200

250

300

time

KLUCB+

0 0.5 1 1.5 2

x 104

0

50

100

150

200

250

300

time

Thompson

(Credit: Akram Baransi)

O-A. Maillard Subsampling and Bandits 25 / 36

.Puzzling experiments (T = 20, 000, 50, 000 replicates)

Page 27: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

BESAI Competitive regret against state-of-the-art for various D.I Same algorithm for all D.I Not relying on upper confidence bounds, not Bayesian...I ...and extremely simple to implement.

QuestionsI How is this possible?I Can we prove optimality?I For which distributions is it optimal?

O-A. Maillard Subsampling and Bandits 26 / 36

.A Puzzling strategy

Page 28: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Sub-sampling BanditsBest Empirical Sub-sampling Average

Page 29: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

FTL1: Play each arm once.2: At time t, define µ̃t,a = µ̂(X a

1:Nt,a) for all a ∈ A.I µ̂(X ): empirical average of population X .I X a

1:Nt,a = {Xs : As = a, s 6 t}3: Choose (break ties in favor of the smallest Nt)

At = argmaxa′∈{a,b}

µ̃t,a′ .

PropertiesI Generally bad: linear regret.I A variant (ε-greedy) performs ok if well-tuned (Auer et al,

2002).I Optimal for very specific distributions (e.g. deterministic).

O-A. Maillard Subsampling and Bandits 28 / 36

.Go back to "Follow the leader"

Page 30: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Compare two arms based on "equal opportunity" i.e. samenumber of observations.

BESA at time t for two arms a, b:1: Sample Iat ∼ Wr(Nt,a;Nt,b) and Ibt ∼ Wr(Nt,b;Nt,a).

I Wr(n,N): sample n points from {1, . . . ,N} withoutreplacement (return all the set if n > N).

2: Define µ̃t,a = µ̂(X a1:Nt,a(Iat )) and µ̃t,b = µ̂(X b

1:Nt,b(Ibt )).

3: Choose (break ties in favor of the smallest Nt)At = argmax

a′∈{a,b}µ̃t,a′ .

QuestionsI Why does it work?I When can we prove log(T ) regret? Optimality?I When does it fail?

O-A. Maillard Subsampling and Bandits 29 / 36

.Follow the FAIR leader (aka BESA)

Page 31: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Compare two arms based on "equal opportunity" i.e. samenumber of observations.

BESA at time t for two arms a, b:1: Sample Iat ∼ Wr(Nt,a;Nt,b) and Ibt ∼ Wr(Nt,b;Nt,a).

I Ex: Nt,a = 3,Nt,b = 10, It,a = {1, 2, 3}, |It,b| = 3,sampled without replacement from {1, . . . , 10}.

2: Define µ̃t,a = µ̂(X a1:Nt,a(Iat )) and µ̃t,b = µ̂(X b

1:Nt,b(Ibt )).

3: Choose (break ties in favor of the smallest Nt)At = argmax

a′∈{a,b}µ̃t,a′ .

QuestionsI Why does it work?I When can we prove log(T ) regret? Optimality?I When does it fail?

O-A. Maillard Subsampling and Bandits 29 / 36

.Follow the FAIR leader (aka BESA)

Page 32: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Assume µb > µa, Nt,a = na, Nt,b = nb with na > nb.The probability of making one mistake is approximatively

P[µ̂(X a

1:na

(I(na; nb)

))> µ̂

(X b

1:nb

)], (4)

where I(na; nb) ∼ Wr(na; nb). The probability of making Mconsecutive mistakes is essentially

P[∀m∈ [M] µ̂

(X a

1:n(m)a

(Im(n(m)

a ; nb)))> µ̂

(X b

1:nb

)], (5)

where ∀m 6 M, Im(na; nb) ∼ Wr(na; nb), n(m)a = na + m − 1.

For deterministic na, nb: (4) ∼ decreases with e−2nb(µb−µa)2 ,(5) with e−2nbM̃(µb−µa)2 where M̃ is the number ofnon-overlapping sub-samples (independent chunks).

Exponential decay of probability of consecutive mistakes.

O-A. Maillard Subsampling and Bandits 30 / 36

.Intuition

Page 33: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Let A = {?, a} and define

α(M, n) = EZ?∼ν?,n

(PZ∼νa,n(Z > Z ?) + 12PZ∼νa,n(Z = Z ?)

)M ,

Theorem (Regret of the BESA strategy)If ∃α ∈ (0, 1), c > 0 such that α(M, 1) 6 cαM , then

RT 611 log(T )µ? − µa

+ Cνa,ν? + O(1) ,

where Cνa,ν? depends on the problem, but not on T .

ExampleI Bernoulli µa, µ?: α(M, 1) = O

((µa∨(1−µa)

2

)M)

O-A. Maillard Subsampling and Bandits 31 / 36

.Regret bound (slightly simplified statement)

Page 34: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

I Uniform X a ∼ U([0.2, 0.4]), X ? ∼ U([0, 1.]):α(M, n) M→∞→ 0.2n

I Consider BESA with initial number of pulls n0 = 0, . . .BESA n0 = 0 n0 = 3 n0 = 7 n0 = 8 n0 = 9 n0 = 10Regret 920.1 216.4 35.4 25.9 17.9 15.4

UCB kl-UCB TS FTL n0 = 10Regret 21.2 20.7 13.2 54.3

Beat BESA n0 = 0 24.3% 24.3% 24.7% -Beat BESA n0 = 3 7.3% 7.3% 7.8% -Beat BESA n0 = 7 1.6% 1.6% 1.8% -Beat BESA n0 = 10 0.6% 0.6% 0.7% -

(Credit: Akram Baransi)

O-A. Maillard Subsampling and Bandits 32 / 36

.Failure of the BESA strategy

Page 35: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Theorem (Regret of the BESA strategy)If ∃α ∈ (0, 1), c > 0 such that α(M, 1) 6 cαM , then

RT 611 log(T )µ? − µa

+ Cνa,ν? + O(1) ,

where Cνa,ν? depends on the problem, but not on T .If ∃β ∈ (0, 1), c > such that α(1, n) 6 cβn, then BESAinitialized with n0,T ' ln(T )/ ln(1/β) pull of each arm gets

RT 611 log(T )µ? − µa

+ n0,T + Cνa,ν? + O(1) ,

Key pointsI First condition holds for large class: extends FTL.I Initial number of pulls is less elegant.I Alternatively: mixing with uniform like ε-greedy.

O-A. Maillard Subsampling and Bandits 33 / 36

.Regret performance of BESA

Page 36: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

1 Basic concentration gives ∆a 6� log(t)1/2

min{Nt,?,Nt,a}1/2 :

If Nt,a < Nt,?, then Nt,aw.h.p.6 �2

∆2 log(t).If Nt,a > Nt,?, then Nt,?

w.h.p.6 �2

∆2 log(t) def= ut .Thus: On the event Nt,? > ut , must have Nt,a

w.h.p.6 ut .

2 Show that Nt,?w.h.p.> ut (like for Thompson Sampling)

Let τ ?j : delay between the j th time tj and (j + 1)th time we play ?.

P[Nt,? 6 ut

]6

ut∑j=1

P[τ ?j > t/ut − 1︸ ︷︷ ︸

`t

]

6ut∑

j=1P[∀s ∈ {0, . . . , b`t

2 c, d`t

2 e, . . . , `t} atj +s = a]

6ut∑

j=1P[∀s ∈ {d`t

2 e, . . . , `t} atj +s = a ∩ Ntj +s,a > b`t

2 c︸ ︷︷ ︸>ut>j for t>c

]

O-A. Maillard Subsampling and Bandits 34 / 36

.Sketch of regret analysis:

Page 37: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

1 Basic concentration gives ∆a 6� log(t)1/2

min{Nt,?,Nt,a}1/2 :

If Nt,a < Nt,?, then Nt,aw.h.p.6 �2

∆2 log(t).If Nt,a > Nt,?, then Nt,?

w.h.p.6 �2

∆2 log(t) def= ut .Thus: On the event Nt,? > ut , must have Nt,a

w.h.p.6 ut .

2 Show that Nt,?w.h.p.> ut (like for Thompson Sampling)

Let τ ?j : delay between the j th time tj and (j + 1)th time we play ?.

P[Nt,? 6 ut

]6

ut∑j=1

P[τ ?j > t/ut − 1︸ ︷︷ ︸

`t

]

6ut∑

j=1P[∀s ∈ {0, . . . , b`t

2 c, d`t

2 e, . . . , `t} atj +s = a]

t>c6

ut∑j=1

P[∀s ∈ {d`t

2 e, . . . , `t} atj +s = a ∩ Ntj +s,a > Ntj +s,?︸ ︷︷ ︸=j

]

O-A. Maillard Subsampling and Bandits 34 / 36

.Sketch of regret analysis:

Page 38: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Optimality and near-optimality regions in P([0, 1])?

PropertiesI Flexible: doesn’t need class of distribution, nor the

support.I We can prove log(T ) regret for certain classes.I Optimality (constants) unknown yet, but we are close.I Exhibit cases when it fails: why, and how to repair it.

O-A. Maillard Subsampling and Bandits 35 / 36

.BESA - Recap

Page 39: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Optimality and near-optimality regions in P([0, 1])?

PropertiesI Flexible: doesn’t need class of distribution, nor the

support.I We can prove log(T ) regret for certain classes.I Optimality (constants) unknown yet, but we are close.I Exhibit cases when it fails: why, and how to repair it.

O-A. Maillard Subsampling and Bandits 35 / 36

.BESA - Recap

Page 40: Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serflingand 1.3 empiricalBernstein-Serflingbounds 2 Sub-samplingforstochasticmulti-armedbandits:

Thank youIf you want toI prove "adaptive" optimality of this strategy orI extend it to contextual bandits, adversarial bandits, MDPs

Come work with me!