Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1...

Subsampling, Concentration andMulti-armed bandits

Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A.Baransi, N. Galichet, J. Pineau, A. Durand

Toulouse, November 09, 2015

O-A. Maillard Subsampling and Bandits 1 / 36

1 Sub-sampling concentration:1.1 Hoeffding-Serfling,1.2 Bernstein-Serfling and1.3 empirical Bernstein-Serfling bounds

2 Sub-sampling for stochastic multi-armed bandits:2.1 "Best empirical sub-sampled arm" strategy,2.2 Illustrative experiments,2.3 Cumulative regret bound and extensions.


.Roadmap

Sub-samplingconcentration

Introduction

"Concentration inequalities for sampling without replacement",Bardenet and Maillard, Bernoulli, 2015.

I X = (x1, . . . , xN),a finite population of N real points.x1 x2 x3 x4 x5 . . . xN−2 xN−1 xN

I Sub-sample of size n 6 N from X : X1, . . . ,Xn pickeduniformly randomly without replacement from X .x1 Xn−1 X1 x4 X2 . . . xN−2 Xn xN

Simple problemApproximating the population mean µ = 1

N∑N

i=1 xi .I Concentration for partial sums of X1, . . . ,Xn.I Careful: dependency.


.Sub-sampling

Lemma (Hoeffding, 1963)Let X = (x1, . . . , xN) be a finite population of N real points,X1, . . . ,Xn denote a random sample without replacement fromX and Y1, . . . ,Yn denote a random sample with replacementfrom X . If f : R→ R is continuous and convex, then

Ef( n∑

i=1Xi

)6 Ef

( n∑i=1

Yi

).

From sampling with to without replacementWe can thus transfer some results for sampling w. replacementto the case of sampling without replacement (via Chernoff).


.Hoeffding’s reduction lemma

0 2000 4000 6000 8000 10000n

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babili

ty b

ound

Estimate

Hoeffding

Bernstein

Hoeffding-Serfling

Bernstein-Serfling

(a) Gaussian N (0, 1)

0 2000 4000 6000 8000 10000n

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babili

ty b

ound

Estimate

Hoeffding

Bernstein

Hoeffding-Serfling

Bernstein-Serfling

(b) Log-normal lnN (1, 1)

0 2000 4000 6000 8000 10000n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pro

babili

ty b

ound

Estimate

Hoeffding

Bernstein

Hoeffding-Serfling

Bernstein-Serfling

(c) Bernoulli B(0.1)

0 2000 4000 6000 8000 10000n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pro

babili

ty b

ound

Estimate

Hoeffding

Bernstein

Hoeffding-Serfling

Bernstein-Serfling

(d) Bernoulli B(0.5)O-A. Maillard Subsampling and Bandits 6 / 36

.Comparing bounds on P(n−1∑ni=1 Xi − µ > 10−2), N = 104.

For 1 6 k 6 N (considering fictitious Xn+1, . . . ,XN)

Zk = 1k

k∑t=1

(Xt − µ) and Z ?k = 1

N − k

k∑t=1

(Xt − µ) . (1)

Lemma (Serfling, 1974)The following forward martingale structure holds for {Z ?

k }k6N :

E[Z ?

k

∣∣∣∣Z ?k−1, . . . ,Z ?

1

]= Z ?

k−1 .

The following reverse martingale structure holds for {Zk}k6N :

E[Zk

∣∣∣∣Zk+1, . . . ,ZN−1

]= Zk+1 .

=⇒ Structured dependency.


.Serfling’s key observation

Theorem (Serfling, 1974)Let a = min16i6N xi , and b = max16i6N xi . Then ∀λ ∈ R+, itholds

logE exp(λnZn

)6

(b − a)2

8 λ2n(1− n − 1

N

).

Moreover,

logE exp(λ max

16k6nZ ?

k

)6

(b − a)2

8λ2

(N − n)2n(1− n − 1

N

).


.A useful result

Theorem (Bardenet M. 2015)Let a = min16i6N xi , and b = max16i6N xi . Then ∀λ ∈ R+, italso holds

logE exp(λnZn

)6

(b − a)2

8 λ2(n + 1)(1− n

N

).

Moreover,

logE exp(λ max

16k6nZ ?

k

)6

(b − a)2

8λ2

(N − n)2n(1− n − 1

N

).

logE exp(λ max

n6k6N−1Zk

)6

(b − a)2

8λ2

n2 (n + 1)(1− n

N

).

(Slight) improvement when n > N/2.


.A useful result

Trivial corollary:

Corollary (Bardenet M., 2015)For all n 6 N, δ ∈ [0, 1], with probability higher than 1− δ, itholds ∑n

t=1(Xt − µ)n 6 (b − a)

√ρn log(1/δ)

2n ,

where we define

ρn =

(1− n−1N ) if n 6 N/2

(1− nN )(1 + 1/n) if n > N/2

. (2)


.A slightly improved Hoeffding-Serfling inequality

Sub-samplingconcentration

Bernstein-Serfling

Let σ2 = N−1∑Ni=1(xi − µ)2, then

Q?k−1 = 1

N − k + 1

k−1∑i=1

((Xi − µ)2 − σ2

)

Qk+1 = 1k + 1

k+1∑i=1

((Xi − µ)2 − σ2

)

Lemma (Bardenet M., 2015)

E[(Xk − µ)2

∣∣∣∣Z1, . . .Zk−1

]= σ2 − Q?

k−1 ,

where the Zis are defined in (1). Likewise

E[(Xk+1 − µ)2

∣∣∣∣Zk+1, . . .ZN−1

]= σ2 + Qk+1 .


.Towards Bernstein-Serfling’s inequality

Corollary (Bardenet M., 2015)Let n 6 N and δ ∈ [0, 1]. With probability larger than 1− 2δ,it holds that∑n

t=1(Xt − µ)n 6 σ

√2ρn log(1/δ)

n + κn(b − a) log(1/δ)n ,

where

ρn =

(1− fn−1) if n 6 N/2(1− fn)(1 + 1/n) if n > N/2

κn =

43 +

√fn

gn−1if n 6 N/2

43 +

√gn+1(1− fn) if n > N/2

, (3)

with fn = n/N and gn = N/n − 1.


.A Bernstein-Serfling inequality

Improvement over BernsteinI Factor ρn can give dramatic improvement.

Proof elementsI Self-bounded property of the variance:

Study of Z = 1(b−a)2

∑ni=1(Xi − µ)2 (cf. Maurer and

Pontil, 2006; via tensorization inequality for the entropy).I Hoeffding reduction’s lemma.


.A Bernstein-Serling inequality

σ̂2n = 1

n

n∑i=1

(Xi − µ̂n)2 = 1n2

n∑i ,j=1

(Xi − Xj)2

2 , where µ̂n = 1n

n∑i=1

Xi .

Lemma (Bardenet M., 2015)When sampling without replacement from a finite populationX = (x1, . . . , xN) of size N, with range [a, b] and variance σ2,the empirical variance σ̂2

n using n < N samples satisfies

P(σ > σ̂n + (b − a)

(1 +

√1 + ρn

)√ log(3/δ)2n

)6 δ .

Possible improvementI Conjecture: Replace (1 +

√1 + ρn) with

√4ρn.

I Difficulty: concentration for self-bounded randomvariables when sampling without replacement.


.Towards an empirical Bersntein-Serfling inequality

Corollary (Bardenet M., 2015)For all δ ∈ [0, 1], with probability larger than 1− 5δ, it holds∑n

t=1(Xt − µ)n 6 σ̂n

√2ρn log(1/δ)

n + κ(b − a) log(1/δ)n ,

where we remind the definition of ρn

ρn =

(1− n−1N ) if n 6 N/2

(1− nN )(1 + 1/n) if n > N/2 ,

and κ = 73 + 3√

2 .


.An empirical Bernstein-Serfling inequality

0 20000 40000 60000 80000 100000n

0.00

0.05

0.10

0.15

0.20

Invert

ed b

ound

Hoeffding-Serfling

Bernstein-Serfling

Empirical Bernstein-Serfling

(e) Gaussian N (0, 1)

0 20000 40000 60000 80000 100000n

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Invert

ed b

ound

Hoeffding-Serfling

Bernstein-Serfling


(f) Log-normal lnN (1, 1)

0 20000 40000 60000 80000 100000n

0.000

0.005

0.010

0.015

0.020

0.025

Invert

ed b

ound

Hoeffding-Serfling

Bernstein-Serfling


(g) Bernoulli B(0.1)

0 20000 40000 60000 80000 100000n

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

Invert

ed b

ound

Hoeffding-Serfling

Bernstein-Serfling


(h) Bernoulli B(0.5)O-A. Maillard Subsampling and Bandits 16 / 36

.Serfling-bounds

What we didI Improved Serfling-Hoeffding bound, new

Bernstein-Serfling and empirical Bernstein-Serflingbounds.

I Improvement over Hoeffding’s reduction due to ρn.

Improvement/Open questionI Tensorization inequality for the entropy in the case of

sampling without replacement?I Would lead to: (1 +

√1 + ρn) replaced with

√4ρn.


.Sub-sampling recap

Sub-sampling BanditsIntroduction

"Sub-sampling for multi-armed bandits", Baransi, Maillard,Mannor ECML, 2014.

SettingSet of choices A. Each a ∈ A is associated with an unknownprobability distribution νa ∈ D with mean µa. At each roundt = 1 . . .T the playerI first picks an arm At ∈ A based on past observations.I then receives (and sees) a stochastic payoff Xt ∼ νAt .

Goal and performanceMinimize the regret at round T :

RTdef= E

[Tµ? −

T∑t=1

Xt

]=∑a∈A

(µ? − µa)E[Nπ

T ,a

]where µ? = max{µa ; a ∈ A}, a?∈argmax{µa ; a ∈ A}

NπT ,a =

T∑t=1

I{At = a} .


.Stochastic Multi-armed bandit setting

Theorem (Burnetas and Katehakis, 1996)For any strategy π that is consistent (for any bandit,sub-optimal arm a, β > 0 it holds E

[Nπ

T ,a

]= o(T β)), and

D ⊂ P([0, 1])

lim infT→∞

RT

logT >∑

a:∆a>0

(µ? − µa)Kinf(νa, µ?)

,

where Kinf(νa, µ?) def= inf{KL(νa||ν), ν ∈ D has mean > µ∗} .


.Lower performance bound

Class of optimal algorithmsI Confidence bound: e.g. KL-UCB (Lai-Robbins, 1985)I Bayesian: e.g. Thompson Sampling (Thompson, 1933)I Sub-sampling?

Provably optimal finite-time regret for some DI Discrete or exponential families of dimension 1.

They need to know D in order to be optimalI A different algorithm for each D: TS or KL-UCB for

Bernoulli, for Poisson, for Exponential, etc.


.Optimality

I 10 Bernoulli(0.1, 3{0.05}, 3{0.02}, 3{0.01})BESA kl-UCB kl-UCB+ TS Others

Regret 74.4 121.2 72.8 83.4 100-400Beat BESA - 1.6% 35.4% 3.1%Run Rime 13.9X 2.8X 3.1X X

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

regr

et

BESA

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

KLUCB

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

KLUCB+

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

Thompson

Others: UCB, Moss, UCB-Tunes, DMED, UCB-V.(Credit: Akram Baransi)


.Puzzling experiments (T = 20, 000, 50, 000 replicates)

I Exponential(15 ,

14 ,

13 ,

12 , 1)

BESA KL-UCB-exp UCB-tuned FTL 10 OthersRegret 53.3 65.7 97.6 306.5 60-110,120+

Beat BESA - 5.7% 4.3% -Run Rime 6X 2.8X X -

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

regr

et

BESA

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

BESAT

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

KLUCBexp

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

UCBtuned

Others: UCB, Moss, kl-UCB,UCB-V.(Credit: Akram Baransi)



I Poisson({12 + i

3}i=1,...,6)

BESA KL-UCB-Poisson kl-UCB FTL 10Regret 19.4 25.1 150.6 144.6

Beat BESA - 4.1% 0.7% -Run Rime 3.5X 1.2X X -

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

regr

et

BESA

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

BESAT

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

KLUCBpoisson

0 0.5 1 1.5 2

x 104

0

50

100

150

200

time

KLUCB

(Credit: Akram Baransi)



I Bernoulli all half but one 0.51.BESA KL-UCB KL-UCB+ TS

Regret 156.7 170.8 165.3 165.1Beat BESA - 41.4% 41.6% 40.8%Run Rime 19.6X 2.8X 3X X

0 0.5 1 1.5 2

x 104

0

50

100

150

200

250

300

time

regr

et

BESA

0 0.5 1 1.5 2

x 104

0

50

100

150

200

250

300

time

KLUCB

0 0.5 1 1.5 2

x 104

0

50

100

150

200

250

300

time

KLUCB+

0 0.5 1 1.5 2

x 104

0

50

100

150

200

250

300

time

Thompson




BESAI Competitive regret against state-of-the-art for various D.I Same algorithm for all D.I Not relying on upper confidence bounds, not Bayesian...I ...and extremely simple to implement.

QuestionsI How is this possible?I Can we prove optimality?I For which distributions is it optimal?


.A Puzzling strategy

Sub-sampling BanditsBest Empirical Sub-sampling Average

FTL1: Play each arm once.2: At time t, define µ̃t,a = µ̂(X a

1:Nt,a) for all a ∈ A.I µ̂(X ): empirical average of population X .I X a

1:Nt,a = {Xs : As = a, s 6 t}3: Choose (break ties in favor of the smallest Nt)

At = argmaxa′∈{a,b}

µ̃t,a′ .

PropertiesI Generally bad: linear regret.I A variant (ε-greedy) performs ok if well-tuned (Auer et al,

2002).I Optimal for very specific distributions (e.g. deterministic).


.Go back to "Follow the leader"

Compare two arms based on "equal opportunity" i.e. samenumber of observations.

BESA at time t for two arms a, b:1: Sample Iat ∼ Wr(Nt,a;Nt,b) and Ibt ∼ Wr(Nt,b;Nt,a).

I Wr(n,N): sample n points from {1, . . . ,N} withoutreplacement (return all the set if n > N).

2: Define µ̃t,a = µ̂(X a1:Nt,a(Iat )) and µ̃t,b = µ̂(X b

1:Nt,b(Ibt )).

3: Choose (break ties in favor of the smallest Nt)At = argmax

a′∈{a,b}µ̃t,a′ .

QuestionsI Why does it work?I When can we prove log(T ) regret? Optimality?I When does it fail?


.Follow the FAIR leader (aka BESA)

Compare two arms based on "equal opportunity" i.e. samenumber of observations.

BESA at time t for two arms a, b:1: Sample Iat ∼ Wr(Nt,a;Nt,b) and Ibt ∼ Wr(Nt,b;Nt,a).

I Ex: Nt,a = 3,Nt,b = 10, It,a = {1, 2, 3}, |It,b| = 3,sampled without replacement from {1, . . . , 10}.

2: Define µ̃t,a = µ̂(X a1:Nt,a(Iat )) and µ̃t,b = µ̂(X b

1:Nt,b(Ibt )).

3: Choose (break ties in favor of the smallest Nt)At = argmax

a′∈{a,b}µ̃t,a′ .

QuestionsI Why does it work?I When can we prove log(T ) regret? Optimality?I When does it fail?


.Follow the FAIR leader (aka BESA)

Assume µb > µa, Nt,a = na, Nt,b = nb with na > nb.The probability of making one mistake is approximatively

P[µ̂(X a

1:na

(I(na; nb)

))> µ̂

(X b

1:nb

)], (4)

where I(na; nb) ∼ Wr(na; nb). The probability of making Mconsecutive mistakes is essentially

P[∀m∈ [M] µ̂

(X a

1:n(m)a

(Im(n(m)

a ; nb)))> µ̂

(X b

1:nb

)], (5)

where ∀m 6 M, Im(na; nb) ∼ Wr(na; nb), n(m)a = na + m − 1.

For deterministic na, nb: (4) ∼ decreases with e−2nb(µb−µa)2 ,(5) with e−2nbM̃(µb−µa)2 where M̃ is the number ofnon-overlapping sub-samples (independent chunks).

Exponential decay of probability of consecutive mistakes.


.Intuition

Let A = {?, a} and define

α(M, n) = EZ?∼ν?,n

(PZ∼νa,n(Z > Z ?) + 12PZ∼νa,n(Z = Z ?)

)M ,

Theorem (Regret of the BESA strategy)If ∃α ∈ (0, 1), c > 0 such that α(M, 1) 6 cαM , then

RT 611 log(T )µ? − µa

+ Cνa,ν? + O(1) ,

where Cνa,ν? depends on the problem, but not on T .

ExampleI Bernoulli µa, µ?: α(M, 1) = O

((µa∨(1−µa)

2

)M)


.Regret bound (slightly simplified statement)

I Uniform X a ∼ U([0.2, 0.4]), X ? ∼ U([0, 1.]):α(M, n) M→∞→ 0.2n

I Consider BESA with initial number of pulls n0 = 0, . . .BESA n0 = 0 n0 = 3 n0 = 7 n0 = 8 n0 = 9 n0 = 10Regret 920.1 216.4 35.4 25.9 17.9 15.4

UCB kl-UCB TS FTL n0 = 10Regret 21.2 20.7 13.2 54.3

Beat BESA n0 = 0 24.3% 24.3% 24.7% -Beat BESA n0 = 3 7.3% 7.3% 7.8% -Beat BESA n0 = 7 1.6% 1.6% 1.8% -Beat BESA n0 = 10 0.6% 0.6% 0.7% -



.Failure of the BESA strategy

Theorem (Regret of the BESA strategy)If ∃α ∈ (0, 1), c > 0 such that α(M, 1) 6 cαM , then


+ Cνa,ν? + O(1) ,

where Cνa,ν? depends on the problem, but not on T .If ∃β ∈ (0, 1), c > such that α(1, n) 6 cβn, then BESAinitialized with n0,T ' ln(T )/ ln(1/β) pull of each arm gets


+ n0,T + Cνa,ν? + O(1) ,

Key pointsI First condition holds for large class: extends FTL.I Initial number of pulls is less elegant.I Alternatively: mixing with uniform like ε-greedy.


.Regret performance of BESA

1 Basic concentration gives ∆a 6� log(t)1/2

min{Nt,?,Nt,a}1/2 :

If Nt,a < Nt,?, then Nt,aw.h.p.6 �2

∆2 log(t).If Nt,a > Nt,?, then Nt,?

w.h.p.6 �2

∆2 log(t) def= ut .Thus: On the event Nt,? > ut , must have Nt,a

w.h.p.6 ut .

2 Show that Nt,?w.h.p.> ut (like for Thompson Sampling)

Let τ ?j : delay between the j th time tj and (j + 1)th time we play ?.

P[Nt,? 6 ut

]6

ut∑j=1

P[τ ?j > t/ut − 1︸︷︷︸

`t

]

6ut∑

j=1P[∀s ∈ {0, . . . , b`t

2 c, d`t

2 e, . . . , `t} atj +s = a]

6ut∑

j=1P[∀s ∈ {d`t

2 e, . . . , `t} atj +s = a ∩ Ntj +s,a > b`t

2 c︸︷︷︸>ut>j for t>c

]


.Sketch of regret analysis:

1 Basic concentration gives ∆a 6� log(t)1/2

min{Nt,?,Nt,a}1/2 :

If Nt,a < Nt,?, then Nt,aw.h.p.6 �2

∆2 log(t).If Nt,a > Nt,?, then Nt,?

w.h.p.6 �2

∆2 log(t) def= ut .Thus: On the event Nt,? > ut , must have Nt,a

w.h.p.6 ut .

2 Show that Nt,?w.h.p.> ut (like for Thompson Sampling)

Let τ ?j : delay between the j th time tj and (j + 1)th time we play ?.

P[Nt,? 6 ut

]6

ut∑j=1

P[τ ?j > t/ut − 1︸︷︷︸

`t

]

6ut∑

j=1P[∀s ∈ {0, . . . , b`t

2 c, d`t

2 e, . . . , `t} atj +s = a]

t>c6

ut∑j=1

P[∀s ∈ {d`t

2 e, . . . , `t} atj +s = a ∩ Ntj +s,a > Ntj +s,?︸︷︷︸=j

]


.Sketch of regret analysis:

Optimality and near-optimality regions in P([0, 1])?

PropertiesI Flexible: doesn’t need class of distribution, nor the

support.I We can prove log(T ) regret for certain classes.I Optimality (constants) unknown yet, but we are close.I Exhibit cases when it fails: why, and how to repair it.


.BESA - Recap

Thank youIf you want toI prove "adaptive" optimality of this strategy orI extend it to contextual bandits, adversarial bandits, MDPs

Come work with me!

Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1...

Documents

Transcript of Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1...