Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1...
Transcript of Subsampling, Concentration and Multi-armed bandits · 1 Sub-samplingconcentration: 1.1...
Subsampling, Concentration andMulti-armed bandits
Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A.Baransi, N. Galichet, J. Pineau, A. Durand
Toulouse, November 09, 2015
O-A. Maillard Subsampling and Bandits 1 / 36
1 Sub-sampling concentration:1.1 Hoeffding-Serfling,1.2 Bernstein-Serfling and1.3 empirical Bernstein-Serfling bounds
2 Sub-sampling for stochastic multi-armed bandits:2.1 "Best empirical sub-sampled arm" strategy,2.2 Illustrative experiments,2.3 Cumulative regret bound and extensions.
O-A. Maillard Subsampling and Bandits 2 / 36
.Roadmap
Sub-samplingconcentration
Introduction
"Concentration inequalities for sampling without replacement",Bardenet and Maillard, Bernoulli, 2015.
I X = (x1, . . . , xN),a finite population of N real points.x1 x2 x3 x4 x5 . . . xN−2 xN−1 xN
I Sub-sample of size n 6 N from X : X1, . . . ,Xn pickeduniformly randomly without replacement from X .x1 Xn−1 X1 x4 X2 . . . xN−2 Xn xN
Simple problemApproximating the population mean µ = 1
N∑N
i=1 xi .I Concentration for partial sums of X1, . . . ,Xn.I Careful: dependency.
O-A. Maillard Subsampling and Bandits 4 / 36
.Sub-sampling
Lemma (Hoeffding, 1963)Let X = (x1, . . . , xN) be a finite population of N real points,X1, . . . ,Xn denote a random sample without replacement fromX and Y1, . . . ,Yn denote a random sample with replacementfrom X . If f : R→ R is continuous and convex, then
Ef( n∑
i=1Xi
)6 Ef
( n∑i=1
Yi
).
From sampling with to without replacementWe can thus transfer some results for sampling w. replacementto the case of sampling without replacement (via Chernoff).
O-A. Maillard Subsampling and Bandits 5 / 36
.Hoeffding’s reduction lemma
0 2000 4000 6000 8000 10000n
0.0
0.2
0.4
0.6
0.8
1.0
Pro
babili
ty b
ound
Estimate
Hoeffding
Bernstein
Hoeffding-Serfling
Bernstein-Serfling
(a) Gaussian N (0, 1)
0 2000 4000 6000 8000 10000n
0.0
0.2
0.4
0.6
0.8
1.0
Pro
babili
ty b
ound
Estimate
Hoeffding
Bernstein
Hoeffding-Serfling
Bernstein-Serfling
(b) Log-normal lnN (1, 1)
0 2000 4000 6000 8000 10000n
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Pro
babili
ty b
ound
Estimate
Hoeffding
Bernstein
Hoeffding-Serfling
Bernstein-Serfling
(c) Bernoulli B(0.1)
0 2000 4000 6000 8000 10000n
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Pro
babili
ty b
ound
Estimate
Hoeffding
Bernstein
Hoeffding-Serfling
Bernstein-Serfling
(d) Bernoulli B(0.5)O-A. Maillard Subsampling and Bandits 6 / 36
.Comparing bounds on P(n−1∑ni=1 Xi − µ > 10−2), N = 104.
For 1 6 k 6 N (considering fictitious Xn+1, . . . ,XN)
Zk = 1k
k∑t=1
(Xt − µ) and Z ?k = 1
N − k
k∑t=1
(Xt − µ) . (1)
Lemma (Serfling, 1974)The following forward martingale structure holds for {Z ?
k }k6N :
E[Z ?
k
∣∣∣∣Z ?k−1, . . . ,Z ?
1
]= Z ?
k−1 .
The following reverse martingale structure holds for {Zk}k6N :
E[Zk
∣∣∣∣Zk+1, . . . ,ZN−1
]= Zk+1 .
=⇒ Structured dependency.
O-A. Maillard Subsampling and Bandits 7 / 36
.Serfling’s key observation
Theorem (Serfling, 1974)Let a = min16i6N xi , and b = max16i6N xi . Then ∀λ ∈ R+, itholds
logE exp(λnZn
)6
(b − a)2
8 λ2n(1− n − 1
N
).
Moreover,
logE exp(λ max
16k6nZ ?
k
)6
(b − a)2
8λ2
(N − n)2n(1− n − 1
N
).
O-A. Maillard Subsampling and Bandits 8 / 36
.A useful result
Theorem (Bardenet M. 2015)Let a = min16i6N xi , and b = max16i6N xi . Then ∀λ ∈ R+, italso holds
logE exp(λnZn
)6
(b − a)2
8 λ2(n + 1)(1− n
N
).
Moreover,
logE exp(λ max
16k6nZ ?
k
)6
(b − a)2
8λ2
(N − n)2n(1− n − 1
N
).
logE exp(λ max
n6k6N−1Zk
)6
(b − a)2
8λ2
n2 (n + 1)(1− n
N
).
(Slight) improvement when n > N/2.
O-A. Maillard Subsampling and Bandits 8 / 36
.A useful result
Trivial corollary:
Corollary (Bardenet M., 2015)For all n 6 N, δ ∈ [0, 1], with probability higher than 1− δ, itholds ∑n
t=1(Xt − µ)n 6 (b − a)
√ρn log(1/δ)
2n ,
where we define
ρn =
(1− n−1N ) if n 6 N/2
(1− nN )(1 + 1/n) if n > N/2
. (2)
O-A. Maillard Subsampling and Bandits 9 / 36
.A slightly improved Hoeffding-Serfling inequality
Sub-samplingconcentration
Bernstein-Serfling
Let σ2 = N−1∑Ni=1(xi − µ)2, then
Q?k−1 = 1
N − k + 1
k−1∑i=1
((Xi − µ)2 − σ2
)
Qk+1 = 1k + 1
k+1∑i=1
((Xi − µ)2 − σ2
)
Lemma (Bardenet M., 2015)
E[(Xk − µ)2
∣∣∣∣Z1, . . .Zk−1
]= σ2 − Q?
k−1 ,
where the Zis are defined in (1). Likewise
E[(Xk+1 − µ)2
∣∣∣∣Zk+1, . . .ZN−1
]= σ2 + Qk+1 .
O-A. Maillard Subsampling and Bandits 11 / 36
.Towards Bernstein-Serfling’s inequality
Corollary (Bardenet M., 2015)Let n 6 N and δ ∈ [0, 1]. With probability larger than 1− 2δ,it holds that∑n
t=1(Xt − µ)n 6 σ
√2ρn log(1/δ)
n + κn(b − a) log(1/δ)n ,
where
ρn =
(1− fn−1) if n 6 N/2(1− fn)(1 + 1/n) if n > N/2
κn =
43 +
√fn
gn−1if n 6 N/2
43 +
√gn+1(1− fn) if n > N/2
, (3)
with fn = n/N and gn = N/n − 1.
O-A. Maillard Subsampling and Bandits 12 / 36
.A Bernstein-Serfling inequality
Improvement over BernsteinI Factor ρn can give dramatic improvement.
Proof elementsI Self-bounded property of the variance:
Study of Z = 1(b−a)2
∑ni=1(Xi − µ)2 (cf. Maurer and
Pontil, 2006; via tensorization inequality for the entropy).I Hoeffding reduction’s lemma.
O-A. Maillard Subsampling and Bandits 13 / 36
.A Bernstein-Serling inequality
σ̂2n = 1
n
n∑i=1
(Xi − µ̂n)2 = 1n2
n∑i ,j=1
(Xi − Xj)2
2 , where µ̂n = 1n
n∑i=1
Xi .
Lemma (Bardenet M., 2015)When sampling without replacement from a finite populationX = (x1, . . . , xN) of size N, with range [a, b] and variance σ2,the empirical variance σ̂2
n using n < N samples satisfies
P(σ > σ̂n + (b − a)
(1 +
√1 + ρn
)√ log(3/δ)2n
)6 δ .
Possible improvementI Conjecture: Replace (1 +
√1 + ρn) with
√4ρn.
I Difficulty: concentration for self-bounded randomvariables when sampling without replacement.
O-A. Maillard Subsampling and Bandits 14 / 36
.Towards an empirical Bersntein-Serfling inequality
Corollary (Bardenet M., 2015)For all δ ∈ [0, 1], with probability larger than 1− 5δ, it holds∑n
t=1(Xt − µ)n 6 σ̂n
√2ρn log(1/δ)
n + κ(b − a) log(1/δ)n ,
where we remind the definition of ρn
ρn =
(1− n−1N ) if n 6 N/2
(1− nN )(1 + 1/n) if n > N/2 ,
and κ = 73 + 3√
2 .
O-A. Maillard Subsampling and Bandits 15 / 36
.An empirical Bernstein-Serfling inequality
0 20000 40000 60000 80000 100000n
0.00
0.05
0.10
0.15
0.20
Invert
ed b
ound
Hoeffding-Serfling
Bernstein-Serfling
Empirical Bernstein-Serfling
(e) Gaussian N (0, 1)
0 20000 40000 60000 80000 100000n
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Invert
ed b
ound
Hoeffding-Serfling
Bernstein-Serfling
Empirical Bernstein-Serfling
(f) Log-normal lnN (1, 1)
0 20000 40000 60000 80000 100000n
0.000
0.005
0.010
0.015
0.020
0.025
Invert
ed b
ound
Hoeffding-Serfling
Bernstein-Serfling
Empirical Bernstein-Serfling
(g) Bernoulli B(0.1)
0 20000 40000 60000 80000 100000n
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
Invert
ed b
ound
Hoeffding-Serfling
Bernstein-Serfling
Empirical Bernstein-Serfling
(h) Bernoulli B(0.5)O-A. Maillard Subsampling and Bandits 16 / 36
.Serfling-bounds
What we didI Improved Serfling-Hoeffding bound, new
Bernstein-Serfling and empirical Bernstein-Serflingbounds.
I Improvement over Hoeffding’s reduction due to ρn.
Improvement/Open questionI Tensorization inequality for the entropy in the case of
sampling without replacement?I Would lead to: (1 +
√1 + ρn) replaced with
√4ρn.
O-A. Maillard Subsampling and Bandits 17 / 36
.Sub-sampling recap
Sub-sampling BanditsIntroduction
"Sub-sampling for multi-armed bandits", Baransi, Maillard,Mannor ECML, 2014.
SettingSet of choices A. Each a ∈ A is associated with an unknownprobability distribution νa ∈ D with mean µa. At each roundt = 1 . . .T the playerI first picks an arm At ∈ A based on past observations.I then receives (and sees) a stochastic payoff Xt ∼ νAt .
Goal and performanceMinimize the regret at round T :
RTdef= E
[Tµ? −
T∑t=1
Xt
]=∑a∈A
(µ? − µa)E[Nπ
T ,a
]where µ? = max{µa ; a ∈ A}, a?∈argmax{µa ; a ∈ A}
NπT ,a =
T∑t=1
I{At = a} .
O-A. Maillard Subsampling and Bandits 19 / 36
.Stochastic Multi-armed bandit setting
Theorem (Burnetas and Katehakis, 1996)For any strategy π that is consistent (for any bandit,sub-optimal arm a, β > 0 it holds E
[Nπ
T ,a
]= o(T β)), and
D ⊂ P([0, 1])
lim infT→∞
RT
logT >∑
a:∆a>0
(µ? − µa)Kinf(νa, µ?)
,
where Kinf(νa, µ?) def= inf{KL(νa||ν), ν ∈ D has mean > µ∗} .
O-A. Maillard Subsampling and Bandits 20 / 36
.Lower performance bound
Class of optimal algorithmsI Confidence bound: e.g. KL-UCB (Lai-Robbins, 1985)I Bayesian: e.g. Thompson Sampling (Thompson, 1933)I Sub-sampling?
Provably optimal finite-time regret for some DI Discrete or exponential families of dimension 1.
They need to know D in order to be optimalI A different algorithm for each D: TS or KL-UCB for
Bernoulli, for Poisson, for Exponential, etc.
O-A. Maillard Subsampling and Bandits 21 / 36
.Optimality
I 10 Bernoulli(0.1, 3{0.05}, 3{0.02}, 3{0.01})BESA kl-UCB kl-UCB+ TS Others
Regret 74.4 121.2 72.8 83.4 100-400Beat BESA - 1.6% 35.4% 3.1%Run Rime 13.9X 2.8X 3.1X X
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
regr
et
BESA
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
KLUCB
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
KLUCB+
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
Thompson
Others: UCB, Moss, UCB-Tunes, DMED, UCB-V.(Credit: Akram Baransi)
O-A. Maillard Subsampling and Bandits 22 / 36
.Puzzling experiments (T = 20, 000, 50, 000 replicates)
I Exponential(15 ,
14 ,
13 ,
12 , 1)
BESA KL-UCB-exp UCB-tuned FTL 10 OthersRegret 53.3 65.7 97.6 306.5 60-110,120+
Beat BESA - 5.7% 4.3% -Run Rime 6X 2.8X X -
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
regr
et
BESA
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
BESAT
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
KLUCBexp
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
UCBtuned
Others: UCB, Moss, kl-UCB,UCB-V.(Credit: Akram Baransi)
O-A. Maillard Subsampling and Bandits 23 / 36
.Puzzling experiments (T = 20, 000, 50, 000 replicates)
I Poisson({12 + i
3}i=1,...,6)
BESA KL-UCB-Poisson kl-UCB FTL 10Regret 19.4 25.1 150.6 144.6
Beat BESA - 4.1% 0.7% -Run Rime 3.5X 1.2X X -
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
regr
et
BESA
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
BESAT
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
KLUCBpoisson
0 0.5 1 1.5 2
x 104
0
50
100
150
200
time
KLUCB
(Credit: Akram Baransi)
O-A. Maillard Subsampling and Bandits 24 / 36
.Puzzling experiments (T = 20, 000, 50, 000 replicates)
I Bernoulli all half but one 0.51.BESA KL-UCB KL-UCB+ TS
Regret 156.7 170.8 165.3 165.1Beat BESA - 41.4% 41.6% 40.8%Run Rime 19.6X 2.8X 3X X
0 0.5 1 1.5 2
x 104
0
50
100
150
200
250
300
time
regr
et
BESA
0 0.5 1 1.5 2
x 104
0
50
100
150
200
250
300
time
KLUCB
0 0.5 1 1.5 2
x 104
0
50
100
150
200
250
300
time
KLUCB+
0 0.5 1 1.5 2
x 104
0
50
100
150
200
250
300
time
Thompson
(Credit: Akram Baransi)
O-A. Maillard Subsampling and Bandits 25 / 36
.Puzzling experiments (T = 20, 000, 50, 000 replicates)
BESAI Competitive regret against state-of-the-art for various D.I Same algorithm for all D.I Not relying on upper confidence bounds, not Bayesian...I ...and extremely simple to implement.
QuestionsI How is this possible?I Can we prove optimality?I For which distributions is it optimal?
O-A. Maillard Subsampling and Bandits 26 / 36
.A Puzzling strategy
Sub-sampling BanditsBest Empirical Sub-sampling Average
FTL1: Play each arm once.2: At time t, define µ̃t,a = µ̂(X a
1:Nt,a) for all a ∈ A.I µ̂(X ): empirical average of population X .I X a
1:Nt,a = {Xs : As = a, s 6 t}3: Choose (break ties in favor of the smallest Nt)
At = argmaxa′∈{a,b}
µ̃t,a′ .
PropertiesI Generally bad: linear regret.I A variant (ε-greedy) performs ok if well-tuned (Auer et al,
2002).I Optimal for very specific distributions (e.g. deterministic).
O-A. Maillard Subsampling and Bandits 28 / 36
.Go back to "Follow the leader"
Compare two arms based on "equal opportunity" i.e. samenumber of observations.
BESA at time t for two arms a, b:1: Sample Iat ∼ Wr(Nt,a;Nt,b) and Ibt ∼ Wr(Nt,b;Nt,a).
I Wr(n,N): sample n points from {1, . . . ,N} withoutreplacement (return all the set if n > N).
2: Define µ̃t,a = µ̂(X a1:Nt,a(Iat )) and µ̃t,b = µ̂(X b
1:Nt,b(Ibt )).
3: Choose (break ties in favor of the smallest Nt)At = argmax
a′∈{a,b}µ̃t,a′ .
QuestionsI Why does it work?I When can we prove log(T ) regret? Optimality?I When does it fail?
O-A. Maillard Subsampling and Bandits 29 / 36
.Follow the FAIR leader (aka BESA)
Compare two arms based on "equal opportunity" i.e. samenumber of observations.
BESA at time t for two arms a, b:1: Sample Iat ∼ Wr(Nt,a;Nt,b) and Ibt ∼ Wr(Nt,b;Nt,a).
I Ex: Nt,a = 3,Nt,b = 10, It,a = {1, 2, 3}, |It,b| = 3,sampled without replacement from {1, . . . , 10}.
2: Define µ̃t,a = µ̂(X a1:Nt,a(Iat )) and µ̃t,b = µ̂(X b
1:Nt,b(Ibt )).
3: Choose (break ties in favor of the smallest Nt)At = argmax
a′∈{a,b}µ̃t,a′ .
QuestionsI Why does it work?I When can we prove log(T ) regret? Optimality?I When does it fail?
O-A. Maillard Subsampling and Bandits 29 / 36
.Follow the FAIR leader (aka BESA)
Assume µb > µa, Nt,a = na, Nt,b = nb with na > nb.The probability of making one mistake is approximatively
P[µ̂(X a
1:na
(I(na; nb)
))> µ̂
(X b
1:nb
)], (4)
where I(na; nb) ∼ Wr(na; nb). The probability of making Mconsecutive mistakes is essentially
P[∀m∈ [M] µ̂
(X a
1:n(m)a
(Im(n(m)
a ; nb)))> µ̂
(X b
1:nb
)], (5)
where ∀m 6 M, Im(na; nb) ∼ Wr(na; nb), n(m)a = na + m − 1.
For deterministic na, nb: (4) ∼ decreases with e−2nb(µb−µa)2 ,(5) with e−2nbM̃(µb−µa)2 where M̃ is the number ofnon-overlapping sub-samples (independent chunks).
Exponential decay of probability of consecutive mistakes.
O-A. Maillard Subsampling and Bandits 30 / 36
.Intuition
Let A = {?, a} and define
α(M, n) = EZ?∼ν?,n
(PZ∼νa,n(Z > Z ?) + 12PZ∼νa,n(Z = Z ?)
)M ,
Theorem (Regret of the BESA strategy)If ∃α ∈ (0, 1), c > 0 such that α(M, 1) 6 cαM , then
RT 611 log(T )µ? − µa
+ Cνa,ν? + O(1) ,
where Cνa,ν? depends on the problem, but not on T .
ExampleI Bernoulli µa, µ?: α(M, 1) = O
((µa∨(1−µa)
2
)M)
O-A. Maillard Subsampling and Bandits 31 / 36
.Regret bound (slightly simplified statement)
I Uniform X a ∼ U([0.2, 0.4]), X ? ∼ U([0, 1.]):α(M, n) M→∞→ 0.2n
I Consider BESA with initial number of pulls n0 = 0, . . .BESA n0 = 0 n0 = 3 n0 = 7 n0 = 8 n0 = 9 n0 = 10Regret 920.1 216.4 35.4 25.9 17.9 15.4
UCB kl-UCB TS FTL n0 = 10Regret 21.2 20.7 13.2 54.3
Beat BESA n0 = 0 24.3% 24.3% 24.7% -Beat BESA n0 = 3 7.3% 7.3% 7.8% -Beat BESA n0 = 7 1.6% 1.6% 1.8% -Beat BESA n0 = 10 0.6% 0.6% 0.7% -
(Credit: Akram Baransi)
O-A. Maillard Subsampling and Bandits 32 / 36
.Failure of the BESA strategy
Theorem (Regret of the BESA strategy)If ∃α ∈ (0, 1), c > 0 such that α(M, 1) 6 cαM , then
RT 611 log(T )µ? − µa
+ Cνa,ν? + O(1) ,
where Cνa,ν? depends on the problem, but not on T .If ∃β ∈ (0, 1), c > such that α(1, n) 6 cβn, then BESAinitialized with n0,T ' ln(T )/ ln(1/β) pull of each arm gets
RT 611 log(T )µ? − µa
+ n0,T + Cνa,ν? + O(1) ,
Key pointsI First condition holds for large class: extends FTL.I Initial number of pulls is less elegant.I Alternatively: mixing with uniform like ε-greedy.
O-A. Maillard Subsampling and Bandits 33 / 36
.Regret performance of BESA
1 Basic concentration gives ∆a 6� log(t)1/2
min{Nt,?,Nt,a}1/2 :
If Nt,a < Nt,?, then Nt,aw.h.p.6 �2
∆2 log(t).If Nt,a > Nt,?, then Nt,?
w.h.p.6 �2
∆2 log(t) def= ut .Thus: On the event Nt,? > ut , must have Nt,a
w.h.p.6 ut .
2 Show that Nt,?w.h.p.> ut (like for Thompson Sampling)
Let τ ?j : delay between the j th time tj and (j + 1)th time we play ?.
P[Nt,? 6 ut
]6
ut∑j=1
P[τ ?j > t/ut − 1︸ ︷︷ ︸
`t
]
6ut∑
j=1P[∀s ∈ {0, . . . , b`t
2 c, d`t
2 e, . . . , `t} atj +s = a]
6ut∑
j=1P[∀s ∈ {d`t
2 e, . . . , `t} atj +s = a ∩ Ntj +s,a > b`t
2 c︸ ︷︷ ︸>ut>j for t>c
]
O-A. Maillard Subsampling and Bandits 34 / 36
.Sketch of regret analysis:
1 Basic concentration gives ∆a 6� log(t)1/2
min{Nt,?,Nt,a}1/2 :
If Nt,a < Nt,?, then Nt,aw.h.p.6 �2
∆2 log(t).If Nt,a > Nt,?, then Nt,?
w.h.p.6 �2
∆2 log(t) def= ut .Thus: On the event Nt,? > ut , must have Nt,a
w.h.p.6 ut .
2 Show that Nt,?w.h.p.> ut (like for Thompson Sampling)
Let τ ?j : delay between the j th time tj and (j + 1)th time we play ?.
P[Nt,? 6 ut
]6
ut∑j=1
P[τ ?j > t/ut − 1︸ ︷︷ ︸
`t
]
6ut∑
j=1P[∀s ∈ {0, . . . , b`t
2 c, d`t
2 e, . . . , `t} atj +s = a]
t>c6
ut∑j=1
P[∀s ∈ {d`t
2 e, . . . , `t} atj +s = a ∩ Ntj +s,a > Ntj +s,?︸ ︷︷ ︸=j
]
O-A. Maillard Subsampling and Bandits 34 / 36
.Sketch of regret analysis:
Optimality and near-optimality regions in P([0, 1])?
PropertiesI Flexible: doesn’t need class of distribution, nor the
support.I We can prove log(T ) regret for certain classes.I Optimality (constants) unknown yet, but we are close.I Exhibit cases when it fails: why, and how to repair it.
O-A. Maillard Subsampling and Bandits 35 / 36
.BESA - Recap
Optimality and near-optimality regions in P([0, 1])?
PropertiesI Flexible: doesn’t need class of distribution, nor the
support.I We can prove log(T ) regret for certain classes.I Optimality (constants) unknown yet, but we are close.I Exhibit cases when it fails: why, and how to repair it.
O-A. Maillard Subsampling and Bandits 35 / 36
.BESA - Recap
Thank youIf you want toI prove "adaptive" optimality of this strategy orI extend it to contextual bandits, adversarial bandits, MDPs
Come work with me!