Online Convex Optimization in the Bandit Settingkamalika/teaching/CSE291W11/Feb9.pdf · Online...
Transcript of Online Convex Optimization in the Bandit Settingkamalika/teaching/CSE291W11/Feb9.pdf · Online...
Online Convex Optimization in the Bandit Setting: Gradient Descent Without a Gradient
-Avinash Atreya
Feb 9 2011
Outline
• Introduction – The Problem
– Example
– Background
– Notation
– Results
• One Point Estimate
• Main Theorem
• Extensions and Related Work
The Problem
At time t: – We need to choose an input vector 𝑥𝑡 ∈ 𝑆 ⊂ ℝ𝑑
– 𝑆 is a convex set
– Nature reveals only the cost 𝑐𝑡 𝑥𝑡
– 𝑐𝑡 : ℝ𝑑 → ℝ is convex (not necessarily differentiable)
Our Goal: – Minimize the expected regret:
𝔼 𝑐𝑡 𝑥𝑡
𝑛
𝑡=1
− min𝑥∈𝑆
𝑐𝑡(𝑥)
𝑛
𝑡=1
Example
Online advertising spends each day
Each component of 𝑥𝑡 : spend on a search engine in dollars
10020050
End of the day we learn the number of clicks
Background
Online Convex Optimization
– We learn the function 𝑐𝑡 after we pick 𝑥𝑡
Bandit Setting
– We learn only the outcome of our action
Online Convex Optimization in Bandit Setting
– We only learn the outcome 𝑐𝑡 𝑥𝑡
Notation I
𝐷 : diameter 𝑥 − 𝑦 2 ≤ 𝐷 ∀𝑥, 𝑦 ∈ 𝑆
𝐺 : gradient upper bound 𝛻𝑐𝑡 𝑥𝑡 2 ≤ 𝐺 ∀𝑡 1 ≤ 𝑡 ≤ 𝑛
Notation II
𝐶 : function absolute value bound 𝑐𝑡 𝑥 ≤ 𝐶 ∀𝑡 , ∀𝑥
𝐿 : Lipschitz Constant 𝑐𝑡 𝑥 − 𝑐𝑡 𝑦 2 ≤ 𝐿 𝑥 − 𝑦 2 ∀𝑡, ∀𝑥, 𝑦 ∈ 𝑆
Notation III
Until ball 𝔹 and unit sphere 𝕊
𝔹 = 𝑥 ∈ ℝ𝑑 𝑥 ≤ 1+,
𝕊 = 𝑥 ∈ ℝ𝑑 𝑥 = 1+
Projection onto the convex set 𝑆 𝑃𝑆 𝑥 = argmin
𝑧∈𝑆|𝑥 − 𝑧|
Key Results
Online Convex Optimization: (Zinkevich)
𝑐𝑡 𝑥𝑡 − min𝑥∈𝑆
𝑐 𝑥 ≤ 𝐷𝐺 𝑛
𝑛
𝑡=1
𝑛
𝑡=1
Bandit Setting
𝔼 𝑐𝑡 𝑥𝑡
𝑛
𝑡=1
− min𝑥∈𝑆
𝑐𝑡 𝑥
𝑛
𝑡=1
≤ 6𝑛56𝑑𝐶
Outline
• Introduction
• One Point Estimate
– Key Challenge
– Projected Gradient Descent
– Expected Gradient Descent
– One Point Estimate
• Main Theorem
• Extensions and Related Work
Key Challenge
Approach
– Projected Gradient descent 𝑥𝑡+1 = 𝑃𝑆 𝑥𝑡 − 𝜈𝛻𝑐𝑡 𝑥𝑡
Challenge
– How to estimate gradient with only 𝑐𝑡 𝑥𝑡 ?
Gradient Estimate
We need at least 𝑑 + 1 points in d dimensions
1-d : 𝑓′ 𝑥 ≈ 𝑓 𝑥+𝛿 − f 𝑥
𝛿
Prior work exists on using two point estimates in d dimensions
𝑓(𝑥)
𝑓(𝑥 + 𝛿)
Projected Gradient Descent
Due to Zinkevich (seen in class)
𝑥1 = 0 ; At time 𝑡 + 1 – 𝑐𝑡 is revealed (convex and differentiable)
– 𝑥𝑡+1 = 𝑃𝑠 𝑥𝑡 − 𝜂𝛻𝑐𝑡 𝑥𝑡
Regret bound:
𝑐𝑡 𝑥𝑡 − min𝑥∈𝑆
𝑐 𝑥 ≤ 𝑅𝐺 𝑛
𝑛
𝑡=1
𝑛
𝑡=1
Expected Gradient Descent
𝑥1 = 0 ; At time 𝑡 + 1
– 𝑥𝑡+1 = 𝑃𝑠 𝑥𝑡 − 𝜂𝑔𝑡
– 𝑔𝑡: random vector
– 𝔼 𝑔𝑡 𝑥𝑡- = 𝛻𝑐𝑡 𝑥𝑡
Same bound holds on expectation:
𝔼 𝑐𝑡 𝑥𝑡
𝑛
𝑡=1
− 𝑚𝑖𝑛𝑥∈𝑆
𝑐𝑡 𝑥
𝑛
𝑡=1
≤ 𝑅𝐺 𝑛
Key Challenge Revisited
Challenge – Estimate gradient 𝛻𝑐𝑡 𝑥𝑡 with one point estimate
𝑐𝑡(𝑥𝑡)
Somewhat easier – Come up with 𝑐 𝑡 , 𝑔𝑡 so that 𝐸 𝑔𝑡 𝑥𝑡- = 𝛻𝑐𝑡 𝑥𝑡
Come up with a function 𝑐𝑡 (close to 𝑐𝑡) whose gradient is easy to estimate (using 𝑐𝑡) on expectation
One Point Estimate I
Fundamental theorem of calculus:
𝑑
𝑑𝑥 𝑐𝑡(𝑥 + 𝑦)𝑑𝑦
+𝛿
−𝛿
= 𝑐𝑡(𝑥 + 𝛿) − 𝑐𝑡(𝑥 − 𝛿)
Uniform random variable:𝑣 ∈ , −1,+1- 𝑑
𝑑𝑥𝛿
1
2𝑐𝑡(𝑥 + 𝑣𝛿)𝑑𝑣
1
−1
=𝑐𝑡 𝑥 + 𝛿 − 𝑐𝑡 𝑥 − 𝛿
2
One Point Estimate II
Random variable 𝑢 ∈ * −1,+1+ 𝑑
𝑑𝑥𝔼𝑣 ~ 𝒰 −1,1 𝑐𝑡 𝑥 + 𝛿𝑣 =
𝔼𝑢 ~ −1,1 𝑐𝑡 𝑥 + 𝛿𝑢 𝑢
𝛿
𝑐𝑡 𝑥 = 𝔼,𝑐𝑡(𝑥 + 𝛿𝑣)- (smoothed version of 𝑐𝑡 ) – The function we are looking for!
– 𝑔𝑡 = 𝑐𝑡 𝑥 + 𝛿𝑢𝑡 𝑢𝑡
𝑣 is drawn from a line, 𝑢 from end points
One point Estimate III
𝑑 dimensions – 𝑣 ~ 𝔹 (the unit ball)
– 𝑢 ~ 𝕊 (the unit sphere)
𝛻𝔼𝑣 ~ 𝔹 𝑐𝑡 𝑥 + 𝛿𝑣 =𝑑
𝛿𝔼𝑢 ~ 𝕊 𝑐𝑡 𝑥 + 𝛿𝑢 𝑢
Follows from Stokes’ theorem (generalization of fundamental theorem to 𝑑 dimensions)
𝑣 𝑢
Putting Things Together
Expected gradient on 𝑐𝑡 : 1 − 𝛼 𝑆 → ,−𝐶, 𝐶-
– 𝑔𝑡 =𝑑
𝛿𝑐𝑡 𝑥𝑡 + 𝛿𝑢𝑡 𝑢𝑡 , 𝑢𝑡 ~ 𝔹
– 𝑥𝑡+1 = 𝑃𝑠 (𝑥𝑡 − 𝜂𝑔𝑡)
– 𝔼 𝑔𝑡 𝑥𝑡- = 𝛻 𝑐𝑡 𝑥
Bound on regret:
𝔼 𝑐𝑡 𝑥𝑡
𝑛
𝑡=1
− 𝑚𝑖𝑛𝑥∈ 1−𝛼 𝑆
𝑐𝑡 𝑥𝑡
𝑛
𝑡=1
≤ 𝑅𝐺 𝑛
Outline
• Introduction
• One point estimate
• Main Theorem
– Algorithm
– Observations
– Proof Sketch
– Results
• Extensions and Related Work
The Algorithm
Bandit-Gradient-Descent(𝛼, 𝛿, 𝜈)
–𝑥1 = 0
–At time t
• Select 𝑢𝑡 ~ 𝕊
• Play 𝑥𝑡 + 𝛿𝑢𝑡
• Observe 𝑐𝑡(𝑥𝑡 + 𝛿𝑢𝑡)
• 𝑥𝑡+1 = 𝑃 1−𝛼 𝑆(𝑥𝑡 − 𝜈𝑐𝑡 𝑥𝑡 + 𝛿𝑢𝑡 𝑢𝑡)
Terms in the bound
Expected gradient for 𝑐 𝑡
Difference between min in 1 − 𝛼 𝑆 and 𝑆
Difference between 𝑐 𝑥 , 𝑥 ∈ 1 − 𝛼 𝑆 and 𝑐 𝑦 , 𝑦 ∈ 𝑆
1 − 𝛼 𝑆
𝑆
Observation I
If we take a step of size 𝛼𝑟 from 𝑥 ∈ 1 − 𝛼 𝑆, we stay in 𝑆
Bounds on S: 𝑟𝔹 ⊂ 𝑆 ⊂ 𝑅𝔹
S contains the origin
𝛼𝑟𝔹 centered at 𝑥 ∈ 𝑆. So, 1 − 𝛼 𝑆 + 𝛼𝑟𝔹 ⊂ 1 − 𝛼 𝑆 + 𝛼𝑆 = 𝑆
𝑟
𝑅
𝑆
Observation II
From expected gradient (𝜂 = 𝜈𝛿/𝑑)
𝔼 𝑐 𝑡 𝑥𝑡
𝑛
𝑡=1
− min𝑥∈ 1−𝛼 𝑆
𝑐 𝑡 𝑥𝑡
𝑛
𝑡=1
≤ 𝑅𝐺 𝑛
Gradient bound 𝐺
|𝑔𝑡| = 𝑑
𝛿𝑐𝑡 𝑥𝑡 + 𝛿𝑢𝑡 𝑢𝑡 ≤
𝑑𝐶
𝛿
Regret bound: 𝑅𝑑𝐶 𝑛
𝛿
Observation III
Optimum in 1 − 𝛼 𝑆 is near optimum in 𝑆
From Jensen’s inequality
𝑐𝑡 1 − 𝛼 𝑥 + 𝛼0 ≤ 1 − 𝛼 𝑐𝑡 𝑥 + 𝛼𝑐𝑡 0
𝑐𝑡 1 − 𝛼 𝑥 − 𝑐𝑡 𝑥 ≤ 𝛼 𝑐𝑡 0 − 𝑐𝑡 𝑥 ≤ 2𝛼𝐶
Summing up
min𝑥∈ 1−𝛼 𝑆
𝑐𝑡 𝑥
𝑛
𝑡=1
− min𝑥∈𝑆
𝑐𝑡 𝑥
𝑛
𝑡=1
≤ 2𝛼𝐶𝑛
Observation IV
Lipcshitz across 1 − 𝛼 𝑆 and 𝑆:
For 𝑥 ∈ 𝑆, 𝑦 ∈ 1 − 𝛼 𝑆
𝑐𝑡 𝑥 − 𝑐𝑡 𝑦 ≤2𝐶|𝑥 − 𝑦|
𝛼r
Obvious when Δ = 𝑥 − 𝑦 > 𝛼𝑟
Otherwise we pick a point 𝑧 ∈ 𝑆 in the direction of Δ and use Jensen’s inequality
Proof Sketch I
Combining all the observations
𝔼 𝑐𝑡 𝑥𝑡
𝑛
𝑡=1
− min𝑥∈𝑆
𝑐𝑡 𝑥𝑡
𝑛
𝑡=1
≤𝑅𝑑𝐶 𝑛
𝛿 (expected gradient)
+6𝛿𝐶𝑛
𝛼𝑟 (effective Lipcshtiz)
+2𝛼𝐶𝑛 (difference in min)
Proof Sketch II
Bound is of the form
𝑎1
𝛿+ 𝑏
𝛿
𝛼+ 𝑐𝛼
Setting 𝛿 =𝑎2
𝑏𝑐
3 , 𝛼 =
𝑎𝑏
𝑐2
3 gives a bound of
3 𝑎𝑏𝑐3
Note: 𝑎 = 𝑅𝑑𝐶 𝑛 , 𝑏 =6𝐶𝑛
𝑟, 𝑐 = 2𝐶𝑛
Theorem
For 𝑛 ≥3𝑅𝑑
2𝑟
2, 𝜈 =
𝑅𝐶
𝑛 ,
𝛿 =𝑟𝑅2𝑑2
12𝑛
3 , 𝛼 =
3𝑅𝑑
2𝑟 𝑛
3
We can show a bound of
3𝐶𝑛56 𝑑𝑅
𝑟
3
Outline
• Introduction
• One Point Estimate
• Main Theorem
• Extensions and Related Work
– Bound with a Lipschitz Constant
– Reshaping to Isotropic Position
– Related Work
Bound with a Lipschitz Constant
When each 𝑐𝑡 is 𝐿 Lipschitz, for suitable values of 𝛼, 𝛿, 𝜈 We can show a bound of
2𝑛34 3𝑅𝑑𝐶 𝐿 +
𝐶
𝑟
Intuition: use the direct Lipschtiz constant instead of the effective one
Reshaping
Dependence on 𝑅/𝑟 is not ideal
Transform S to be in its isotropic position – Affine transformation so that covariance = 𝐼
– 𝑟′ = 1, 𝑅′ = 1.01𝑑, 𝐿′ = 𝐿𝑅, 𝐶′ = 𝐶
Related Work
Klienberg (independently) 𝑂(𝑛3
4) bound for the same problem
– Phases of length 𝑑 + 1
– Random one-point gradient estimates
– Only oblivious adversaries
Online linear optimization in bandit setting
– Kalai and Vempala show a bound of 𝑂( 𝑛)