Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale...
-
Upload
aliyah-dolphin -
Category
Documents
-
view
217 -
download
0
Transcript of Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale...
Taming the monster :A fast and s imple a lgor i thm for contextual bandi tsP R E S E N T E D B Y S a t y e n K a l e
J o i n t w o r k w i t h A l e k h A g a r w a l , D a n i e l H s u , J o h n L a n g f o r d , L i h o n g L i a n d R o b S c h a p i r e
Learning to interact: example #1
Loop:› 1. Patient arrives with symptoms, medical history, genome, …› 2. Physician prescribes treatment.› 3. Patient’s health responds (e.g., improves, worsens).
Goal: prescribe treatments that yield good health outcomes.
Learning to interact: example #2
Loop:› 1. User visits website with profile, browsing history, …› 2. Website operator choose content/ads to display.› 3. User reacts to content/ads (e.g., click, “like”).
Goal: choose content/ads that yield desired user behavior.
Contextual bandit setting (i.i.d. version)
Set X of contexts/features and K possible actions For t = 1,2,…,T:
› 0. Nature draws (xt, rt) from distribution D over X × [0,1]K.
› 1. Observe context xt. [e.g., user profile, browsing history]
› 2. Choose action at ϵ [K]. [e.g., content/ad to display]
› 3. Collect reward rt(at). [e.g., indicator of click or positive feedback]
Goal: algorithm for choosing actions at that yield high reward.
Contextual setting: use features xt to choose good actions at.
Bandit setting: rt(a) for a ≠ at is not observed.› Exploration vs. exploitation
Learning objective and difficulties
No single action is good in all situations – need to exploit context. Policy class Π: set of functions (“policies”) from X [K]
(e.g., advice of experts, linear classifiers, neural networks). Regret (i.e., relative performance) to policy class Π:
… a strong benchmark if Π contains a policy with high reward. Difficulties: feedback on action only informs about subset of policies;
explicit bookkeeping is computationally infeasible when Π is large.
Arg max oracle (AMO)
Given fully-labeled data (x1, r1),…,(xt, rt), AMO returns
Abstraction for efficient search of policy class Π. In practice: implement using standard heuristics (e.g., convex
relax., backprop) for cost-sensitive multiclass learning algorithms.
Our results
New fast and simple algorithm for contextual bandits› Optimal regret bound (up to log factors): › Amortized calls to argmax oracle (AMO) per round.
Comparison to previous work› [Thompson’33]: no general analysis.› [ACBFS’02]: Exp4 algorithm; optimal regret, enumerates policies.› [LZ’07]: ε-greedy variant; suboptimal regret, one AMO call/round.› [DHKKLRZ’11]: “monster paper”; optimal regret, O(T5K4) AMO calls/round.
Note: Exp4 also works in adversarial setting.
Rest of this talk
1. Action distributions, reward estimates viainverse probability weights [oldies but goodies]
2. Algorithm for finding policy distributionsthat balance exploration/exploitation
3. Warm-start / epoch trick
New
New
Basic algorithm structure (same as Exp4)
Start with initial distribution Q1 over policies Π.
For t=1,2,…,T:› 0. Nature draws (xt,rt) from distribution D over X × [0,1]K.
› 1. Observe context xt.
› 2a. Compute distribution pt over actions {1,2,…,K} (based on Qt and xt).
› 2b. Draw action at from pt.
› 3. Collect reward rt(at).
› 4. Compute new distribution Qt+1 over policies Π.
Inverse probability weighting (old trick)
Importance-weighted estimate of reward from round t:
Unbiased, and has range & variance bounded by 1/pt(a).
Can estimate total reward and regret of any policy:
Constructing policy distributions
Optimization problem (OP):
Find policy distribution Q such that:
Low estimated regret (LR) – “exploitation"
Low estimation variance (LV) – “exploration”
Theorem: If we obtain policy distributions Qt via solving (OP), then with high probability, regret after T rounds is at most
Feasibility
Feasibility of (OP): implied by minimax argument.
Monster solution [DHKKLRZ’11]: solves variant of (OP) with ellipsoid algorithm, where Separation Oracle = AMO + perceptron + ellipsoid.
Coordinate descent algorithm
INPUT: Initial weights Q. LOOP:
› IF (LR) is violated, THEN replace Q by cQ.› IF there is a policy π causing (LV) to be violated, THEN
• UPDATE Q(π) = Q(π) + α.› ELSE
• RETURN Q.
Above, both 0 < c < 1 and α have closed form expressions.
(Technical detail: actually optimize over sub-distributions Q that may sum to < 1.)
Claim: Can check by making one AMO call per iteration.
Iteration bound for coordinate descent
# steps of coordinate descent =
Also gives bound on sparsity of Q.
Analysis via a potential function argument.
Warm-start
If we warm-start coordinate descent (initialize with Qt to get Qt+1), then only need
coordinate descent iterations over all T rounds.
Caveat: need one AMO call/round to even check if (OP) is solved.
Epoch trick
Regret analysis: Qt has low instantaneous expected regret(crucially relying on i.i.d. assumption).› Therefore same Qt can be used for O(t) more rounds!
Epoching: Split T rounds into epochs, solve (OP) once per epoch. Doubling: only update on rounds 21,22,23,24,…
› Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors).
Squares: only update on rounds 12,22,32,42,…› Total of O(T1/2) updates, each requiring AMO calls, on average.
Experiments
Algorithm Epsilon-greedy
Bagging Linear UCB “Online Cover”
[Supervised]
Loss 0.095 0.059 0.128 0.053 0.051
Time (seconds)
22 339 212000 17 6.9
Bandit problem derived from classification task (RCV1).Reporting progressive validation loss.
“Online Cover” = variant with stateful AMO.