Battle of Bandits - Online learning from relative preferences
Aadirupa Saha, Computer Science and Automation (CSA) Prof. Aditya Gopalan, Electrical Communication Engineering (ECE) Indian Institute of Science (IISc), Bangalore.
Aadirupa Saha - Ph.D. Student, Dept of Computer Science and Automation (CSA, IISc) Advisors: Prof. Aditya Gopalan (ECE, IISc), Prof. Siddharth Barman (CSA, IISc) - M.E. (CSA, IISc), Advisor: Prof. Shivani Agarwal. Research Interests: Machine Learning, Learning Theory, Optimization
Brief Intro.
Research intern, Google Mountain View Collaborators: Ofer Meshi, Branislav Kveton, Craig Boutilier
Currently ---
09-Sep-19 2
Problem Overview
Learning from Preferences
Introduction – Multi-Armed Bandits (MAB)
Qualcomm Innovation Fellowship India 2018
Introduction – Multi-Armed Bandits (MAB)
Qualcomm Innovation Fellowship India 2018
-$10 $1 $50 -$5
How fast can we
find the arm with
*highest reward*?
Play sequentially (one by one)
Qualcomm Innovation Fellowship India 2018
μ1 μ2 μ3 μ4 … μn > > > > >
…
*Best* arm
More formally: MAB (Learning from single choices)
More formally: MAB (Learning from single choices)
Observe (noisy) reward rt ~ Dist(𝜇(𝒂𝒕))
repeat
Expected Regret in T rounds:
…
Select an arm at from {1,2,…n}
At round t,
Best possible: 𝑂(𝑛 𝑙𝑜𝑔 𝑇)
State of the art
Auer et. al. Finite-time analysis of the multiarmed bandit problem. Machine Learning 2002.
Restaurant recommendation
09-Sep-19 8
Search engine optimization:
Information aggregation from preference data
Learning from relative preferences
Absolute vs. Relative preferences
Rankings (Relative)
Ratings (Absolute)
Often easier (& more accurate) to elicit relative preferences than absolute scores
--- How much you score it out of 5?
--- Do you like movie A over B?
09-Sep-19
09-Sep-19 12
Guess the most liked flavour?
Information aggregation from pairwise preference data
Dueling Bandits
Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. ICML 2009.
09-Sep-19 13
More Formally: Dueling Bandits (Learning from pairwise preferences)
Observe (noisy) comparison xt є {0,1} ~ P:=
repeat
…
Select two arms (at,bt)
At round t,
1 2 3 4 5
1 0.5 0.53 0.54 0.56 0.6
2 0.47 0.5 0.53 0.58 0.61
3 0.46 0.47 0.5 0.54 0.57
4 0.44 0.42 0.46 0.5 0.51
5 0.4 0.39 0.43 0.49 0.5
Preference Matrix
Szorenyi et. al. Online rank elicitation for Plackett-Luce: A dueling bandits approach. NuerIPS 2015.
Yue and Joachims. Beat the mean bandit. ICML 2011.
Objective: Find the best arm with
with minimum possible #samples (rounds)
Wouldn’t a subset-wise preference make more sense?
09-Sep-19 15
vs.
Why Subsets??
09-Sep-19 16
Realistic & budget friendly
More feedback flexibility
Easy data collection
Main question: Faster information aggregation with subsets?
09-Sep-19
Prior Art: Not much!
Almost None! Chen, Xi, Yuanzhi Li, and Jieming Mao. "A Nearly Instance Optimal Algorithm for Top-k Ranking under the
Multinomial Logit Model." Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms.
Society for Industrial and Applied Mathematics, (2018).
Khetan, Ashish, and Sewoong Oh. "Data-driven rank breaking for efficient rank aggregation." Journal of Machine
Learning Research 17.193 (2016).
Batch (Offline / Non-active) setting
Active setting
18
Wenbo Ren, Jia Liu, Ness B. Shroff. "PAC Ranking from Pairwise and Listwise Queries: Lower Bounds and Upper
Bounds." arXiv (2018).
Assortment Optimization Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2016, July). A near-optimal exploration-exploitation
approach for assortment selection. In Proceedings of the 2016 ACM Conference on Economics and Computation.
09-Sep-19
Battling Bandits
Observe (noisy) “subset-wise feedback” f(St) ~ P (“stochastic model”)
repeat
…
Select set of k-arms St
At round t,
09-Sep-19 19
?
(Learning relatively from subsets)
Objective: Find the best arm with
with minimum possible #samples (rounds)
Choice Modeling (Challenges):
2. Combinatorial structure:
3. How to express relative utilities of arms within subsets?
Probabilistic modeling of feedback 𝒂 in set 𝑺 := 𝑷(𝒂|𝑺)
Number of parameters: 𝑛𝑘
or 𝑛𝑘 --- Combinatorially large!!
Subset-wise preference matrix
1. Choice modeling
1 2 3 ... #out-
comes
𝑺𝟏 0.13 0.01 0.05 … 0.22
𝑺𝟐 0.27 0.12 0.03 … 0.19
𝑺𝟑 0.04 0.11 0.05 … 0.23
… … … … … …
𝑺 𝑛𝑘
0.23 0.19 0.03 0.19 0.24
09-Sep-19 20
Discrete Choice Models:
Other Choice models: Multinomial Probit, Mallows, Nested GEV etc.
Plackett-Luce choice model:
Modelling stochastic preferences of an individual or group of items
in a given context (subset)
Just n parameters!
Parameters: 𝑛𝑘 n
Parameter Reduction!!
Azari et al., Random utility theory for social choice. NuerIPS 2012.
for any subset
Discrete Choice Models:
Plackett-Luce choice model:
Modelling stochastic preferences of an individual or group of items
in a given context (subset)
Just n parameters!
Parameters:
Parameter Reduction!!
Azari et al., Random utility theory for social choice. NuerIPS 2012.
for any subset
Type of PL feedback: General Top-m Ranking:
Example: For subset (k=4) -- Top-m ranking feedback (m=2):
-- Full ranking feedback (m=4):
𝑛𝑘 n
Work done
Part I: Learning the (𝜖, 𝛿)-PAC-Best Item
Part III: Learning the (𝜖, 𝛿)-PAC-Optimal Ranking
Part IV: Cost optimization or Regret minimization
(a). PL-model (b). PSC-model
09-Sep-19
Part II: Instant optimal-PAC-Best Item
Part-I: Learning (𝜖, 𝛿)-PAC Best Item
09-Sep-19
In 30th International Conference on Algorithmic Learning Theory (ALT), 2019.
Objective:
Problem: (𝜖, 𝛿)-PAC Best Item:
Output item i such that:
09-Sep-19
with minimum possible #samples (rounds)
For any and and and any A, there exist
an instance of the PL model where A requires a sample complexity of at least
rounds
Result Overview: (𝜖, 𝛿)-Sample Complexity
1. Sample Complexity Lower Bound:
2. Proposed algorithm takes: rounds.
𝛺𝑛
𝑚𝜖2 𝑙𝑜𝑔
1
𝛿
O𝑛
𝑚𝜖2 𝑙𝑜𝑔
𝑘
𝛿
-- Algorithm-1: Divide and Battle
-- Algorithm-2: Halving Battle Essentially ‘independent‟ of k !
Reduces with m ! 09-Sep-19
A. Lower Bound Analysis
09-Sep-19 27
PL instances:
Arm-1 Arm-2 Arm-n Arm-(n-1) Arm-3 Arm-a
- optimal arm
True instance -
09-Sep-19
PL instances:
Arm-1
True instance -
Alternative instance -
Arm-2 Arm-3
Arm-1 Arm-2 Arm-3
Arm-a
Arm-a
- optimal arm
- optimal arm
Arm-n Arm-(n-1)
Arm-n Arm-(n-1) 09-Sep-19
Fundamental Inequality (Kaufmann et al. 2016): Consider two MAB instances on n arms: and . Arm set:
reward distribution of arm i for (similarly for )
number of plays of arm i during any finite stopping time
where
: Any event under sigma-algebra of the algorithm’s trajectory
Kaufmann et al., On the complexity of best-arm identification in multi-armed bandit models. JMLR 2016.
09-Sep-19
Lower Bound Analysis:
Arm set :
: Event that Algorithm (A) returns item-1
, and
LHS: ,
RHS:
Result follows further using:
(Kaufmann et al. 2016)
09-Sep-19
B. Proposed Algorithms + Guarantees
09-Sep-19 32
Rank Breaking:
Example: Consider a subset of size (k = 4)
Rank-Breaking
Idea of extracting pairwise preferences from subset-wise feedback
Upon top-m ranking feedback (m=2):
Rank-Breaking
Upon full ranking feedback (m=4):
„Strongest‟ Winner of maximum no. of Pairwise Duels 09-Sep-19
Key Lemma (Deviations of pairwise win-probability estimates for PL model):
.
.
.
Then:
Assume,
where and 09-Sep-19
Retain the 'strongest‟ Divide into groups
Play each for
times
+ Rank-Breaking
.
.
.
.
.
.
Proposed Algorithm-1:
Divide and Battle (DaB)
09-Sep-19
(output)
phases
Retain the 'strongest‟ Divide into groups
Proposed Algorithm-1:
Play each for
times
+ Rank-Breaking
.
.
.
.
.
.
.
.
.
Repeat
PAC Item
Divide and Battle (DaB)
Comparisons: Existing Results
Algorithm-1 (DaB):
Existing Dueling Bandit (k=2) results (m=1)
Algorithm-2 (HB):
PLPAC (Szorenyi et al., 2015):
BTM (Yue and Joachims, 2011):
sub-optimality
Szorenyi et. al. Online rank elicitation for Plackett-Luce: A dueling bandits approach. NuerIPS 2015.
Yue and Joachims. Beat the mean bandit. ICML 2011.
But no existing work
for top-m feedback!!
Part-II: Instant-Optimal-PAC Best Item
In submission
09-Sep-19
What if 𝜖 = 0 ?
Shouldn‟t sample complexity depend on the
„hardness‟ of the problem instance?
09-Sep-19
Motivation:
Arm-1
Instance - 1
Arm-2 Arm-3
Arm-1 Arm-2 Arm-3
Arm-a
Arm-a
Arm-n Arm-(n-1)
Arm-n Arm-(n-1) 09-Sep-19
Instance - 2
“Hard” instance
“Easy” instance
can‟t be same !!
Finding - optimal arm
Motivation:
Arm-1
Instance - 1
Arm-2 Arm-3
Arm-1 Arm-2 Arm-3
Arm-a
Arm-a
can‟t be same !!
Arm-n Arm-(n-1)
Arm-n Arm-(n-1) 09-Sep-19
Instance - 2 Finding - optimal arm
“Hard” instance
“Easy” instance
(Gaps)
Pure-Exploration:
Lower Bound:
We achieved:
Instant Dependent-Sample Complexity
Instant optimal Best-Item
09-Sep-19
“Hard” instance “Easy” instance
---- for “Hard” instances
---- for “Easy” instances
Partition into batches, and play each for times
-PAC Best-Item subroutine
Find -optimal Best-Item
Prune the items with and merge the rest
Resulting
At any sub-phase [ assume ]
PAC-Wrapper!
Re
pe
at
Instant optimal Best-Item: Proposed algorithm
Item-wise survival time:
09-Sep-19 44
Sample Complexity vs (𝜖, 𝛿):
09-Sep-19 45
varying 𝜖 varying 𝛿
Sample Complexity vs. rank-ordered feedback (m):
09-Sep-19 46
Learning (𝜖, 𝛿)-PAC Best Ranking
Part - III
09-Sep-19
In 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
True-ranking:
Objective: Predict a full Ranking (𝝈):
Problem Setting: (𝜖, 𝛿)-PAC-Ranking
with minimum possible #samples (rounds)
09-Sep-19
For any and and and any A satisfying
label invariance, there exist an instance of the PL model where A requires a
sample complexity of at least rounds.
Result Overview: (𝜖, 𝛿)-Sample Complexity
1. Sample Complexity Lower Bound:
2. Proposed algorithm takes: rounds.
-- Algorithm-1: Beat-the-Pivot -- Algorithm-2: Score-and-Rank
Again ‘independent‟ of k !
Inverse linear dependency on m ! 09-Sep-19
A. Lower Bound
09-Sep-19 50
Arm-0 Arm-1 Arm-(n-1) Arm-(n-2) Arm-2 Arm- 𝑛
2
True instance - PL instances:
such that
….
Arm-( 𝑛
2+1)
….
- best optimal arms
09-Sep-19
Arm-0 Arm-1 Arm-(n-1) Arm-(n-2) Arm-2 Arm- 𝑛
2
Alternative instance - PL instances:
such that
….
Arm-( 𝑛
2+1)
….
- best optimal arms
09-Sep-19
Arm-0 Arm-1 Arm-(n-1) Arm-(n-2) Arm-2 Arm- 𝑛
2
Alternative instance - PL instances:
such that
….
Arm-( 𝑛
2+1)
….
- best optimal arms
‘Label Invariance’!
09-Sep-19
Arm set for current setting:
We have , and we set
And we show .
Result follows using:
Lower Bound Analysis:
(Kaufmann et al. 2016)
Kaufmann et. al. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models. JMLR 2016. 09-Sep-19
B. Proposed Algorithms + Guarantees
09-Sep-19 55
-- Algorithm-1: Beat-the-Pivot -- Algorithm-2: Score-and-Rank
(output)
PAC Item: ‘b’
Divide into
.
.
.
-PAC Best-Item subroutine
groups
Proposed Algorithm-1:
Beat-the-Pivot (BP)
(Saha & Gopalan, ALT19)
(output)
PAC Item: ‘b’
Divide into
Play each for
times
.
.
.
-PAC Best-Item subroutine
groups
.
.
.
Compute
Proposed Algorithm-1:
Beat-the-Pivot (BP)
+ Rank-Breaking
?
?
(Saha & Gopalan, ALT19)
Proposed Algorithm-1:
Beat-the-Pivot (BP)
Correctness and Sample Complexity guarantee
Theorem: Beat the pivot finds an (𝜖, 𝛿)-PAC Optimal Ranking with
sample complexity :
Proof?
Main Idea: If , , then for any 𝑏 ∈ [𝑛]
Can we estimate with high confidence? 09-Sep-19
Guarantees on (𝜖, 𝛿)-Sample Complexity:
Algorithm-1 (BP):
Comparison: Existing Dueling Bandit (k=2) results (m=1)
Algorithm-2 (SaR):
PLPAC-AMPR (Szorenyi et al., 2015):
sub-optimality
Szorenyi et. al. Online rank elicitation for Plackett-Luce: A dueling bandits approach. NeurIPS 2015.
Again no existing work
for top-m feedback!!
09-Sep-19
Kendall-Tau ranking loss: ,
Experiments: True
Sample-size
Ke
ndall-
Tau loss
09-Sep-19
Predicted
Cost (Regret Minimization)
Part-IV:
09-Sep-19
Accepted to Neural Information Processing Systems (NeurIPS), 2019.
Regret Minimization for Plackett-Luce Model
Part-IV(a):
09-Sep-19
Accepted to Neural Information Processing Systems (NeurIPS), 2019.
Problem Setting
Objectives:
1. Regret w.r.t. to the Best-Item: ,
2. Regret w.r.t. to the Top-k Items: ,
09-Sep-19
Result Overview:
A. Lower Bound:
Results: Cumulative Regret in T rounds
B. We achieved:
09-Sep-19
B. Proposed Algorithm (Regret w.r.t. to the Best-Item)
09-Sep-19
Possible set of Good items
U: UCB of 𝑃 Pairwise-Preference Matrix (𝑃 )
Algorithm: MaxMin-UCB
1 … n
1 𝑝 11 𝑝 1𝑛
… 𝑝 𝑖𝑗
n 𝑝 𝑛1 𝑝 𝑛𝑛
1 … n
1 𝑢11 𝑢1𝑛
… 𝑢𝑖𝑗
n 𝑢𝑛1 𝑢𝑛𝑛
Max-Min set
building rule
St
Play St
?
09-Sep-19
Com
pute
Com
pute
Algorithm: MaxMin-UCB Set Building Rule
09-Sep-19
Max-Min set
building rule T
Algorithm: MaxMin-UCB Set Building Rule
Max-Min set
building rule
That’s it!
Otherwise, recurse for m times:
09-Sep-19
T
Effect of varying subset-size(k):
Full ranking feedback Winner feedback
09-Sep-19 75
Winner-Regret Performance:
09-Sep-19 76
Top-k-Regret Performance:
09-Sep-19 77
Regret Minimization for Pairwise-Subset Choice Model
Part-IV(b):
In 9th International Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
09-Sep-19
Result Overview:
A. Lower Bound:
Results: Cumulative Regret in T rounds
B. We achieved:
09-Sep-19
Parameters: Preference
matrix P = [P(a,b)]nxn
Pairwise-subset choice model
Matching! Thus optimal.
BB
DB
St
a1
a2
ak
.
.
.
Feedback winner
of the Duel to BB Play two random
arms from St
Lower Bound: Reducing Dueling Bandits to Battling Bandits
09-Sep-19
2. Regret Setting: (a) PSC model Proposed Algorithm: (Using Dueling Bandit Blackbox)
BB
DB St
𝒙𝒕
.
.
Feedback winner
of the Battle to DB Play St
Replicate
k/2 times
𝒙𝒕
𝒚𝒕
.
.
𝒚𝒕
𝒙𝒕
𝒙𝒕
𝒚𝒕
09-Sep-19
Comparative Regret Performances (on synthetic datasets):
09-Sep-19 82
09-Sep-19 83
Comparative Regret Performances (on real datasets):
Future directions…
09-Sep-19
Best-
Item
Top-
K
Full-
Ranking … Cascading
Condorcet-
winner All 4 All 4 All 4
Borda-winner ?? ?? ??
Copeland-
winner
…
Top-cycle
Bank-set
Best-
Item
Top-
K Full-Ranking … Cascading
Condorcet-
winner All 4 All 4 All 4
Borda-winner ?? ?? ??
Copeland-
winner
…
Top-cycle
Bank-set
Best-
Item
Top-
K
Full-
Ranking … Cascading
Condorcet-
winner All 4 All 4 All 4
Borda-winner ?? ?? ??
Copeland-
winner
…
Top-cycle
Bank-set
Best-
Item Top-K
Full-
Ranking … Cascading
Condorcet-
winner
Borda-winner
Copeland-
winner
…
Top-cycle
Ranking
Ob
jective
Feedback mechanism
Pla
ck
ett
-Lu
ce
Mu
ltin
om
ial P
rob
it
Ma
llo
ws
mo
de
l
Ne
ste
d G
EV
09-Sep-19 85
Future Work: (2). with other Choice models + Objectives
Contextual
…
09-Sep-19 86
More future works: Revenue Maximization (Item prices/budgets)
Problem Setup:
Revenue maximization:
Every item is priced as: 𝑟𝑖.
Objective:
Modeling:
Every round choose set of items St ⊆ 𝑛 such that,
Plackett-Luce model:
Parameters:
for any subset
where
,
, and
Thank You!
Questions?
09-Sep-19
Top Related