Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine...

Machine Learning in the Bandit Setting

Algorithms, Evaluation, and Case Studies

Lihong Li

Machine LearningYahoo! Research

SEWM2012-05-25

2012-05-25SEWM 2

ACTION

Statistics, ML, DM, …

E = MC2

KNOWLEDGE

UTILITY

MOREDATA

ReinforcementLearning

Outline

Introduction

Basic Solutions

Advanced algorithms

Advanced Offline Evaluation

Conclusions

2012-05-25SEWM 3

Yahoo-User Interaction

2012-05-25SEWM 4

ads, news, ranking, …

click, conversion, revenue, …

gender,age, …

ACTION

REWARD

CONTEXT servingstrategyPOLICY

GoalMaximize total REWARD

by optimizing POLICY

Today Module @ Yahoo! Front Page

A small pool of articles chosen

by editors

“Featured Article”

2012-05-255SEWM

Objectives and Challenges

• Objectives• (informally) choose most interesting articles for

individual users• (formally) maximize click-through rate (CTR)

• Challenges• Dynamic content pool fast learning• Sparse user visits transfer interests among users• Partial user feedback efficient explore/exploit

2012-05-256SEWM

Challenge: Explore/Exploit

• Observation: only displayed articles get user click feedback

EXPLOIT(choose good articles)

Article CTR estimates

EXPLORE(choose novel articles)

How to trade off?

… with dynamic article pools… while considering user interests

2012-05-257SEWM

Insufficient Exploration Example

always pays $5/round

pays $100 a quarter of the time(so $25/round on average)

2012-05-258SEWM

It turns out…

Contextual Bandit Formulation

Multi-armed contextual bandit [LZ’08]

At : available articles at time t

x t : user features (age, gender, interests, ...)

at : the displayed article at time t

rt,a t: 1 for click, 0 for no - click

Formally, we want to maximize rt ,a t

In Today Module:

2012-05-259SEWM

Select

at ∈ At

Observe K arms At and “context”

x t ∈Rd

Receive reward

rt,a t∈ [0,1]

t ← t +1

2012-05-25SEWM 10

Another Example – Display Ads

At : eligible ads in current page view

x t : page/user features

at : the displayed ad(s)

rt,a t: $ if clicked/converted, 0 otherwise

2012-05-25SEWM 11

At : possible document rankings for query qt

x t : query/document features

at : the displayed ranking for query qt

rt,a t: 1 if session succeeds, 0 otherwise

Yet Another Example - Ranking

Related Work• Standard information retrieval and collaborative filtering

• Also concerns with (personalized) recommendation• But with (almost) static users/items

training often done in batch/offline mode

no need for online exploration

• Full reinforcement learning• General: including bandit problems as special cases• Need to tackle “temporal credit assignment”

2012-05-2512SEWM

Outline

Introduction

Basic Solutions› Algorithms

› Evaluation

› Experiments

Advanced algorithms

Conclusions

2012-05-25SEWM 13

2012-05-25SEWM 14

Prior Bandit Algorithms

Herbert Robbins Tze Leung Lai

Regret minimization(focus of this talk)

Bayesian optimal solution

John Gittins

Traditional K-armed Bandits

Assumption: CTR (click-through rate) not affected by user features

CTR1 ≈ μ1

CTR2 ≈ μ2

CTR3 ≈ μ3

"ε − greedy":

with prob 1- ε : choose article argmaxa μa

with prob ε : choose a random article

"UCB1":

choose article argmaxa μa +α

⎧ ⎨ ⎩

⎫ ⎬ ⎭

The more “a” has been displayed,the less uncertainty in CTRa

2012-05-2515SEWM

CTR estimates = #clicks / #impressions

No contexts no personalization

• EXP4 [ACFS’02], EXP4.P [BLLRS’11], elimination [ADKLS’12]• Strong theoretical guarantees• But computationally expensive

• Epoch-greedy [LZ’08]

• Similar to e-greedy• Simple, general and less expensive• But not most effective

• This talk: algorithms with compact, parametric models• Both efficient and effective• Extension of UCB1 to linear models• … and to generalized linear models• Randomized algorithm with Thompson sampling

Contextual Bandit Algorithms

2012-05-2516SEWM

• Linear model assumption:• Standard least-squares ridge regression

• Reward prediction for new user:

• Whether to explore requires quantifying parameter uncertainty

LinUCB: UCB for Linear Models

E ra | x[ ] = xTθ a

ˆ θ a = (DaTDa + I)−1Da

where Da =

−x1T −

−x2T −

⎢ ⎢ ⎢

⎥ ⎥ ⎥, ca =

⎢ ⎢ ⎢

⎥ ⎥ ⎥

xT ˆ θ a − xTθ a ≤ α xTA a−1x (with high probability)€

xT ˆ θ a ≈ xTθ a

measures how "dissimilar" x is to previous users

2012-05-2517SEWM

prediction error

LinUCB: UCB for Linear Models (II)

With high prob : xT ˆ θ a − xTθ a ≤ α xTA a−1x

LinUCB always selects an arm with highest UCB:

a* = argmaxa

xT ˆ θ a + α xTA a−1x{ }

to exploit to explore

UCB1: a* = argmaxa

ˆ μ a +α

⎧ ⎨ ⎩

⎫ ⎬ ⎭

LinRel [Auer 2002] works similarly but in a more complicated way.

2012-05-2518SEWM

Recall...

Outline

Introduction

› Evaluation

› Experiments

Advanced algorithms

Conclusions

2012-05-25SEWM 19

Goal: estimate average reward of running p with iid x

• Static p

• Adaptive p

Golden standard

• Run p in real system and see how well it works

• …but expensive and risky

Evaluation of Bandit Algorithms

2012-05-2520SEWM

V (π,T) :=1

TE r x t ,π (x t ,ht )( )

∑ ⎡

⎣ ⎢

⎦ ⎥

V (π ) := Ex

r x,π (x)( )[ ]

• Benefits• Cheap and risk-free!• Avoid frequent bucket tests• Replicable / fair comparisons

• Common in non-interactive learning problems (e.g., classification)• Benchmark data organized as (input, label) pairs

• … but not straightforward for interactive learning problems• Data in bandits usually consists of (context, arm, reward) triples• No reward signal for other arm’ ≠ arm

Offline Evaluation

2012-05-2521SEWM

Common/Prior Evaluation Approaches

x1,a1,r1 M

xL ,aL ,rL

⎨ ⎪

⎩ ⎪

⎬ ⎪

⎭ ⎪

Reward simulator :

ˆ r (x,a) ≈ E r x,a[ ]

this (difficult) step is often biased

In contrast, our approach

• avoids explicit user modeling simple

• gives unbiased evaluation results reliable

unreliable evaluation

bandit algorithm p

2012-05-2522SEWM

reveal x i

choose ˆ a i = π (x i)€

For i =1,2,...,L :

reveal ri only if ˆ a i = ai (a "match")

Our Evaluation Method: “Replay”

x1,a1,r1 M

xL ,aL ,rL

⎨ ⎪

⎩ ⎪

⎬ ⎪

⎭ ⎪

bandit algorithm p

Finally, output ˆ V =K

Lri ⋅I( ˆ a i = ai)

2012-05-2523SEWM

Want to estimate V (π ) := Ex

r x,π (x)( )[ ]

Key requirement for data collection:

ai ~ unif(A)

2012-05-25SEWM 24

Theoretical Guarantees

Thm 1: Our estimator is unbiased Mathematically,

So on average reflects real, online performance

Thm 2: Estimation error 0 with more data Mathematically,

So accuracy guaranteed with large volume of data

V (π ) = E ˆ V [ ]

V (π ) − ˆ V = O K L( )

Case Study in Today Module [LCLW’11] Data:

› Large volume of real user traffic in Today Module

Policies being evaluated:

› EMP [ACE’ 09]

› SEMP/CEMP: personalized EMP variants

› Use policies’ online bucket CTR as “truth”

Random bucket data for evaluation:

› 40M visits, K ~= 20 on average

› Use it to offline-evaluate policies’ CTR

2012-05-25SEWM 25

Are they close?

Unbiasedness (Article nCTR)

Recorded Online nCTR 2012-05-2526SEWM

The offline estimate is indeed unbiased!

Unbiasedness (Daily nCTR)

Recorded Online nCTR

Estimated nCTR

Ten Days in November 2009 2012-05-2527SEWM

The offline estimate is indeed unbiased!

Estimation Error

2012-05-2528SEWM €

Number of Data (L)

Recall our theoretical error bound:

Thm 2 (error bound) : V (π ) − ˆ V = O K L( )

Unbiased Offline Evaluation: Recap What we have shown

› A principled method for benchmark data collection

› which allows reliable/unbiased evaluation

› of any bandit algorithms

Analogue: UCI, Caltech101 ... datasets for supervised learning

The first such benchmark was released by Yahoo!

http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

2nd and 3rd versions available for PASCAL2 Challenge

› ICML 2012 workshop2012-05-25SEWM 29

Outline

Introduction

› Evaluation

› Experiments

Advanced algorithms

Conclusions

2012-05-25SEWM 30

Experiment Setup: Architecture

• Model updated every 5 minutes

• Main metric: overall normalized CTR in deployment bucket• nCTR = CTR * secretNumber

(to protect sensitive business information)

2012-05-2531SEWM

where E/E happens

exploitation only

“Learning Bucket”

“Deployment Bucket”

Experiment Setup: Data

• May 1 2009 data for parameter tuning• May 3-9 2009 data for performance evaluation (33M visits)• Number of candidate articles per user visit is about 20• Dimension reduction on user features [CBP+’09]

• 6 features

• Data available from Yahoo! Research’s Webscope program

http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

E rt,a | x t ,a[ ] = x t ,aT θa

2012-05-2532SEWM

“cheating”policy

(no feature)

CTR in Deployment Bucket [LCLS’10]

• UCB-type algorithms do better than e-greedy counterparts

• CTR improved significantly when features/contexts are considered

2012-05-2533SEWM

Article CTR Lift

2012-05-2534SEWM

no context linear model

+ e-greedyo UCB

Outline

Introduction

Basic Solutions

Advanced algorithms› Hybrid linear models

› Generalized linear models

› Thompson sampling

› Theory

Conclusions2012-05-25SEWM 35

Advantagelearns faster when there are few data

Challengeseems to require unbounded computation complexity

Good news!Efficient implementation made possible by block matrix manipulations

LinUCB for Hybrid Linear Models

New assumption : E ra | x[ ] = xTθ a + zaT β

Previous assumption : E ra | x[ ] = xTθ a

information shared by all articles(eg, teens like articles about Harry Potter)

article-specific information(eg, Californian males like this article)

2012-05-2536SEWM

Overall CTR in Deployment Bucket

advantageof hybrid

2012-05-2537SEWM

• UCB-type algorithms do better than e-greedy counterparts

• CTR improved significantly when features/contexts are considered

• Hybrid model is better when data are scarce

Outline

Introduction

Basic Solutions

› Theory

2012-05-25SEWM 39

Extensions to GLMs Linear models are unnatural for binary events

Generalized linear models (GLMs)

Logistic regression

Probit regression

E ra | x[ ] = xTθ a

E ra | x[ ] = g−1 xTθ a( )

E ra | x[ ] =1

1+ exp(−xTθ a )

E ra | x[ ] = Φ xTθ a( )

(F: CDF of standard Gaussian)

“inverse link function”

g−1 : R →[0,1]

logistic function

xTθ a

2012-05-25SEWM 40

Model Fitting in GLMs

• Maintain a Bayesian posterior of parameter qa by N(ma, Sa) Use Bayes’ formula with new data (x,r):

p(θ a ) ∝ N(θ a;μa ,Σa )⋅ 1+ exp −(2r −1)xTθ a( )( )−1

Current posterior Likelihood

Laplace approximation

N μa ',Σa '( )New posterior

2012-05-25SEWM 41

UCB Heuristics for GLMs

E ra | xa[ ] ≤

xaT μa + α xa

TΣaxa linear

1+ α exp xaTΣaxa −1( )

1+ exp −xaT μa( )

logistic

Φ xaT μa + α xa

TΣaxa( ) probit

⎪ ⎪ ⎪ ⎪

• Use posterior N(ma, Sa) to derive (approximate) upper confidence bounds [LCLMW’12]

Experiment Setup• One week data in from June 2009 (34M user visits)• About 20 candidate articles per user visit• Features: 20 features by PCA on raw binary user features• Model updated every 5 minutes

• Main metric: overall (normalized) CTR in deployment bucket

2012-05-2542SEWM

where E/E happens

exploitation only

“Learning Bucket”

“Deployment Bucket”

2012-05-25SEWM 43

GLM Comparisons

Obs #1: active exploration is necessaryObs #2: Logistic/probit > linearObs #3: UCB > e-greedy

e-greedy exploration UCB exploration

Outline

Introduction

Basic Solutions

› Theory

2012-05-25SEWM 45

Limitations of UCB Exploration

Exploration can be too much

may explore the whole space exhaustively

difficult to use prior knowledge

Exploration is deterministic

Poor performance when rewards are delayed

Deriving an (approx.) UCB is not always easy

2012-05-25SEWM 46

Thompson Sampling (1933)

Algorithmic idea: “probability matching”

Pr(a|x) = Pr(a is optimal for x)

Randomized action selection (by definition)

More robust to reward delay

Straightforward to implement [CL’12]

Maintain parameter posterior:

Draw random models:

Act accordingly:

Easily combined with other (non-)parametric models€

˜ θ a ~ Pr θ a D( )

Pr θ a D( )

a(x) = argmaxa

f (x,a; ˜ θ a )

2012-05-25SEWM 47

Thompson Sampling

One-week data from Today Module on Yahoo!’s front pageLogistic regression with Gaussian posteriors

Obs #1: TS is competitive uniformlyObs #2: TS is more robust to reward delay

Outline

Introduction

Basic Solutions

› Theory

Regret-based Competitive Analysis

Regret(T) = E rt,a t

∑ ⎡

⎣ ⎢

⎦ ⎥− E rt,a t

∑ ⎡

⎣ ⎢

⎦ ⎥

the best we could do if we knew all

achieved by algorithm

2012-05-2549SEWM

An algorithm “learns” if

An algorithm “learns fast” if is small

Regret(T) = O(Tα ) with α < 1

Regret Bounds

• LinUCB [CLRS’11]: with matching lower bound

• Generalized LinUCB: still open• A variant [FCGSz’11]:

• Thompson sampling• A variant [L’12]:

Average reward converges to optimal at the rate O Kd T( ).

Example : K = 20, d = 50, T =10M, Kd T = 0.01

2012-05-2550SEWM

O KdT( )

O d T( )

O K1/ 3T 2 / 3( )

Outline

Introduction

Basic Solutions

Advanced algorithms

Advanced Offline Evaluation› Importance weighting

› Doubly robust technique

Conclusions

2012-05-25SEWM 51

Uniformly random data sometimes are a luxury…

› System/cost constraints, user experience considerations, …

Randomized log suffices (by importance weighting)

Variance reduction with the “doubly robust” technique [DLL’11]

Better bias/variance tradeoff by soft rejection sampling [DDLL’12]

Extensions

2012-05-25

SEWM 52

V (π ) = E(x,r )~D rπ (x )[ ] ≈1

ra ⋅ I(π (x) = a)

max ˆ p (a | x),τ{ }(x,a,ra )∈S

t controls bias/variance trade-off [SLLK 2011]

Offline Evaluation with Non-Uniform Data

2012-05-25SEWM 53

Key idea: importance reweighting

Can use weighted empirical average with estimated p(a|x)

ˆ V =1

ra ⋅ I(π (x) = a)

∑ ≈ V (π )

t controls bias/variance trade-off [SLLK 2011]

V (π ) = E(x,r )~D rπ (x )[ ] = E(x,r)~D ra ⋅ I(π (x) = a)a

∑ ⎡

⎣ ⎢

⎦ ⎥

= E(x,r)~D

ra ⋅ I(π (x) = a)

p(a | x)p(a | x)

∑ ⎡

⎣ ⎢

⎦ ⎥= E(x,r )~D,a ~ p

ra ⋅ I(π (x) = a)

p(a | x)

⎣ ⎢

⎦ ⎥

Results in Today Module Data [SLLK’11]

2012-05-25SEWM 54

Outline

Introduction

Basic Solutions

Advanced algorithms

Advanced Offline Evaluation› Importance weighting

› Doubly robust technique

Conclusions

2012-05-25SEWM 55

Doubly Robust Estimation

Importance weighted formula

Doubly robust technique

Usually DR estimate decreases variance [DLL’11]

2012-05-25SEWM 56

Estimation has high variance if p(a|x) is small

ˆ V DR =1

ra − ˆ r a( )⋅ I(π (x) = a)

max ˆ p (a | x),τ{ }+ ˆ r a

⎣ ⎢ ⎢

⎦ ⎥ ⎥(x,a,ra )∈S

Unbiased if ˆ r a or ˆ p is correct.

V (π ) = E(x,r )~D,a ~ p

ra ⋅ I(π (x) = a)

p(a | x)

⎣ ⎢

⎦ ⎥≈

ra ⋅ I(π (x) = a)

2012-05-25SEWM 57

Multiclass Classification

K-class classification as a K-armed bandit

Training data› In usual (non-bandit) setting,

› In bandit setting,

x,c ⇒ x,r1,r2,K ,rK where ra =0 if a = c

1 otherwise

⎧ ⎨ ⎩

D = x i,c i{ }i=1,2,K m

D = x i,ai, pi,ri,a i{ }i=1,2,K m

usualsetting

banditsetting

123...

1 2 3 … K

observed loss

unobserved loss

Loss matrixwith rij in (i,j) entry

Experimental Results on UCI Datasets Split data 50/50 for training (fully labeled) and testing (partially

labeled)

Train p on training data, evaluate p on test data

Repeated 500 times

2012-05-25SEWM 58

Outline

Introduction

Basic Solutions

Advanced algorithms

Conclusions

2012-05-25SEWM 59

Conclusions

• Contextual bandit as a principled formulation for• News article recommendation• Internet advertising• Web search• ...

• An offline evaluation method of bandit algorithms• unbiased• accurate compared to online bucket results

• Encouraging results in significant applications• strong performance of UCB/TS exploration

2012-05-2560SEWM

Future Work• Offline evaluation

• Better use of non-uniform data• Extension to full reinforcement learning

• Use of prior knowledge

• Variants of bandits• Bandits with budgets• Bandits with many arms• Bandits with multiple objectives• Bandits with submodular rewards• Bandits with delayed reward observations• …

2012-05-2561SEWM

2012-05-25SEWM 62

References Offline policy evaluation

[LCLW] Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. WSDM, 2011

[SLLK] Learning from logged implicit exploration data. NIPS, 2010 [DLL] Doubly robust policy evaluation and learning. ICML, 2011 [DDLL] Sample-efficient nonstationary-policy evaluation for contextual

bandits. Under review.

Bandit algorithms [LCLS] A contextual-bandit approach to personalized news article

recommendation. WWW, 2010 [CLRS] Contextual bandits with linear payoff functions. AISTATS, 2011 [BLLRS] Contextual bandit algorithms with supervised learning

guarantees. AISTATS, 2011 [CL] An empirical evaluation of Thompson sampling. NIPS, 2011 [LCLMW] Unbiased offline evaluation of contextual bandit algorithms with

generalized linear models. JMLR W&PS, 2012

2012-05-25SEWM 63

Thank You!

Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine...

Documents

Transcript of Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine...

April 2014 SEWM 20141 Event Detection from Social Media: User-centric Parallel Split-n-merge and Composite Kernel Truc-Vien T. Nguyen, Lugano University,

Unit 10 Euthanasia By Quan Lihong & Peng Zhenzhu Nov., 12, 2008.

SEWM'14 keynote: Mining Events from Multimedia Streams

Model reduction of dynamical systems Introduction...Model reduction of dynamical systems Introduction Lihong Feng Max Planck Institute for Dynamics of Complex Technical Systems Computational

Holography for thermalization - Swansea Universitypyweb.swan.ac.uk/sewm/sewmweb/talks/heller.pdf · Holography for thermalization: transition to hydrodynamics and its features Michał

lts.yonsei.ac.krlts.yonsei.ac.kr/NFUpload/...name=JR(13.09.14)-LTS.docx · Web viewLindsey Heathcock, Johanna D. James, Lindsey D. Goodman, Siobhan Conroy, Lihong Long, Nina Lelic,

Photoacoustic elastographycoilab.caltech.edu/epub/2016/HaiP_2016_Optics Lett_v41...Photoacoustic elastography PENGFEI HAI,JUNJIE YAO,GUO LI,CHIYE LI, AND LIHONG V. W ANG* Optical Imaging

A Unifying Framework for Computational Reinforcement Learning Theory Lihong Li Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) Department.

eprints.whiterose.ac.uk · Web viewWord count: 3934 Zhesi He (zhesi.he@york.ac.uk) Lihong Wang (sophia.cheng@york.ac.uk) Andrea L. Harper (andrea.harper@york.ac.uk) Lenka Havlickova

Parallelized Stochastic Gradient Descentresearch.cs.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf · In the following we assume that kL(x,y,y ′)k Lip ≤G as a function of

Reduced Basis Method for Poisson-Boltzmann Equation · Reduced Basis Method for Poisson-Boltzmann Equation Workshop in Industrial and Applied Mathematics, WIAM16 Cleophas Kweyu, Lihong

Model Order Reduction for Nonlinear Systems...Max Planck Institute Magdeburg Model Order Reduction for Nonlinear Systems Lihong Feng MAX-PLANCK-INSTITUT DYNAMIK KOMPLEXER TECHNISCHER

6B Unit5 The seasons By Yang lihong spring summer.

Havlickova, Lenka and He, Zhesi and Wang, Lihong and ...eprints.nottingham.ac.uk/48509/1/Havlickova_et_al-2017-The_Plant... · preventing the oxidation of polyunsaturated fatty ...

, Heng Li, Mao Xie, Bibek Gyanwali, Lihong Xie, Yikang Liu, … · 2016. 2. 28. · Meichan Zhu #, Guangyao He #, Heng Li, Mao Xie, Bibek Gyanwali, Lihong Xie, Yikang Liu, Songhua

Single-cell label-free photoacoustic owoxigraphy in vivo · in vivo Lidai Wang, Konstantin Maslov, and Lihong V. Wang1 Optical Imaging Laboratory, Department of Biomedical Engineering,

Potent STING activation stimulates immunogenic cell death ... · Potent STING activation stimulates immunogenic cell death to enhance antitumor immunity in neuroblastoma Lihong Wang-

MARKET FRICTIONS AS BUILDING BLOCKS OF AN … · market frictions as building blocks of an organizational economics approach to strategic management joseph t. mahoney1* and lihong

Single-shot ultrafast optical imagingcoilab.caltech.edu/epub/2018/Liang-2018-Optica.pdf · Single-shot ultrafast optical imaging JINYANG LIANG1 AND LIHONG V. W ANG2,* 1Laboratory

INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of