Introduction to Contextual Multi-bandit Algorithm to Contextual Multi-bandit... · Outline...

AnIntroductiontoContextualBanditsAlgorithm

Ph.D.Candidate:QingWang,2016FloridaInternationalUniversity

Outline

§ Introduction§ Motivation§ Contextual-freeBanditAlgorithms§ ContextualBanditAlgorithms§ OurWork

§ EnsembleContextualBanditsforPersonalizedRecommendation§ PersonalizedRecommendationviaParameter-FreeContextualBandits

§ FutureWork§ Q&A

WhatisPersonalizedRecommendation?

§ PersonalizedRecommendationhelpusersfindinterestingitemsbasedtheindividualinterestofeachitem.§ UltimateGoal:maximizeuser

engagement.

WhatisColdStartProblem?

§ Donothaveenoughobservationsfornew itemsornew users.§ Howtopredictthepreferenceofusersifwedonothavedata?

§ Manypracticalissuesforofflinedata§ Historicaluserlogdataisbiased.§ Userinterestmaychange overtime.

Approach:Multi-armedBanditAlgorithm

§ Agamblerwalksintoacasino§ Arowofslotmachinesproviding

arandomrewards

Objective:Maximizethesumofrewards(Money)!

Example:NewsPersonalization

§ Recommend newsbasedonusers’interests.

§ Goal:Maximizeuser’sClick-Through-Rate.

[1]Li,Lihong,etal."Acontextual-banditapproachtopersonalizednewsarticlerecommendation." Proceedingsofthe19thinternationalconferenceonWorldwideweb.ACM,2010.


§ Thereareabunchofarticlesinthenewspool§ Userscomesequentiallyandreadytobeenter

Newsarticles

[1]ZhouLi,“NewspersonalizationwithMulti-armedbandits”.


§ Ateachtime,wewanttoselectonearticleforuser.

newsarticles

article1

Likeit?

MABs


§ Goal:maximumCTR.

MABs

newsarticles

article1

Likeit?

Notreally!


§ Updatethemodelwithuser’sfeedback

MABs

newsarticles

article1

Likeit?

Notreally!

feedback


• Updatethemodeloncegiventhefeedback

MABs

newsarticles

article2

Likeit?

Yeah!


§ Updatethemodeloncegiventhefeedback

MABs

newsarticles

article2

Likeit?

Yeah!

feedback

Howaboutarticle3,4,5…?

Multi-armedBanditDefinition

§ TheMABproblemisaclassicalparadigminMachineLearninginwhichanonlinealgorithmchosesfromasetofstrategiesinasequenceoftrialssoastomaximizethetotalpayoffofthechosenstrategies.[1]

[1]http://research.microsoft.com/en-us/projects/bandits/

Application:ClinicalTrial

[1]Einstein,A.,B.Podolsky,andN.Rosen,1935,“Canquantum-mechanicaldescriptionofphysicalrealitybeconsideredcomplete?”,Phys.Rev. 47,777-780

§ Twotreatmentswithunknowneffectiveness

Webadvertising

[1]TangL,RosalesR,SinghA,etal.Automaticadformatselectionviacontextualbandits[C],Proceedingsofthe22ndACMinternationalconferenceonConferenceoninformation&knowledgemanagement.ACM,2013:1587-1594.

§ Wheretoplacethead?

PlayingGolfwithmulti-balls

[1]Dumitriu,Ioana,PrasadTetali,andPeterWinkler."Onplayinggolfwithtwoballs." SIAMJournalonDiscreteMathematics 16.4(2003):604-615.

Multi-AgentSystem

[1]Ny,JeromeLe,Munther Dahleh,andEricFeron."Multi-agenttaskassignmentinthebanditframework."DecisionandControl,200645thIEEEConferenceon.IEEE,2006.

§ KagentstrackingN(N>K)targets:

SomeJargonTerms[1]

§ Arm:oneidea/strategy§ Bandit:Agroupofideas(strategies)§ Pull/Play/Trial:Onechancetotryyourstrategy§ Reward:Theunitofsuccesswemeasureaftereachpull§ Regret:PerformanceMetric

[1]BanditAlgorithmsforWebsiteOptimizationDeveloping,Deploying,andDebuggingBy JohnMylesWhite,O'ReillyMedia,2012

K-ArmedBandit

[1]CS246MiningMassiveDataSets2015,StanfordUniversity

§ EachArma§ Wins(reward=1)withfixed(unknown)prob.𝜇#§ Loses(reward=0)withfixed(unknown)prob.(1 − 𝜇#)

§ Alldrawsareindependentgiven𝜇(…𝜇*§ Howtopullarmstomaximizetotalreward?(estimatethe

arm’sprob.ofwinning𝜇#)

ModelofK-ArmedBandit


§ Setof𝒌 choices(arms)§ Eachchoicea isassociatedwithunknownprobability

distribution𝑷𝒂 in[0,1]§ WeplaythegameforTrounds§ Ineachroundt:

§ Wepicksomearmj§ Weobtainrandomsample𝑿𝒕 from𝑷𝒋(rewardis

independentofpreviousdraws)§ Goal:maximize∑ 𝑿𝒕2

34( (withoutknown𝜇#)§ However,everytimewepullsomearmawegettolearnabit

about𝜇#.

PerformanceMetric:Regret


§ Letbe𝜇# themeanof𝑷𝒂§ Payoff/reward bestarm:𝜇∗ = 𝒎𝒂𝒙 𝜇# 𝒂 = 𝟏,… , 𝒌}§ Let𝑖(, …𝑖2 bethesequenceofarmspulled§ Instantaneousregretattimet:𝑟3= 𝜇∗ −𝜇#>§ Totalregret:

§ 𝑹𝑻 =∑ 𝒓𝒕234(

§ Typicalgoal:armallocationstrategythatguarantees:

§𝑹𝑻2→ 0asT → ∞

AllocationStrategies


§ Ifweknewthepayoffs,whicharmshouldwepull?§ bestarm:𝜇∗ = 𝒎𝒂𝒙 𝜇# 𝒂 = 𝟏,… , 𝒌}

§ Whatifweonlycareaboutestimatingpayoff𝜇#?

§ Pickeachofkarmsequallyoften:𝑻*

§ Estimate :𝜇#H = ∑ 𝑋#,JKLJ4( /(𝑻

*) 𝒌

2∑ 𝑋#,J2/*J4(

§ Totalregret:

§ 𝑹𝑻 =𝑻*∑ (𝜇∗−𝜇#)*

#4(

ExploitationvsExploration

§ Tradeoff:§ Onlyexploitation(makingdecisionsbasedonhistorydata),youwillhavebadestimationfor“best”items.

§ Onlyexploration(gatheringdataaboutarmpayoffs),youwillhavelowuser’sengagement.

AlgorithmtoExploration&Exploitation

Exploration Exploitation

Contextualfree Contextual

1. Epsilonalgorithm[1]2. UCB1[2]

1. Ex3,Ex42. TompsonSampling[3]3. LinUCB[4]

tradeoff

[1]WynnP.Ontheconvergenceandstabilityoftheepsilonalgorithm[J].SIAMJournalonNumericalAnalysis,1966,3(1):91-122.[2]AuerP,Cesa-BianchiN,FischerP.Finite-timeanalysisofthemulti-armedbanditproblem[J].Machinelearning,2002,47(2-3):235-256.[3]AgrawalS,GoyalN.AnalysisofThompsonsamplingforthemulti-armedbanditproblem[J].arXiv preprintarXiv:1111.1797,2011.[4]Li,Lihong,etal."Acontextual-banditapproachtopersonalizednewsarticlerecommendation." Proceedingsofthe19thinternationalconferenceonWorldwideweb.ACM,2010.

§ Ittriestobefairtothetwooppositegoalsofexploration(withprob.𝜀)andexploitation(1-𝜀)byusingamechanism:flipsacoin.

𝜀-GreedyAlgorithm

Roundt

Exploration

Exploitation(choosebestarm)

𝜺

1-𝜺

Arm𝑎∗

Arm𝑎1/k

1-𝜺

𝜺/k

Arm𝑏

𝜺/k1/k

§ Fort=1:T§ Set𝜀3 = 𝑂 (

3§ Withprob.𝜀3: Explorebypickinganarmchosen

uniformlyatrandom§ Withprob.1-𝜀3:Exploitbypickinganarmwith

highestempiricalmeanpayoff§ Theorem[Aueretal.‘02]

§ Forsuitablechoiceof𝜀3 itholdsthat

𝜀-GreedyAlgorithm

§ Notelegant”:Algorithmexplicitlydistinguishesbetweenexplorationandexploitation

§ Moreimportantly:Explorationmakessuboptimalchoices(sinceitpicksanyarmequallylikely)

§ Idea:Whenexploring/exploitingweneedtocomparearms.

Issueswith𝜀-GreedyAlgorithm

Example:ComparingArms

§ Supposewehavedoneexperiments:§ Arm1:1001110001§ Arm2:1§ Arm3:1101001111

§ Meanarmvalues:§ Arm1:5/10Arm2:1Arm3:7/10

§ Whicharmwouldyouchoosenext?§ Idea:Notonlylookatthemeanbutalsothe

confidence!

ConfidenceIntervals

§ Aconfidenceintervalisarangeofvalueswithinwhichwearesurethemeanlieswithacertainprobability§ Wecouldbelieve𝜇# iswithin[0.2,0.5]with

probability0.95§ Ifwewouldhavetriedanactionlessoften,our

estimatedrewardislessaccuratesotheconfidenceintervalislarger

§ Intervalshrinksaswegetmoreinformation(trytheactionmoreoften)

ConfidenceBasedSelection

§ Assumingweknowtheconfidenceintervals§ Then,insteadoftryingtheactionwiththehighestmeanwecan

trytheactionwiththehighestupperboundonitsconfidenceinterval.

ConfidenceintervalsvsSamplingtimes

Theestimationofconfidencebecomessmallerasthenumberofpullingtimesincreases.

[1]Jean-YvesAudibert andRemiMunos,IntroductiontoBandits:AlgorithmsandTheory.ICML2011,Bellevue(WA),USA

CalculatingConfidenceBounds

§ Supposewefixarma:§ Let𝑟#,(… 𝑟#,T bethepayoffsofarmainthefirstm

trials§ 𝑟#,(… 𝑟#,T arei.i.d.takingvaluesin[0,1]

§ Ourestimate:𝜇#,TU = 𝟏T∑ 𝑟#,JTJ4(

§ Wanttofindbsuchthatwithhighprobability𝜇# − 𝜇#,TU ≤ 𝑏 (wantbtobeassmallaspossible)

§ Goal:Wanttobound𝐏( 𝜇# − 𝜇#,TU ≤ 𝑏)

UCB1Algorithm

Hoeffding’s Inequality

§ UCB1(Upperconfidencesampling)algorithm§ Let𝜇(H … = 𝜇*H = 0and𝑚( =…=𝑚* =0

§ 𝜇#H isourestimateofpayoffofarm 𝑖§ 𝑚# is thenumberofpullsofarm𝑖 sofar.

§ Fort=1:T

§ ForeacharmacalculateUCB a = 𝜇#H + 𝛼 ^_`>ab

�

§ Pickarm𝑗 = 𝑎𝑟𝑔𝑚𝑎𝑥#𝑈𝐶𝐵(𝑎)§ Pullarm𝑗 andobserve𝑦3§ 𝑚J =𝑚J +1and𝜇JH =1/𝑚J(𝑦3+(𝑚J−1)𝜇JH)

UCB1Algorithm:Discussion

§ Confidenceintervalgrowswiththetotalnumberofactionstwehavetaken

§ ButShrinkswiththenumberoftimes𝑚# wehavetriedarma

§ Thisensureseacharmistriedinfinitelyoftenbutstillbalancesexplorationandexploitation

§ 𝛼 playstheroleof𝛿: 𝛼 = f mn= 1 + _`(^/o)

^

�

§ ForeacharmacalculateUCB a = 𝜇#H + 𝛼 ^_`>ab

�

§ Pickarm𝑗 = 𝑎𝑟𝑔𝑚𝑎𝑥#𝑈𝐶𝐵(𝑎)§ Pullarm𝑗 andobserve𝑦3§ 𝑚J =𝑚J +1and𝜇JH =1/𝑚J(𝑦3+(𝑚J−1)𝜇JH)

UCB1AlgorithmPerformance

§ Theorem[Aueretal.2002]§ Supposeoptimalmeanpayoffis§ Andforeacharmlet§ Thenitholdsthat

§ So,weget

ContextualBandits

§ Contextualbanditalgorithminroundt§ Algorithmobserversuser𝒖𝒕 andaset𝐀 ofarms

togetherwiththeirfeatures𝒙𝒕,𝒂(context)§ Basedonpayoffsfromprevioustrials,algorithm

choosesarm𝒂 ∈ 𝐀 andreceivespayoff𝒓𝒕,𝒂§ Algorithmimprovesarmselectionstrategywitheach

observation(𝒙𝒕,𝒂, 𝒂,𝒓𝒕,𝒂)

LinUCBAlgorithm[1]

§ Contextualbanditalgorithminroundt§ Algorithmobserversuser𝒖𝒕 andaset𝐀 ofarms

togetherwiththeirfeatures𝒙𝒕,𝒂(context)§ Basedonpayoffsfromprevioustrials,algorithm

choosesarm𝒂 ∈ 𝐀 andreceivespayoff𝒓𝒕,𝒂§ Algorithmimprovesarmselectionstrategywitheach

observation(𝒙𝒕,𝒂, 𝒂,𝒓𝒕,𝒂)

[1]Li,Lihong,etal."Acontextual-banditapproachtopersonalizednewsarticlerecommendation." Proceedingsofthe19thinternationalconferenceonWorldwideweb.ACM,2010.

LinUCBAlgorithm

§ Expectationofrewardofeacharmismodeledasalinearfunctionofthecontext.

Payoffofarma:E 𝑟3,# 𝑥3,# = [𝑥3,#]2𝜃#∗

§ Thegoalistominimizeregret,definedasthedifferencebetweentheexpectationoftherewardofbestarmsandtheexpectationoftherewardofselectedarms.

𝑅3 𝑇 ≝ 𝐸 {𝑟3,#>∗

2

34(

− 𝐸[{𝑟3,#>

2

34(

]

𝒙𝒕,𝒂 isad-dimensionalfeaturevector

𝜽𝒂∗ istheunknowncoefficientvectorweaimtolearn

LinUCBAlgorithm

§ E 𝑟3,# 𝑥3,# = [𝑥3,#]2𝜃#∗§ Howtoestimate𝜃#?

§ Linearregressionsolutionto𝜃# is𝜽𝒂} = 𝒂𝒓𝒈𝒎𝒊𝒏𝜽 ∑ ([𝑥3,#]2𝜃# −𝒃𝒂

(𝒎))m�𝒎∈𝑫𝒂

Wecanget:𝜽𝒂} = (𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅)�𝟏 𝑫𝒂𝑻𝒃𝒂

𝑫𝒂 isam× dmatrixofmtraininginputs[𝑥3,#]

𝒃𝒂 isam-dimensionvectorofresponsesto𝒂(click/no-click)

LinUCBAlgorithm

§ UsingsimilartechniquesasweusedforUCB

|[𝑥3,#]2𝜽𝒂} − E 𝑟3,# 𝑥3,# | ≤ 𝜶 [𝑥3,#]2(𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅)�𝟏𝑥3,#�

§ Foragivencontext,weestimatetherewardandtheconfidenceinterval.

𝒂𝒕 ≝ 𝒂𝒓𝒈𝒎𝒂𝒙𝒂∈𝑨𝒕([𝑥3,#]2𝜽𝒂} + 𝜶 [𝑥3,#]2(𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅)�𝟏𝑥3,#

�)

𝜶 = 𝟏 + 𝒍𝒏(𝟐/𝜹)/𝟐�

Estimated𝜇# Confidenceinterval

LinUCBAlgorithm§ Initialization:𝐴# ≝ 𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅

§ Foreacharm𝑎:§ 𝐴# = 𝐼� //identitymatrixd×d§ 𝑏# = [0]� //vectorofzeros

§ Onlinealgorithm:§ Fort=[1:T]:

§ Observefeaturesforallarms𝑎 ∶ 𝑥3,# ∈ 𝑅�§ Foreacharm𝑎 ∶

§ 𝜃# = 𝐴#�(𝑏# //regressioncoefficients

§ 𝑝3,# = [𝑥3,#]2𝜃# + 𝜶 [𝑥3,#]2𝐴#�(𝑥3,#�

§ Choosearm𝑎3 = 𝑎𝑟𝑔𝑚𝑎𝑥#𝑝3,# //choosearm§ 𝐴#> = 𝐴#> + 𝑥3,#>[𝑥3,#>]

2 //updateAforthechosenarm𝑎3§ 𝑏#> = 𝑏#> + 𝑟3𝑥3,#> //updatebforthechosenarm𝑎3

LinUCB: Discussion

§ LinUCBcomputationalcomplexityis§ Linear inthenumberofarmsand§ Atmostcubicinthenumberoffeatures

§ LinUCBworkswellforadynamic armset(armscomandgo)§ Forexample,innewsarticlerecommendation,for

instance,editorsadd/removearticlesto/fromapool

DifferentbetweenUCB1andLinUCB

§ UCB1 directlyestimates𝜇# throughexperimentation(withoutanyknowledgeaboutarma)

§ LinUCB estimates𝜇# byregression𝜇# = [𝑥3,#]2𝜽𝒂∗§ Thehopeisthatwewillbeabletolearnfasteras

weconsiderthecontext𝑥#(user, ad) ofarma§ 𝜽𝒂∗ unknowncoefficientvectorweaimtolearn

ThompsonSampling

§ AsimplenaturalBayesianheuristic§ Maintainabelief(distribution)fortheunknown

parameters§ Eachtime,pullarma andobserveareward𝑟

§ Initializepriorsusingbeliefdistribution§ Fort=1:T:

§ SamplerandomvariableXfromeacharm’sbeliefdistribution

§ SelectthearmwithlargestX§ Observetheresultofselectedarm§ Updatepriorbeliefdistributionforselectedarm

[1]AgrawalS,GoyalN.AnalysisofThompsonsamplingforthemulti-armedbanditproblem[J].arXiv preprintarXiv:1111.1797,2011.

SimpleExample

§ Cointoss:x̴Bernoulli(𝜃)§ Let’sassumethat

§ 𝜃̴Beta(𝛼£, 𝛼2)§ P(𝜃)∝ 𝜃¥¦�((1 − 𝜃)¥K�(

§ 𝑃 𝜃 𝑋 = ¨ 𝑋 𝜃 ¨(©)∑ ¨(ª|©)�«

Posterior

Prior

Thepriorisconjugate!

Betadistribution

ThompsonSamplingUsingBetabeliefdistribution§ Theorem[Emilieetal.2012]

§ Initiallyassumesarm𝒊 withpriorBeta(1,1)on𝝁𝒊§ 𝑆® =#“Success”,𝐹®=#“Failure”

ThompsonSamplingUsingBetabeliefdistribution

Arm1 Arm2 Arm3

Beta(1,1) Beta(1,1) Beta(1,1)

§ Initialization


Arm1 Arm2 Arm3


X0.70.20.4

§ Foreachround:§ SamplerandomvariableXfromeacharm’sBeta

Distribution


Arm1 Arm2 Arm3


X0.70.20.4


Distribution§ SelectthearmwithlargestX


Arm1 Arm2 Arm3


X0.70.20.4


Distribution§ SelectthearmwithlargestX§ Observetheresultofselectedarm

Success!


Arm1 Arm2 Arm3


X0.70.20.4


Distribution§ SelectthearmwithlargestX§ Observetheresultofselectedarm§ UpdatepriorBetadistributionforselectedarmSuccess!

OurResearch1:EnsembleContextualBanditsforPersonalizedRecommendation

[1]Tang,Liang,etal."Ensemblecontextualbanditsforpersonalizedrecommendation." Proceedingsofthe8thACMConferenceonRecommendersystems.ACM,2014.

ProblemStatement

§ ProblemSetting: havemanydifferentrecommendationmodels(orpolicies):§ DifferentCTRPredictionAlgorithms.§ DifferentExploration-ExploitationAlgorithms.§ DifferentParameterChoices.

§ Nodatatodomodelvalidation§ ProblemStatement:howtobuildanensemblemodelthatis

closetothebestmodelinthecoldstartsituation?

HowEnsemble?

§ Classifierensemblemethoddoesnotworkinthissetting§ RecommendationdecisionisNOTpurelybasedonthepredictedCTR.

§ Eachindividualmodelonlytellsus:§ Whichitemtorecommend.

EnsembleMethod

§ OurMethod:§ Allocaterecommendationchancestoindividualmodels.

§ Problem:§ Bettermodelsshouldhavemorechances.§ Wedonotknowwhichoneisgoodorbadinadvance.§ Idealsolution:allocateallchancestothebestone.

CurrentPractice:OnlineEvaluation(orA/Btesting)§ Letπ1,π2 …πm betheindividualmodels.

§ Deployπ1,π2 …πm intotheonlinesystematthesametime.

§ Dispatchasmallpercentusertraffictoeachmodel.§ Afteraperiod,choosethemodelhavingthebestCTRastheproductionmodel.

CurrentPractice:OnlineEvaluation(orA/Btesting)§ Letπ1,π2 …πm betheindividualmodels.

§ Deployπ1,π2 …πm intotheonlinesystematthesametime.

§ Dispatchasmallpercentusertraffictoeachmodel.§ Afteraperiod,choosethemodelhavingthebestCTRas

theproductionmodel.

Ifwehavetoomanymodels,thiswillhurttheperformanceoftheonlinesystem.

OurIdea1(HyperTS)

§ TheCTRofmodelπi isarandomunknownvariable,Ri .§ Goal:

§ maximize,rt isarandomnumberdrawnfromRs(t),s(t)=1,2,…,orm.Foreacht=1,…,N,wedecides(t).

§ Solution:§ BernoulliThompsonSampling (flatprior:beta(1,1)).

§ π1,π2 …πmarebanditarms.

1N

rtt=1

N

∑ CTRofourensemblemodel

Notrickyparameters

AnExampleofHyperTS

Inmemory,wekeeptheseestimatedCTRsforπ1,π2 …πm.

R1

R2

Rk

…

Rm

…

AnExampleofHyperTSAuservisit

HyperTSselectsacandidatemodel,πk .

EstimatedCTRs

R1

R2

Rk

…

Rm

…



πk recommendsitemA totheuser.

A

xt::contextfeatures

EstimatedCTRs

R1

R2

Rk

…

Rm

…



πk recommendsitemA totheuser.

A

xt::contextfeatures

EstimatedCTRs

R1

R2

Rk

…

Rm

…HyperTSupdatesthe

estimationofRk basedon rt.

update

Two-LayerDecision

BernoulliThompsonSampling

π1 π2 πmπk

Item A Item B Item C

… …

OurIdea2(HyperTSFB)

§ LimitationofPreviousIdea:§ Foreachrecommendation,userfeedbackisusedbyonlyoneindividualmodel (e.g.,πk).

§ Motivation:§ CanweupdateallR1,R2,…,Rm byeveryuserfeedback?(Shareeveryuserfeedbacktoeveryindividualmodel).

OurIdea2(HyperTSFB)

§ Assumeeachmodelcanoutputtheprobabilityofrecommendinganyitemgivenxt.§ E.g.,fordeterministicrecommendation,itis1or0.

§ Forauservisitxt:§ πk isselectedtoperformrecommendation(k=1,2,…,orm).§ ItemA isrecommendedbyπkgivenxt.§ Receiveauserfeedback(clickornotclick),rt.§ Askeverymodelπ1,π2 …πm,whatistheprobabilityofrecommendingA givenxt.

OurIdea2(HyperTSFB)

§ Assumeeachmodelcanoutputtheprobabilityofrecommendinganyitemgivenxt.§ E.g.,fordeterministicrecommendation,itis1or0.

§ Forauservisitxt:§ πk isselectedtoperformrecommendation(k=1,2,…,orm).§ ItemA isrecommendedbyπkgivenxt.§ Receiveauserfeedback(clickornotclick),rt.§ Askeverymodelπ1,π2 …πm,whatistheprobabilityofrecommendingAgivenxt.

EstimatetheCTRof π1,π2 …πm(ImportanceSampling)

ExperimentalSetup

§ ExperimentalData§ Yahoo!TodayNewsdatalogs(randomlydisplayed).§ KDDCup2012OnlineAdvertisingdataset.

§ EvaluationMethods§ Yahoo!TodayNews:Replay (seeLihongLiet.al’s WSDM2011paper).

§ KDDCup2012Data:Simulation byaLogisticRegressionModel.

ComparativeMethods

§ CTRPredictionAlgorithm§ LogisticRegression

§ Exploitation-ExplorationAlgorithms§ Random,ε-greedy,LinUCB,Softmax,Epoch-greedy,Thompsonsampling

§ HyperTSandHyperTSFB

ResultsforYahoo!NewsData

§ Every100,000impressionsareaggregatedintoabucket.

ResultsforYahoo!NewsData(Cont.)

Conclusions

§ Theperformanceofbaselineexploitation-explorationalgorithmsisverysensitivetotheparametersetting.§ Incold-startsituation,noenoughdatatotuneparameter.

§ HyperTSandHyperTSFBcanbeclosetotheoptimalbaselinealgorithm(Noguaranteebebetterthantheoptimalone),eventhoughsomebadindividualmodelsareincluded.

§ ForcontextualThompsonsampling,theperformancedependsonthechoiceofprior distributionforthelogisticregression.§ ForonlineBayesianlearning,theposteriordistribution

approximationisnotaccurate(cannotstorethepastdata).

OurResearch2:PersonalizedRecommendationviaParameter-FreeContextualBandits

[1] Tang,Liang,etal."Personalizedrecommendationviaparameter-freecontextualbandits." Proceedingsofthe38thInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval.ACM,2015.

HowtoBalanceTradeoff

§ Performanceismainlydeterminedbythetradeoff.Existingalgorithmsfindthetradeoffbyuserinputparametersanddatacharacteristics(e.g.,varianceoftheestimatedreward).

§ Existingalgorithmsareallparameter-sensitive.

Analgorithm

Bad

Goodalgorithmparameterisgood

algorithmparameterisbad

Chicken-and-EggProblemforExistingBanditAlgorithms

§ Whyweusebanditalgorithms?§ Solvethecoldstartproblem(Noenoughdataforestimatinguserpreferences).

§ Howtofindthebestinputparameters?§ Tunetheparametersonlineoroffline.

ifyoualreadyhavethedataoronlinetraffictotunetheparameters,whydoyouneedbandit

algorithms?

OurWork

§ Parameter-free:§ Itcanfindthetradeoffbydatacharacteristicsautomatically.

§ Robust:§ Existingalgorithmcanhaveverybadperformanceiftheinputparameterisnotappropriate.

Solution

§ ThompsonSampling§ Randomlyselectamodelcoefficientvectorfromposteriordistributionandfindthe“best”item.

§ Prior istheinputparameterforcomputingposterior.

§ Non-BayesianThompsonSampling(OurSolution)§ RandomlyselectabootstrapsampletofindtheMLEofmodelcoefficientandfindthe“best”item.

§ Bootstrappinghasnoinputparameter.

BootstrapBanditAlgorithm

Input:afeaturevectorx ofthecontext.Algorithm:

if each article has sufficient observations then {for each article i=1,…, k

i. Di ç randomly sample nk impression data of article i with replacement // Generate a bootstrap sample

ii. θiç MLE coefficient of Di // Model estimation on bootstrap sampleselect the article i* = argmax(f(x, θi)), i=1,…, k. to show.

}else{randomly select an article that has no sufficient observations to show.

}

Predictionfunction

OnlineBootstrapBandits

§ WhyOnlineBootstrap?§ Inefficienttogenerateabootstrapsampleforeachrecommendation.

§ Howtoonlinebootstrap?§ Keepthecoefficientestimatedbyeachbootstrapsampleinmemory.

§ Noneedtokeepallbootstrapsamplesinmemory.§ Whenanewdataarrives,incrementallyupdatetheestimatedcoefficientforeachbootstrapsample[1].

[1]N.C.Oza andS.Russell.Onlinebaggingandboosting.InIEEEinternationalconferenceonSystems,manandcybernetics,volume3,pages2340–2345,2005.

ExperimentData§ Twopublicdatasets

§ Newsrecommendationdata(Yahoo!TodayNews)§ NewsdisplayedontheYahoo!FrontPagefromOct.2nd,2011toOct.16th 2011.

§ 28,041,015uservisitevents.§ 136dimensionsoffeaturevectorforeachevent.

§ Onlineadvertisingdata(KDDCup2012,Track2)§ ThedatasetiscollectedbyasearchengineandpublishedbyKDDCup2012.

§ 1millionuservisitevents.§ 1,070,866dimensionsofthecontextfeaturevector.

OfflineEvaluationMetricandMethods§ Setup

§ OverallCTR(averagerewardofatrial).

§ EvaluationMethod§ TheexperimentonYahoo!TodayNewsisevaluatedbythereplaymethod[1].

§ TherewardonKDDCup2012ADdataissimulatedwithaweightvectorforeachAD[2].

[1]L.Li,W.Chu,J.Langford,andX.Wang.Unbiasedofflineevaluationofcontextual-bandit-basednewsarticlerecommendationalgorithms.InWSDM,pages297–306,2011.[2] O.Chapelle andL.Li.Anempiricalevaluationofthompson sampling.InNIPS,pages2249–2257,2011.

Experimental Methods

§ Ourmethod§ Bootstrap(B),whereB isthenumberofbootstrapsamples.

§ Baselines§ Random:itrandomlyselectsanarmtopull.§ Exploit:itonlyconsidertheexploitationwithoutexploration.§ ε-greedy(ε):ε istheprobabilityofexploration.§ LinUCB(α):itpullsthearmwithlargestscoredefinedbytheparameter

α§ TS(q0):Thompsonsamplingwithlogisticregression,whereq0-1 isthe

priorvariance,0isthepriormean.§ TSNR(q0):SimilartoTS(q0),butthelogisticregressionisnotregularized

bytheprior.

Experiment(Yahoo!NewsData)§ Allnumbersarerelativetotherandommodel.

Experiment(ADKDDCup’12)§ Allnumbersarerelativetotherandommodel.

CTRoverTimeBucket(Yahoo!NewsData)

CTRoverTimeBuckets(KDDCupAdsData)

Efficiency§ Timecostondifferentbootstrapsamplesizes

SummaryofExperiment

§ Summary§ Forsolvingthecontextualbanditproblem,thealgorithmsofє-greedyandLinUCBcanachievetheoptimalperformance,buttheinputparametersthatcontroltheexplorationneedtobetunedcarefully.

§ Theprobabilitymatchingstrategieshighlydependontheselectionoftheprior.

§ Ourproposedalgorithmisasafechoiceofbuildingpredictivemodelsforcontextualbanditproblemsunderthescenarioofcold-start.

Conclusion

§ Proposeanon-BayesianThompsonSamplingmethodtosolvethepersonalizedrecommendationproblem.

§ GiveboththeoreticalandempiricalanalysistoshowthattheperformanceofThompsonsamplingdependsonthechoiceoftheprior.

§ Conductextensiveexperimentsonrealdatasetstodemonstratetheefficacyoftheproposedmethod andothercontextualbanditalgorithms.

FutureWork

§ MABwithsimilarityinformation§ MABinachangingenvironment§ Explore-exploit tradeoffinmechanismdesign§ Explore-exploitlearningwithlimitedresources§ Riskvs.rewardtradeoffinMAB

[1]http://research.microsoft.com/en-us/projects/bandits/

QuestionandAnswer

Thanks!

Introduction to Contextual Multi-bandit Algorithm to Contextual Multi-bandit... · Outline...

Documents

Transcript of Introduction to Contextual Multi-bandit Algorithm to Contextual Multi-bandit... · Outline...