Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck...
Transcript of Reinforcement Learning - Rhodeskirlinp/courses/ai/s17/lessons/rl/rl-slides.pdf · •Infinite deck...
ReinforcementLearning
Environments
• Fully-observablevs partially-observable• Singleagentvs multipleagents• Deterministicvs stochastic• Episodicvs sequential• Staticordynamic• Discreteorcontinuous
Whatisreinforcementlearning?
• Threemachinelearningparadigms:– Supervisedlearning– Unsupervisedlearning(overlapsw/datamining)– Reinforcementlearning
• Inreinforcementlearning,theagentreceivesincrementalpiecesoffeedback,calledrewards,thatitusestojudgewhetheritisactingcorrectlyornot.
Examplesofreal-lifeRL
• Learningtoplaychess.• Animals(ortoddlers)learningtowalk.• Drivingtoschoolorworkinthemorning.
• Keyidea:MostRLtasksareepisodic,meaningtheyrepeatmanytimes.– SounlikeinotherAIproblemswhereyouhaveoneshottogetitright,inRL,it'sOKtotaketimetotrydifferentthingstoseewhat'sbest.
n-armedbanditproblem• Youhavenslotmachines.• Whenyouplayaslotmachine,itprovidesyouareward(negativeorpositive)accordingtosomefixedprobabilitydistribution.
• Eachmachinemayhaveadifferentprobabilitydistribution,andyoudon'tknowthedistributionsaheadoftime.
• Youwanttomaximizetheamountofreward(money)youget.
• Inwhatorderandhowmanytimesdoyouplaythemachines?
RLproblems
• EveryRLproblemisstructuredsimilarly.• Wehaveanenvironment,whichconsistsofasetofstates,andactions thatcanbetakeninvariousstates.– Environmentisoftenstochastic(thereisanelementofchance).
• OurRLagentwishestolearnapolicy,π,afunctionthatmapsstatestoactions.– π(s)tellsyouwhatactiontotakeinastates.
WhatisthegoalinRL?
• InotherAIproblems,the"goal"istogettoacertainstate.NotinRL!
• A RLenvironmentgivesfeedbackeverytimetheagenttakesanaction.Thisiscalledareward.– Rewardsareusuallynumbers.– Goal:Agentwantstomaximizetheamountofrewarditgetsovertime.
– Criticalpoint:Rewardsaregivenbytheenvironment,nottheagent.
Mathematicsofrewards• Assumeourrewardsarer0,r1,r2,…• Whatexpressionrepresentsourtotalrewards?
• Howdowemaximizethis?Isthisagoodidea?• Usediscounting:ateachtimestep,therewardisdiscountedbyafactorofγ (calledthediscountrate).
• Futurerewardsfromtimet=rt + �rt+1 + �2rt+2 + · · · =
1X
k=0
�krt+k
MarkovDecisionProcesses• AnMDPhasasetofstates,S,andasetofactions,A(s),foreverystatesinS.
• AnMDPencodestheprobabilityoftransitioningfromstatestostates'onactiona:P(s'|s,a)
• RLalsorequiresarewardfunction,usuallydenotedbyR(s,a,s')=rewardforbeinginstates,takingactiona,andarrivinginstates'.
• AnMDPisaMarkovchainthatallowsforoutsideactionstoinfluencethetransitions.
• Grassgivesarewardof0.• Monstergivesarewardof-5.• Potofgoldgivesarewardof+10(andendsgame).• Twoactionsarealwaysavailable:
– ActionA:50%chanceofmovingright1square,50%chanceofstayingwhereyouare.
– ActionB:50%chanceofmovingright2squares,50%chanceofmovingleft1square.
– Anymovementthatwouldtakeyouofftheboardmovesyouasfarinthatdirectionaspossibleorkeepsyouwhereyouare.
Valuefunctions• AlmostallRLalgorithmsarebasedaroundcomputing,estimating,orlearningvaluefunctions.
• Avaluefunctionrepresentstheexpectedfuturereward fromeitherastate,orastate-actionpair.– Vπ(s):Ifweareinstates,andfollowpolicyπ,whatisthetotalfuturerewardwewillsee,onaverage?
– Qπ(s,a):Ifweareinstates,andtakeactiona,thenfollowpolicyπ,whatisthetotalfuturerewardwewillsee,onaverage?
Optimalpolicies
• GivenanMDP,thereisalwaysa"best"policy,calledπ*.
• ThepointofRListodiscoverthispolicybyemployingvariousalgorithms.– Somealgorithmscanusesub-optimalpoliciestodiscoverπ*.
• WedenotethevaluefunctionscorrespondingtotheoptimalpolicybyV*(s)andQ*(s,a).
Bellmanequations
• TheV*(s)andQ*(s,a)functionsalwayssatisfycertainrecursiverelationshipsforanyMDP.
• Theserelationships,intheformofequations,arecalledtheBellmanequations.
RecursiverelationshipofV*andQ*:
V ⇤(s) = max
aQ⇤
(s, a)
Q⇤(s, a) =X
s0
P (s0 | s, a)⇥R(s, a, s0) + �V ⇤(s0)
⇤
Theexpectedfuturerewardsfromastatesisequaltotheexpectedfuturerewardsobtainedbychoosingthebestactionfromthatstate.
Theexpectedfuturerewardsobtainedbytakinganactionfromastateistheweightedaverageoftheexpectedfuturerewardsfromthenewstate.
Bellmanequations
• Noclosed-formsolutioningeneral.• Instead,mostRLalgorithmsusetheseequationsinvariouswaystoestimateV*orQ*.AnoptimalpolicycanbederivedfromeitherV*orQ*.
V ⇤(s) = max
a
X
s0
P (s0 | s, a)⇥R(s, a, s0) + �V ⇤
(s0)⇤
Q⇤(s, a) =
X
s0
P (s0 | s, a)⇥R(s, a, s0) + �max
a0Q⇤
(s0, a0)⇤
RLalgorithms
• AmaincategorizationofRLalgorithmsiswhetherornottheyrequireafullmodeloftheenvironment.
• Inotherwords,doweknowP(s'|s,a)andR(s,a,s')forallcombinationsofs,a,s'?– Ifwehavethisinformation(uncommonintherealworld),wecanestimateV*orQ*directlywithverygoodaccuracy.
– Ifwedon'thavethisinformation,wecanestimateV*orQ*fromexperienceorsimulations.
Valueiteration
• Valueiterationisanalgorithmthatcomputesanoptimalpolicy,givenafullmodeloftheenvironment.
• AlgorithmisderiveddirectlyfromtheBellmanequations(usuallyforV*,butcanuseQ*aswell).
Valueiteration• Twosteps:• EstimateV(s)foreverystate.– Foreachstate:
• Simulatetakingeverypossibleactionfromthatstateandexaminetheprobabilitiesfortransitioningintoeverypossiblesuccessorstate.Weighttherewardsyouwouldreceivebytheprobabilitiesthatyoureceivethem.
• Findtheactionthatgaveyouthemostreward,andrememberhowmuchrewarditwas.
• Computetheoptimalpolicybydoingthefirststepagain,butthistimeremembertheactionsthatgiveyouthemostreward,nottherewarditself.
Valueiteration• ValueiterationmaintainsatableofVvalues,oneforeachstate.EachvalueV[s]eventuallyconvergestothetruevalueV*(s).
• Grassgivesarewardof0.• Monstergivesarewardof-5.• Potofgoldgivesarewardof+10(andendsgame).• Twoactionsarealwaysavailable:
– ActionA:50%chanceofmovingright1square,50%chanceofstayingwhereyouare.
– ActionB:50%chanceofmovingright2squares,50%chanceofmovingleft1square.
– Anymovementthatwouldtakeyouofftheboardmovesyouasfarinthatdirectionaspossibleorkeepsyouwhereyouare.
• γ (gamma)=0.9
V[s]valuesconvergeto:
6.477.918.560
Howdoweusethesetocomputeπ(s)?
ComputinganoptimalpolicyfromV[s]
• Laststepofthevalueiterationalgorithm:
• Inotherwords,runonelasttimethroughthevalueiterationequationforeachstate,andpicktheactionaforeachstatesthatmaximizestheexpectedreward.
⇡(s) = argmax
a
X
s0
P (s0 | s, a)[R(s, a, s0) + �V [s0]]
V[s]valuesconvergeto:
6.477.918.560Optimalpolicy:
ABB---
Review
• Valueiterationrequiresaperfectmodeloftheenvironment.– YouneedtoknowP(s'|s,a)andR(s,a,s')aheadoftimeforallcombinationsofs,a,ands'.
– OptimalVorQvaluesarecomputeddirectlyfromtheenvironmentusingtheBellmanequations.
• Oftenimpossibleorimpractical.
SimpleBlackjack• Costs$5toplay.• Infinitedeckofshuffledcards,labeled1,2,3.• Youstartwithnocards.Ateveryturn,youcaneither"hit"(takeacard)or"stay"(endthegame).Yourgoalistogettoasumof6withoutgoingover,inwhichcaseyoulosethegame.
• Youmakeallyourdecisionsfirst,thenthedealerplaysthesamegame.
• Ifyoursumishigherthanthedealer's,youwin$10(youroriginal$5back,plusanother$5).Iflower,youlose(youroriginal$5).Ifthesame,draw(getyour$5back).
SimpleBlackjack• TosetthisupasanMDP,weneedtoremovethe2nd player(thedealer)fromtheMDP.
• Usuallyatcasinos,dealershavesimplerulestheyhavetofollowanywayaboutwhentohitandwhentostay.
• Isiteveroptimalto"stay"fromS0-S3?• Assumethatonaverage,ifwe"stay"from:– S4,wewin$3(net$-2).– S5,wewin$6(net$1).– S6,wewin$7(net$2).
• Doyouevenwanttoplaythisgame?
SimpleBlackjack• Whatshouldgammabe?• Assumewehavefinishedoneroundofvalueiteration.
• CompletethesecondroundofvalueiterationforS1—S6.
Learningfromexperience
• Whatifwedon'tknowtheexactmodeloftheenvironment,butweareallowedtosamplefromit?– Thatis,weareallowedto"practice"theMDPasmuchaswewant.
– Thisechoesreal-lifeexperience.• Onewaytodothisistemporaldifferencelearning.
Temporaldifferencelearning
• WewanttocomputeV(s)orQ(s,a).• TDlearningusestheideaoftakinglotsofsamplesofVorQ(fromtheMDP)andaveragingthemtogetagoodestimate.
• Let'sseehowTDlearningworks.
Example:Timetodrivehome
• SupposefortendaysIrecordhowlongittakesmetodrivehomeafterwork.
• Ontheeleventhday,whattimeshouldIpredictmytraveltimehometobe?
Example:Timetodrivehome
• BasicTDequation:• V(s)=V(s)+𝛼(reward– V(s))• Butwhatifourrewardcomesinpieces,notallatonce?
• totalreward=onestepreward+restofreward• totalreward=rt +𝛾V(s')• V(s)=V(s)+𝛼[rt +𝛾V(s')– V(s)]
Q-learning
• Q-learningisatemporaldifferencelearningalgorithmthatlearnsoptimalvaluesforQ(insteadofV,asvalueiterationdid).
• Thealgorithmworksinepisodes,wheretheagent"practices"(akasamples)theMDPtolearnwhichactionsobtainthemostrewards.
• Likevalueiteration,tableofQvalueseventuallyconvergetoQ*.(undercertainconditions)
• NoticetheQ[s,a]updateequationisverysimilartothedrivingtimeupdateequation.– (Theextraγmaxa' Q[s',a']pieceistohandlefuturerewards.)
– alpha(0<α<=1)iscalledthelearningrate;itcontrolshowfastthealgorithmlearns.Instochasticenvironments,alphaisusuallysmall,suchas0.1.
• Note:The"chooseaction"stepdoesnotmeanyouchoosethebestactionaccordingtoyourtableofQvalues.
• Youmustbalanceexplorationandexploitation;likeintherealworld,thealgorithmlearnsbestwhenyou"practice"thebestpolicyoften,butsometimesexploreotheractionsthatmaybebetterinthelongrun.
• Oftenthe"chooseaction"stepusespolicythatmostlyexploitsbutsometimesexplores.
• Onecommonidea:(epsilon-greedypolicy)– Withprobability1- ε,pickthebestaction(the"a"thatmaximizesQ[s,a].
– Withprobabilityε,pickarandomaction.• Alsocommontostartwithlargeε anddecreaseovertimewhilelearning.
• WhatmakesQ-learningsoamazingisthattheQ-valuesstillconvergetotheoptimalQ*valueseventhoughthealgorithmitselfisnotfollowingtheoptimalpolicy!
Q-learningwithBlackjack
• Updateformula:
• Sampleepisodes(statesandactions):S0è Hitè S3è Stayè EndS0è Hitè S3è Hitè S6è Stayè EndS0è Hitè S3è Hitè S5è Stayè End
Q[s, a] Q[s, a] + ↵hr + �max
a0Q[s0, a0]�Q[s, a]
i
2-PlayerQ-learningNormalupdateequation:
Normallywealwaysmaximizeourrewards.Consider2-playerQ-learningwithplayerAmaximizingandplayerBminimizing(asinminimax).
Whydoesthisbreaktheupdateequation?
Q[s, a] Q[s, a] + ↵hr + �max
a0Q[s0, a0]�Q[s, a]
i
2-PlayerQ-learningPlayerA'supdateequation:
PlayerB'supdateequation:
PlayerA'soptimalpolicyoutput:
PlayerB'soptimalpolicyoutput:
Q[s, a] Q[s, a] + ↵hr + �min
a0Q[s0, a0]�Q[s, a]
i
Q[s, a] Q[s, a] + ↵hr + �max
a0Q[s0, a0]�Q[s, a]
i
⇡(s) = argmax
aQ[s, a]
⇡(s) = argmina
Q[s, a]