Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement...
Transcript of Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement...
ReinforcementLearning
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Based on slides by Dan Klein
ReinforcementLearning
§ Basicidea:§ Receivefeedbackintheformofrewards§ Agent’sutilityisdefinedbytherewardfunction§ Must(learnto)actsoastomaximizeexpectedrewards§ Alllearningisbasedonobservedsamplesofoutcomes!
Environment
Agent
Actions:aState:sReward:r
ReinforcementLearning
§ StillassumeaMarkovdecisionprocess(MDP):§ AsetofstatessÎ S§ Asetofactions(perstate)A§ AmodelT(s,a,s’)§ ArewardfunctionR(s,a,s’)
§ Stilllookingforapolicyp(s)
§ Newtwist:don’tknowTorR§ I.e.wedon’tknowwhichstatesaregoodorwhattheactionsdo§ Mustactuallytryactionsandstatesouttolearn
Offline(MDPs)vs.Online(RL)
OfflineSolution OnlineLearning
Model-BasedLearning
Model-BasedLearning
§ Model-BasedIdea:§ Learnanapproximatemodelbasedonexperiences§ Solveforvaluesasifthelearnedmodelwerecorrect
§ Step1:LearnempiricalMDPmodel§ Countoutcomess’foreachs,a§ Normalizetogiveanestimateof§ Discovereach whenweexperience(s,a,s’)
§ Step2:SolvethelearnedMDP§ Forexample,usevalueiteration,asbefore
Example:Model-BasedLearning
InputPolicyp
Assume:g =1
ObservedEpisodes(Training) LearnedModel
A
B C D
E
B,east,C,-1C,east,D,-1D,exit,x,+10
B,east,C,-1C,east,D,-1D,exit,x,+10
E,north,C,-1C,east,A,-1A,exit,x,-10
Episode1 Episode2
Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10
T(s,a,s’).T(B,east,C)=1.00T(C,east,D)=0.75T(C,east,A)=0.25
…
R(s,a,s’).R(B,east,C)=-1R(C,east,D)=-1R(D,exit,x)=+10
…
Model-FreeLearning
PassiveReinforcementLearning
PassiveReinforcementLearning
§ Simplifiedtask:policyevaluation§ Input:afixedpolicyp(s)§ Youdon’tknowthetransitionsT(s,a,s’)§ Youdon’tknowtherewardsR(s,a,s’)§ Goal:learnthestatevalues
§ Inthiscase:§ Learneris“alongfortheride”§ Nochoiceaboutwhatactionstotake§ Justexecutethepolicyandlearnfromexperience§ ThisisNOTofflineplanning!Youactuallytakeactionsintheworld.
DirectEvaluation
§ Goal:Computevaluesforeachstateunderp
§ Idea:Averagetogetherobservedsamplevalues§ Actaccordingtop§ Everytimeyouvisitastate,writedownwhatthesumofdiscountedrewardsturnedouttobe
§ Averagethosesamples
§ Thisiscalleddirectevaluation
Example:DirectEvaluation
InputPolicyp
Assume:g =1
ObservedEpisodes(Training) OutputValues
A
B C D
E
B,east,C,-1C,east,D,-1D,exit,x,+10
B,east,C,-1C,east,D,-1D,exit,x,+10
E,north,C,-1C,east,A,-1A,exit,x,-10
Episode1 Episode2
Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10
A
B C D
E
+8 +4 +10
-10
-2
ProblemswithDirectEvaluation
§ What’sgoodaboutdirectevaluation?§ It’seasytounderstand§ Itdoesn’trequireanyknowledgeofT,R§ Iteventuallycomputesthecorrectaveragevalues,usingjustsampletransitions
§ Whatbadaboutit?§ Eachstatemustbelearnedseparately§ So,ittakesalongtimetolearn
OutputValues
A
B C D
E
+8 +4 +10
-10
-2
IfBandEbothgotoCunderthispolicy,howcantheirvaluesbedifferent?
WhyNotUsePolicyEvaluation?
§ SimplifiedBellmanupdatescalculateVforafixedpolicy:§ Eachround,replaceVwithaone-step-look-aheadlayeroverV
§ Thisapproachfullyexploitedtheconnectionsbetweenthestates§ Unfortunately,weneedTandRtodoit!
§ Keyquestion:howcanwedothisupdatetoVwithoutknowingTandR?§ Inotherwords,howtowetakeaweightedaveragewithoutknowingtheweights?
p(s)
s
s,p(s)
s, p(s),s’s’
Sample-BasedPolicyEvaluation?§ WewanttoimproveourestimateofVbycomputingtheseaverages:
§ Idea:Takesamplesofoutcomess’(bydoingtheaction!)andaverage
p(s)
s
s,p(s)
'1s'2s '3ss, p(s),s’
s'
Almost!Butwecan’trewindtimetogetsampleaftersamplefromstates.
TemporalDifferenceLearning
TemporalDifferenceLearning§ Bigidea:learnfromeveryexperience!
§ UpdateV(s)eachtimeweexperienceatransition(s,a,s’,r)§ Likelyoutcomess’willcontributeupdatesmoreoften
§ Temporaldifferencelearningofvalues§ Policystillfixed,stilldoingevaluation!§ Movevaluestowardvalueofwhateversuccessoroccurs:runningaverage
p(s)s
s,p(s)
s’
SampleofV(s):
UpdatetoV(s):
Sameupdate:
ExponentialMovingAverage
§ Exponentialmovingaverage§ Therunninginterpolationupdate:
§ Makesrecentsamplesmoreimportant:
§ Forgetsaboutthepast(distantpastvalueswerewronganyway)
§ Decreasinglearningrate(alpha)cangiveconvergingaverages
Example:TemporalDifferenceLearning
Assume:g =1,α =1/2
ObservedTransitions
B,east,C,-2
0
0 0 8
0
0
-1 0 8
0
0
-1 3 8
0
C,east,D,-2
A
B C D
E
States
ProblemswithTDValueLearning
§ TDvalueleaningisamodel-freewaytodopolicyevaluation,mimickingBellmanupdateswithrunningsampleaverages
§ However,ifwewanttoturnvaluesintoa(new)policy,we’resunk:
§ Idea:learnQ-values,notvalues§ Makesactionselectionmodel-freetoo!
a
s
s,a
s,a,s’s’
ActiveReinforcementLearning
ActiveReinforcementLearning
§ Fullreinforcementlearning:optimalpolicies(likevalueiteration)§ Youdon’tknowthetransitionsT(s,a,s’)§ Youdon’tknowtherewardsR(s,a,s’)§ Youchoosetheactionsnow§ Goal:learntheoptimalpolicy/values
§ Inthiscase:§ Learnermakeschoices!§ Fundamentaltradeoff:explorationvs.exploitation§ ThisisNOTofflineplanning!Youactuallytakeactionsintheworldandfindoutwhathappens…
Detour:Q-ValueIteration
§ Valueiteration:findsuccessive(depth-limited)values§ StartwithV0(s)=0,whichweknowisright§ GivenVk,calculatethedepthk+1valuesforallstates:
§ ButQ-valuesaremoreuseful,socomputetheminstead§ StartwithQ0(s,a)=0,whichweknowisright§ GivenQk,calculatethedepthk+1q-valuesforallq-states:
Q-Learning§ Q-Learning:sample-basedQ-valueiteration
§ LearnQ(s,a)valuesasyougo§ Receiveasample(s,a,s’,r)§ Consideryouroldestimate:§ Consideryournewsampleestimate:
§ Incorporatethenewestimateintoarunningaverage:
Q-LearningProperties
§ Amazingresult:Q-learningconvergestooptimalpolicy-- evenifyou’reactingsuboptimally!
§ Thisiscalledoff-policylearning
§ Caveats:§ Youhavetoexploreenough§ Youhavetoeventuallymakethelearningratesmallenough
§ …butnotdecreaseittooquickly§ Basically,inthelimit,itdoesn’tmatterhowyouselectactions(!)