Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement...

ReinforcementLearning

Robot Image Credit: Viktoriya Sukhanova © 123RF.com

Based on slides by Dan Klein


§ Basicidea:§ Receivefeedbackintheformofrewards§ Agent’sutilityisdefinedbytherewardfunction§ Must(learnto)actsoastomaximizeexpectedrewards§ Alllearningisbasedonobservedsamplesofoutcomes!

Environment

Agent

Actions:aState:sReward:r


§ StillassumeaMarkovdecisionprocess(MDP):§ AsetofstatessÎ S§ Asetofactions(perstate)A§ AmodelT(s,a,s’)§ ArewardfunctionR(s,a,s’)

§ Stilllookingforapolicyp(s)

§ Newtwist:don’tknowTorR§ I.e.wedon’tknowwhichstatesaregoodorwhattheactionsdo§ Mustactuallytryactionsandstatesouttolearn

Offline(MDPs)vs.Online(RL)

OfflineSolution OnlineLearning

Model-BasedLearning

Model-BasedLearning

§ Model-BasedIdea:§ Learnanapproximatemodelbasedonexperiences§ Solveforvaluesasifthelearnedmodelwerecorrect

§ Step1:LearnempiricalMDPmodel§ Countoutcomess’foreachs,a§ Normalizetogiveanestimateof§ Discovereach whenweexperience(s,a,s’)

§ Step2:SolvethelearnedMDP§ Forexample,usevalueiteration,asbefore

Example:Model-BasedLearning

InputPolicyp

Assume:g =1

ObservedEpisodes(Training) LearnedModel

A

B C D

E

B,east,C,-1C,east,D,-1D,exit,x,+10


E,north,C,-1C,east,A,-1A,exit,x,-10

Episode1 Episode2

Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10

T(s,a,s’).T(B,east,C)=1.00T(C,east,D)=0.75T(C,east,A)=0.25

…

R(s,a,s’).R(B,east,C)=-1R(C,east,D)=-1R(D,exit,x)=+10

…

Model-FreeLearning

PassiveReinforcementLearning

PassiveReinforcementLearning

§ Simplifiedtask:policyevaluation§ Input:afixedpolicyp(s)§ Youdon’tknowthetransitionsT(s,a,s’)§ Youdon’tknowtherewardsR(s,a,s’)§ Goal:learnthestatevalues

§ Inthiscase:§ Learneris“alongfortheride”§ Nochoiceaboutwhatactionstotake§ Justexecutethepolicyandlearnfromexperience§ ThisisNOTofflineplanning!Youactuallytakeactionsintheworld.

DirectEvaluation

§ Goal:Computevaluesforeachstateunderp

§ Idea:Averagetogetherobservedsamplevalues§ Actaccordingtop§ Everytimeyouvisitastate,writedownwhatthesumofdiscountedrewardsturnedouttobe

§ Averagethosesamples

§ Thisiscalleddirectevaluation

Example:DirectEvaluation

InputPolicyp

Assume:g =1

ObservedEpisodes(Training) OutputValues

A

B C D

E



E,north,C,-1C,east,A,-1A,exit,x,-10

Episode1 Episode2

Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10

A

B C D

E

+8 +4 +10

-10

-2

ProblemswithDirectEvaluation

§ What’sgoodaboutdirectevaluation?§ It’seasytounderstand§ Itdoesn’trequireanyknowledgeofT,R§ Iteventuallycomputesthecorrectaveragevalues,usingjustsampletransitions

§ Whatbadaboutit?§ Eachstatemustbelearnedseparately§ So,ittakesalongtimetolearn

OutputValues

A

B C D

E

+8 +4 +10

-10

-2

IfBandEbothgotoCunderthispolicy,howcantheirvaluesbedifferent?

WhyNotUsePolicyEvaluation?

§ SimplifiedBellmanupdatescalculateVforafixedpolicy:§ Eachround,replaceVwithaone-step-look-aheadlayeroverV

§ Thisapproachfullyexploitedtheconnectionsbetweenthestates§ Unfortunately,weneedTandRtodoit!

§ Keyquestion:howcanwedothisupdatetoVwithoutknowingTandR?§ Inotherwords,howtowetakeaweightedaveragewithoutknowingtheweights?

p(s)

s

s,p(s)

s, p(s),s’s’

Sample-BasedPolicyEvaluation?§ WewanttoimproveourestimateofVbycomputingtheseaverages:

§ Idea:Takesamplesofoutcomess’(bydoingtheaction!)andaverage

p(s)

s

s,p(s)

'1s'2s '3ss, p(s),s’

s'

Almost!Butwecan’trewindtimetogetsampleaftersamplefromstates.

TemporalDifferenceLearning

TemporalDifferenceLearning§ Bigidea:learnfromeveryexperience!

§ UpdateV(s)eachtimeweexperienceatransition(s,a,s’,r)§ Likelyoutcomess’willcontributeupdatesmoreoften

§ Temporaldifferencelearningofvalues§ Policystillfixed,stilldoingevaluation!§ Movevaluestowardvalueofwhateversuccessoroccurs:runningaverage

p(s)s

s,p(s)

s’

SampleofV(s):

UpdatetoV(s):

Sameupdate:

ExponentialMovingAverage

§ Exponentialmovingaverage§ Therunninginterpolationupdate:

§ Makesrecentsamplesmoreimportant:

§ Forgetsaboutthepast(distantpastvalueswerewronganyway)

§ Decreasinglearningrate(alpha)cangiveconvergingaverages

Example:TemporalDifferenceLearning

Assume:g =1,α =1/2

ObservedTransitions

B,east,C,-2

0

0 0 8

0

0

-1 0 8

0

0

-1 3 8

0

C,east,D,-2

A

B C D

E

States

ProblemswithTDValueLearning

§ TDvalueleaningisamodel-freewaytodopolicyevaluation,mimickingBellmanupdateswithrunningsampleaverages

§ However,ifwewanttoturnvaluesintoa(new)policy,we’resunk:

§ Idea:learnQ-values,notvalues§ Makesactionselectionmodel-freetoo!

a

s

s,a

s,a,s’s’

ActiveReinforcementLearning

ActiveReinforcementLearning

§ Fullreinforcementlearning:optimalpolicies(likevalueiteration)§ Youdon’tknowthetransitionsT(s,a,s’)§ Youdon’tknowtherewardsR(s,a,s’)§ Youchoosetheactionsnow§ Goal:learntheoptimalpolicy/values

§ Inthiscase:§ Learnermakeschoices!§ Fundamentaltradeoff:explorationvs.exploitation§ ThisisNOTofflineplanning!Youactuallytakeactionsintheworldandfindoutwhathappens…

Detour:Q-ValueIteration

§ Valueiteration:findsuccessive(depth-limited)values§ StartwithV0(s)=0,whichweknowisright§ GivenVk,calculatethedepthk+1valuesforallstates:

§ ButQ-valuesaremoreuseful,socomputetheminstead§ StartwithQ0(s,a)=0,whichweknowisright§ GivenQk,calculatethedepthk+1q-valuesforallq-states:

Q-Learning§ Q-Learning:sample-basedQ-valueiteration

§ LearnQ(s,a)valuesasyougo§ Receiveasample(s,a,s’,r)§ Consideryouroldestimate:§ Consideryournewsampleestimate:

§ Incorporatethenewestimateintoarunningaverage:

Q-LearningProperties

§ Amazingresult:Q-learningconvergestooptimalpolicy-- evenifyou’reactingsuboptimally!

§ Thisiscalledoff-policylearning

§ Caveats:§ Youhavetoexploreenough§ Youhavetoeventuallymakethelearningratesmallenough

§ …butnotdecreaseittooquickly§ Basically,inthelimit,itdoesn’tmatterhowyouselectactions(!)

Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement...

Documents

Transcript of Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement...