Post on 21-Jul-2020
Trustregionpolicyoptimization(TRPO)
ValueIteration
ValueIteration
• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.
ValueIteration
• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.
Model-based
Model-free
ValueIteration
• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.
• OncewehaveQ(s,a),wecanfindoptimalpolicyπ*using:
PolicyIteration• Wecandirectlyoptimizeinthepolicyspace.
PolicyIteration• Wecandirectlyoptimizeinthepolicyspace.
Smaller thanQ-functionspace
PreliminariesFollowingidentityexpressestheexpectedreturnofanotherpolicy intermsoftheadvantageoverπ,accumulatedovertimesteps:
WhereAπ istheadvantagefunction:
Andisthevisitation frequencyofstatesinpolicy
PreliminariesToremovethecomplexity dueto,following localapproximation isintroduced:
Ifwehaveaparameterized policy ,where isadifferentiable functionoftheparametervector ,then matchestofirstorder. i.e.,
Thisimplies thatasufficiently small stepthatimproves willalsoimprove ,butdoesnotgiveusanyguidance onhowbigofasteptotake.
• Toaddressthisissue,Kakade &Langford(2002)proposedconservativepolicyiteration:
where,
• Theyderivedthefollowinglowerbound:
Preliminaries
Preliminaries
• Computationally,thisα-couplingmeansthatifwerandomlychooseaseedforourrandomnumbergenerator,andthenwesamplefromeachofπ andπnew aftersettingthatseed,theresultswillagreeforatleastfraction1-α ofseeds.• Thusα canbeconsideredasameasureofdisagreementbetweenπandπnew
Theorem1• Previousresultwasapplicabletomixturepoliciesonly.Schulmanshowedthatitcanbeextended togeneralstochasticpoliciesbyusingadistancemeasurecalled“TotalVariation”divergencebetweenπandas:
• Let
• Theyprovedthatfor ,followingresultholds:
fordiscreteprobability distributionsp;q
• NotethefollowingrelationbetweenTotalVariation&Kullback–Leibler:
• Thusboundingconditionbecomes:
Theorem1
Algorithm1
TrustRegionPolicyOptimization
• Forparameterizedpolicieswithparametervector,weareguaranteedtoimprovethetrueobjectivebyperformingfollowingmaximization:
• However,usingthepenaltycoefficientlikeaboveresultsinverysmallstepsizes.OnewaytotakelargerstepsinarobustwayistouseaconstraintontheKLdivergencebetweenthenewpolicyandtheoldpolicy,i.e.,atrustregionconstraint:
TrustRegionPolicyOptimization
• Theconstraintisboundedateverypointinstatespace,whichisnotpractical.Wecanusethefollowingheuristicapproximation:
• Thus,theoptimizationproblembecomes:
TrustRegionPolicyOptimization
• Intermsofexpectation,previousequationcanbewrittenas:
where,qdenotesthesamplingdistribution• Thissamplingdistributioncanbecalculatedintwoways:
Ø a)SinglePathMethodØ b)VineMethod
FinalAlgorithm
• Step1: Usethesinglepathorvineprocedurestocollectasetofstate-actionpairsalongwithMonteCarloestimatesoftheirQ-values• Step2:Byaveragingoversamples,constructtheestimatedobjectiveandconstraintinEquation(14)• Step3: Approximatelysolvethisconstrainedoptimizationproblemtoupdatethepolicy’sparametervector