Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization...
Transcript of Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization...
![Page 1: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/1.jpg)
Trust Region Policy OptimizationJohn Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel
@ICML 2015
Presenter: Shivam Kalra
CS 885 (Reinforcement Learning)
Prof. Pascal Poupart
June 20th 2018
![Page 2: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/2.jpg)
Reinforcement Learning
Action Value Function
Q-Learning
Policy Gradients
TRPOActor Critic PPO
A3C ACKTR
Ref: https://www.youtube.com/watch?v=CKaN5PgkSBc
![Page 3: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/3.jpg)
Policy Gradient
For i=1,2,β¦
Collect N trajectories for policy ππ
Estimate advantage function π΄
Compute policy gradient π
Update policy parameter π = ππππ + πΌπ
![Page 4: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/4.jpg)
Problems of Policy Gradient
For i=1,2,β¦
Collect N trajectories for policy ππ
Estimate advantage function π΄
Compute policy gradient π
Update policy parameter π = ππππ + πΌπ
Non stationary input data due to changing policy and reward distributions change
![Page 5: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/5.jpg)
Problems of Policy Gradient
For i=1,2,β¦
Collect N trajectories for policy ππ
Estimate advantage function π΄
Compute policy gradient π
Update policy parameter π = ππππ + πΌπ
cvAdvantage is very random initially
Advantage Policy
Youβre bad
![Page 6: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/6.jpg)
Problems of Policy Gradient
For i=1,2,β¦
Collect N trajectories for policy ππ
Estimate advantage function π΄
Compute policy gradient π
Update policy parameter π = ππππ + πΌπWe need more carefully crafted policy update
We want improvement and not degradation
Idea: We can update old policy ππππ to a new policy πsuch that they are βtrustedβ distance apart. Suchconservative policy update allows improvement insteadof degradation.
![Page 7: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/7.jpg)
RL to Optimization
β’ Most of ML is optimizationβ’ Supervised learning is reducing training loss
β’ RL: what is policy gradient optimizing?β’ Favoring (π , π) that gave more advantage π΄.
β’ Can we write down optimization problem that allows to do small update on a policy π based on data sampled from π (on-policy data)
Ref: https://www.youtube.com/watch?v=xvRrgxcpaHY (6:40)
![Page 8: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/8.jpg)
What loss to optimize?
β’ Optimize π(π) i.e., expected return of a policy π
β’ We collect data with ππππ and optimize the objective to get a new policy π.
π π = πΌπ 0~π0,ππ‘~π . π π‘
π‘=0
β
πΎπ‘ππ‘
![Page 9: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/9.jpg)
What loss to optimize?
β’ We can express π π in terms of the advantage over the original policy1.
π π = π ππππ + πΌπ~π
π‘=0
β
πΎπ‘π΄ππππ(π π‘ , ππ‘)
[1] Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.
Expected return of new policy
Expected return of old policy
Sample from newpolicy
![Page 10: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/10.jpg)
What loss to optimize?
β’ Previous equation can be rewritten as1:
π π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
Expected return of new policy
Expected return of old policy
Discounted visitation frequency
ππ π = π π 0 = π + πΎπ π 1 = π + πΎ2π +β―
![Page 11: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/11.jpg)
π π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
What loss to optimize?
New Expected Return
Old Expected Return
β₯ π
![Page 12: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/12.jpg)
π π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
New Expected Return Old Expected Return>
What loss to optimize?
β₯ π
Guaranteed Improvement from ππππ β π
![Page 13: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/13.jpg)
π π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
New State Visitation is Difficult
State visitation based on new policy
New policy
βComplex dependency of ππ(π ) onπ makes the equation difficult tooptimize directly.β [1]
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
![Page 14: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/14.jpg)
π π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
New State Visitation is Difficult
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
πΏ π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
Local approximation of πΌ(π )
![Page 15: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/15.jpg)
Local approximation of π( π)
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
πΏ π = π ππππ +
π
ππ(π )
π
π π π π΄ππππ(π , π)
The approximation is accurate within step size πΏ (trust region)
Monotonic improvement guaranteed
π
πβ² ππβ² π π does not change dramatically.
Trust region
![Page 16: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/16.jpg)
β’ Monotonically improving policies can be generated by:
β’ The following bound holds:
Local approximation of π( π)
π π β₯ πΏ π β πΆπ·πΎπΏπππ₯(π, π)
Where, πΆ =4ππΎ
1βπΎ 2
π = arg maxπ
[πΏ π β πΆπ·πΎπΏπππ₯ π, π ]
Where, πΆ =4ππΎ
1βπΎ 2
![Page 17: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/17.jpg)
Minorization Maximization (MM) algorithm
Actual function π(π)
Surrogate function πΏ π β πΆπ·πΎπΏπππ₯(π, π)
![Page 18: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/18.jpg)
Optimization of Parameterized Policies
β’ Now policies are parameterized ππ π π with parameters π
β’ Accordingly surrogate function changes to
arg maxπ
[πΏ π β πΆπ·πΎπΏπππ₯ ππππ , π ]
![Page 19: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/19.jpg)
Optimization of Parameterized Policies
arg maxπ
[πΏ π β πΆπ·πΎπΏπππ₯ ππππ , π ]
In practice πΆ results in very small step sizes
One way to take larger step size is to constraint KLdivergence between the new policy and the oldpolicy, i.e., a trust region constraint:
πππππππππ½
π³π½(π½)
subject to, π«π²π³πππ π½πππ , π½ β€ πΉ
![Page 20: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/20.jpg)
Solving KL-Penalized Problem
β’ maxππππ§ππ πΏ π β πΆ.π·πΎπΏπππ₯(ππππ , π)
β’ Use mean KL divergence instead of max.β’ i.e., maxππππ§π
ππΏ π β πΆ.π·πΎπΏ(ππππ , π)
β’ Make linear approximation to πΏ and quadratic to KL term:
maxππππ§ππ
π . π β ππππ βπ
2π β ππππ
ππΉ(π β ππππ)
where, π =π
πππΏ π Θπ=ππππ , πΉ =
π2
π2ππ·πΎπΏ ππππ , π Θπ=ππππ
![Page 21: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/21.jpg)
Solving KL-Penalized Problem
β’ Make linear approximation to πΏ and quadratic to KL term:
β’ Solution: π β ππππ =1
ππΉβ1π. Donβt want to form full Hessian matrix
πΉ =π2
π2ππ·πΎπΏ ππππ , π Θπ=ππππ .
β’ Can compute πΉβ1π approximately using conjugate gradient algorithm without forming πΉ explicitly.
maxππππ§ππ
π . π β ππππ βπ
2π β ππππ
ππΉ(π β ππππ)
where, π =π
πππΏ π Θπ=ππππ , πΉ =
π2
π2ππ·πΎπΏ ππππ, π Θπ=ππππ
![Page 22: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/22.jpg)
Conjugate Gradient (CG)
β’ Conjugate gradient algorithm approximately solves for π₯ = π΄β1πwithout explicitly forming matrix π΄
β’ After π iterations, CG has minimized 1
2π₯ππ΄π₯ β ππ₯
![Page 23: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/23.jpg)
TRPO: KL-Constrained
β’ Unconstrained problem: maxππππ§ππ
πΏ π β πΆ. π·πΎπΏ(ππππ , π)
β’ Constrained problem: maxππππ§ππ
πΏ π subject to πΆ. π·πΎπΏ ππππ , π β€ πΏ
β’ πΏ is a hyper-parameter, remains fixed over whole learning processβ’ Solve constrained quadratic problem: compute πΉβ1π and then rescale step
to get correct KLβ’ maxππππ§π
ππ . π β ππππ subject to
1
2π β ππππ
ππΉ π β ππππ β€ πΏ
β’ Lagrangian: β π, π = π . π β ππππ βπ
2[ π β ππππ
ππΉ π β ππππ β πΏ]
β’ Differentiate wrt π and get π β ππππ =1
ππΉβ1π
β’ We want 1
2π ππΉπ = πΏ
β’ Given candidate step π π’ππ πππππ rescale to π =2πΏ
π π’ππ πππππ.(πΉπ π’ππ πππππ)π π’ππ πππππ
![Page 24: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/24.jpg)
TRPO Algorithm
For i=1,2,β¦
Collect N trajectories for policy ππEstimate advantage function π΄Compute policy gradient πUse CG to compute π»β1πCompute rescaled step π = πΌπ»β1π with rescaling and line search
Apply update: π = ππππ + πΌπ»β1π
maxππππ§ππ
πΏ π subject to πΆ. π·πΎπΏ ππππ, π β€ πΏ
![Page 25: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML](https://reader034.fdocuments.in/reader034/viewer/2022042521/5f62089385e8ca7d785a16d2/html5/thumbnails/25.jpg)
Questions?