Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell
description
Transcript of Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell
![Page 1: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/1.jpg)
1. Algorithms for Inverse Reinforcement Learning
2. Apprenticeship learning via Inverse Reinforcement Learning
![Page 2: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/2.jpg)
Algorithms for Inverse Reinforcement Learning
Andrew Ng and Stuart Russell
![Page 3: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/3.jpg)
Motivation
● Given: (1) measurements of an agent's behavior over time, in a variety of circumstances, (2) if needed, measurements of the sensory inputs to that agent; (3) if available, a model of the environment.
● Determine: the reward function being optimized.
![Page 4: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/4.jpg)
Why?
● Reason #1: Computational models for animal and human learning.
● “In examining animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation.”
● Particularly true of multiattribute reward functions (e.g. Bee foraging: amount of nectar vs. flight time vs. risk from wind/predators)
![Page 5: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/5.jpg)
Why?
● Reason #2: Agent construction.● “An agent designer [...] may only have a very
rough idea of the reward function whose optimization would generate 'desirable' behavior.”
● e.g. “Driving well”● Apprenticeship learning: Recovering expert's
underlying reward function more “parsimonious” than learning expect's policy?
![Page 6: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/6.jpg)
Possible applications in multi-agent systems
● In multi-agent adversarial games, learning opponents’ reward functions that guild their actions to devise strategies against them.
● example● In mechanism design, learning each agent’s
reward function from histories to manipulate its actions.
● and more?
![Page 7: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/7.jpg)
Inverse Reinforcement Learning (1) – MDP Recap
• MDP is represented as a tuple (S, A, {Psa}, ,R)
Note: R is bounded by Rmax
• Value function for policy :
• Q-function:
]|..........)()([)( 211 sRsREsV
)]'([)(),( (.)~' sVEsRasQsaPs
![Page 8: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/8.jpg)
Inverse Reinforcement Learning (1) – MDP Recap
• Bellman Equation:
• Bellman Optimality:
')(
')(
)'()'()()(
)'()'()()(
sss
sss
sVsPsRsQ
sVsPsRsV
),(maxarg)( asQsAa
![Page 9: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/9.jpg)
Inverse Reinforcement Learning (2)Finite State Space
• Reward function solution set (a1 is optimal action)
RPIV
VPRV
a
a
1)(1
1
0))((
)()(
),(maxarg)(
1
11
1
11
111
1
RPIPP
PIPRPIP
VPVP
asQsa
aaa
aaaa
aa
Aa
1\ aAa
1\ aAa
1\ aAa
![Page 10: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/10.jpg)
Inverse Reinforcement Learning (2)Finite State Space
)),(max),((1\
1 asQasQSs
aAa
There are many solutions of R that satisfy the inequality (e.g. R = 0), which one might be the best solution?
1. Make deviation from as costly as possible:
2. Make reward function as simple as possible
![Page 11: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/11.jpg)
Inverse Reinforcement Learning (2)Finite State Space
• Linear Programming Formulation:
NiRR
RPIiPiPst
RRPIiPiP
i
aaa
N
iaaaaaaa k
,....,1,
0)))(()(.(
})))(()({(minmax
max
1
11
1},....,,{
11
1132
1\ aAa
a1, a2, …, an
R?
Va1
Va2maximized
![Page 12: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/12.jpg)
Inverse Reinforcement Learning (3)Large State Space
• Linear approximation of reward function (in driving example, basis functions can be collision, stay on right lane,…etc)
• Let be value function of policy , when reward R =
• For R to make optimal
)(........)()()( 2211 ssssR dd
)(1 sa
)]'([)]'([ ~'~'1
sVEsVEsasa PsPs
1\ aAa
)(si
ddVVVV ........2211
iV )(si
![Page 13: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/13.jpg)
Inverse Reinforcement Learning (3)Large State Spaces
• In an infinite or large number of state space, it is usually not possible to check all constraints:
• Choose a finite subset S0 from all states
• Linear Programming formulation, find αi that:
• x>=0, p(x)=x; otherwise p(x)=2x
)]'([)]'([ ~'~'1
sVEsVEsasa PsPs
dits
sVEsVEp
i
PsPsSs
aaaa sasak
,....,1,1..
)])}'([)]'([({minmax ~'~'},{1
0
,....,32
![Page 14: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/14.jpg)
Inverse Reinforcement Learning (4)IRL from Sample Trajectories
• If is only accessible through a set of sampled trajectories (e.g. driving demo in 2nd paper)
• Assume we start from a dummy state s0,(whose next state distribution is according to D).
• In the case that reward trajectory state sequence (s0, s1, s2….):
iR
..........)()()()( 22
100
^
ssssV iiii
)(.........)()( 0
^
01
^
10
^
sVsVsV dd
![Page 15: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/15.jpg)
Inverse Reinforcement Learning (4)IRL from Sample Trajectories
• Assume we have some set of policies
• Linear Programming formulation
• The above optimization gives a new reward R, we then compute based on R, and add it to the set of policies
• reiterate
dits
sVsVp
i
k
i
i
....1,1..
))()((max1
0
^
0
*^
k ,....,, 21
1k
![Page 16: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/16.jpg)
Discrete Gridworld Experiment
● 5x5 grid world● Agent starts in bottom-left square.● Reward of 1 in the upper-right square.● Actions = N,W,S,E (30% chance of random)
![Page 17: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/17.jpg)
Discrete Gridworld Results
![Page 18: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/18.jpg)
Mountain Car Experiment #1
● Car starts in valley, goal is at the top of hill● Reward is -1 per “step” until goal is reached● State = car's x-position & velocity (continuous!)● Function approx. class: all linear combinations
of 26 evenly spaced Gaussian-shaped basis functions
![Page 19: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/19.jpg)
Mountain Car Experiment #2
● Goal is in bottom of valley● Car starts... not sure. Top of hill?● Reward is 1 in the goal area, 0 elsewhere● γ = 0.99● State = car's x-position & velocity (continuous!)● Function approx. class: all linear combinations
of 26 evenly spaced Gaussian-shaped basis functions
![Page 20: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/20.jpg)
Mountain Car Results
#1
#2
![Page 21: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/21.jpg)
Continuous Gridworld Experiment
● State space is now [0,1] x [0,1] continuous grid● Actions: 0.2 movement in any direction + noise
in x and y coordinates of [-0.1,0.1]● Reward 1 in region [0.8,1] x [0.8,1], 0 elsewhere● γ = 0.9● Function approx. class: all linear combinations
of a 15x15 array of 2-D Gaussian-shaped basis functions
● m=5000 trajectories of 30 steps each per policy
![Page 22: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/22.jpg)
Continuous Gridworld Results
3%-10% error when comparing fitted reward's optimal policy with the true optimal policy
However, no significant difference in quality of policy (measured using true reward function)
![Page 23: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/23.jpg)
Apprenticeship Learning via Inverse Reinforcement Learning
Pieter Abbeel & Andrew Y. Ng
![Page 24: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/24.jpg)
Algorithm
● For t = 1,2,…
Inverse RL step: Estimate expert’s reward function R(s)=
wT(s) such that under R(s) the expert performs better than all previously found policies {i}.
RL step: Compute optimal policy t for the estimated reward w.
Courtesy of Pieter Abbeel
![Page 25: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/25.jpg)
Algorithm: IRL step
● Maximize , w:||w||2
≤ 1
● s.t. Vw(E) Vw(i) + i=1,…,t-1
● = margin of expert’s performance over the performance of previously found policies.
● Vw() = E [t t R(st)|] = E [t t wT(st)|]
● = wT E [t t (st)|]
● = wT ()
● () = E [t t (st)|] are the “feature expectations”
Courtesy of Pieter Abbeel
![Page 26: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/26.jpg)
Feature Expectation Closeness and Performance
● If we can find a policy such that
● ||(E) - ()||2 ,● then for any underlying reward R*(s) =w*T(s), ● we have that
● |Vw*(E) - Vw*()| = |w*T (E) - w*T ()|
● ||w*||2 ||(E) - ()||2
● .
Courtesy of Pieter Abbeel
![Page 27: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/27.jpg)
IRL step as Support Vector Machine
maximum margin hyperplane seperatingtwo sets of points
(E)
()
|w*T (E) - w*T ()|= |Vw*(E) - Vw*()| = maximal difference between expert policy’s value function and 2nd to the optimal policy’s value function
![Page 28: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/28.jpg)
1
(0)
w(1)
w(2)(1)
(2)
2
w(3)
Uw() = wT()
(E)
Courtesy of Pieter Abbeel
![Page 29: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/29.jpg)
Gridworld Experiment
● 128 x 128 grid world divided into 64 regions, each of size 16 x 16 (“macrocells”).
● A small number of macrocells have positive rewards.
● For each macrocell, there is one feature Φi(s)
indicating whether that state s is in macrocell i● Algorithm was also run on the subset of
features Φi(s) that correspond to non-zero
rewards.
![Page 30: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/30.jpg)
Gridworld Results
Performance vs. # TrajectoriesDistance to expert vs. # Iterations
![Page 31: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/31.jpg)
Car Driving Experiment
● No explict reward function at all!● Expert demonstrates proper policy via 2 min. of
driving time on simulator (1200 data points).● 5 different “driver types” tried.● Features: which lane the car is in, distance to
closest car in current lane.● Algorithm run for 30 iterations, policy hand-
picked.● Movie Time! (Expert left, IRL right)
![Page 32: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/32.jpg)
Demo-1 Nice
![Page 33: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/33.jpg)
Demo-2 Right Lane Nasty
![Page 34: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.fdocuments.in/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/34.jpg)
Car Driving Results