RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM:...
-
Upload
jasper-mills -
Category
Documents
-
view
212 -
download
0
Transcript of RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM:...
![Page 1: RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher.](https://reader036.fdocuments.in/reader036/viewer/2022083009/56649d9e5503460f94a88ab5/html5/thumbnails/1.jpg)
RL via Practice and Critique AdviceKshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich
PROBLEM: RL takes a long time to learn a good policy.
Reinforcement Learning
Teacher
behavioradvice
Environment
state
action
reward
RESEARCH QUESTION: Can we make RL perform better with some outside help, such as critique/advice from teacher and how?
DESIDERATA: Non-technical
users as teachers
Natural interaction methods
High level rules as advice for RL In the form of programming-language constructs (Maclin
and Shavlik 1996), rules about action and utility preferences (Maclin et al. 2005)
Logical rules derived from a constrained natural language (Kuhlmann et al. 2004)
Earlier Work
Learning by Demonstration (LBD) User provides full demonstrations of a task that the agent
can learn from (Billard et al. 2008). Recent works (Coates, Abbeel, and Ng 2008) include
model learning to improve on demonstrations but does not allow users to provide feedback.
Argall, Browning, and Veloso 2007; 2008 combines LBD and human critiques on behavior (similar to our work here), but there is no autonomous practice.
Real time feedback from User TAMER framework (Knox and Stone, 2009) uses a type of
supervised learning to predict the user’s reward signal, and then select actions to maximize predicted reward.
Thomaz and Breazeal (2008) rather combine the end-user reward signal with the environmental reward, and use Q-Learning.
RL via Practice + Critique Advice
GetCritiqu
e
GetExperie
nce
Simulator
Learn
Act
C
T
Critique data:
Trajectory data:
Features: Allows feedback
and guidance advice Allows practice Novel approach to
learn from critique advice and practice
Solution ApproachAdvice
InterfaceSimulator
How to pick the value of λ?
What are the forms of U and L?
Optimization using
Gradient Ascent
Choosing λ using Validation
Choose
Simulator
Estimate Utility
Problem: Given data sets T and C, how can we update the agent’s policy so as to
maximize its reward?
Critique Data Loss L(θ,C) Set of all optimal actions
Any action not in O(s) is suboptimal
All actions
are equallygood
Learning Goal: find a probabilistic policy that has a high probability of returning an action in O(s) when applied to s. It is not important which action is selected as long as the probability of selecting an action in O(s) is high.
We call this problem Any Label Learning (ALL)
ALL likelihood:
The Multi-Label Learning problem (Tsoumakas and Katakis 2007) differs in that the goal is to learn a classifier that outputs all of the labels in sets and no others.
Reality: there does not exist an ideal teacher!!!
Expected Any-Label Learning
Ideal Teacher
Key idea: define a user model that induces a distribution over ALL problems.
User model: distribution over sets given critique data, assume: independence among different states.
We introduce two noise parameters and , and one bias parameter (probability that an unlabeled action is in O(s) ).
Expected ALL likelihood:
Closed form of likelihood:
Simulated ExperimentsExperimental Setup Our Domain: RTS tactical micro-management 5 friendly footmen versus 5 enemy footmen (Wargus AI). Difficulty: Fast pace and multiple units acting in parallel Our setup: Provide end-users with an interface that allows to watch
a battle and pause at any moment. The user can then scroll back and forth within the episode and mark any possible action of any agent as good or bad.
Available actions for each military unit are to attack any of the units on the map (enemy or friendly) giving a total of 9 actions per unit.
Two battle maps, which differed only in the initial placement of the units.
Evaluated 3 Learning Systems: Pure RL: Only practice Pure Supervised: Only critique Combined System: Critique + Practice
Goal: Test learning capability with varying amount of critique and practice data
Total of 2 users per map.For each user: divide critique data into 4 equal sized segments
creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data.
We provided the combined system with each of these data sets and allowed it to practice for 100 episodes.
Map 1 Map 2 Advice Interface
The user study involved 10 end-users 6 with CS backgrounds and 4 without a CS background. For each user, the study consisted of teaching both the pure
supervised and the combined systems, each on a different map, for a fixed amount of time. (Supervised: 30 mins, Combined: 60 mins)
User Studies
These results show that the users were able to significantly outperform pure RL using both the supervised and combined systems.
The end-users had slightly greater success with the pure supervised system versus the combined system: Large delay experienced while waiting for the practice stages to end Policy returned by practice was sometimes poor, ignored advice
Lesson Learned: Such behavior is detrimental to the user experience and overall performance.
Future Work: Better behaving combined system Studies where users are not captive during practice stages
Frustrating
Advice Patterns of Users
Fraction of positive, negative and mixed advice.
Su
pe
rvis
ed
Co
mb
ine
d
Positive (or negative) advice is where the user only gives feedback on the action taken by the agent.
Mixed is where the user not only gives feedback on the agent's action but also suggests alternative actions to the agent.
We use likelihood weighting to estimate the utility U(,T) of policy using off-policy trajectories T (Peshkin and Shelton 2002). Let be the probability of generating trajectory and let be the parameters of the policy that generated . An unbiased utility estimate is given by:
where
The gradient of has a compact closed form.
Expected Utility U(θ,T)