RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM:...

1
RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Reinforcement Learning Teacher behavio r advice Environme nt state action reward RESEARCH QUESTION: Can we make RL perform better with some outside help, such as critique/advice from teacher and how? DESIDERATA: Non- technical users as teachers Natural interaction methods High level rules as advice for RL In the form of programming-language constructs (Maclin and Shavlik 1996), rules about action and utility preferences (Maclin et al. 2005) Logical rules derived from a constrained natural language (Kuhlmann et al. 2004) Earlier Work Learning by Demonstration (LBD) User provides full demonstrations of a task that the agent can learn from (Billard et al. 2008). Recent works (Coates, Abbeel, and Ng 2008) include model learning to improve on demonstrations but does not allow users to provide feedback. Argall, Browning, and Veloso 2007; 2008 combines LBD and human critiques on behavior (similar to our work here), but there is no autonomous practice. Real time feedback from User TAMER framework (Knox and Stone, 2009) uses a type of supervised learning to predict the user’s reward signal, and then select actions to maximize predicted reward. Thomaz and Breazeal (2008) rather combine the end-user reward signal with the environmental reward, and use Q-Learning. RL via Practice + Critique Advice Get Critiqu e Get Experien ce Simulat or Learn Act C T Critique data: Trajectory data: Features: Allows feedback and guidance advice Allows practice Novel approach to learn from critique advice and practice Solution Approach Advice Interface Simulator How to pick the value of λ? What are the forms of U and L? Optimizatio n using Gradient Ascent Choosing λ using Validation Choose Simulator Estimate Utility Problem: Given data sets T and C, how can we update the agent’s policy so as to maximize its reward? Critique Data Loss L(θ,C) Set of all optimal actions Any action not in O(s) is suboptimal All actions are equally good Learning Goal: find a probabilistic policy that has a high probability of returning an action in O(s) when applied to s. It is not important which action is selected as long as the probability of selecting an action in O(s) is high. We call this problem Any Label Learning (ALL) ALL likelihood: The Multi-Label Learning problem (Tsoumakas and Katakis 2007) differs in that the goal is to learn a classifier that outputs all of the labels in sets and no others. Reality: there does not exist an ideal teacher!!! Expected Any-Label Learning Ideal Teacher Key idea: define a user model that induces a distribution over ALL problems. User model: distribution over sets given critique data, assume: independence among different states. We introduce two noise parameters and , and one bias parameter (probability that an unlabeled action is in O(s) ). Expected ALL likelihood: Closed form of likelihood: Simulated Experiments Experimental Setup Our Domain: RTS tactical micro-management 5 friendly footmen versus 5 enemy footmen (Wargus AI). Difficulty: Fast pace and multiple units acting in parallel Our setup: Provide end-users with an interface that allows to watch a battle and pause at any moment. The user can then scroll back and forth within the episode and mark any possible action of any agent as good or bad. Available actions for each military unit are to attack any of the units on the map (enemy or friendly) giving a total of 9 actions per unit. Two battle maps, which differed only in the initial placement of the units. Evaluated 3 Learning Systems: Pure RL: Only practice Pure Supervised: Only critique Combined System: Critique + Practice Goal: Test learning capability with varying amount of critique and practice data Total of 2 users per map. For each user: divide critique data into 4 equal sized segments creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data. We provided the combined system with each of these data sets and allowed it to practice for 100 episodes. Map 1 Map 2 Advice Interface The user study involved 10 end-users 6 with CS backgrounds and 4 without a CS background. For each user, the study consisted of teaching both the pure supervised and the combined systems, each on a different map, for a fixed amount of time. (Supervised: 30 mins, Combined: 60 mins) User Studies These results show that the users were able to significantly outperform pure RL using both the supervised and combined systems. The end-users had slightly greater success with the pure supervised system versus the combined system: Large delay experienced while waiting for the practice stages to end Policy returned by practice was sometimes poor, ignored advice Lesson Learned: Such behavior is detrimental to the user experience and overall performance. Future Work: Better behaving combined system Studies where users are not captive during practice stages Frustratin g Advice Patterns of Users Fraction of positive, negative and mixed advice. Supervi sed Combine d Positive (or negative) advice is where the user only gives feedback on the action taken by the agent. Mixed is where the user not only gives feedback on the agent's action but also suggests alternative actions to the agent. We use likelihood weighting to estimate the utility U(,T) of policy using off-policy trajectories T (Peshkin and Shelton 2002). Let be the probability of generating trajectory and let be the parameters of the policy that generated . An unbiased utility estimate is given by: where The gradient of has a compact closed form. Expected Utility U(θ,T)

Transcript of RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM:...

Page 1: RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher.

RL via Practice and Critique AdviceKshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich

PROBLEM: RL takes a long time to learn a good policy.

Reinforcement Learning

Teacher

behavioradvice

Environment

state

action

reward

RESEARCH QUESTION: Can we make RL perform better with some outside help, such as critique/advice from teacher and how?

DESIDERATA: Non-technical

users as teachers

Natural interaction methods

High level rules as advice for RL In the form of programming-language constructs (Maclin

and Shavlik 1996), rules about action and utility preferences (Maclin et al. 2005)

Logical rules derived from a constrained natural language (Kuhlmann et al. 2004)

Earlier Work

Learning by Demonstration (LBD) User provides full demonstrations of a task that the agent

can learn from (Billard et al. 2008). Recent works (Coates, Abbeel, and Ng 2008) include

model learning to improve on demonstrations but does not allow users to provide feedback.

Argall, Browning, and Veloso 2007; 2008 combines LBD and human critiques on behavior (similar to our work here), but there is no autonomous practice.

Real time feedback from User TAMER framework (Knox and Stone, 2009) uses a type of

supervised learning to predict the user’s reward signal, and then select actions to maximize predicted reward.

Thomaz and Breazeal (2008) rather combine the end-user reward signal with the environmental reward, and use Q-Learning.

RL via Practice + Critique Advice

GetCritiqu

e

GetExperie

nce

Simulator

Learn

Act

C

T

Critique data:

Trajectory data:

Features: Allows feedback

and guidance advice Allows practice Novel approach to

learn from critique advice and practice

Solution ApproachAdvice

InterfaceSimulator

How to pick the value of λ?

What are the forms of U and L?

Optimization using

Gradient Ascent

Choosing λ using Validation

Choose

Simulator

Estimate Utility

Problem: Given data sets T and C, how can we update the agent’s policy so as to

maximize its reward?

Critique Data Loss L(θ,C) Set of all optimal actions

Any action not in O(s) is suboptimal

All actions

are equallygood

Learning Goal: find a probabilistic policy that has a high probability of returning an action in O(s) when applied to s. It is not important which action is selected as long as the probability of selecting an action in O(s) is high.

We call this problem Any Label Learning (ALL)

ALL likelihood:

The Multi-Label Learning problem (Tsoumakas and Katakis 2007) differs in that the goal is to learn a classifier that outputs all of the labels in sets and no others.

Reality: there does not exist an ideal teacher!!!

Expected Any-Label Learning

Ideal Teacher

Key idea: define a user model that induces a distribution over ALL problems.

User model: distribution over sets given critique data, assume: independence among different states.

We introduce two noise parameters and , and one bias parameter (probability that an unlabeled action is in O(s) ).

Expected ALL likelihood:

Closed form of likelihood:

Simulated ExperimentsExperimental Setup Our Domain: RTS tactical micro-management 5 friendly footmen versus 5 enemy footmen (Wargus AI). Difficulty: Fast pace and multiple units acting in parallel Our setup: Provide end-users with an interface that allows to watch

a battle and pause at any moment. The user can then scroll back and forth within the episode and mark any possible action of any agent as good or bad.

Available actions for each military unit are to attack any of the units on the map (enemy or friendly) giving a total of 9 actions per unit.

Two battle maps, which differed only in the initial placement of the units.

Evaluated 3 Learning Systems: Pure RL: Only practice Pure Supervised: Only critique Combined System: Critique + Practice

Goal: Test learning capability with varying amount of critique and practice data

Total of 2 users per map.For each user: divide critique data into 4 equal sized segments

creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data.

We provided the combined system with each of these data sets and allowed it to practice for 100 episodes.

Map 1 Map 2 Advice Interface

The user study involved 10 end-users 6 with CS backgrounds and 4 without a CS background. For each user, the study consisted of teaching both the pure

supervised and the combined systems, each on a different map, for a fixed amount of time. (Supervised: 30 mins, Combined: 60 mins)

User Studies

These results show that the users were able to significantly outperform pure RL using both the supervised and combined systems.

The end-users had slightly greater success with the pure supervised system versus the combined system: Large delay experienced while waiting for the practice stages to end Policy returned by practice was sometimes poor, ignored advice

Lesson Learned: Such behavior is detrimental to the user experience and overall performance.

Future Work: Better behaving combined system Studies where users are not captive during practice stages

Frustrating

Advice Patterns of Users

Fraction of positive, negative and mixed advice.

Su

pe

rvis

ed

Co

mb

ine

d

Positive (or negative) advice is where the user only gives feedback on the action taken by the agent.

Mixed is where the user not only gives feedback on the agent's action but also suggests alternative actions to the agent.

We use likelihood weighting to estimate the utility U(,T) of policy using off-policy trajectories T (Peshkin and Shelton 2002). Let be the probability of generating trajectory and let be the parameters of the policy that generated . An unbiased utility estimate is given by:

where

The gradient of has a compact closed form.

Expected Utility U(θ,T)