Interactively Shaping Agents via Human Reinforcement
Transcript of Interactively Shaping Agents via Human Reinforcement
Interactively Shaping Agentsvia Human Reinforcement
W. Bradley Knoxand
Peter Stone
The University of Texas at AustinDepartment of Computer Science
The TAMER Framework
Learning agents
©1997-2009 Adam Dorman
Autonomous Learning
• human defines andprograms anevaluation functionand then steps back
Autonomous Learning
• can be calledreinforcementlearning– types:
• value functionapproximation
• policy search
• dominant inresearch
Kohl and Stone(2004)
Shaping
Def. - creating a desired behavior by reinforcingsuccessive approximations of the behavior
LOOK magazine, 1952
The Shaping Scenario(in this context)
A human trainer observes an agentand manually delivers reinforcement(a scalar value), signaling approval
or disapproval.
Why shaping?
Potential benefits over purely autonomouslearners:
• No evaluation function needed• Allows lay users to teach agents the
policies that they prefer (no programming!)• Decreases sample size• Learns in more challenging domains
Research Question
How can agents harness the informationcontained in signals of positive and
negative evaluation from a human tolearn sequential decision-making tasks?
Talk Goals
• the usual
• get suggestions for cognitive sciencedirections
• collaborate
Types of Natural Knowledge Transferfrom Humans
• Imitation learning(I.e., Programming byDemonstration)
• Natural Language Advice• Shaping
Imitation LearningSchaal et al., 2003
Def. - agent observes demonstrations from a humanexpert and learns to imitate the human’s behavior
• often the goal is to generalize to unseen states
Natural Language Advice
Robot soccer: “If player 2 has the balland is near the opponent’s goal,player 2 should shoot the ball atthe goal.”
• Kuhlmann et al., 2004
Def. - using natural language, a human givesadvice in the form of conditions and suggestedbehavior under those conditions
If limited to one form of knowledgetransfer...
cheap samples?
yes no
yesno
can define and program anevaluation function?
yes no
requisite expertise and interface to control?
autonomous learningvia an evaluationfunction
programming bydemonstration shaping
Outline
• Intro to shaping• Our approach• Future work
Previous work on human-shapableagents
• Clicker training for entertainment agents(Blumberg et al., 2002; Kaplan et al., 2002)
• Sophie’s World (Thomaz & Breazeal, 2006)– RL with reward = environmental (MDP) reward +
human reinforcement• Social software agent Cobot in LambdaMoo
(Isbell et al., 2006)– RL with reward = human reinforcement
The Shaped Agent’s Perspective
• Each time step, agent:– receives state description– might receive a real-valued human
reinforcement signal– chooses an action
– does not receive an MDP reward signal
MDP rewardvs.
Human reinforcement
• MDP reward– Key problem:
credit assignmentfrom sparse,delayed rewards
I won!
But why did I win?
MDP rewardvs.
Human reinforcement
Reinforcement from ahuman trainer:– Trainer has long-
term impact in mind– Reinforcement is
within a smalltemporal window ofthe targetedbehavior
– Credit assignmentproblem is largelyremoved
BADROBOT!!
I just didsomething
bad…
Teaching an Agent Manually viaEvaluative Reinforcement (TAMER)
• TAMER approach:– Learn a model of human reinforcement
– Directly exploit the model to determinepolicy
• If greedy:
Teaching an Agent Manually viaEvaluative Reinforcement (TAMER)
Learning fromtargeted human reinforcement
is a supervised learning problem,not a reinforcement learning problem.
Teaching an Agent Manually viaEvaluative Reinforcement (TAMER)
Tetris
– Drop blocks to make solid horizontallines, which then disappear
– |state space| > 2200
– Challenging (NP hard) but slow
– 21 features extracted from (s, a)– TAMER model:
• Linear model over features• Gradient descent updates
– Greedy action selection
TAMER in action: Tetris
Training:Beforetraining:
Aftertraining:
TAMER Results: Tetris(9 subjects)
TAMER Results: Tetris(9 subjects)
Credit assignment
Tasks with several time steps per second
Hockley (1984)
P(response delay)
TAMER in action: Mountain Car
Before training:
Training:
TAMER Results: Mountain Car(19 subjects)
Contributions of finished work
• new learning paradigm: explicitlydefining it and showing its power– baseline algorithms with great results
• guidance and justification of whichalgorithms to use
• novel credit assignment method
Publications• W. Bradley Knox and Peter Stone. TAMER: Training an Agent
Manually via Evaluative Reinforcement. In IEEE 7thInternational Conference on Development and Learning (ICDL-08), August 2008.
• W. Bradley Knox, Ian Fasel, and Peter Stone. DesignPrinciples for Creating Human-Shapable Agents. In AAAISpring 2009 Symposium on Agents that Learn from HumanTeachers, March 2009.
• W. Bradley Knox and Peter Stone. Interactively ShapingAgents via Human Reinforcement: The TAMER Framework.To appear in Proceedings of The Fifth International Conferenceon Knowledge Capture (KCAP-09). September 2009.
Outline
• Intro to shaping• Our approach• Future work
Future work
1. Identify TAMER’s strengths and weaknesses
2. TAMER+R– Human reinforcement: rich but flawed– MDP Reward (R): sparse but flawless– How to use the two signals together?
Future work
3. Extend TAMER to training scenarios thatviolate our current assumptions
4. What about the human?- Investigate how humans train viareinforcement.
5. Other cognitive science directions...
The end