Interactively Shaping Agents via Human Reinforcement

34
Interactively Shaping Agents via Human Reinforcement W. Bradley Knox and Peter Stone The University of Texas at Austin Department of Computer Science The TAMER Framework

Transcript of Interactively Shaping Agents via Human Reinforcement

Page 1: Interactively Shaping Agents via Human Reinforcement

Interactively Shaping Agentsvia Human Reinforcement

W. Bradley Knoxand

Peter Stone

The University of Texas at AustinDepartment of Computer Science

The TAMER Framework

Page 2: Interactively Shaping Agents via Human Reinforcement

Learning agents

©1997-2009 Adam Dorman

Page 3: Interactively Shaping Agents via Human Reinforcement

Autonomous Learning

• human defines andprograms anevaluation functionand then steps back

Page 4: Interactively Shaping Agents via Human Reinforcement

Autonomous Learning

• can be calledreinforcementlearning– types:

• value functionapproximation

• policy search

• dominant inresearch

Kohl and Stone(2004)

Page 5: Interactively Shaping Agents via Human Reinforcement

Shaping

Def. - creating a desired behavior by reinforcingsuccessive approximations of the behavior

LOOK magazine, 1952

Page 6: Interactively Shaping Agents via Human Reinforcement

The Shaping Scenario(in this context)

A human trainer observes an agentand manually delivers reinforcement(a scalar value), signaling approval

or disapproval.

Page 7: Interactively Shaping Agents via Human Reinforcement

Why shaping?

Potential benefits over purely autonomouslearners:

• No evaluation function needed• Allows lay users to teach agents the

policies that they prefer (no programming!)• Decreases sample size• Learns in more challenging domains

Page 8: Interactively Shaping Agents via Human Reinforcement

Research Question

How can agents harness the informationcontained in signals of positive and

negative evaluation from a human tolearn sequential decision-making tasks?

Page 9: Interactively Shaping Agents via Human Reinforcement

Talk Goals

• the usual

• get suggestions for cognitive sciencedirections

• collaborate

Page 10: Interactively Shaping Agents via Human Reinforcement

Types of Natural Knowledge Transferfrom Humans

• Imitation learning(I.e., Programming byDemonstration)

• Natural Language Advice• Shaping

Page 11: Interactively Shaping Agents via Human Reinforcement

Imitation LearningSchaal et al., 2003

Def. - agent observes demonstrations from a humanexpert and learns to imitate the human’s behavior

• often the goal is to generalize to unseen states

Page 12: Interactively Shaping Agents via Human Reinforcement

Natural Language Advice

Robot soccer: “If player 2 has the balland is near the opponent’s goal,player 2 should shoot the ball atthe goal.”

• Kuhlmann et al., 2004

Def. - using natural language, a human givesadvice in the form of conditions and suggestedbehavior under those conditions

Page 13: Interactively Shaping Agents via Human Reinforcement

If limited to one form of knowledgetransfer...

cheap samples?

yes no

yesno

can define and program anevaluation function?

yes no

requisite expertise and interface to control?

autonomous learningvia an evaluationfunction

programming bydemonstration shaping

Page 14: Interactively Shaping Agents via Human Reinforcement

Outline

• Intro to shaping• Our approach• Future work

Page 15: Interactively Shaping Agents via Human Reinforcement

Previous work on human-shapableagents

• Clicker training for entertainment agents(Blumberg et al., 2002; Kaplan et al., 2002)

• Sophie’s World (Thomaz & Breazeal, 2006)– RL with reward = environmental (MDP) reward +

human reinforcement• Social software agent Cobot in LambdaMoo

(Isbell et al., 2006)– RL with reward = human reinforcement

Page 16: Interactively Shaping Agents via Human Reinforcement

The Shaped Agent’s Perspective

• Each time step, agent:– receives state description– might receive a real-valued human

reinforcement signal– chooses an action

– does not receive an MDP reward signal

Page 17: Interactively Shaping Agents via Human Reinforcement

MDP rewardvs.

Human reinforcement

• MDP reward– Key problem:

credit assignmentfrom sparse,delayed rewards

I won!

But why did I win?

Page 18: Interactively Shaping Agents via Human Reinforcement

MDP rewardvs.

Human reinforcement

Reinforcement from ahuman trainer:– Trainer has long-

term impact in mind– Reinforcement is

within a smalltemporal window ofthe targetedbehavior

– Credit assignmentproblem is largelyremoved

BADROBOT!!

I just didsomething

bad…

Page 19: Interactively Shaping Agents via Human Reinforcement

Teaching an Agent Manually viaEvaluative Reinforcement (TAMER)

• TAMER approach:– Learn a model of human reinforcement

– Directly exploit the model to determinepolicy

• If greedy:

Page 20: Interactively Shaping Agents via Human Reinforcement

Teaching an Agent Manually viaEvaluative Reinforcement (TAMER)

Learning fromtargeted human reinforcement

is a supervised learning problem,not a reinforcement learning problem.

Page 21: Interactively Shaping Agents via Human Reinforcement

Teaching an Agent Manually viaEvaluative Reinforcement (TAMER)

Page 22: Interactively Shaping Agents via Human Reinforcement

Tetris

– Drop blocks to make solid horizontallines, which then disappear

– |state space| > 2200

– Challenging (NP hard) but slow

– 21 features extracted from (s, a)– TAMER model:

• Linear model over features• Gradient descent updates

– Greedy action selection

Page 23: Interactively Shaping Agents via Human Reinforcement

TAMER in action: Tetris

Training:Beforetraining:

Aftertraining:

Page 24: Interactively Shaping Agents via Human Reinforcement

TAMER Results: Tetris(9 subjects)

Page 25: Interactively Shaping Agents via Human Reinforcement

TAMER Results: Tetris(9 subjects)

Page 26: Interactively Shaping Agents via Human Reinforcement

Credit assignment

Tasks with several time steps per second

Hockley (1984)

P(response delay)

Page 27: Interactively Shaping Agents via Human Reinforcement

TAMER in action: Mountain Car

Before training:

Training:

Page 28: Interactively Shaping Agents via Human Reinforcement

TAMER Results: Mountain Car(19 subjects)

Page 29: Interactively Shaping Agents via Human Reinforcement

Contributions of finished work

• new learning paradigm: explicitlydefining it and showing its power– baseline algorithms with great results

• guidance and justification of whichalgorithms to use

• novel credit assignment method

Page 30: Interactively Shaping Agents via Human Reinforcement

Publications• W. Bradley Knox and Peter Stone. TAMER: Training an Agent

Manually via Evaluative Reinforcement. In IEEE 7thInternational Conference on Development and Learning (ICDL-08), August 2008.

• W. Bradley Knox, Ian Fasel, and Peter Stone. DesignPrinciples for Creating Human-Shapable Agents. In AAAISpring 2009 Symposium on Agents that Learn from HumanTeachers, March 2009.

• W. Bradley Knox and Peter Stone. Interactively ShapingAgents via Human Reinforcement: The TAMER Framework.To appear in Proceedings of The Fifth International Conferenceon Knowledge Capture (KCAP-09). September 2009.

Page 31: Interactively Shaping Agents via Human Reinforcement

Outline

• Intro to shaping• Our approach• Future work

Page 32: Interactively Shaping Agents via Human Reinforcement

Future work

1. Identify TAMER’s strengths and weaknesses

2. TAMER+R– Human reinforcement: rich but flawed– MDP Reward (R): sparse but flawless– How to use the two signals together?

Page 33: Interactively Shaping Agents via Human Reinforcement

Future work

3. Extend TAMER to training scenarios thatviolate our current assumptions

4. What about the human?- Investigate how humans train viareinforcement.

5. Other cognitive science directions...

Page 34: Interactively Shaping Agents via Human Reinforcement

The end