Interactively Shaping Agents via Human Reinforcement

Interactively Shaping Agentsvia Human Reinforcement

W. Bradley Knoxand

Peter Stone

The University of Texas at AustinDepartment of Computer Science

The TAMER Framework

Learning agents

©1997-2009 Adam Dorman

Autonomous Learning

• human defines andprograms anevaluation functionand then steps back

Autonomous Learning

• can be calledreinforcementlearning– types:

• value functionapproximation

• policy search

• dominant inresearch

Kohl and Stone(2004)

Shaping

Def. - creating a desired behavior by reinforcingsuccessive approximations of the behavior

LOOK magazine, 1952

The Shaping Scenario(in this context)

A human trainer observes an agentand manually delivers reinforcement(a scalar value), signaling approval

or disapproval.

Why shaping?

Potential benefits over purely autonomouslearners:

• No evaluation function needed• Allows lay users to teach agents the

policies that they prefer (no programming!)• Decreases sample size• Learns in more challenging domains

Research Question

How can agents harness the informationcontained in signals of positive and

negative evaluation from a human tolearn sequential decision-making tasks?

Talk Goals

• the usual

• get suggestions for cognitive sciencedirections

• collaborate

Types of Natural Knowledge Transferfrom Humans

• Imitation learning(I.e., Programming byDemonstration)

• Natural Language Advice• Shaping

Imitation LearningSchaal et al., 2003

Def. - agent observes demonstrations from a humanexpert and learns to imitate the human’s behavior

• often the goal is to generalize to unseen states

Natural Language Advice

Robot soccer: “If player 2 has the balland is near the opponent’s goal,player 2 should shoot the ball atthe goal.”

• Kuhlmann et al., 2004

Def. - using natural language, a human givesadvice in the form of conditions and suggestedbehavior under those conditions

If limited to one form of knowledgetransfer...

cheap samples?

yes no

yesno

can define and program anevaluation function?

yes no

requisite expertise and interface to control?

autonomous learningvia an evaluationfunction

programming bydemonstration shaping

Outline

• Intro to shaping• Our approach• Future work

Previous work on human-shapableagents

• Clicker training for entertainment agents(Blumberg et al., 2002; Kaplan et al., 2002)

• Sophie’s World (Thomaz & Breazeal, 2006)– RL with reward = environmental (MDP) reward +

human reinforcement• Social software agent Cobot in LambdaMoo

(Isbell et al., 2006)– RL with reward = human reinforcement

The Shaped Agent’s Perspective

• Each time step, agent:– receives state description– might receive a real-valued human

reinforcement signal– chooses an action

– does not receive an MDP reward signal

MDP rewardvs.

Human reinforcement

• MDP reward– Key problem:

credit assignmentfrom sparse,delayed rewards

I won!

But why did I win?

MDP rewardvs.

Human reinforcement

Reinforcement from ahuman trainer:– Trainer has long-

term impact in mind– Reinforcement is

within a smalltemporal window ofthe targetedbehavior

– Credit assignmentproblem is largelyremoved

BADROBOT!!

I just didsomething

bad…

Teaching an Agent Manually viaEvaluative Reinforcement (TAMER)

• TAMER approach:– Learn a model of human reinforcement

– Directly exploit the model to determinepolicy

• If greedy:


Learning fromtargeted human reinforcement

is a supervised learning problem,not a reinforcement learning problem.

Tetris

– Drop blocks to make solid horizontallines, which then disappear

– |state space| > 2200

– Challenging (NP hard) but slow

– 21 features extracted from (s, a)– TAMER model:

• Linear model over features• Gradient descent updates

– Greedy action selection

TAMER in action: Tetris

Training:Beforetraining:

Aftertraining:

TAMER Results: Tetris(9 subjects)

Credit assignment

Tasks with several time steps per second

Hockley (1984)

P(response delay)

TAMER in action: Mountain Car

Before training:

Training:

TAMER Results: Mountain Car(19 subjects)

Contributions of finished work

• new learning paradigm: explicitlydefining it and showing its power– baseline algorithms with great results

• guidance and justification of whichalgorithms to use

• novel credit assignment method

Publications• W. Bradley Knox and Peter Stone. TAMER: Training an Agent

Manually via Evaluative Reinforcement. In IEEE 7thInternational Conference on Development and Learning (ICDL-08), August 2008.

• W. Bradley Knox, Ian Fasel, and Peter Stone. DesignPrinciples for Creating Human-Shapable Agents. In AAAISpring 2009 Symposium on Agents that Learn from HumanTeachers, March 2009.

• W. Bradley Knox and Peter Stone. Interactively ShapingAgents via Human Reinforcement: The TAMER Framework.To appear in Proceedings of The Fifth International Conferenceon Knowledge Capture (KCAP-09). September 2009.

Outline

• Intro to shaping• Our approach• Future work

Future work

1. Identify TAMER’s strengths and weaknesses

2. TAMER+R– Human reinforcement: rich but flawed– MDP Reward (R): sparse but flawless– How to use the two signals together?

Future work

3. Extend TAMER to training scenarios thatviolate our current assumptions

4. What about the human?- Investigate how humans train viareinforcement.

5. Other cognitive science directions...

The end

Interactively Shaping Agents via Human Reinforcement

Documents

Transcript of Interactively Shaping Agents via Human Reinforcement