ECE 517: Reinforcement Learning in Artificial...

26
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2015 October 27, 2015

Transcript of ECE 517: Reinforcement Learning in Artificial...

Page 1: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

1

ECE 517: Reinforcement Learningin Artificial Intelligence

Lecture 14: Planning and Learning

Dr. Itamar Arel

College of EngineeringDepartment of Electrical Engineering and Computer Science

The University of TennesseeFall 2015

October 27, 2015

Page 2: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI

Final projects - logistics

Projects can be done in groups of up to 3 students

Details on projects will be posted soon Students are encouraged to propose a topic

Please email me your top three choices for a project along with a preferred date for your presentation

Presentation dates:

Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD)

Format: 20 min presentation + 5 min Q&A

~5 min for background and motivation

~15 for description of your work, results, conclusions

Written report due: Monday, Dec. 7

Format similar to project report

2

Page 3: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI

Final projects – sample topics

DQN – Playing Atari games using RL

Teris player using RL (and NN)

Curiosity based TD learning*

Reinforcement Learning of Local Shape in the Game of Go

AIBO learning to walk

Study of value function definitions for TD learning

Imitation learning in RL

3

Page 4: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 4

Outline

Introduction

Use of environment models

Integration of planning and learning methods

Page 5: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 5

Introduction

Earlier we discussed Monte Carlo and temporal-difference methods as distinct alternativesThen showed how they can be seamlessly integrated by using eligibility traces such as in TD(l)Planning methods: e.g. Dynamic Programming and heuristic search Rely on knowledge of a model Model – any information that helps the agent predict the way

the environment will behave

Learning methods: Monte Carlo and Temporal Difference Learning Do not require a model

Our goal: Explore the extent to which the two methods can be intermixed

Page 6: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 6

The original idea

Page 7: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 7

The original idea (cont.)

Page 8: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 8

Models

Model: anything the agent can use to predict how the environment will respond to its actions Distribution models: provide description of all possibilities

(of next states and rewards) and their probabilitiese.g. Dynamic Programming

Example - sum of a dozen dice – produce all possible sums and their probabilities of occurring

Sample models: produce just one sample experienceIn our example - produce individual sums drawn according to this probability distribution

Both types of models can be used to (mimic) produce simulated experience

Often sample models are much easier to come by

Page 9: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 9

Planning

Planning: any computational process that uses a model to create or improve a policy

Planning in AI: State-space planning (such as in RL) – search for policy

Plan-space planning (e.g., partial-order planner)e.g. evolutionary methods

We take the following (unusual) view: All state-space planning methods involve computing value

functions, either explicitly or implicitly

They all apply backups to simulated experience

Page 10: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 10

Planning (cont.)

Classical DP methods are state-space planning methods

Heuristic search methods are state-space planning methods

Planning methods rely on “real” experience as input, but in many cases they can be applied to simulated experience just as well

Example: a planning method based on Q-learning:

Random-Sample One-Step Tabular Q-Planning

Page 11: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 11

Learning, Planning, and Acting

Two uses of real experience: Model learning: to

improve the model Direct RL: to directly

improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL.Here, we call it planning.

Q: What are the advantages/disadvantages of each?

Page 12: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 12

Direct vs. Indirect RL

Indirect methods: make fuller use of

experience: get better policy with fewer environment interactions

Direct methods simpler

not affected by bad models

But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel

Q: Which scheme do you think applies to humans?

Page 13: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 13

The Dyna-Q Architecture (Sutton 1990)

Page 14: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 14

The Dyna-Q Algorithm

model learning

(update)

planning

direct RL

Random-sample single-step tabular Q-planning method

Page 15: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 15

Dyna-Q on a Simple Maze

rewards = 0 until

goal reached, when

reward = 1

Page 16: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 16

Dyna-Q Snapshots: Midway in 2nd Episode

Recall that in a planning context … Exploration – trying actions that improve the model

Exploitation – Behaving in the optimal way given the current model

Balance between the two is always a key challenge!

Page 17: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 17

Variations on the Dyna-Q agent

(Regular) Dyna-Q Soft exploration/exploitation with constant

rewards

Dyna-Q+ Encourages exploration of state-action pairs that

have not been visited in a long time (in real interaction with the environment)

If n is the number of steps elapsed between two

consecutive visits to (s,a), then the reward is

larger as a function of n

Dyna-AC Actor-Critic learning rather that Q-learning

Page 18: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 18

More on Dyna-Q+ ?

Uses an “exploration bonus”: Keeps track of time since each state-action pair

was tried for real

An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting

The agent (indirectly) “plans” how to visit long unvisited states

Page 19: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 19

When the Model is Wrong: Blocking Maze (cont.)

The maze example was oversimplified

In reality many things could go wrong Environment could be stochastic

Model can be imperfect (local minimum, stochasticity or no convergence)

Partial experience could be misleading

When the model is incorrect, the planning process will compute a suboptimal policy

This is actually a learning opportunity Discovery and correction of the modeling error

Page 20: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 20

When the Model is Wrong: Blocking Maze (cont.)

The changed environment is harder

Page 21: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 21

Shortcut Maze

The changed environment is

easier

Page 22: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 22

Prioritized Sweeping

In the Dyna agents presented, simulated transitions are started in uniformly chosen state-action pairs Probably not optimal

Which states or state-action pairs should be generated during planning?Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would

change a lot if backed up, prioritized by the size of the change

When a new backup occurs, insert predecessors according to their priorities

Always perform backups from first in queue

Moore and Atkeson 1993; Peng and Williams, 1993

Page 23: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 23

Prioritized Sweeping

Page 24: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 24

Prioritized Sweeping vs. Dyna-Q

Both use N = 5 backups per

environmental interaction

Page 25: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 25

Trajectory Sampling

Trajectory sampling: perform backups along simulated trajectories

This samples from the on-policy distribution Distribution constructed from experience (visits)

Advantages when function approximation is used

Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored:

Initial states

Reachable underoptimal control

Irrelevant states

Page 26: ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

ECE 517: Reinforcement Learning in AI 26

Summary

Discussed close relationship between planning and learningImportant distinction between distribution modelsand sample modelsLooked at some ways to integrate planning and learning synergy among planning, acting, model learning

Distribution of backups: focus of the computation prioritized sweeping trajectory sampling: backup along trajectories