The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

21
www.igi.tu-graz.ac.at/ril-toolbo x The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks Gerhard Neumann Master Thesis 2005 Institute für Grundlagen der Informationsverarbeitung (IGI)

description

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks. Gerhard Neumann. Master Thesis 2005 Institute für Grundlagen der Informationsverarbeitung (IGI). Master Thesis:. Reinforcement Learning Toolbox General Software Tool for Reinforcement Learning - PowerPoint PPT Presentation

Transcript of The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

Page 1: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

Gerhard Neumann

Master Thesis2005Institute für Grundlagen der Informationsverarbeitung (IGI)

Page 2: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Master Thesis:

Reinforcement Learning Toolbox General Software Tool for Reinforcement

Learning Benchmark tests of Reinforcement Learning

algorithms on three Optimal Control Problems Pendulum Swing Up Cart-Pole Swing Up Acro-Bot Swing Up

Page 3: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

RL Toolbox: Features

Software: C++ Class System Open Source / Non Commercial

Homepage: www.igi.tu-graz.ac.at/ril-toolbox Class Reference, Manual Runs under Linux and Windows > 40.000 lines of code, > 250 classes

Page 4: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

RL Toolbox: Features

Learning in discrete or continuous State Space Learning in discrete or continuous Action Space Different kinds of Learning Algorithms

TD-lambda learning Actor critic learning Dynamic Programming, Model based learning, planning methods Continuous time RL Policy search algorithm Residual / Residual gradient Algorithm

Use Different Function Approximators RBF-Networks Linear Interpolation CMAC-Tile Coding Feed Forward Neural Networks

Learning from other (self coded) Controllers Hierarchical Reinforcement Learning

Page 5: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

The Agent and the environment The agent tells the environment which action to

execute, the environment makes the internal state transitions

Environment defines the learning problem

Structure of the Learning System

Page 6: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Structure of the learning system

Linkage to the learning algorithms All algorithms need <st,at,st+1> for learning. The algorithms are implemented as listeners

The algorithms adapt the agent controller to learn optimal policy

Agent informs all listeners about the steps and when a new episode has started.

Page 7: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Reinforcement Learning:

Agent: State Space S Action Space A Transition Function Agent has to optimize the future discounted

reward

Many possibilities to solve the optimization task: Value based Approaches Genetic Search Other Optimization algorithms

Page 8: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Short Overview over the algorithms:

Value-based algorithms Calculate the goodness of each

state Policy-search algorithms

Represent the policy directly, search in the policy parameter space

Hybrid Methods Actor-Critic Learning

Page 9: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Value Based Algorithms

Calculate either: Action value function (Q-Function):

Directly used for action selection Value Function

Need the transition function for action selection E.g. Do state prediction or use the derivation of the transition

function Representation of the V or Q Function is in the most

cases independent of the learning algorithm. We can use any function approximator for the value

function Independent V-Function and Q-Function interfaces

Different Algorithms: TD-Learning, Advantage Learning, Continuous Time RL

Page 10: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Policy Search / Policy Gradient Algorithms

Directly climb the value function with a parameterized policy

Calculate the Values of N given initial states per simulation (PEGASUS, [NG, 2000])

Use standard optimization techniques like gradient ascent, simulated annealing or genetic algorithms. Gradient Ascent used in the Toolbox

Page 11: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Actor Critic Methods:

Learn the value function and an extra policy representation Discrete actor critic

Stochastic Policies Represent directly the action

selection propabilities. Similar to TD-Q Learning

Continous actor critic Directly output the continuous

control vector Policy can be represented by any

Function approximator Stochastic Real Values (SRV)

Algorithm ([Gullapalli, 1992]) Policy-Gradient Actor-Critic

(PGAC) algorithm

Page 12: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Policy-Gradient Actor-Critic Agorithm

Learn the V-Function with standard algorithm Calculate Gradient of the Value within a certain time

window (k-steps in the past, l-steps in the future)

Gradient is then estimated by:

Again exact model is needed

Page 13: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Second Part: Benchmark Tests

Pendulum Swing Up Easy Task

CartPole Swing Up Medium Task

AcroBot Swing Up Hard Task

Page 14: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Benchmark Problems

Common problems in non-linear control

Try to find an unstable fixpoint 2 or 4 continuous state variables 1 continuous control variable Reward: Height of the end point at

time each step

Page 15: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Benchmark Tests:

Test the algorithms on the benchmark problems with different parameter settings. Compare sensitivity of the parameter setting

Use different Function Approximators (FA) Linear FAs (e.g. RBF-Networks)

Typical local representation Curse of dimensionality

Non-Linear FA (e.g. Feed-Forward Neural-Networks): No expontial dependency on the input state dimension Harder to learn (no local representation)

Compare the algorithms with respect to their features and requirements Is the exact transition function needed? Can the algorithm produce continuous actions? How much computation time is needed?

Use hierarchical RL, directed exploration strategies or planning methods to boost learning

Page 16: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Benchmark Tests: Cart-Pole Task, RBF-network

Planning boosts performance significantly Very time intensive (search depth 5 – 120 times longer

computation time) PG-AC approach can compete with standard V-

Learning approach Can not represent sharp decision boundaries

Page 17: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Benchmark: PG-AC vs V-Planning, Feed Forward NN

Learning with FF-NN using the standard planning approach almost impossible (very unstable performance)

PG-AC with RBF critic (time window = 7 time steps) manges to learn the task in almost 1/10 of episodes of the standard planning approach.

PG-AC V-Planning

Page 18: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

V-Planning

Cart-Pole Task: Higher Search Depths could improve performance significantly, but at exponential cost of computation time

Page 19: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Hierarchical RL

Cart-Pole Task: The Hierarchical Sub-Goal Approach (alpha = 0.6) outperforms the flat approach (alpha = 1.0)

Page 20: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Other general results

The Acro-Bot Task could not be learned with the flat architecture Hierarchical Architecture manges to swing up, but could not stay

on top Nearly all algorithms managed to learn the first two tasks with

linear function approximation (RBF networks) Non linear function approximators are very hard to learn

Feed Forward NN‘s have a very poor performance (no locality), but can be used for larger state spaces

Very restrictive parameter settings Approaches which use the transition function typically

outperform the model-free approaches. The Policy Gradient algorithm (PEGASUS) only worked with

the linear FAs, with non-linear FAs it could not recover from local maxima.

Page 21: The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

www.igi.tu-graz.ac.at/ril-toolbox

Literature

[Sutt_1999] R. Sutton and A. Barto: Reinforcement Learning: An Introduction. MIT press

[NG_2000] A. Ng an M. Jordan: PEGASUS: A policy search method for large mdps and pomdps approximation

[Doya_1999] K. Doya: Reinforcement Learning in continuous time and space

[Baxter, 1999] J. Baxter: Direct gradient-based reinforcement learning: 2. gradient ascent algorithms and experiments.

[Baird_1999] L. Baird: Reinforcement Learning Through Gradient Descent. PhD Thesis

[Gulla_1992] V. Gullapalli: Reinforcement Learning and its application to control

[Coulom_2000] R. Coulom: Reinforcement Learning using Neural Networks. PhD thesis