Online policy iteration for real-time control: A learning ...

1
Online policy iteration for real-time control: A learning system demonstrator Lucian Bu¸ soniu, Fankai Zhang, Robert Babuˇ ska Delft Center for Systems and Control, Delft University of Technology Mekelweg 2, 2628 CD Delft, the Netherlands. Phone: +31 (0) 15 27 88573 Email: [email protected], [email protected], [email protected] Keywords: machine learning, reinforcement learning, real-time control, demonstrator. A system equipped with a reinforcement learning algorithm can in principle learn how to achieve its goal online, without requiring prior knowledge. The goal is specified by defining a reward function that assigns larger rewards to more desirable situations. Because data is a costly resource in practice, an important challenge is to develop data-efficient online algorithms, which achieve a good performance after interacting with the system for only a short time. In [2], the so-called online least-squares policy iteration (LSPI) algorithm was proposed, with the contribution of coauthors of this abstract. This algorithm extends the data-efficient offline LSPI [3] to the online case. While online LSPI showed highly encouraging learning results [2], most of them were in simulation, with only some preliminary real-time control results. In the present work, we provide a real-time control demonstrator for online LSPI, with easily inter- pretable visualization and reproducible results. To this end, a structured real-time control protocol was designed within the Matlab toolbox for approximate reinforcement learning and dynamic programming [1], and a real-time version of online LSPI was implemented using this protocol. The toolbox was then coupled with a graphical user interface (GUI) — see Figure 1, left — which shows “live” the control policy and value function learned (the two lower graphs), together with the performance achieved per trial (the upper graph). Learning can be paused at any time, and the performance of the current policy can be examined by using it to control the system. 1 2 3 4 5 6 Figure 1: Left: A screenshot of the demonstrator GUI. Right: Video stills of the inverted pendulum being controlled from the pointing-down to the pointing-up position. Because of limited power, the pendulum must be swung right before being pushed left and stabilized pointing up. (Note the inverted pendulum is constructed from a mass attached off-center to a disk turned by a DC motor.) The demonstrator currently interfaces with an inverted pendulum system, see Figure 1, right. Online LSPI typically learns to swing up the pendulum very quickly, in under 3 minutes (including the overhead to update the GUI and reset the system in-between learning trials). This illustrates the potential of reinforcement learning in real-time applications. In the future, we plan to use this framework to also demonstrate the learning control of a complex robot arm, and other robotic systems in our lab. References [1] L. Bu¸ soniu. ApproxRL: A Matlab toolbox for approximate reinforcement learning and dynamic programming. http://www.dcsc.tudelft.nl/ ~ lbusoniu/repository.php. [2] L. Bu¸ soniu, D. Ernst, B. De Schutter, and R. Babuˇ ska. Online least-squares policy iteration for reinforcement learning control. In Proceedings 2010 American Control Conference (ACC-10), pages 486–491, Baltimore, US, 30 June – 2 July 2010. [3] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107– 1149, 2003.

Transcript of Online policy iteration for real-time control: A learning ...

Online policy iteration for real-time control:

A learning system demonstrator

Lucian Busoniu, Fankai Zhang, Robert BabuskaDelft Center for Systems and Control, Delft University of Technology

Mekelweg 2, 2628 CD Delft, the Netherlands. Phone: +31 (0) 15 27 88573

Email: [email protected], [email protected], [email protected]

Keywords: machine learning, reinforcement learning, real-time control, demonstrator.

A system equipped with a reinforcement learning algorithm can in principle learn how to achieve its goalonline, without requiring prior knowledge. The goal is specified by defining a reward function that assignslarger rewards to more desirable situations. Because data is a costly resource in practice, an importantchallenge is to develop data-efficient online algorithms, which achieve a good performance after interactingwith the system for only a short time. In [2], the so-called online least-squares policy iteration (LSPI)algorithm was proposed, with the contribution of coauthors of this abstract. This algorithm extends thedata-efficient offline LSPI [3] to the online case. While online LSPI showed highly encouraging learningresults [2], most of them were in simulation, with only some preliminary real-time control results.

In the present work, we provide a real-time control demonstrator for online LSPI, with easily inter-pretable visualization and reproducible results. To this end, a structured real-time control protocol wasdesigned within the Matlab toolbox for approximate reinforcement learning and dynamic programming

[1], and a real-time version of online LSPI was implemented using this protocol. The toolbox was thencoupled with a graphical user interface (GUI) — see Figure 1, left — which shows “live” the control policyand value function learned (the two lower graphs), together with the performance achieved per trial (theupper graph). Learning can be paused at any time, and the performance of the current policy can beexamined by using it to control the system.

1 2

3 4

5 6

Figure 1: Left: A screenshot of the demonstrator GUI. Right: Video stills of the inverted pendulum beingcontrolled from the pointing-down to the pointing-up position. Because of limited power, the pendulummust be swung right before being pushed left and stabilized pointing up. (Note the inverted pendulum isconstructed from a mass attached off-center to a disk turned by a DC motor.)

The demonstrator currently interfaces with an inverted pendulum system, see Figure 1, right. OnlineLSPI typically learns to swing up the pendulum very quickly, in under 3 minutes (including the overheadto update the GUI and reset the system in-between learning trials). This illustrates the potential ofreinforcement learning in real-time applications. In the future, we plan to use this framework to alsodemonstrate the learning control of a complex robot arm, and other robotic systems in our lab.

References

[1] L. Busoniu. ApproxRL: A Matlab toolbox for approximate reinforcement learning and dynamic programming.http://www.dcsc.tudelft.nl/~lbusoniu/repository.php.

[2] L. Busoniu, D. Ernst, B. De Schutter, and R. Babuska. Online least-squares policy iteration for reinforcementlearning control. In Proceedings 2010 American Control Conference (ACC-10), pages 486–491, Baltimore, US,30 June – 2 July 2010.

[3] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.