Reinforcement Learning on a Double Linked Inverted Pendulum: Towards a...

Post on 15-Mar-2020

1 views 0 download

Transcript of Reinforcement Learning on a Double Linked Inverted Pendulum: Towards a...

INTRODUCTION

Reinforcement Learning on a Double Linked Inverted Pendulum: Towards a Human Posture Control System

R. Pienaar and A.J. van den BogertDepartment of Biomedical Engineering, The Cleveland Clinic Foundation, Cleveland OH

� Problem Description

� Mechanical system consisting of rigid segments articulating with each other via hinge joints (Figure 3)

� Each segment is 2m long� Each segment has mass of 10kg

� Connection to ground is by a hinge joint

� All motion is constrained to two dimensions

� Problem Implementation

� Software simulation� Ground contact link has possible torques of (−400, 0, 400) Nm� Free swinging link has possible torques of (−200, 0, 200) Nm� A torque action vector is selected and applied to the system for

50ms

� Software architecture (Figure 3)

� "Low level" equations of motion and mechanical behaviour are generated by SD/FAST for each pendulum system

� "Intermediate level" sdbio library connects learning system with SD/FAST; library was designed to allow addition of muscle−type actuators and joint geometries (currently not used for the current system)

� "Top level" component defines the learning system and adaptive controller

� Learning shows "noisy" exponential convergence

� A learning episode continues until the pendulum moves into "terminal" state

� Learning occurs during both exploration and exploitation

� Once the pendulum has been balanced for a continuous hour of "virtual time", the simulation is considered complete

� Strictly speaking, the pole balancing problem is not Markovian

� Discretization of continuous state into quantized table violates Markov condition

� Q learning is still able to learn an appropriate balancing stategy

� Learning performance

� Single link pendulum learns relatively quickly (about one hour of "virtual" time or five minutes of real time)

� Double link pendulum requires more time (about 2 − 3 days of "virtual" time or an hour of real time)

� Reducing the action possibilities resulted in faster learning for double link pendulum in some cases

� The controller had no a priori knowledge of the system it was to control

� Could only observe current state and received a "reward" value from the environment

� Q Learning was able to balance both single and double link pendulum systems

� Future work

� Application to human postural control model (Figure 5)

� Generalize learning over higher−resolution spaces

� Distributed vs. global control

� Symbolic Dynamics Inc. SD/FAST. http://www.symdyn.com, 1996.

� L. P. Kaebling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4: 237−285, 1996.

� R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.

� C. J. C. H. Watkins and P. Dayan. Technical note: Q−learning. Machine Learning, 8: 279−292, 1992.

� Purpose of research is ultimately to develop an adaptive human posture control system

� Initial development "platforms" are inverted pendula

� Human mechanical system can be simplistically modeled as inverted pendulum

� Single link and double link� Double link pendulum is a non−linear system that is comparatively

simple to describe, yet difficult to control� An adaptive human posture control system can be used to assist patients

suffering from paraplegia and other degenerative neuro−muscular disorders

� Control system is Reinforcement Learning (RL) based

� Does not require a priori information on the plant

� Learns the plant’s system dynamics through exploring / exploiting strategies

� Develop / test a software development environment that can be used for human posture RL control

� Control a double link inverted pendulum through a specific RL algorithm called Q Learning.

� Reinforcement learning

� Learn from experience, using exploitation and exploration

� Learn to maximize the sum of all future rewards

� Biologically inspired� Behavioural learning, conditioned reflexes� Feedback does not fight the natural dynamics� Consistent with electrophysiological data of dopamine neurons

during motor learning

� Successful applications in artificial intelligence, few in robotics or biomechanics

� Q Learning (see Algorithm in Figure 1)

� Q(s,a) = expected sum of future rewards for executing action a in system state s

� Controller learns from experience control system; implicitly learns the system dynamics

� Results in adaptive optimal control without an explicit system model

� The Q(s, a) function (continuous or discrete) defines the values (or worth) of taking a particular action a in a particular state s.

� Assumes that underlying system can be described as a Markov decision process

� Markov process implies that for a given state a decision on future action depends only on the current state and not on the history that led up to the current state

SPECIFIC GOALS

BACKGROUND

� Initialize(Q(s,a), random_values)

� repeat

� observe(state(s)) (sensory information)

� forall(actions(a)) in state(s))

� find_action(a) with highest(Q(s,a))

� if(random_number > ��

� execute(action(a))

� else

� execute(random(action(a))

� observe(new_state(s))

� receive(reward(r))

� Q(s,a) = Q(s,a) + ⊆{r + ∈ max{Q(s’,b)} − Q(s,a)]

⊆: learning rate

∈: discount factor

�: exploration rate

� until(converged(Q(s,a))

ReinforcementController

sdbio

SD/FAST

System model

Muscle descriptionfile

Mechanical systemfile

(Θ0,ω

0) (Θ

0,ω

0)

(Θ1,ω

1)

ball

hinge

universal joint

body mass�

METHODS

Figure 1. Pseudo−code description of the Q−learning algorithm.

Θi : Angle of

segment i (rad)

ωi: Angular velocity

of segment i (rad/s)

Figure 2. Schematic overview of the single− and double−linkpendulum systems.

Figure 3. Conceptual overview of software architecture.

RESULTS

DISCUSSION

1a

1b

2a2b

3

4

5a

5b

1a Quadriceps femoris1b Hamstrings

2a Vasti2b Glutei

3 Sartorius

4 Gastocnemius

5a Tibialis anterior5b Tibialis posterior

Figure 5. Future work will apply controllerto human musculo−skeletal system

CONCLUSION

REFERENCES

Figure 4. Controller GUI (on left) and performance graph (on right)

(−400 0 400) (−400 0 400) 5.24(−200 0 200) (−100 0 100) 3.96(−400 0 400) (−200 0 200) 5.79

Ground link action vector (Nm)

Free link action vector (Nm)

Total learning time (days)

Table 1. Learning performance for different actionvectors