Using a Controller Based on Reinforcement Learning

1Using a controller based on reinforcement learningfor a passive dynamic walking robot

E. Schuitema (+), D. G. E. Hobbelen (*), P. P. Jonker (+), M. Wisse (*), J. G. D. Karssen (*)

AbstractOne of the difficulties with passive dynamic walkingis the stability of walking. In our robot, small uneven or tiltedparts of the floor disturb the locomotion and must be dealtwith by the feedback controller of the hip actuation mechanism.This paper presents a solution to the problem in the form ofcontroller that is based on reinforcement learning. The controlmechanism is studied using a simulation model that is basedon a mechanical prototype of passive dynamic walking robotwith a conventional feedback controller. The successful walkingresults of our simulated walking robot with a controller basedon reinforcement learning showed that in addition to the primeprinciple of our mechanical prototype, new possibilities such asoptimization towards various goals like maximum speed andminimal cost of transport, and adaptation to unknown situationscan be quickly found.

Index TermsPassive Dynamic Walking, ReinforcementLearning, Hip Controller, Biped.

I. INTRODUCTION

TWO-LEGGED WALKING ROBOTS have a strong at-tractive appeal due to the resemblance with humanbeings. Consequently, some major research institutions andprivate companies have started to develop bipedal (two-legged)robots, which has led to sophisticated machines [14], [8].To enable economically viable commercialization (e.g. forentertainment), the challenge is now to reduce the designcomplexity of these early successes, in search for the ideal setof characteristics: stability, simplicity, and energy efficiency.

A promising idea for the simultaneous reduction of com-plexity and energy consumption, while maintaining or evenincreasing the stability, is McGeers concept of passive dy-namic walking [9]. On a shallow slope, a system consisting oftwo legs with well-chosen mass properties can already showstable and sustained walking [6]. No actuators or controls arenecessary as the swing leg moves in its natural frequency.Using McGeers concept as a starting point we realized anumber of 2D and 3D mechanical prototypes with increasingcomplexity [24], [22], [5]. These prototypes are all poweredby hip actuation and the control of these robots is extremelysimple; a foot switch per leg triggers a change in desired hipangle, resulting in swing of the opposite leg.

Although passive dynamics combined with this simplecontroller already stabilize the effect of small disturbances,larger disturbances, such as an uneven floor, quickly lead tofailures [15]. Also, the simple controller does not guaranteeoptimal efficiency or speed. Consequently, in this paper weelaborate on the introduction of more complex controllers

The authors are with the Faculties of (+) Applied Sciences, Lorentzweg1, 2628 CJ and (*) Mechanical Engineering, Mekelweg 2, 2628CD, Delft University of Technology, Delft, The Netherlands. E-mail:[email protected] .

based on learning. A learning controller has several advan-tages:

It is model free, so no model of the bipeds dynamicsystem nor of the environment is needed.

It uses a fully result driven optimization method. It is on-line learning, in principle possible on a real robot. It is adaptive, in the sense that when the robot or its

environment changes without notice, the controller canadapt until performance is again maximal.

It can optimize relatively easily towards several goals,such as: minimum cost of transport, largest averageforward speed, or both.

Section II gives an overview of the concept of passivedynamic walking, our mechanical prototype (Fig. 1) and the2D simulation model describing the dynamics of this proto-type. This simulation model is used for our learning controllerstudies. Section III describes the principles of reinforcementlearning, their application in a controller for walking and ourmeasurements. In Section IV we conclude that a reinforce-ment based controller provides an elegant and simple controlsolution for stable and efficient biped walking.

Fig. 1. Meta; a 2D robot based on the principle of passive dynamic walking.This study is based on the simulation model of this prototype.

2II. PASSIVE DYNAMIC WALKING

A. Basic Principles

Since McGeers work, the idea of passive dynamic walkinghas gained in popularity. The most advanced fully passivewalker, constructed at Cornell University, has two legs andstable three-dimensional dynamics with knees, and counter-swinging arms [6].

The purely passive walking prototypes demonstrate convinc-ing walking patterns, however, they require a slope as well as asmooth and well adjusted walking surface. A small disturbance(e.g. introduced by the manual launch) can still be handled,but larger disturbances quickly lead to a failure [15].

One way to power passive dynamic walkers to walk on alevel floor and make them more robust to large disturbances iship actuation. This type of actuation can supply the necessaryenergy for maintaining a walking motion and keep the robotfrom falling forward [23]. The faster the swing leg is swungforward (and then kept there), the more robust the walker isagainst disturbances. This creates a trade-off between energyconsumption and robustness for the amount of hip actuationthat is applied.

B. Mechanical prototype

The combination of passive dynamics and hip actuation hasresulted in multiple prototypes made at the Delft BioroboticsLaboratory. The most recent 2D model is Meta (Fig. 1), whichis the subject of this study. This prototype is a 2D walkerconsisting of 7 body parts (an upper body, two upper legs,two lower legs and two feet). It has a total of 5 Degreesof Freedom, located in a hip joint, two knee joints and twoankle joints. The upper body is connected to the upper legs bya bisecting hip mechanism, which passively keeps the upperbody in the intermediate angle of the legs [22].

The system is powered by a DC motor that is located atthe hip. This actuator is connected to the hip joint througha compliant element, based on the concept of Series ElasticActuation first introduced by the MIT Leg Lab [13]. Bymeasuring the elongation of this compliant element, this allowsthe hip joint to be force controlled. The compliance ensuresthat the actuators output impedance is low, which makes itpossible to replicate passive dynamic motions. Also it ensuresthat the actuator can perform well under the presence ofimpacts. This actuator construction allows us to apply a desiredtorque pattern up to a maximum torque of around 10 Nm witha bandwidth of around 20 Hz. These properties should allowthe reinforcement learning based controller to be implementedin practice in the near future.

The prototype is fully autonomous running on lithium ionpolymer batteries. The control platform is a PC/104 stack witha 400 MHz processor and the controllers are implementedthrough the Matlab Simulink xPC Target environment. Theangles of all 5 joints as well as the elongation of the actuatorscompliant element are measured in real-time using incrementalencoders. Next to these sensors there are two switches under-neath the two feet to detect foot contact.

The knee and ankle joints are both fully passive, but theknee joint can be locked to keep the knee extended wheneverthe robot is standing on the corresponding leg.

The prototype can walk based on a fairly simple controlalgorithm. The hip angle is PD controlled given a constantreference hip angle. If the foot switch of the current swingleg is contacted (and thus becomes the new stance leg), thereference angle is inverted, effectively pulling the new swingleg forward. Simultaneously, the knee latches of the new swingleg are released briefly. Then, the system just waits for the newswing legs foot switch to make contact, assuming that kneeextension takes place before heel contact.

C. Dynamic system model

The dynamic simulation model that is used in this study wasmade using the Open Dynamics Engine physics simulator [16].The model consists of the same 7 body parts as the prototypemodeled by rigid links having a mass and moment of inertiaassociated with them (Fig. 2). The joints are modeled bystiff spring-damper combinations. The knees are providedwith a hyperextension stop and a locking mechanism whichis released just after the start of the swing phase. The hipbisecting mechanism that keeps the upper body upright ismodeled by introducing a kinematic chain through two addedbodies with negligible mass. The floor is - provisionally -assumed to be a rigid, flat, and level surface.

Contact between the foot and the ground is also modeled bya tuned spring-damper combination which is active wheneverpart of the foot is below the ground. The model of thefoot mainly consists of two cylinders at the back and thefront of the foot. The spring damper combination is tunedsuch that the qualitative motion of the model is similar to arigid contact model made in Matlab which has been validatedusing measurements from a former prototype [22]. A profoundvalidation of our ODE model with the prototype will beperformed in the near future.

g

hka

ul

ll

ucuw

uI um

lw

lI lmlc

bc

bw

bI b

m

fw

fI fm

fr

fh

fl

Fig. 2. Two-dimensional 7-link model. Left the parameter definition, rightthe Degrees of Freedom (DoFs). Only the DoFs of the swing leg are given,which are identical to the DoFs of the other leg.

3Trainer

Learning

Simulator

Trainer

Learning

Robot

Trainer

Learning

Simulator

Trainer

Learning

Simulator

Trainer

Learning

Robot

Trainer

Learning

Robot

Fig. 3. Learning in a simulator first and downloading the result

A set of physically realistic parameter values were derivedfrom the prototype; see Table I. Its values were used through-out this study.

TABLE IDEFAULT PARAMETER VALUES FOR THE SIMULATION MODEL.

upper lower footbody leg leg

mass m [kg] 8 0.7 0.7 0.1mom. of Inertia I [kgm2] 0.11 0.005 0.005 0.0001length l [m] 0.45 0.3 0.3 0.06vert. dist. CoM c [m] 0.2 0.15 0.15 0hor. offset CoM w [m] 0.02 0 0 0.015foot radius fr [m] - - - 0.02foot hor. offset fh [m] - - - 0.015

III. REINFORCEMENT LEARNING BASED CONTROL

A. Simulation versus on-line learning

A learning controller has several advantages over a normalPID controller: It is model free, so no model of the bipedsdynamic system nor of the environment is needed. It usesresult driven optimization. It is adaptive, in the sense thatwhen the robot or its environment changes without notice,the controller can adapt until performance is again maximal.It can optimize relatively easily towards several goals, such as:minimum cost of transport, largest average forward speed, orboth. In principle, it can be performed on-line on the real robotitself. However, problematic with robot control using learningthrough trial and error from scratch, is that the robot will falldown quite some times, that the robot needs to be initializedin initial states over and over again and that its behavior needsto be monitored adequately. With a good simulator, adequatelydescribing the real robot, learning an adaptive and optimizingcontroller can be done without tedious human labor to coachthe robot and without the robot damaging itself. Moreover,learning occurs at the computers calculation speed, whichusually means several times realtime. The final result can bedownloaded into the controller of the real robot after whichlearning can be continued. Fig. 3 shows the learning controllerthat first learns to adapt to a simulator of the robot, after whichits result can be downloaded to the controller of a real robot.Note that the controller is divided into the controller itselfand a trainer (internal to the controller on a meta level) thatcontrols the reward assignments.

B. State of the art

Using reinforcement learning techniques for the control ofwalking bipeds is not new [3], [4], [12], [20]. Especially

interesting for passive dynamics based walkers is Poincarebased reinforcement learning as discussed in [11], [7], [10].

Other promising current work in the field of learning andexecution simultaneously is found in [18], [19]. Due to themechanical design of their robot, it is able to acquire a robustpolicy for dynamic bipedal walking from scratch. The trials areimplemented on the physical robot itself, a simulator for off-line pre-learning is not necessary. The robot begins walkingwithin a minute and learning converges in +/- 20 minutes. Itquickly and continually adapts to the terrain with every stepit takes.

Our approach is based on experiences from various suc-cessful mechanical prototypes and is similar to the approachof e.g. [11]. Although, we aim for a solution as found in[19], and also our simulated robot often converges quickly towalking, see Fig. 5, until now we feel more comfortable withthe approach of learning a number of controllers from randominitialization and downloading the best of their results into thephysical robot. See section III-I. Not found in literature is theoptimization that we applied towards various goals, such asspeed and efficiency. See section III-H. Unlike methods basedon Poincare mapping, our method does not require periodicsolutions with a one footstep period.

C. Reinforcement learning principles

Reinforcement learning is learning what to do - how to mapsituations to actions - so as to maximize a numerical rewardsignal [17]. In principle it does a trial-and-error search througha state-action space to optimize the cumulative discounted sumof rewards. This may include rewards delayed over severaltime steps. In reinforcement learning for control problems, weare trying to find an optimal action selection method or policy, which gives us the optimal action-value function definedby:

Q(s, a) = maxQ(s, a) (1)

s S and a A(s), which may be shared by severaloptimal policies . Q-learning [21] is an off-policy temporaldifference (TD) control algorithm that approximates the op-timal action-value function independent of the policy beingfollowed, in our case the -greedy policy. The update rule forQ-learning, which is done after every state transition, is:

Q(st, at) Q(st, at)+[rt+1+maxaQ(st+1, a)Q(st, at)](2)

in which s is our state signal, a is the chosen action, st+1is the new state after action a has been performed and a isan iterator to find the action which gives us the maximumQ(st+1, a).

is the learnrate, constant in our tests, which defineshow much of the new estimate is blended with the oldestimate.

r is the reward received after taking action a in state s. is the rate at which delayed reward are discounted every

time step.During learning, actions are selected according to the -greedypolicy: there is an (1 ) chance of choosing the action thatgives us the maximum Q(st+1, a) (exploitation), and an

4chance of choosing a random action (exploration). When thestate signal succeeds in retaining all relevant information aboutthe current learning situation, it is said to have the Markovproperty.

A standard technique often combined with Q-learning is theuse of an eligibility trace. By keeping a record of the visitedstate-action pairs over the last few time steps, all state-actionpairs in the trace, with decaying importance are updated. Inthis paper we used Q-learning combined with an eligibilitytrace the way Watkins [21] first proposed: Watkins Q().

To approximate our action-value function, CMACS tilecoding was used [17], [21], [1], [2], a linear function approxi-mator. For each input and output dimension, values within thedimension dependent tile width are discretized to one state,creating a tiling. By constructing several randomly shiftedtilings, each real valued state-action pair falls into a numberof tilings. The Q-value of a certain state-action pair is thenapproximated by averaging all tile values that the state-actionpair falls into. Throughout this research, we used 10 tilings.The Q-values are all initialized with a random value between0 and 1. Depending on the random number generator, theinitial values can be favorable or non-favorable in finding anactuation pattern for a walking motion.

D. Learning with a dynamic system simulator

The state space of the walking biped problem consists ofsix dimensions: angle and angular velocity of upper stanceleg, upper swing leg, and the lower swing leg. In order notto learn the same thing twice, symmetry between the left andright leg is implemented by mirroring left and right leg stateinformation when the stance leg changes. In the mirrored case,the chosen hip torque is also mirrored by negation. This definesthe state of the robot except for the feet, thereby not fullycomplying to the Markov property, but coming very closewhen finding walking cycles. There is one output dimension:the torque to be applied on the hip joint, which was given arange between -8 and 8 Nm, divided in 60 discrete torques;to be evaluated in the function approximator when choosingthe best action. All dimensions (input and output) were givenapproximately the same number of discrete states within theirrange of occurrence during a walking cycle. This boils downto about 100,000 discrete states, or estimating 1,000,000 Q-values when using 10 tilings. The parameters of Q-learningwere set to =0.25, =1.0, =0.05 and =0.92, while decayswith time with a discount rate of 0.9999 /s. The values for , and are very common in Q-learning. The choice of will be explained for each learning problem. A test run wasperformed after every 20 learning runs, measuring average hipspeed, cost of transport and the number of footsteps taken.

E. Learning to walk

At the start of a learning run, the robot is placed in an initialcondition which is known to lead to a stable walking motionwith the PD controlled hip actuation: a left leg angle of 0.17rad, right leg angle of -0.5 rad (both with the absolute vertical)and an angular velocity of 0.55 rad/s for all body parts. Itplaces the robot in such a state that the first footstep can hardly

Fig. 4. The simulated robot performing steps

be missed (see Fig.5). The learning run ends when either therobot fell (ground contact of the head, knees or hip) or whenit has made 16 footsteps. The discount factor was set to 1.0,since time does not play a role in this learning problem. Inorder to keep the expected total (undiscounted) sum of rewardsbounded, the maximum number of footsteps is limited to 16.To learn a stable walking motion, the following rewardingscheme was chosen: A positive reward is given when a footstepoccurs, while a negative reward is given per time step when thehip moves backward. A footstep does not count if the hip angleexceeds 1.2 rad, to avoid rewarding overly stretched steps. Thisscheme leaves a large freedom to the actual actuation pattern,since there is not one best way to finish 16 footsteps whendisturbances are small or zero. This kind of reward structureleads to a walking motion very fast, often under 30 minutes ofsimulation time, sometimes under 3 minutes, depending on theinitial random initialization of the Q-values and the amount ofexploration that was set. Inherently, in all learning problems inthis paper, a tradeoff will be made between robustness againstdisturbances and the goal set by rewards, simply because thetotal expected return will be higher in the case of finishingthe full run of 16 footsteps. Although the disturbances areself-induced by either exploration, an irregular gait and/or theinitial condition, the states outside the optimal walking motionmay occur equally well because of external disturbances.

F. Minimizing cost of transport

To minimize the specific cost of transport (C.o.T.), definedas the amount of energy used per unit transported systemweight (m.g) per distance traveled, the following rewardingscheme was chosen: A reward of +200/m, proportional tothe length of the footstep when a footstep occurs, a rewardof -8.3/J.s proportional to the motor work done per timestep, and a reward of -333/s every time step that the hipmoves backward. The first reward is largely deterministicbecause the angles of both upper legs will define the length ofthe footstep, provided that both feet are touching the floorand that the length of both legs is constant. The secondreward is completely deterministic, being calculated from theangular velocities of both upper legs (which are part of thestate space) and the hip motor torque (chosen as action).Again no discounting is used ( = 1.0). The optimal policywill be the one that maximizes the tradeoff between making

50 20 40 60 80 1000

2

4

6

8

10

12

14

16

18

Time [min]

Ave

rage

d nu

mbe

r of s

teps

take

n

EfficientFastFast and efficient

Fig. 5. Learning curves: average number of footsteps over learning time,averaging 50 learning episodes for each optimization problem.

large footsteps and spending energy. The negative rewardfor backward movement of the hip should not occur whena walking cycle has been found, and thus will mostly playa role at the start of the learning process. Although, whenwalking slowly and accidentally becoming instable on thebrim of falling backward, the robot often keeps its leg withunlocked knee straight and stiff, standing still. Fig. 5 showsthe average learning curve of 50 learning episodes (differentrandom seeds), optimizing on minimum cost of transport.The average and minimum cost of transport can be found inTable II.

TABLE IIAVERAGE AND BEST VALUES FOR HIP SPEED AND COST OF TRANSPORT

(COT) FOR ALL THREE OPTIMIZATION PROBLEMS.

Optimization Optimization Optimizationon speed on CoT on speed

and CoTAverage speed [m/s] 0.554 0.526 0.540Maximum speed [m/s] 0.582 0.549 0.566Average CoT [-] 0.175 0.102 0.121Minimum CoT [-] 0.120 0.078 0.090

G. Maximizing speed

To maximize the forward speed, the following rewardingscheme was chosen: A reward of 150/m proportional to thelength of the footstep, when a footstep occurs, a reward of-56/s every time step, and a reward of -333/s every timestep when the hip moves backward. Again, no discountingis used, although time does play a role in optimizing thisproblem. Our reward should linearly decrease with time, notexponentially as is the case with discounting. Fig. 5 showsthe average learning curve of 50 learning episodes (differentrandom seeds), optimizing on maximum forward speed of thehip. The average and maximum forward speed can be foundin Table II.

H. Minimizing C.o.T and maximizing speed

Both previous reward structures can be blended. All re-wards together (proportional footstep length reward, motorwork penalty, time step penalty, backward movement penalty)produce a tradeoff between minimum C.o.T. and maximumforward speed. This tradeoff will depend on the exact numbersof the rewards for motor work, time step and footstep length.In our test, we used the following reward scheme: A rewardof 350/m proportional to the length of the footstep, whena footstep occurs, a reward of -8.3/J.s proportional to themotor work done every time step, a reward of -56/s everytime step, and a reward of -333/s every time step when thehip moves backward. Fig. 5 shows the average learning curveof 50 learning episodes (different random seeds), optimizingon minimum cost of transport as well as maximum averageforward speed. The average and maximum forward velocityas well as the average and minimum cost of transport can befound in Table II.

I. Learning curve, random initialization and ranking

In general the robot learns to walk very quickly, as Fig. 5shows. A stable walking motion is often found within 20minutes. In order to verify that the robot is not in a localminimum (i.e. the C.o.T might suddenly drop at some point),the simulations need to be performed for quite some time. Weperformed tests with simulation times of 15 hours, showingno performance drop, indicating convergence.

Due to the random initialization, not all attempts to learnare all equally successful. Some random seeds never lead toa result. Optimizing on minimum cost of transport failed toconverge once in our 50 test runs. It is even so that walkersdevelop their own character. For example, initially somewalkers might develop a preference for a short step with theirleft leg and a large step with their right leg. Some dribble andsome tend to walk like Russian elite soldiers. Due to the builtin exploration and the optimization (e.g. towards efficiency)odd behaviors mostly disappear in the long run. A ranking onperformance of all results makes it possible to select the bestwalkers as download candidates for the real robot.

J. Robustness and adaptivity

In order to try how robust the controller is for disturbances,we have set-up a simulation in which we, before each runof 16 footsteps, randomly changed the height of the tiles ofthe floor. In worst case each step encounters another height.The system appears to be able to learn to cope with thesedisturbances in the floor up to 1.0 cm, which is slightly betterthan with its real mechanical counterpart and a PD controller.

To illustrate the adaptive behavior, the robot was first placedon a level surface, learning to walk. After some time, a changein the environment was introduced by means of a small ramp.At first, performance drops. After a relatively short time,performance has recovered to its maximum again. Especiallywhen trying to walk with a minimum cost of transport, thisbehavior is desirable. A learning controller without beingnotified of the angle of the ramp, will find a new optimum

6after some time, purely result driven: a desirable feature forautonomously operating robots.

IV. CONCLUSION

Using a generic learning algorithm, stable walking motionscan be found for a passive dynamic walking robot with hipactuation, by learning to control the torque applied in thehip to the upper legs. To test the learning algorithm, a twodimensional model of a passive dynamic walking biped wasused that of which its mechanical counterpart is known towalk stably with a PD controller for hip actuation. A dynamicsystem model of the robot was used to train the learningcontroller. None of the body dynamics of the mechanicalrobot were provided to the learning algorithm itself. Using asingle learning module, simple ways of optimizing the walkingmotion on goals such as minimum cost of transport andmaximum forward velocity, were demonstrated. Convergencetimes showed to be acceptable even when optimizing ondifficult criteria such as minimum cost of transport. By meansof standard and easy to implement Q()-learning, problemsare solved which are very difficult to tackle with conventionalanalysis.

We have verified the robustness of the system for distur-bances, leading to the observation that height differences of1.0 cm can be dealt with. The system can adapt itself quicklyto a change in the environment such as a weak ramp.

Q-learning proves to operate as a very efficient searchalgorithm for finding the optimal path through a large state-action space with simple rewards, when they are chosencarefully.

REFERENCES

[1] James S. Albus. A theory of cerebellar function. MathematicalBiosciences, 10:2561, 1971.

[2] James S. Albus. Brains, behavior, and Robotics. BYTE Books, McGraw-Hill, Peterborough, NH, Nov 1981.

[3] H. Benbrahim and J. Franklin. Biped dynamic walking using reinforce-ment learning. Robotics and Autonomous Systems, (22):283302, 1997.

[4] C.-M. Chew and G. A. Pratt. Dynamic bipedal walking assisted bylearning. Robotica, 20(5):477491, 2002.

[5] S. H. Collins, A. Ruina, R. Tedrake, and M. Wisse. Efficient bipedalrobots based on passive-dynamic walkers. Science, 307(5712):10821085, 2005.

[6] S. H. Collins, M. Wisse, and A. Ruina. A two legged kneed passivedynamic walking robot. Int. J. of Robotics Research, 20(7):607615,July 2001.

[7] G. Endo, J. Morimoto, J.Nakanishi, and G.M.W. Cheng. An empiricalexploration of a neural oscillator for biped locomotion control. In Proc.4th IEEE Int. Conf. on Robotics and Automation, pages 3030 3035Vol.3, Apr 26-May 1 2004.

[8] Y. Kuroki, M. Fujita, T. Ishida, K. Nagasaka, and J. Yamaguchi. Asmall biped entertainment robot exploring attractive applications. InProc., IEEE Int. Conf. on Robotics and Automation, pages 471476,2003.

[9] T. McGeer. Passive dynamic walking. Int. J. Robot. Res., 9(2):6282,April 1990.

[10] J. Morimoto, J. Cheng, C.G. Atkeson, and G. Zeglin. A simplereinforcement learning algorithm for biped walking. In Proc. 4th IEEEInt. Conf. on Robotics and Automation, pages 3030 3035 Vol.3, Apr26-May 1 2004.

[11] J. Morimoto, J.Nakanishi, G. Endo, and G.M.W. Cheng. Acquisition ofa biped walking pattern using a poincare map. In Proc. 4th IEEE/RASInt. Conf. on Humanoid Robots, pages 912 924 Vol. 2, Nov. 10-122004.

[12] Y. Nakamura, M. Sato, and S. lshii. Reinforcement learningfor biped robot. In Proc. 2nd Int. Symp. on Adaptive Motionof Animals and Machines. www.kimura.is.uec.ac.jp/amam2003/online-proceedings.html, 2003.

[13] G. A. Pratt and M. M. Williamson. Series elastic actuators. IEEEInternational Conference on Intelligent Robots and Systems, pages 399406, 1995.

[14] Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higaki, andM. Fujita. The intelligent asimo: System overview and integration. InProc., Int. Conf. on Intelligent Robots and Systems, pages 24782483,2002.

[15] A. L. Schwab and M. Wisse. Basin of attraction of the simplest walkingmodel. In Proc., ASME Design Engineering Technical Conferences,Pennsylvania, 2001. ASME. Paper number DETC2001/VIB-21363.

[16] R. Smith. Open dynamics engine. Electronic Citation, 2005.[17] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an

introduction. The MIT Press, Cambridge, MA, 1998. ISBN 0-262-19398-1.

[18] R. Tedrake, T.W. Zhang, M.-F. Fong, and H.S. Seung. Actuating a simple3D passive dynamic walker. In Proc., IEEE Int. Conf. on Robotics andAutomation, 2004.

[19] Russ Tedrake, Teresa Weirui Zhang, and H. Sebastian Seung. Learningto walk in 20 minutes. In Proc. 14th Yale Workshop on Adaptive andLearning Systems. Yale University, New Haven, CT, 2005.

[20] E. Vaughan, E. Di Paolo, and I. Harvey. The evolution of control andadaptation in a 3d powered passive dynamic walker. In Proc. 9th Int.Conf. on the Simulation and Synthesis of Living Systems, pages 28492854, Boston, September 12-15 2004. MIT Press.

[21] C.J.C.H. Watkins. Learning from Delayed Rewards. PhD thesis,Cambridge University, Cambridge, UK, 1989.

[22] M. Wisse, D. G. E. Hobbelen, and A. L. Schwab. Adding the upperbody to passive dynamic walking robots by means of a bisecting hipmechanism. IEEE Transactions on Robotics, (submitted), 2005.

[23] M. Wisse, A. L. Schwab, R. Q. v. d. Linde, and F. C. T. v. d. Helm. Howto keep from falling forward; elementary swing leg action for passivedynamic walkers. IEEE Transactions on Robotics, 21(3):393401, 2005.

[24] M. Wisse and J. v. Frankenhuyzen. Design and construction of mike;a 2d autonomous biped based on passive dynamic walking. 2ndInternational Symposium on Adaptive Motion of Animals and Machines,2003.

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /Unknown

/Description >>> setdistillerparams> setpagedevice

Using a Controller Based on Reinforcement Learning

Documents

Transcript of Using a Controller Based on Reinforcement Learning