Www.sciencedirect.com Science Article Pii S0950705113003833

20
Biologically inspired layered learning in humanoid robots Hamed Shahbazi a,, Kamal Jamshidi b , Amir Hasan Monadjemi b , Hafez Eslami b a Department of Mechatronic Engineering, University of Isfahan, Isfahan, Iran b Department of Computer Engineering, University of Isfahan, Isfahan, Iran article info Article history: Received 25 December 2012 Received in revised form 17 November 2013 Accepted 2 December 2013 Available online 14 December 2013 Keywords: Neural networks Oscillatory neurons Layered learning Central pattern generator Policy gradient algorithm Humanoid robots abstract A hierarchical paradigm for bipedal walking which consists of 4 layers of learning is introduced in this paper. In the Central Pattern Generator layer some Learner-CPGs are trained which are made of coupled oscillatory neurons in order to generate basic walking trajectories. The dynamical model of each neuron in Learner-CPGs is discussed. Then we explain how we have connected these new neurons with each other and built up a new type of neural network called Learner-CPG neural networks. Training method of these neural networks is the most important contribution of this paper. The proposed two-stage learn- ing algorithm consists of learning the basic frequency of the input trajectory to find a suitable initial point for the second stage. In the next stage a mathematical path to the best unknown parameters of the neural network is designed. Then these neural networks are trained with some basic trajectories enable them to generate new walking patterns based on a policy. A policy of walking is parameterized by some policy parameters controlling the central pattern generator variables. The policy learning can take place in a middle layer called MLR layer. High level commands are originated from a third layer called HLDU layer. In this layer the focus is on training curvilinear walking in NAO humanoid robot. This policy should opti- mize total payoff of a walking period which is defined as a combination of smoothness, precision and speed. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction The problem of robot locomotion is where neuroscience and robotics converge. This common part is the pattern generators in the spinal cord of vertebrate animals called ‘‘Central Pattern Gen- erators’’ (CPGs). CPGs are neural circuits located in the end parts of the brain and first parts of the spinal cord of a large number of animals and are responsible for generating rhythmic and peri- odic patterns of locomotion in different parts of their bodies [1]. Although these pattern generators use very simple sensory inputs imported from the sensory systems, they can produce high dimen- sional and complex patterns for walking, swimming, jumping, turning and other types of locomotion [2]. The idea that human nervous system has a layered mechanism in generating complex locomotion patterns with only simple stimulations is a provocative one which is intended to be modeled in this paper. Learning in humanoid robots deals with a large number of chal- lenges. For example, the robot should overcome noisy and nonde- terministic situations and reduce unwelcome perturbations [4]. The state space is continuous and multidimensional, thus it is impossible to search systematically in that space. The fact that there is no explicit mapping between intentions and actions in a humanoid robot is a big issue that should be solved [5]. In this paper we intend to train to perform a curvilinear walk in a NAO soccer player robot using a hierarchical layered learning paradigm. The proposed method uses a basic CPG based walk con- troller built of Learner-CPG Neural Networks (LCPGNNs). In this manner, any kind of complex behavior can be trained into a CPG neural network and it can be used in the movement of different types of robots. In the next section related works in the field of humanoid robot locomotion and learning will be reviewed and the advantages and disadvantages in each method will be discussed. In this section we also introduce NAO platform which is used in this research. Sec- tion 3 is dedicated to the proposed model of layered learning in this work. It introduces each layer of our learning platform and ex- plains different correlations between the layers. The CPG layer is explained in Section 4. The role of the arms and coupling of them with other joints is explained in this section. Another important concept is the feedback pathways which are discussed here. The mathematical discussion about the learning algorithm used for Learner-CPG neural networks is presented in Section 5. Here the two-stage learning algorithm which can train each oscillator 0950-7051/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.knosys.2013.12.003 Corresponding author. Tel.: +98 311 7935620. E-mail addresses: [email protected] (H. Shahbazi), [email protected] (K. Jamshidi), [email protected] (A.H. Monadjemi), [email protected] (H. Eslami). Knowledge-Based Systems 57 (2014) 8–27 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

description

ok

Transcript of Www.sciencedirect.com Science Article Pii S0950705113003833

Page 1: Www.sciencedirect.com Science Article Pii S0950705113003833

Knowledge-Based Systems 57 (2014) 8–27

Contents lists available at ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier .com/locate /knosys

Biologically inspired layered learning in humanoid robots

0950-7051/$ - see front matter � 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.knosys.2013.12.003

⇑ Corresponding author. Tel.: +98 311 7935620.E-mail addresses: [email protected] (H. Shahbazi), [email protected]

(K. Jamshidi), [email protected] (A.H. Monadjemi), [email protected](H. Eslami).

Hamed Shahbazi a,⇑, Kamal Jamshidi b, Amir Hasan Monadjemi b, Hafez Eslami b

a Department of Mechatronic Engineering, University of Isfahan, Isfahan, Iranb Department of Computer Engineering, University of Isfahan, Isfahan, Iran

a r t i c l e i n f o a b s t r a c t

Article history:Received 25 December 2012Received in revised form 17 November 2013Accepted 2 December 2013Available online 14 December 2013

Keywords:Neural networksOscillatory neuronsLayered learningCentral pattern generatorPolicy gradient algorithmHumanoid robots

A hierarchical paradigm for bipedal walking which consists of 4 layers of learning is introduced in thispaper. In the Central Pattern Generator layer some Learner-CPGs are trained which are made of coupledoscillatory neurons in order to generate basic walking trajectories. The dynamical model of each neuronin Learner-CPGs is discussed. Then we explain how we have connected these new neurons with eachother and built up a new type of neural network called Learner-CPG neural networks. Training methodof these neural networks is the most important contribution of this paper. The proposed two-stage learn-ing algorithm consists of learning the basic frequency of the input trajectory to find a suitable initial pointfor the second stage. In the next stage a mathematical path to the best unknown parameters of the neuralnetwork is designed. Then these neural networks are trained with some basic trajectories enable them togenerate new walking patterns based on a policy. A policy of walking is parameterized by some policyparameters controlling the central pattern generator variables. The policy learning can take place in amiddle layer called MLR layer. High level commands are originated from a third layer called HLDU layer.In this layer the focus is on training curvilinear walking in NAO humanoid robot. This policy should opti-mize total payoff of a walking period which is defined as a combination of smoothness, precision andspeed.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

The problem of robot locomotion is where neuroscience androbotics converge. This common part is the pattern generators inthe spinal cord of vertebrate animals called ‘‘Central Pattern Gen-erators’’ (CPGs). CPGs are neural circuits located in the end partsof the brain and first parts of the spinal cord of a large numberof animals and are responsible for generating rhythmic and peri-odic patterns of locomotion in different parts of their bodies [1].Although these pattern generators use very simple sensory inputsimported from the sensory systems, they can produce high dimen-sional and complex patterns for walking, swimming, jumping,turning and other types of locomotion [2]. The idea that humannervous system has a layered mechanism in generating complexlocomotion patterns with only simple stimulations is a provocativeone which is intended to be modeled in this paper.

Learning in humanoid robots deals with a large number of chal-lenges. For example, the robot should overcome noisy and nonde-terministic situations and reduce unwelcome perturbations [4].

The state space is continuous and multidimensional, thus it isimpossible to search systematically in that space. The fact thatthere is no explicit mapping between intentions and actions in ahumanoid robot is a big issue that should be solved [5].

In this paper we intend to train to perform a curvilinear walk ina NAO soccer player robot using a hierarchical layered learningparadigm. The proposed method uses a basic CPG based walk con-troller built of Learner-CPG Neural Networks (LCPGNNs). In thismanner, any kind of complex behavior can be trained into a CPGneural network and it can be used in the movement of differenttypes of robots.

In the next section related works in the field of humanoid robotlocomotion and learning will be reviewed and the advantages anddisadvantages in each method will be discussed. In this section wealso introduce NAO platform which is used in this research. Sec-tion 3 is dedicated to the proposed model of layered learning inthis work. It introduces each layer of our learning platform and ex-plains different correlations between the layers. The CPG layer isexplained in Section 4. The role of the arms and coupling of themwith other joints is explained in this section. Another importantconcept is the feedback pathways which are discussed here. Themathematical discussion about the learning algorithm used forLearner-CPG neural networks is presented in Section 5. Here thetwo-stage learning algorithm which can train each oscillator

Page 2: Www.sciencedirect.com Science Article Pii S0950705113003833

H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27 9

neuron and its synaptic connection in a LCPGNN is explained. Sec-tion 6 introduces the MLR layer and its learning mechanism. Weuse reinforcement learning in this layer to find an optimal policyfor the CPG layer. Policy parameterization and payoff function isdiscussed here as well. Section 7 includes the experimental results.Here some of the implementations and results in WebotsTM sim-ulator and simulink of Matlab are presented. In Section 8 the high-est layer HLDU, its functions and capabilities are briefly discussed.Section 9 includes the conclusion and future works.

2. Related works

There are many approaches to solve bipedal skill learning issues[6]. As an alternative to the methods using pre-recorded trajecto-ries [7,8], ZMP-based approaches [11] or methods using heuristiccontrol laws (e.g. Virtual Model Control (VMC) [12]), the CPG basedmethods are introduced, using some biological perspectives. Theyencode rhythmic trajectories as limit cycles of nonlinear dynamicalsystems. Coupled oscillator-based CPG implementations offermiscellaneous features such as the stability properties of thelimit-cycle behavior (i.e. the ability to overcome perturbationsand compensate their effects), the smooth online modulation oftrajectories through changes in the parameters of the dynamicalsystem, and entrainment phenomena when the CPG is coupledwith a mechanical system. Examples of CPGs applied to bipedlocomotion are included in [13,14]. Matsubaraa et al. discussed aCPG-based method for biped walking combined with policygradient learning [10].

A drawback of the CPG approach is that most of the time theseCPGs have to be custom-made for a specific application. There arefew techniques to construct a CPG for generating an arbitrary inputtrajectory. Righetti et al. represented a model for constructions of ageneric model of CPG [13]. This method is a Programmable CentralPattern Generator (PCPG) by applying dynamical systems andsome differential equations for developing a training algorithm.The learner model is based on the works of [15] a Hebian learningmethod in dynamical Hopfs oscillators. The programmable centralpattern generator was used to generate walking patterns for aHoap2 robot. This Hoap2 can increase its speed without falling tothe ground. Using this type of generic CPGs they trained the PCPGswith sample trajectories of walking patterns of the Hoap-2 robotprovided by Fujitsu. Each trajectory is a teaching signal to the cor-responding CPG controlling associated joints.

Hackenberger initiated some proceedings [16] on programma-ble CPG model included in [13] in order to use a nonlinear feed-back policy for balancing a humanoid robot during a walkinggait. This system consists of two modules: A polar-based PCPGwhich reproduces a walking trajectory, and a reinforcementlearning agent responsible for modifying the walking patterns.This paradigm can use programmable central pattern generatorsand enables them to incorporate gyro feedbacks into the systemdefinitions generating the walking trajectories. Degallier et al.[17] defined a modular generator of movements called UnitPattern Generators (UPGs) and combined them to build CPGsfor some robots with great degrees of freedoms. He applied hisframework to interactive drumming and infant crawling in iCubhumanoid robot.

3. Layered learning architecture

The idea of layered learning in multi agent systems was intro-duced in [19] by Stone. He investigated the use of machine learn-ing within a team of soccer player’s 2D agents. Using hierarchicaltask decomposition, layered learning enables us to learn a specifictask at each level of the hierarchy. Here a hierarchical learning

framework for walk learning in soccer player humanoid robotsis designed. Our model composed of 4 different layers. Designsof these layers are inspired from biological hierarchy of the ner-vous system in human [22]. These layers are called HLDU, MLR,CPG, LLJ layers. Fig. 1 illustrates this model. In this section we dis-cuss the overall hierarchical model, specific function of each layerand relations between the layers.

� HLDU layer: High Level Decision Unit (HLDU) is a model ofCerebrum part of the brain cortex. Cerebrum controls learnedbehaviors and memory in human being and makes up about80% of the brain mass [20]. In this region high level com-mands for different motor behavior, vision, hearing andspeaking are generated. In our model this is the place of deci-sion making which learns to analyze input images, processthem and send commands/vision feedbacks to the next layer.The Image captured by the robot cameras which determinethe local position of the robot with respect to the desired pathand this position generates the immediate speed and preci-sion feedbacks. In the current study, The special commandsgenerated by this layer determine a curvilinear path on theground.� MLR Layer: Mesencephalic Locomotor Region (MLR) layer is

responsible for making suitable policies for the lower parts ofneuronal system. This region is located in the midbrain andhas descending pathways to the spinal cord via the reticular for-mations. Here is the center for decisions related to locomotion.Different decisions from higher parts of the brain are enteredinto this region and it produces some types of high-level andlow-level electrical stimulation in order to modify the behaviorof central pattern generators [20]. The level of stimulation canmodulate the speed of locomotion or translation of the gaits[3]. This region is modeled as a policy learner which getsparameterized inputs (path commands and vision feedbacks)from HLDU layer and generates a policy vector for the nextlayer. A policy of walking is a stimulation of CPG layer that isformulated as a policy gradient learning problem on some openparameters of the CPG layer. In our previous works [21,22] wedid not consider the effects of feedback pathways in this policyvector. In this paper, however the gyro and foot pressure feed-backs are added to the CPG layer and consequently the policyshould consider the effects of these state variables in the learn-ing process. This value can determine the instantaneoussmoothness of walking. Instantaneous smoothness, speed andprecision are combined to compute total payoff of a walkingexperiment.� CPG layer: The third layer is the Central Pattern Generator (CPG)

layer that consists of some Learner Central Pattern GeneratorNeural Networks (LCPGNNs). CPG layer is connected to the LLJlayer and sends motion trajectories generated from a high leveldecision command to the PIDs. The fundamental building blockof the LCPGNNs is oscillatory neurons (o-neurons), designed andintroduced in this paper. These o-neurons have the property oflearning the frequency of a periodic input signal and changing itbased on some sensory input. Usually, the frequency of an o-neurons can be controlled by a specific parameter in the staterepresentation. In this paper a learning algorithm is introducedfor finding specific parameters of o-neurons and synaptic con-nection weights in a LCPGNN.� LLJ layer: Low Level Joint (LLJ) layer is composed of PID control-

lers located in the robot hardware. This layer is directly con-nected to the robot and controls each Degree Of Freedom(DOF). Its input is the desired positions of the joints which aregenerated from CPG layer and it also receives the previous jointvalues as a feedback. It can calculate error and generate appro-priate voltages to produce required torques and speeds.

Page 3: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 1. Hierarchical model of layered learning, these layers are called HLD, MLR, CPG, LLJ layers. HLDU layer is a model of Cerebrum part of the brain cortex. It sends high levelcommands to the MLR layer. MLR layer is responsible for making suitable policies for the lower parts of the system. It learns desired policies and stimulate CPG layer toexecute high level commands. CPG layer consists of some Learner Central Pattern Generator Neural Networks (LCPGNNs). CPG layer is connected to the LLJ layer and sendsmotion trajectories generated from a high level decision to the PIDs. LLJ layer is composed of PID controllers located in the NAO robot hardware. This layer is directlyconnected to the robot and controls each joint.

10 H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27

4. CPG layer

In this section the building blocks of CPG layer, the third layer ofthe hierarchical model is discussed. First, the new design of oscil-latory neural networks is described. The architecture, inputs, out-puts and internal dynamics of each neuron is explained. Then theneurons are connected to build up the Learner-CPG Neural Net-work (LCPGNN). Next the coupling scheme between joints and itsrole in walking are discussed. Section 4.4 of this section is devotedto the arms’ movement. The effects of moving the arms during acurvilinear walk are investigated and new equations for imple-menting them are introduced. The last two subsections describethe role of feedback pathways in CPG layer. In Section 4.5 of thissection the gyro feedbacks is discussed. Section 4.6 of this sectionis dedicated to foot pressure feedback pathways and its relevantparameterized equations. Some parameters are defined in theselast two subsections which will be trained in the next layer (MLRlayer).

4.1. Learning in CPG layer

In this model CPG layer can learn to generate suitable trajecto-ries for the LLJ layer. This layer is built up based on the idea ofmotor primitive approach [17]. This approach intensely reducesthe complexity of the learning problem in robots with multipleDOFs. Degallier et al. have applied it in [17] for different robotictasks such as drumming, crawling and reaching. The main ideaof the concept of motor primitives is to insert a low level neuralplanner between high level planners and PID motor controllers.This planner is made from a set of trajectories with predefineddynamics. These predefined trajectories in this work are walkingtrajectories produced by some other method of walking for NAOhumanoid robot. In [23] a method based on inverted pendulumand inverse kinematic is discussed to generate walking patternfor NAO.

4.2. O-neurons

The CPG layer is composed of Learner-CPG Neural Networks(LCPGNNs). The fundamental building block of the LCPGNN is O-neurons, a new design of oscillatory neurons which will be intro-duced here. This neuron model is inspired from some other modelsof neurons such as Perceptorns [20]. Fig. 2 illustrates the schematicdiagram of the O-neurons. The internal parts of the neuron areshown in this figure. Each O-neuron has an internal state whichis saved in a 3-element vector [a, b, c]. This internal state is beingupdated during time by the neurons dynamics. O-neuron is biasedby two biases v1 and v2 which initiate the dynamics of the neuron.These biases determine the initial frequency and initial phase ofthe oscillation. They are trained in the learning procedure of theLCPGNN. Such an O-neuron receives the Pa, Pb, Pc and Pdc inputs.The Pa, Pb, Pc enter the sensory feedbacks to the LCPGNN and mod-ify output patterns of the network, while Pdc is used in synchroniz-ing the neuron with other neurons in the network. Synchrony isone of the most important properties of O-neurons; it can learnto generate specific component of a complicated pattern and accel-erate or decelerate itself to change the total pattern. This modelconsists of a particular transfer function called O-Function. O-Func-tion receives two of the three internal states, [a, c] and compute anoscillation based on these values. The c value is another output ofthe neuron which is used for synchronization within the network.

The internal dynamics of the neuron is described through thefollowing equations:

aðt þ 1Þ ¼ aðtÞ þ ðT1ðað0Þ2 � aðtÞ2ÞaðtÞ þ PaÞdt ð1Þbðt þ 1Þ ¼ bðtÞ þ ðT2ðbð0Þ2 � bðtÞ2ÞbðtÞ þ PbÞdt ð2Þcðt þ 1Þ ¼ cðtÞ þ bðtÞ � dt þ Pc þ Pdc ð3Það0Þ ¼ 1; bð0Þ ¼ 1 � v1; cð0Þ ¼ 1 � v2 ð4Þ

Here a(t), b(t), c(t) are the three variables of internal dynamicsof the neuron and T1; T2 are the two timing constants whichdetermine the sensitivity level of the neuron. The initial values of

Page 4: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 2. Schematic diagram of the O-neurons; each neuron receives some inputs, Pa, Pb, Pc enter the sensory feedbacks to the LCPGNN and modify output patterns of thenetwork. Pdc is used in synchronizing the neuron with other neurons in the network. Bias values initiates the state vector [a, b, c]; dynamics described in Eqs. (1)–(4) changethe internal states; inputs enter to the dynamics and change the output; Output is produced from [a, c]; These values enter to a O-function transfer function. O-function makea acceptable low-computational oscillation.

H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27 11

variables a(t), b(t), c(t) are constructed by the bias values v1 and v2.For the sake of simplicity we have set a(0) = 1 for all neurons in thenetwork. Since there are some synaptic weights connected to theoutput of the o-neurons this setting does not reduce the generalityof the system. The dynamical system generates new state variablesa(t + 1), b(t + 1), c(t + 1) in the next time episode and send a(t + 1)and c(t + 1) to the O-function. The O-function is the transfer func-tion of the neuron which computes the output pattern of the neu-ron based on its internal state. This function is described in Eq. (5).

Oða; cÞ ¼signðcÞ � a � ð1�UðcÞðC1�UðcÞ � C2ÞÞ; cj j < p�signðcÞ � a � ð1�UðcÞðC1�UðcÞ � C2ÞÞ; cj j > p

�ð5Þ

In this equation C1 = 0.5 and C2 = 0.0416 are two pre-computedconstants which shape a desirable low-cost oscillation in this func-tion. This function is built from the cosine Taylor expansion. Thesign function determines the sing of its input c. When O() is usedwith one input a is set to 1. The UðcÞ is a limiter function whichbounds the input x to a linear interval and generates a periodicbehavior for it. Eq. (6) presents this function.

UðxÞ ¼xmodð2pÞ � p

2

� �2; ðxmodð2pÞ

�� �� < p

xmodð2pÞ � 3p2

� �2; ðxmodð2pÞ�� �� > p

(ð6Þ

This Phi-function computes the remainder of its input x throughdividing it by 2p, then reduces it and calculates its square. For eachinput x, the UðcÞ is computed once then saved and used twice incomputing O-function. This procedure reduces the computationalcomplexity of the function and allows for numerous implementa-tion of this function during the test period. So the neural networkscan be ran on different controller platforms.

Each O-neuron is able to learn a specific periodic signal andregenerate it during time according to some sensory inputs. InEqs. (1) and (2), the dynamic is designed to damp the inserted per-turbations. In fact, O-neuron form a stable limit cycle which can at-tract all close trajectories in the phase plan by making it possiblefor the robot to bypass the rough surfaces. This damping propertyis very useful in robot walking applications.

In Eq. (3) the dynamic of c is a linear one combined with someinstantaneous input Pc, Pdc. Input Pc, a sensory input which can beoriginated from sensory feedbacks of the system, moves the phaseof the pattern to a new one. Pdc is a phase difference coming fromanother neuron in the network. This phase difference is the meanfor synchronization of neurons within the network. Using this op-tion in LCPGNNs, a change in one neuron phase will be transmittedto other neurons.

4.3. Network of the O-neurons

The LCPGNN is built by O-neurons which are connected andcoupled in a network architecture. This neural network can havemultiple layers. In Fig. 3 we have presented a network with onlytwo layers. In the first layer there are usually some O-neurons cou-pled with one another. Coupling can be achieved by sending outthe c state to the next neuron in the network. This phase will beadded to a phase difference value and inserted to the input Pdcat the next neuron. O-neurons can be connected to a post synapticneuron in the next layer. Post Synaptic neuron computes aweighted sum on the outputs of O-neurons and sends this sumto a limiter transfer function. Each synapse is modeled through aweight link between O-neurons and synaptic neuron. The weightsshould be trained to the network in the learning procedure toshape a particular wave form. Function F is described in Eq. (7).

FðowÞ ¼ow; jowj < M1signðowÞ �M1; jowj > M1

�ð7Þ

ow ¼P

wi � Oðai; ciÞ þ bi is the weighted sum of the inputs. The FFunction is purely linear for the range between �M1 and M1. Thefunction prevents the size of output to be bigger than a predefinedvalue. This would be useful in control application of walking here.The b1 is a bias value in the synaptic neuron which is able to biasthe output on a specific level. This constant should be trained inthe first step of learning to simplify the next steps.

A one-dimensional network of neurons can be coupled with an-other network to shape a multi-output network of o-neurons. Inthis application a 10-dimensional network is necessary in orderto generate suitable walking patterns for all 10 joints involved inthe walking application. Fig. 4a illustrates the connection betweenLCPGNNs that make an n-dimensional network. In this figure a cir-cle containing an external difference dij between the LCPGNNswithin the network is observed. These dijs are trained in the laststep of learning algorithm. For the current application 4 O-neuronsin are set each LCPGNN. This number is sufficient for learning ofthe walking trajectories [25]. To coordinate the several joints, foreach leg and the opposite arm, a chain coupling from the hip tothe ankle of the first oscillator of each LCPGNN is being proposed.This coupling is illustrated in Fig. 4b.

4.4. Arm movements

To stabilize and increase the performance of the curvilinearwalk the design should include the role of arms in robot walking.

Page 5: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 3. One dimensional network of O-neurons. In the first layer there are usually some O-neurons coupled with one another. Coupling can be done by sending out the c stateto the next neuron in the network. This phase will be added to a phase difference value and enter to the input Pdc at the next neuron. O-neurons can be connected to a synapticneuron in the next layer. Synaptic neuron computes a weighted sum on the outputs of O-neurons and sends this sum to a limiter transfer function. The Limiter transferfunction can bound the upper and lower output of the synaptic neuron. Each synapse is modeled through a weight link between O-neurons and synaptic neuron.

Fig. 4. (a) Connection of some learner CPG neural network. The circle between LCPGNNs contain an external difference dij between the LCPGNNs within the network. These dij

are trained in the last step of learning algorithm. (b) The coupling scheme of the different LCPGNNs in our application. Here, joint number 1 and 6 (Hip pitch joints of the leftand right feet) are synchronized together. Joints 1, 2, 3, 4 and 5 in one set and 6, 7, 8, 9 and 10 in the other set are synchronized sequentially. Joint 12 (the right shoulder pitchjoint) is synchronized with joints 1 and 2, and joint 11 is synchronized with joints 6 and 7 (the left-shoulder pitch joint).

12 H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27

Page 6: Www.sciencedirect.com Science Article Pii S0950705113003833

H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27 13

The most important arm joint which mostly moves the arm is theshoulder pitch joint (in NAO). This joint has the tendency to movethe arm in the direction of the body and can be used to keep theequivalence of the torques generated by the legs. There are twodifferent arm-control rules for pure rotation and pure rectilinearwalks; here a linear combination of these two is used for a curvilin-ear walk. In the rectilinear case the angular acceleration generatedby the arms should repeal unwanted accelerations around the Yaxis which may be generated by the foot swinging. On the otherhand, in a pure rotation case, the angular acceleration generatedby the arms must intensify the acceleration around Z which is usedfor rotation [9].

For the arms we no direct training of LCPGNNs is used in theshoulder pitch joints. In other words a shoulder pitch joint iscomputed indirectly with a linear combination of the otherLCPGNNs. Since we intend to compensate the effects of swingphase, the left arm should be synchronized with the right leg andthe right arm should be synchronized with the left leg. Eqs. (8)and (9) are applied in controlling arm movements:

PLShoulderðtÞ ¼ Pra1RRHipðtÞ þ ðB� Pra1Þ � PRHipðtÞ þ b ð8ÞPRShoulderðtÞ ¼ Pla1RLHipðtÞ þ ðB� Pla1Þ � PLHipðtÞ þ b ð9Þ

In the above equations the left shoulder pitch-joint position(PLShoulderðtÞ) at the time t is computed from a linear combinationof the right hip-roll joint position (RRHip) and the right hip pitch-joint position (PRHip). Here B is a coupling constant and b is a biasvalue. Constants Pla1 and Pra1, are variables used for linear combina-tion which will be used in the policy parameters in MLR layer.

4.5. Gyro feedback pathways

To sustain stability of the robot locomotion gyro feedback path-way is imported to the LCPGNNs and the appropriate coefficientsin MLR layer are trained. According to [13] vestibular system in hu-man measures the tilt of the body and can stimulate contra lateralmuscles to keep balance. The same applied in humanoid robotsusing gyro sensor. Gyro sensor is located in the chest of the NAOcan compute the tilts of the body in directions X, Y, Z (Here X, Zare axis of the soccer field). When the tilt is sharp on one side,the controller forces the robot to incline to the opposite side. Thiscan keep the robot balanced while walking. Joints which should becontrolled are the hip and the ankle joints. Lateral tilt of the bodyin frontal and sagittal planes are symbolized by and wfrontal andwsagittal. The feedbacks for hip and ankle are defined as follows:

FPankle¼ Kf � jwfrontalj ð10Þ

FPhip¼ �Kf � jwfrontalj ð11Þ

FRankle¼ Ks � jwsagittalj ð12Þ

FRhip¼ �Ks � jwsagittalj ð13Þ

Since we usually need symmetric changes of the trajectories (toprevent instability and falling) the gains Kf and Ks are the same forboth feedback pathways on ankle and hip. In this manner when theankle touches the ground, hip orientation is preserved properly.These feedbacks are projected on the radius of the limit cycle ofall the oscillators in the pitch and roll joints in both hip and ankle.This projection can change only amplitudes of the trajectorieswhile phases are preserved. The projection is shown in the follow-ing equation:

Pai;k ¼ TrðFkÞ ¼Fk; jFkj 6 Tr1

Tr1; jFkj > Tr1

�ð14Þ

where Pai;k is the first input of ith neuron in the kth joint. Fk is thefeedback term of k joint and k 2 fPankle;Rankle; Phip;Rhipg. Here Tr is alimiting function of variation of Fk on Pai;k. The Gains Kf and Ks

are open parameters which should be trained by MLR layer. In Sec-tion 6 training of these gains is discussed.

4.6. Foot pressure feedback pathways

There is a great amount of information embedded in foot pres-sure sensors which can be added to the model of bipedal walking.Foot sensor feedbacks can determine different states of locomotionand help the robot control its stability and respond to the changesthat occur in walking surface. If any part of the walking surface haspits or hills the pressure sensors tell the robot how to change itswalking patterns in order to bypass them. We simulated the effectsof the foot sensors and added them to this model to make it effi-cient. These feedbacks control the phase differences between thejoints in both the feet. There are 4 pressure sensors on the bottomof each foot of NAO robot. In a normal period of walk the totalaccumulated force signals during time on both the feet are sym-metric [24]. This symmetry is observed in tests conducted onNAO robot. This fact is used to insert the effect of foot sensory feed-backs in the CPG layer. The equation of ‘‘phase deviance’’ is pro-posed in this model to measure the deviance occurring betweentwo feet of the NAO robot. The deviance can be generated from dif-ferent origins including non-smooth walking surface, collision withanother robot, collision with the rotating ball, and rotation of thebody itself during the walk. Eq. (15) is to compute the phasedeviance.

Fdiff ðtÞ ¼1

Mn

Z t

t�T

X4

i¼1

ðFsrði;rightÞðtÞ � Fsrði;leftÞðtÞÞ !

ð15Þ

In Eq. (15) Mn is a normalizing constant which is applied in scalingthe pressure values in the appropriate phase difference values. T isthe time interval of a gait period. Here Fsrði;jÞ is the foot pressure ofthe ith sensor (i can be 1, 2, 3 or 4) in the jth foot (j can be left orright for the left or right foot, respectively).

According to Eq. (15), in each cycle the difference of corre-sponding sensors in both feet is computed and summed togetherin each time period T, i.e. the period of one complete gait to forma foot sensor difference (Fdiff ) value. Since this value is the sumof forces, Fdiff should be scaled by a divisor Mn to become applica-ble in LCPGNN equations (Eq. (16)). To compensate the deviance,phase deviance term Fdiff is imported to the LCPGNNs for updatingthe speed of phase change according to this quantity. It means thatthe speed of lagging foot is increased and the speed of leading footis decreased to allow them to reach their basic phase difference. Eq.(16) is defined in order to modify phase equations in the LCPGNNs.

Pci;k ¼0; Fdiff ðtÞ

�� �� 6 MAXN

Pdiff � signðFdiff ðtÞÞ:ð Fdiff ðtÞ�� ���MaxNÞ; othervise

(

ð16Þ

where Pci;k represent the phase input in the ith neuron of the kthjoint of the robot. MAXN is the maximum value of Fdiff ðtÞ in thenormal walking state (i.e. in the most common cases) and Pdiff isan open parameter which determines the speed of compensation.This parameter will be trained in MLR layer. Using these new onlinephase calculation equations, one can modify the phases of the hippitch joints and knee pitch joints.

5. Training algorithm of the neural network

To train different trajectories to LCPGNNs in CPG layer we needto enter the training trajectories to the system and after a shortperiod of time, the parameters of LCPGNNs will be trained to thedesired inputs (i.e. the total squared error converge to zero). Thisshould be done to find initial-state values of O-neurons and post

Page 7: Www.sciencedirect.com Science Article Pii S0950705113003833

14 H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27

synaptic weights through the network. The neuron in eachLCPGNN sends out its c state as synchronization criterion to theother o-neurons. The major benefit of this proposed model is itsability to encode arbitrary periodic signals as limit cycles in a net-work of coupled neurons. When all the properties of such systemsare realized, the modulation of the frequency and the amplitude ina smooth manner would become practical, stability in perturba-tions and integration of feedback pathways to adaptively regener-ate trajectories and optimize different skills in soccer robots.

In this model learning includes two stages: In the first stage, thesystem is leaded to a suitable initial point which can be started toattain the optimum point. In the second stage there is a rapid ana-lytical learning procedure which trains all the unknown initialstates of the neurons and synaptic weights of the whole system.These two stages can be started concurrently and help each otherto accelerate the learning process. The first stage is focused on find-ing the fundamental frequency of the input pattern which is verycrucial for the second stage. Fundamental frequency is the lowestfrequency among all harmonics of the signal. By finding one ofthe fundamental frequencies, it would be possible to use only thesecond stage in the learning procedure.

5.1. First stage: learning the fundamental frequency

One of the most important steps for learning a periodic patternis to make the pattern as symmetric as possible. Having a symmet-ric wave form, the input signal can be learned better in the neuralnetworks. To do this, the mean value of the signal could be com-puted during time and this mean value is extracted from the signal(i.e. each sample is subtracted from the mean value). In fact we cansend the signal to a very low-pass signal (which eliminates onlythe DC of the signal) to make it more symmetric. By extracting thismean value, it is set as the bias b1 of the synaptic neuron ofLCPGNN that can be used in the filtered signal in the next steps.

In the first stage the extraction of the fundamental frequency ofthe input pattern is necessary. To do this a canonical-dynamicalsystem is defined based on the dynamics of O-neurons. This systemis presented in [25]. In [25], Gams et al. discussed a system (calledCDS-ODS) for learning and encoding a periodic signal with noknowledge on its frequency and waveform, which was able tomodulate an input periodic trajectory in response to some externalevents. Their system is used to learn periodic tasks on the arms of ahumanoid HOAP-2 robot for the task of drumming. The canonicaldynamical system, is actually a polar implementation of PCPGs in-cluded in [13]. This canonical-dynamical system is used here in theLCPGNNs. Eqs. (17)–(20) work the canonical system of LCPGNNsout.

aiðt þ 1Þ ¼ aiðtÞ þ � � eðtÞ � O �ciðtÞ þ p2

� �� �dt ð17Þ

biðt þ 1Þ ¼ biðtÞ � g � eðtÞ � OðciðtÞÞdt ð18Þciðt þ 1Þ ¼ ciðtÞ þ ðbiðtÞ � s � eðtÞ � OðciðtÞÞÞdt ð19Þ

eðtÞ ¼yinðtÞ � youtðtÞ; t < TLyinðtÞ�youtðtÞffiffi

tp ; t > TL

(ð20Þ

youtðtÞ ¼ FXNum

i¼0

aiðtÞ � O �ciðtÞ þ p2

� � !ð21Þ

In these equations ½aiðtÞ; biðtÞ; ciðtÞ� is the state vector of the ithneuron in a one-dimensional network; �;g; s are rate constants.O-function introduced in Eq. (5); e(t) is an error signal which shouldconverge to zero at the end of procedure. F is the transfer function ofsynaptic neuron. TL is a specific time when the system switches tothe convergence state. Num is the number of neurons if each one-dimensional LCPGNN. yout(t) is the output pattern of this phaseand yin(t) is the input pattern value at time t.

5.2. Second stage: learning the whole pattern

The second stage is the main learning process which uses someinitial state values from the first stage. Starting form an appropri-ate initial state, the second stage is able to compute the total un-known parameters of the neural network. Training algorithm is amodified version of the Levenberg–Marquardt algorithm proposedin [26]. The total system is modeled as a nonlinear system whichsolve a Nonlinear Least Square problem (NLS). In such a NLS prob-lem the following parameter vector b is defined in the system assome unknown parameters of gðti; bÞ function.

b ¼ w1;v11; v1

2;w2;v21;v2

2; . . . ;wn;vn1;vn

2

ð22Þ

In this vector wi is the synaptic weight between neuron i and thesynaptic neuron and v i

1;v i2 are two biases used as initial states in

each O-neuron. Parametric g function can be defined as:

gðti;bÞ ¼Xn

8j¼3k;k¼1

bj � Oðbjþ1 � ti þ bjþ2Þ ð23Þ

where n is the number of neurons and bj is the jth parameter of thevector. For j = 3k, bj ¼ wi; j = 3k + 1, bj ¼ v i

1 and j = 3k + 2, bj ¼ v i2. To

minimize erðbÞ the total error of the learning procedure iscomputed as:

erðbÞ ¼Xm

i¼1

½yinðtÞ � gðti;bÞ�2 ð24Þ

This learning procedure is an iterative one which starts from thebest initial value obtained in the previous stage. In each iterationstep, the parameters vector b is replaced by a new estimatebþ Db. To determine Db gðti; bþ DbÞ should be approximated byits linearization in Eq. (25).

gðti;bþ DbÞ � gðti;bÞ þ JiDb ð25Þ

where Ji is the gradient of g with respect to b for the input ti. Con-sidering the total derivative of the Jacobian matrix with respect toDb and setting the result to zero gives:

ðJT JÞMb ¼ JT ½y� gðbÞ� ð26Þ

In Levenberg–Marquardt algorithm Eq. (26) is replaced by a‘‘damped version’’:

JT J þ k � diagðJT JÞ� �

Db ¼ JT ½y� gðbÞ� ð27Þ

The damping factor k, is tuned at each iteration. Small k is usedif reduction of err is rapid which bring the algorithm closer to theGauss–Newton method. For insufficient decrees of err, k can be en-larged, making the algorithm similar to the gradient decent method.The most important part is computation of the Jacobian matrix. Inthis approach the Jacobian matrix is computed analytically. Bydefining zj ¼ bjþ1 � ti þ bjþ2, gðti; bÞ becomes dependent to zj and isrewritten as the following equation:

gðti;bÞ ¼ gðzj;bÞ ¼Xn

8j¼3k;k¼0

bj � OðzjÞ ð28Þ

ijed element of the jacobian matrix J can be computed as:

Jij ¼@gðti; bÞ@b

¼ @gðzj;bÞ@zj

� @zj

@bjð29Þ

The gradient of zj with respect to bj is:

@zj

@bj¼

ti; jmod3 ¼ 11; jmod3 ¼ 2

�ð30Þ

Taking derivation from Eq. (28) and replacing it in (29) we have:

Page 8: Www.sciencedirect.com Science Article Pii S0950705113003833

H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27 15

Jij ¼

Oðbjþ1 � ti þ bjþ2Þ; CD1

ti � bj�1 �Wðbj � ti þ bjþ1Þ � 4 � C2 � ðWðbj � ti þ bjþ1ÞÞ2 � 1

h i; CD2

bj�2 �Wðbj�1 � ti þ bjÞ � 4 � C2 � ðWðbj�1 � ti þ bjÞÞ2 � 1

h i; CD3

�ti � bj�1 �Wðbj � ti þ bjþ1Þ � 4 � C2 � ðWðbj � ti þ bjþ1ÞÞ2 � 1

h i; CD4

�bj�2 �Wðbj�1 � ti þ bjÞ � 4 � C2 � ðWðbj�1 � ti þ bjÞÞ2 � 1

h i; CD5

8>>>>>>>>>><>>>>>>>>>>:

ð31Þ

In this equation the C2 = 0.0416 is a pre-computed constant.sign function determines the sign of its input. The WðxÞ is definedin Eq. (32).

WðtÞ ¼xmodð2pÞ � p

2

� �; jxmodð2pÞj < p

xmodð2pÞ � 3p2

� �; jxmodð2pÞj > p

(ð32Þ

The WðxÞ function is the derivation of UðxÞ function, which the con-stant of that is rendered into C2. The CD1, CD2, CD3, CD4, CD5 are 5different conditions of derivation. These condition are describedbelow:

� CD1 : jmod3 ¼ 0� CD2 : jmod3 ¼ 1 and bj � ti þ bjþ1

�� �� < p;� CD3 : jmod3 ¼ 2 and bj�1 � ti þ bj

�� �� < p;� CD4 : jmod3 ¼ 1 and bj � ti þ bjþ1

�� �� > p;� CD5 : jmod3 ¼ 2 and bj � ti þ bjþ1

�� �� > p;

The flowchart of the learning algorithm is shown in Fig. 5. In theleft side, some parallel plane are shown which are all identical. Eachplan is devoted to a one-dimensional LCPGNN. All the planes

Fig. 5. Flowchart of learning algorithm of LCPGNNs; Yin(t) is the input signal of each layeThere is another process box under the first one which searches for a suitable initial staspecial place in the memory and first stage will be finished. The second stage starts from athe first stage. In the forth step, the error is computed. It tests to find whether the erroLCPGNNs are trained the final step of training will be started. In the final step the biase

describe the same learning algorithm described before. The flow-chart starts from the top-left. Two stages of learning are separatedby two process blocks. In the top of the Stage 1 there is an initial pro-cess for computation of the bias b1. As it is mentioned earlier thisstep filters the input signal yin(t) and extracts the d1 value. Thereis another process box under the first one which seeks a suitable ini-tial state. In each cycle yin(t) (without its bias) enter the mines circleand is subtracted by yout(t). Error e(t) is injected to the dynamics ofEqs. (17)–(21). After a pre-defined number of cycles the Initial StateVector (ISV) will be stored in a special place in the memory and firststage will be completed. The second stage starts from an initial statecalculated in the first stage. It can be started contemporaneouslywith the first stage. In each cycle of the second stage, Eq. (31) com-putes the Jacobian matrix analytically, and then the next step com-putes Db using Eq. (27). The third step calculates new Db from b. Inthe fourth step, the error is computed using Eq. (24). It tests to findwhether the error is small enough to finish the loop and set theLCPGNN parameters. When all the LCPGNNs are trained, the finalstep of training begins. In the final step the v2 biases trained in theprevious step are sent to the coupling process. In the coupling pro-cess v2 s are used to compute dis and dij which are the coupling con-stants shown in Figs. 3 and 4a. These coupling constants determinethe phase differences between neurons in the network. The first v2

in each LCPGNN is a base for other neurons in LCPGNN. Eq. (33) com-putes the dis within each LCPGNN.

di ¼ O v12 � v i

2

� �ð33Þ

The first v2 in each LCPGNN is used to synchronized differentLCPGNNs. In the last step of stage2 there is a module that puts these

r; extracting the bias value, it is set as the bias d1 of the synaptic neuron of LCPGNN.te. After a pre-defined number of cycles the Initial State Vector (will be stored in an initial state calculated in the first stage. It can be started contemporaneously withr is small enough to finish the loop and set the LCPGNN parameters. When all thes trained in the previous step are sent to the coupling process.

Page 9: Www.sciencedirect.com Science Article Pii S0950705113003833

16 H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27

first v2s in order based on the coupling scheme. Each LCPGNN sendsout its first v2 to the next LCPGNN to be used in computation of dij

using Eq. (34).

dij ¼ O v12ðiÞ � v1

2ðjÞ� �

ð34Þ

where v12ðkÞ is the first v2 in the kth LCPGNN and O is the O-

function.

6. MLR policy learning layer

MLR part of this proposed model is an important layer which isresponsible to make suitable stimulations for CPG layer. Thissection discusses the learning process in MLR layer. The mainobjective in this layer, the policy learning, is explained. How tomodel previous layer with a reinforcement learning problem andtrain some important open parameters with policy gradient learn-ing according to feedbacks originated from gyro sensor values andinformation of speed and precision comes from HLDU layer isdescribed here.

6.1. Reinforcement learning in MLR layer

This layer uses a reinforcement learning structure to map eachstate of the CPG layer to an action of the robot. In this work stim-ulation in the biological Mesencephalic Locomotor Region (MLR) ismodeled by a policy of locomotion. To train these polices, policygradient learning is applied. Policy gradient method used here isbased on optimizing parameterized policies with respect to along-term cumulative delayed reward by gradient descent [27].High level commands are formulated as a policy gradient learningproblem over some free parameters of the CPG layer. These param-eters will be considered as a policy that can be implemented by therobot and evaluated by its payoff function that is defined in itsevaluation system.

If policy p : ðS;UÞ ! A is defined the function computes whichaction A should be doing in state S with input U. S vector is mod-eled with the internal states of CPG layer after it learns its trajecto-ries. U can be defined as the sensory inputs entered into the CPGlayer from the soccer field. As described earlier these are gyroand foot pressure values. Actions A are modifications of servo mo-tors on the NAO humanoid robot. Policy starts from an initial valueand are modified by MLR learner gradually to reach the best value.Reward function is defined in HLDU layer. It computes an immedi-ate reward R for each action A. The optimal policy is p� : ðS;UÞ ! Awhich gives the best mapping between states-inputs and actionsaccording to a reward function R : A! R. Policy p can be parame-terized with a h parameter vector and written as:

ph : ðS� U � hÞ ! A ð35Þ

The main objective of the learning algorithm is discovering thebest parameter vector h� inorder to produce the maximum value oftotal payoff function JðhÞ, defined as:

JðhÞ ¼ EXH

k¼0

ck � Rk

( )ð36Þ

JðhÞ is the expected discounted sum of the rewards Rk from thebeginning to a H planning horizon which is obtained by implement-ing policy ph. c 2 0;1½ � represents a discount factor. Policy gradientmethods compute the gradient of expected return rJ and follow itto reach h�. The gradient is multiplied by a learning rate (ah 2 Rþ)and added to the current policy vector to generate the next policyvector by:

hhþ1 ¼ hh þ ah � rJh ð37Þ

The main problem in almost all applications that use policy gra-dient methods is to obtain a good estimator of the policy gradient(rJ). The most popular approach, which is applied to robotic appli-cations, is finite-difference method. We are going to use finite-dif-ference in our layered learning paradigm to compute policygradient (rJ). In this method policy parameter h is varied by smallincrement Dhh ¼ ah � q1;q2; . . . ;qk½ �T and for each variation hh þ Dhh

roll-outs are performed to produce DJl � Jðhh þ DhhÞ � Jref estimateof the excepted return. Here Jref is a reference value which can becomputed in different manners, for instance Jref can be forward-dif-ference estimate value Jref ¼ JðhhÞ or central-difference estimate valueJref ¼ Jðhh � DhhÞ. In this way we can estimate policy gradient (rJ)by a regression yielding gFD using the following equation:

rJ u gFD ¼ ðDHTDHÞ�1DHT � DJ ð38Þ

where DH ¼ ½Dh1;Dh2; . . . ;DhM� is the vector of differences in poli-cies and DJ ¼ ½DJ1;DJ2; . . . ;DJM �

T is the vector of all payoff differ-ences for each policy.

Finite-difference method starts from an initial parameter vectorh0 and tries to compute all partial derivations for any parameter inthe vector. To do so M new policy near the current policy vectorwill be generated randomly and then the robot would evaluatethese new policies. The new randomly generated policy can be ob-tained by:

hkh ¼ hh þ ah � ½q1;q2; . . . ;qk�

T ð39Þ

where h 2 0;1; . . . ;Hf g is the update number and k 2 0;1; . . . ;Kf g isthe roll-out number. Each qj is a perturbation chosen randomly in aspecific small interval ��k;0; �k½ � defined for each policy parameter.The payoff values computed for each policy vector is used to calcu-late an estimation of the gradient.

6.2. Policy parameters

The policy parameter vector consists of the 10 flowingelements:

� Left/right hip pitch amplitude (lhP–rhP): this number deter-mines the amplitude of neurons of the left/right hip pitch joints.Some parameters are found in the training phase of LCPGNNsand this number is multiplied by all of the output values ofthe neurons in these joints.� Left/right knee pitch amplitude (lkP–rkP): this number deter-

mines the amplitude of neurons of the left/right knee pitchjoints. These are multipliers for output values of left/right kneepitch joint neurons.� Left/right arm coefficient 1–2 (Pla1—Pra1): these 2 coefficients

are applied in Eqs. (10)–(13) in the linear combination of rolland pitch joints of the right/left foot to be used in left/rightarm shoulder joints.� Yaw pitch amplitude (yP): this value is used in both yaw-pitch

joints (with 2 different signs) inorder to control the amount of rota-tion of the legs. In this application which is curvilinear walk learn-ing, yP is multiplied by normalized rd value (radius of circular pathwhich is sent by HLDU layer) and is sent to yaw-pitch joints.� Gyro feedback pathways (Ks—Kf ): these are Ks and Kf gains

introduced in Eqs. (10)–(13) used in gyro feedback pathways.Best values should be found for these gains inorder to stabilizethe locomotion.� Speed of phase adaptation (Pdiff ): this value is applied in Eq. (16)

to determine adaption rate of phases when foot sensors detectdeviation between two feet. This open parameter in CPG layerdetermines the robot how much to use its foot sensor feedbacksin order to compensate phase deviance.

Page 10: Www.sciencedirect.com Science Article Pii S0950705113003833

H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27 17

The h parameters can be written as:

h ¼ ½lhP; rhP; lkP; rkP; Pla1; Pra1; yP;Ks;Kf ; Pdiff � ð40Þ

Fig. 6. The features of curvilinear reward computation, rd is the radius of a circlepath centered at (XC, YC). L(Ki) is the amount of traverse in episode Ki (indicatinginstantaneous speed), E(Ki) is the error of traversed path in episode Ki (indicatingprecision)

6.3. Reward function

To train the robot with policy gradient method an appropriatereward function is necessary. The quality of a walking period ismeasured using three fundamental parameters: speed, smooth-ness and precision of the motion according to the desired path.These three parameters can be combined in a weighted linearsummation by the following reward function:

RhðkÞ ¼ kLLhðkÞ þ kPPhðkÞ þ kSShðkÞ ð41Þ

where LhðkÞ is the length of the path covered in a fixed amount oftime (kth episode) indicating the speed of the robot; PhðkÞ is the pre-cision according to the desired path at this episode, measured forparameter set h and ShðkÞ is the walking smoothness which is an-other measure of quality of learned behavior. The coefficientskL; kP and kS reveal the weights of these three values in reward. Totalreward can be obtained by summing up rewards in all episodes.

JðhÞ ¼X8k

ck � RhðkÞ þ UX8k

LhðkÞ � Tre

!� 1

!� Vpunish ð42Þ

Here the discount factor c ¼ 1 is used in order to give same impor-tance to all episodes. U(v) is the Unit function 1 for the v valuesgreater than 0 and 0 for the values smaller than 0. The termP8k ck � RhðkÞ � Tre as the v input is sent to the U(v) to be used in

modeling the punishment of falling. When robot falls, theP8k ck � RhðkÞ will be smaller than Tre so the U

P8k ck � RhðkÞ � Tre

� �is 0 and the punishment Vpunish will be subtracted from the originalpayoff value. Considering this punishment, payoff can be computedaccurately. Thus the robot prevents using policies which cause fall-ing down.

LhðkÞ, PhðkÞ and ShðkÞ can be computed by Eqs. (43)–(45):

LhðkÞ ¼1LN

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðXðkÞ � Xðk� 1ÞÞ2 þ ðZðkÞ � Zðk� 1ÞÞ2

qð43Þ

PhðkÞ ¼ 1�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðXðkÞ � Xdesiredðk� 1ÞÞ2 þ ðZðkÞ � Zdesiredðk� 1ÞÞ2

qEN

ð44Þ

ShðkÞ ¼ 1�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi@Gx@k

� �2 þ @Gy

@k

� �2þ @Gz

@k

� �2r

HNð45Þ

In these equations LhðkÞ is the distance covered by the robot duringthe episode. PhðkÞ is the complement of Euclidean distance betweenNAO’s position at time t and the circumference of the circle-shapedesired curve with radius. This summation is shown In Fig. 6. Therd is the HLDU command which is sent to MLR layer for this curvi-linear walking application. Here it is assumed that the center of thecircular path is located on (XC, YC). EN ; LN and HN are normalizingconstants. The normalized error can be subtracted from 1 to indi-cate the PhðkÞ precision. The value ShðkÞ is defined here as smooth-ness according to the parameter set h. To compute this smoothnessthe instantaneous roughness is measured during the episode, nor-malized by a HN constant and subtracted from 1. The instantaneousroughness is absolute values of derivation of gyro sensor in the X, Yand Z planes. (GL is the gyro sensor value in the L plane, L = X, Y, Z.)

7. Experimental results and analysis

In this section, implementation methods and experimental re-sults of the biologically inspired layered learning paradigm is pre-sented. The first-stage experiments (training phase of CPG layer)were tested using the Simulink toolbox in MATLAB. In the second

stage (the online policy gradient based training of the robot inMLR layer), an integrated simulation of the NAO robot in WebotsRobotstadium was used [18]. The model of the robot is almost sim-ilar to the real robot as far as simulation is concerned. There aresome constant values in CPG layer and MLR layer. These valuesare used in the experiments are c ¼ 1; � ¼ 1; s ¼ 0:6;C1 ¼ 0:5,C2 ¼ 0:0416, M1 ¼ 1:5, g ¼ 0:25, Tl ¼ 33, Num ¼ 4,T1 ¼ 1; T2 ¼ 0:25;Mn ¼ 250; B ¼ 2; b ¼ 1:75;K ¼ 10;H ¼ 50;kL ¼ 1; kp ¼ 1:5; ks ¼ 1; Tr ¼ 0:25; Vpunish ¼ 2; LN ¼ 2; EN ¼ 1000;HN ¼ 500;ah ¼ 0:05. In Fig. 7 an example of trained LCPGNN inSimulink is illustrated which has been programmed by sample tra-jectories. This example clarifies the design of a LCPGNN which con-sists of some O-neurons. The internal design of O-neuron ispresented at the top of this figure. The system is trained with asample input and the biases and synaptic weights are set in thesystem using some constant blocks. It can be seen that the firstneuron sends out its phase as a synchronization criteria to otherneurons. These neurons have learned the most important harmon-ics of the input trajectory.

7.1. Training LCPGNNs in the first stage

The first stage of the learning algorithm seeks for a suitable startpoint in the state space which can appropriately initiate the secondstage. Finding this starting point is very important because the sec-ond stage is very sensitive to its initial points. It can be trapped inlocal optima if the first stage does not find a good starting point. Inmany cases the second stage never converges to the input signalwithout executing the first stage. The most important elementwhich should be revealed by the first stage is the fundamental fre-quency of the input pattern. By discovering this element, the meth-od in the second stage can drive unknown parameters of the neuralnetwork in direction of gradient and reach to the global optimumof the fitness (least square error) function. The importance of thefundamental frequency is originated from the periodic behaviorof the input patterns.

Page 11: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 7. An example of trained LCPGNN in Simulink is illustrated which can be programmed by a sample training trajectories. Design of O-neuron is presented in the top of thisfigure.

18 H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27

The canonical dynamical system of the first stage can efficientlyextract this fundamental frequency of the walking patterns. InFig. 8 an example of a diagram in the first stage training mode isillustrated. The input trajectory is the left-hip pitch-joint values(Yin(t), multiplied by 10 to be trained better in the LCPGNN) andthe other one is what the corresponding CPGs is generating(Yout(t)). It can be seen that this training is very fast and almostefficient. The output converges to a signal with fundamental fre-quency of the input pattern. In the lower part of the figure the eval-uation of the basic frequency is shown. We have set theconvergence time of the system on t = 33 (TL = 33 in Eq. (20)). Itcan be seen that after this time the system rapidly converge to aspecific value.

Fig. 9 shows Yin and Yout trajectories for all 10 DOFs of CPGlayer (these are the main DOFs located in the both the feet of therobot used in walking motion) which have been generated in thefirst stage. The input trajectories are multiplied by 10 to be trainedbetter in LCPGNNs. Fig. 10 illustrates the evolution of the funda-mental frequencies for all the DOFs. The rapid and efficient trainingprocess is noticed here as well.

7.2. Training LCPGNNs in second stage

The second stage of learning algorithm is a specific search in theparameter space. All unknown parameters of LCPGNNs are inte-grated in a single vector and a cost function is defined based on thisvector. The note that the gradient in this function can be computed

analytically is a key feature of this proposed algorithm, which isvery essential. In many other similar methods the gradient shouldbe computed numerically which impose a large computation over-head to the learning process. Essentially Levenberg–Marquardtalgorithm is a greedy search which follows the maximum gradientin each point to reach the maximum point in the parameter space.So the algorithm should begin from an initial point containing afundamental frequency found in the first stage. In this mannerthe complete waveform of the input signal can be learned throughthe network. Fig. 12 illustrates the second stage learning of left hippitch trajectory discussed in the first stage. This figure contains 6snapshots of input/output signals which present the evaluationof the system’s unknown parameters during time. The time differ-ence between the snapshots is approximately 2 cycles (each cycleis computed in algorithm with the order of O(n)). So it can be con-cluded that the total learning is very fast and has a linear timecomplexity.

To illustrate the efficiency of the proposed learning algorithm(LCPGNNs), this method is compared with 3 different methods.These methods are the learning methods used in [13] called XY-PCPGs, another similar optimized method in [16] called R-PCPGsand a 2-stage learning algorithms in [25] called CDS-ODS method.These methods are compared in Table 1. In this comparison 3 dif-ferent efficiency criteria are defined. The first one is the AverageConvergence Time. This number indicates the average time (propor-tional to the number of training epochs) which is required for theconvergence of the algorithm (i.e. reaching to a acceptable training

Page 12: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 8. Upper plot: First stage of the algorithm: left hip pitch trajectory inserted to the canonical dynamical system in the CPG layer. In the lower plot, evaluation of the basicfrequency during time is shown.

Fig. 9. Training of all the patterns in the first stage.

H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27 19

error). The second criterion is the Average Training Error which isthe average error during the training process. The third one isthe Average Testing Error which presents the average error of thesystem after all the parameters are found and fixed in it.

As illustrated in Table 1 the proposed method has the minimumconvergence time between the other methods. This means that ouralgorithm can obtain the necessary parameters faster than the oth-ers. Despite faster convergence time, the average training error in

the LCPGNNs are higher than the other methods. Since LCPGNNsdo not have the fundamental frequency in the beginning of thetraining process, they make very big errors during the trainingstage, but when they find the fundamental frequency and suitableinitial points they rapidly converge to the best final parameters. Inthe other words this method make many different wrong guessesin the parameter space but it can quickly find the right answer.This fact is illustrated in Fig. 11. The average testing error of the

Page 13: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 10. Evolution of the fundamental frequencies in 10 DOFs.

Table 1Comparison of learning behavior of the 4 different methods. In this comparison 3different efficiency criteria are defined. Average Convergence Time indicates theaverage time (proportional to the number of training epochs) which is required forthe convergence of the algorithm. Average Training Error is the average error duringthe training process. Average Testing Error presents the average error of the systemafter all the parameters are found and fixed in the system.

Method Averageconvergence time

Averagetraining error

Averagetesting error

XY-PCPGs 5.5 27.8048 29148.41R-PCPGS 7.5 66.38 21379.10ODS-CDS 14.5 82.97 69.54LCPGNNs 4.1 1111.60 19.9867

20 H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27

LCPGNNs is lower than the other methods. This indicates that thismethod can find its required parameters better that the other onesand is able to regenerate the teaching trajectories very well.

There are 12 sub-parameters in the main vector parameter of aLCPGNN which is learning the walking patterns. These parametersand their evolutions during time are shown in Fig. 13. In this figurethere is another plot which demonstrates evolution of k (theswitching parameter of Levenberg–Marquardt) during time.Fig. 14 presents all trained trajectories in the left leg of NAO robot.The values of 12 parameters for these trajectories are presented inTable 2.

7.3. Training the best policy in MLR layer

The second phase of learning consists of learning the best policyfor HLDU layer commands. The commands consist of features of

Fig. 11. Learning curve of the 4 different methods; LCPGNNs are converged faster thansimple Hebian learning. CDS-ODSs has a long convergence time but it makes very low e

curvilinear path on the soccer field ((XC, YC) is the center and rdis the radius of such a curvilinear path). In the conducted experi-ments here, constant command (XC = 1, YC = 3, rd = 3) is used inthe learning experiments. This is shown in Fig. 15. HLDU sends thiscommand to the MLR layer and MLR layer tries to learn suitablepolicy based on some experiments on the simulated robot. Webotssimulation software is used for learning in this layer.

To train the robot, MLR layer should begin from an initial pointin the policy space. Since many of the parameters in h are multipli-ers they initiate from 1. In the first learning experiment the initialpolicy vector would be h ¼ ½1;1;1;1;1;1;1;1;1;1�. The initial stateS0 can be computed from the results of the learning in CPG layer.

Fig. 16 illustrates the learning curves of two different learningtests with initial vector h ¼ ½1;1;1;1;1;1;1;1;1;1�. In each newexperiment a new policy parameter is produced using Eq. (39)and this policy is sent to the CPG layer of robot controller to imple-ment it. The gradient direction heads into some directions thathave the highest values in the search space. Since the path to thebest policy does not matter here, this search is considered as a Lo-cal Search. Local search algorithms operate using a single currentstate and periodically move only to the neighborhoods of the state.Although local search algorithms are not systematic, they use verylittle memory and usually can provide reasonable solutions inhighly-dimensional or infinite state spaces. The policy gradientmethod used here tries to search for locally M near neighbors inthe state space and move towards the direction maximum gradi-ent. MLR layer saves all the policies searched in the state spaceand their associated payoff value to use them in later learning pro-cedures. During the first test of learning process (Test1) it is no-ticed that the payoff has not any negative values. It means that

the other 3 method. XY-PCPGs and R-PCPGs are very similar because they both userrors in the both training and testing stages.

Page 14: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 12. Learning the complete waveform of the left hip pitch trajectory in the second stage.

Fig. 13. Evolution of 12 parameters in a LCPGNN learning the hip pitch trajectory.

H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27 21

in this learning process robot did not explore many new policies soit did not fail and did not get a negative punishment. In the secondlearning test (Test2) there are many negative values for payoffwhile the final payoff is more than that of the same in Test1. Itmeans that the robot has searched farther points in policy spaceand found better policies. In Fig. 16 evolution corresponds to mix-ture of speed, precision and smoothness is illustrated. Horizontalsolid lines in each part determine Jref . These horizontal lines jumpto new values after an evolution in Jref takes place. All the initialpoints have positive payoff values. The objective is to find thehighest zenith (a global maximum) of the state space. The policy

gradient method does not look forward beyond the immediateneighbors of the current point and this is considered as one ofthe disadvantages of this method. Due to this disadvantage in thelearning algorithm its performance depends on the initial startingpoints.

The starting point in search space is very important which canaffect the results of policy search in the spate space. Beginningfrom different start points may cause ending up in very differentfinal best payoff values. In Table 3 these initial starting pointsand their associated payoff values in three different tests areillustrated. Each test of learning has 50 epochs of experiment and

Page 15: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 14. Left leg trajectories trained in CPG layer.

Table 212 Parameters trained in the algorithm for 5 DOFs in the left leg trajectories.

Time b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12

Initial 1.00 0.25 1.00 1.00 0.51 1.00 1.00 0.76 1.00 1.00 1.01 1.00Final Yl �0.17 0.25 �0.26 �0.01 0.51 1.18 0.01 0.76 1.70 0.00 0.95 �0.39Final Y2 �0.08 0.26 1.70 0.04 0.51 0.50 0.00 1.24 1.06 0.00 1.34 �0.64Final Y3 �0.02 0.34 �1.70 �0.06 0.50 0.94 �0.05 0.77 1.70 0.02 1.02 �0.50Final Y4 0.07 0.25 0.65 0.04 0.51 1.57 0.03 0.77 1.70 0.00 0.12 �1.66Final Y5 0.17 0.25 �0.26 �0.01 0.76 1.70 0.00 0.91 1.69 0.00 1.00 0.70

Fig. 15. HLDU layer command determining the curvilinear path with (XC = 1, YC = 3, rd = 3), The red square is the initial point of the robot and the robot should go to the blue(left) goal. Radius of the requested path is shown. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

22 H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27

900 iterations for an experiment. Since our policy vector has 10parameters, in each epoch MLR learner examines M = 10 pointsnear the current point to compute the gradient. In this mannerEq. (38) contains square 10 � 10 matrices that can be inverted veryeasily.

In Table 3 the last final optimal policy for the special commandof Fig. 16, is h ¼ 0:9;1:0;1:09;0:94;0:95;1:0;0:99;1:39;1:50;1:59½ �.Snapshots of the execution of this optimal (in the Robotstadiumenvironment) are illustrated in Fig. 17. At the beginning of theexperiment the NAO soccer player is behind the middle circle ofthe field. Then it slowly walks and rotates up to the right of the soc-cer field. The notion that the robot never stops to rotate is a keypoint in the curvilinear walk.

Training optimal policy in simulation environment, it is possibleto transfer the proposed layered controllers (containing optimalpolicies of different behaviors) to the real NAO robot. In this

manner the learning phase is completed in simulated robot andthe results of learning will be used in real robot. Therefore thereis no need to execute exhaustive and destructive learning processon the real robot. Fig. 18 illustrates such a curvilinear walk (similarto the simulated path) on the real NAO robot in this experiment.Snapshots are marked from 1 to 16 indicating the time indices.In this experiment NAO walks on two types of surfaces (carpetand plastic). Change of surface induces a small perturbation tothe robot. In snapshot number (10), the robot faces the border ofboth the surfaces. It can be seen at this point that the robot canrepeal the induced perturbation and keep itself balanced.

In this optimal vector it is observed that the hip pitch coeffi-cients are lhp = 0.9, rhp = 1 and knee pitch coefficients arelhp = 1.09, rhp = 0.94. Here it is concluded that in order for a NAOhumanoid robot to follow the circular path to the right (with thecommand of Fig. 16) it should increase the amplitude of its left

Page 16: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 16. Learning curves for two different tests of learning H = 50, ah = 0.05, solid lines in each part determine Jref .

Table 3Some examples of policies for three different tests with 3 initial vectors.

Test lhP rhP lkP rkP Plal Pral yP Ks Kp Pdiff Payoff

1 1 1 1 1 1 1 1 1 1 1 1.470.96 0.99 1.05 0.97 1.0 1.003 0.99 0.96 0.994 1.055 2.241.016 1.007 1.006 0.96 0.99 1.055 0.97 1.00 1.002 0.99 2.431.00 1.03 1.014 0.97 0.98 1.086 0.96 0.996 0.99 0.95 2.581.00 1.03 1.018 0.977 0.989 1.087 0.966 0.99 0.99 0.95 2.65

2 1 1 1.1 0.9 1.05 0.95 1 0.8 0.9 0.7 1.8541.02 1.0 1.07 0.91 1.06 0.945 1.01 0.82 0.86 0.68 2.071.017 1.01 1.05 0.89 1.11 0.93 1.00 0.815 0.90 0.67 2.471.017 1.014 1.06 0.89 1.10 0.936 1.011 0.81 0.90 0.67 2.491.011 1.019 1.06 0.S9 1.111 0.93 1.01 0.817 0.9 0.66 2.67

3 0.9 1 1.1 0.95 0.95 1 1 1.4 1.5 1.6 2.630.9 1.0 1.09 0.94 0.95 1.0 0.99 1.39 1.50 1.59 2.79

Fig. 17. Snapshots of the final curvilinear walking in Robotstadium simulation environment. In this experiment the optimal policy learned in MLR layer is used. TheCommand of HLDU layer is (XC = 1, YC = 3, rd = 3).

H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27 23

knee and decrease the same on the opposite knee; to keep itselfbalanced it should reverse the process in its hip pitch joints. Thisfact is shown in Fig. 19. In this figure the outputs of CPG layer

for the optimal policy are shown. During the experiment, left kneepitch value is greater than the right knee pitch value and left hippitch value is smaller than right hip pitch value. In the optimal

Page 17: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 18. Snapshots of curvilinear walking in real NAO, In snapshot number 10 the robot faces the border of both the surfaces. It can repeal the induced perturbation.

Fig. 19. Comparison of outputs of CPG layer for the optimal policy, during the experiment, the left knee pitch joint is bigger than the right knee pitch and the left hip pitchjoint is smaller than the right hip pitch values.

24 H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27

policy Pla1 ¼ 0:95 and Pra1 ¼ 1:0. These coefficients are used in Eqs.(8) and (9) to synchronize body CPGs with the arms. In these twoequations we set b = 2. In Fig. 20 the coupling of shoulder pitchwith hip pitch and hip roll joints is illustrated. Since Pra1 ¼ 1:0and (B� Pra1Þ ¼ 1 the left shoulder pitch is coupled equally withboth the hip pitch and hip roll joints, but the right arm (withPla1 ¼ 0:95 and (B� Pla1Þ ¼ 1:05Þ are mostly coupled with the hip

pitch joint. It proves that when the robot tends to turn to the rightit should use the right arm to repeal unwanted momenta aroundthe Y axis which is generated by the foot swinging.

In the optimal policy Ks and Kp are determined as (1.39,1.50).These coefficients are used in Eqs. (10)–(13) in order to computegyro feedback pathways. The gyro feedbacks are used in ankleand hip joints to prevent unstable gaits. This feedback is used in

Page 18: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 20. Arm movements according to their associated couplings with the hip pitch and the hip roll joints.

Fig. 21. Gyro sensor effects on joint number 5, Pai;5 is the radius of the first oscillators in left ankle joint. Gyro SensorX is the frontal gyro feedback. BasicQ5 is the original valueof left ankle pitch without feedback. Q5Gyro is the output of ankle pitch joint with gyro feedback.

H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27 25

variations of Pai;k (Eq. (14)). In Fig. 21 one example of the gyro feed-back for joint number 5 (Ankle pitch) is illustrated. Variations ofPai;k based on frontal feedback (FPankle

) is shown in this figure. Thisfigure compares the output of ankle pitch joint with and withoutgyro feedbacks. BasicQ 5 is the original value of left ankle pitchwithout feedback. The Q5Gyro is the output of ankle pitch jointwith gyro feedback. It can be seen that Q 5Gyro is very close tothe BasicQ5 in a normal walking pattern. In other words, gyro feed-backs tune the CPG layer whenever robot deals with someperturbations.

Another parameter determined in the optimal policy isPdiff ¼ 1:59 which is a rate coefficient for the Fdiff . The Fdiff is thephase deviance value defined in Eq. (15). CPG layer computes thisvalue during time based on the foot pressure differences. During anormal bipedal walk this value is bounded between a small range,

so in Eq. (16) speed of phase change remain unchanged. This fact ispresented in Fig. 22. In this figure sum of foot pressure values inboth feet are illustrated. It can be observed that in the normal caseof walking, changes are partially rhythmic; therefore the Fdiff isoscillating with a small amplitude.

8. HLDU layer

The HLDU layer is briefly discussed here. Since this layer dealswith high level decision making in soccer robots, it contains com-plex processing units. The inputs and outputs of this layer are de-scribed in the application and illustrate future perspectives andcapabilities of this layer in learning advanced behavior in soccerplayer humanoid robots.

Page 19: Www.sciencedirect.com Science Article Pii S0950705113003833

Fig. 22. Fdiff values during a normal experiment of walking.

26 H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27

The main objective of this layer is to decide about strategicgoals in soccer i.e. where is the robot now, where it should go next,where is the ball, where it should be kicked to, where is the goalframework, where are the other soccer player robots and howcan the robot travel to another point in the soccer field. It can learnto generate team work strategies and corporation techniques be-tween players. In [20] these jobs are defined for 2d soccer playeragents. In this research we only focused on a specific job in thislayer which is very useful in the soccer match. The special com-mands generated by this layer determine a curvilinear path onthe ground. To simplify learning process we focused on a circularpath that can be parameterized by a center and a radius.

This layer should get input camera images, process them andsend some commands to the MLR layer to make appropriate poli-cies. In addition to these commands, some other feedbacks are gen-erated in this layer which describe the local position and speed ofthe robot. These feedbacks are used in the payoff function of MLRlayer. Learning can be implied in this layer to optimize processingof the images and calculation of position and speed. In the currentwork we have used a very simple decision making system in thislayer which calculate desired values in this layer. This layer com-putes vision feedbacks, for MLR layer. This computation deals withsome complex image processing algorithms for learning in realNAO. We can use simulation to simplify the problem. In this caselearning is performed in a simulator and the trained parameterswill be used in real NAO. We can use server based commands insimulator to find out the positions and errors. In this work weuse WebotTM to train the robot.

9. Conclusions

A hierarchical model for learning bipedal walking skills usinglearner central pattern generator neural networks are introducedwhich are trained in two stages. In the first stage CPG layer istrained by NAO basic walking trajectories in order to findfundamental frequencies. In the second stage a learning methodis proposed based on Levenberg–Marquardt algorithm with ananalytical gradient calculation. The second stage starts from theinitial point found in the first stage and finalize the learningprocess. At the end of the algorithm some phase differences arecomputed to be used as synchronization constants between theneurons in the proposed oscillatory neural networks.

The proposed method of learning in central pattern generatorshas many advantages compared to the previous methods. it hasthe minimum convergence time between the other methods. De-spite faster convergence time, the average training error in the

LCPGNNs are higher than the other methods. Since LCPGNNs donot have the fundamental frequency in the beginning of the train-ing process, they make very big errors during the training stage,but when they find the fundamental frequency and suitable initialpoints they rapidly converge to the best final parameters. We de-scribed that this method makes many different wrong guesses inthe parameter space but it can quickly find the right answer. Theaverage testing error of our learning algorithms is lower than theother methods. This indicates that this method can find itsrequired parameters better that the other ones and is able toregenerate the teaching trajectories very well.

MLR layer is trained to learn the optimal policy of curvilinearwalking based on a high level command. The mathematicalmapping of sensory feedbacks to the different control layers inthe framework is introduced. This framework can be used in otherbio-inspired robots to teach them some hierarchical behaviors. Thetest in this research is limited to only one desired behavior in eachlayer. The focus here was on curvilinear walking. Other types ofbehaviors can be learned in this framework.

References

[1] Pierre A. Guertin, The mammalian central pattern generator for locomotion,Brain Res. Rev. 62 (2009) 45–56.

[2] A.H. Cohen, Control principles for locomotion looking toward biology,J. Biomech. Eng. (2003) 41–51.

[3] A.J. Ijspeert, Central pattern generators for locomotion control in animals androbots: a review, J. Neural Networks 21 (2008) 642–653.

[4] N. Kohl, P. Stone, Policy gradient reinforcement learning for fast quadrupedallocomotion, in: IEEE International Conference on Robotics and Automation,2004.

[5] M. Sato, Y. Nakamura, S. Ishi, Reinforcement learning for biped locomotion, in:International Conference on Artificial Neural Networks, 2002.

[6] J.J. Kuffner, S. Kagami, Dynamically-stable motion planning for humanoidrobots, Auton. Robots 12 (2002) 105–118.

[7] M. Hebbel, R. Kosse, W. Nistico, Modeling and learning walking gaits of bipedrobots, in: IEEE-RAS International Conference on Humanoid Robots, 2006.

[8] C. Niehaus, T. Rofer, T. Laue, Gait optimization on a humanoid robot usingparticle swarm optimization, in: IEEE-RAS International Conference onHumanoid Robots, 2007.

[9] A. Cherubini, F. Giannone, L. Locchi, M. Lombardo, G. Oriolo, Policy gradientlearning for a humanoid soccer robot, J. Robot. Auton. Syst. 57 (2009) 808–818.

[10] T. Matsubaraa et al., Learning CPG-based biped locomotion with a policygradient method, Robot. Auton. Syst. 54 (2006) 911–920.

[11] J. Strom, J. Morimotob, J. Nakanishib, M. Satoc, K. Doyaa, in: OmnidirectionalWalking Using ZMP and Preview Control for the NAO Humanoid Robot, LNAI5949, Springer-Verlag, Berlin Heidelberg, 2009, pp. 378–389.

[12] J. Pratt, P. Dilworth, G. Pratt, Virtual model control of a biped walking robot, in:Presented at the IEEE Int’l Conf. on Robotics & Automation, 1997.

[13] L. Righetti, A.J. Ijspeert, Programmable central pattern generators: anapplication to biped locomotion control, in: Presented at the IEEEInternational Conference on Robotics and Automation Orlando, Florida 2006.

[14] T. Zielinska, Biological inspiration used for robots motion synthesis, J. Physiol.– Paris 103 (2009).

Page 20: Www.sciencedirect.com Science Article Pii S0950705113003833

H. Shahbazi et al. / Knowledge-Based Systems 57 (2014) 8–27 27

[15] L. Righetti, J. Buchli, A.J. Ijspeert, Hebbian learning in adaptive frequencyocillators, Physica D (2005) 105–116.

[16] F. Hackenberger, Balancing Central Pattern Generator based Humanoid RobotGait using Reinforcement Learning, Graz University of Technology, 2007.

[17] S. Degallier, L. Righetti, L. Gay, A.J. Ijspeert, Toward simple control for complex,autonomous robotic applications: combining discrete and rhythmic motorprimitives, Auton. Robots 31 (2011) 155–181.

[18] P. Marc, Nao programming for the Robotstadium on-line contest, in: EPFLBiologically Inspired Robot Group, Ecole Polytechnique Federale de LausanneEPF, 2010.

[19] P. Stone, Layered Learning in Multi-Agent Systems, Phd, School of ComputerScience, Carnegie Mellon University, Pittsburgh, 1998.

[20] C.Y.-F. Ho, B.W.-K. Ling, Lam Hak-Keung, M.H.U. Nasir, Global convergence andlimit cycle behavior of weights of perceptron, IEEE Trans. Neural Networks 19(2008) 938–947.

[21] H. Shahbazi, K. Jamshidi, A.H. Monadjemi, Curvilinear bipedal walk learning inNao humanoid robot using a CPG based policy gradient method, Appl. Mech.Mater. 110–116 (2011) 5161–5166.

[22] H. Shahbazi, K. Jamshidi, A.H. Monadjemi, Modeling of locomotor regoin inspinal cord for a humanoid robot, in: Presented at the CICIS’11, IASBS, 2011.

[23] J.J. Alcaraz-Jimenez, D. Herrero-Perez, H. Martinez-Barbera, D. Herrero-Perez,H. Mart?nez-Barbera, Motion planning for omnidirectional dynamic gait inhumanoid soccer robots, Phys. Agents 5 (2011) 25–34.

[24] C. Azevedo, H. Poignet, B. Espiau, Artificial locomotion control: from human torobots, Robot. Auton. Syst. 47 (2004) 203–223.

[25] A. Gams, A.J. Ijspeert, S. Schaal, On-line learning and modulation ofperiodic movements with nonlinear dynamical systems, Auton. Robot 27(2009) 3–23.

[26] D. Marquardt, An algorithm for least-squares estimation of nonlinearparameters, Appl. Math. 11 (1963) 431–441.

[27] J. Peters, S. Schaal, Policy gradient methods for robotics, in: Presented at theInternational Conference on Intelligent Robots and Systems, 2006 IEEE/RSJ I,Beijing, 2006.