340 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL...

340 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 4, DECEMBER 2010

Multilevel Darwinist Brain (MDB):Artificial Evolution in a Cognitive

Architecture for Real RobotsFrancisco Bellas, Member, IEEE, Richard J. Duro, Senior Member, IEEE, Andrés Faiña, and Daniel Souto

Abstract—The multilevel Darwinist brain (MDB) is a cognitivearchitecture that follows an evolutionary approach to provide au-tonomous robots with lifelong adaptation. It has been tested in realrobot on-line learning scenarios obtaining successful results thatreinforce the evolutionary principles that constitute the main orig-inal contribution of the MDB. This preliminary work has lead to aseries of improvements in the computational implementation of thearchitecture so as to achieve realistic operation in real time, whichwas the biggest problem of the approach due to the high compu-tational cost induced by the evolutionary algorithms that make upthe MDB core. The current implementation of the architecture isable to provide an autonomous robot with real time learning ca-pabilities and the capability for continuously adapting to changingcircumstances in its world, both internal and external, with min-imal intervention of the designer. This paper aims at providing anoverview or the architecture and its operation and defining whatis required in the path towards a real cognitive robot following adevelopmental strategy. The design, implementation and basic op-eration of the MDB cognitive architecture are presented throughsome successful real robot learning examples to illustrate the va-lidity of this evolutionary approach.

Index Terms—Adaptive systems, artificial neural networks, au-tonomous robotics, cognitive architecture, developmental robotics,evolutionary computation.

I. INTRODUCTION

S OME types of cognitive functions such as anticipationand planning may be achieved by internally simulating

the robot’s interaction with the environment through its actionsand their consequences [1]. However, this would require theexistence of good enough updated models of the world and ofthe robot itself. Models must adapt to changing circumstancesand they must be remembered and generalized in order to bereused. Actions and sequences of actions, on the other hand,must be generated in real time so that the robot can cope withthe environment and “survive.”

A control system such as the one that is necessary for au-tonomy is really something that goes beyond traditional controlin terms of its specifications or requirements. These additional

Manuscript received March 01, 2010; revised August 14, 2010; acceptedSeptember 30, 2010. Date of publication October 14, 2010; date of currentversion December 10, 2010. This work was partially funded by the Xuntade Galicia through Project (09DPI012166PR) and the European RegionalDevelopment Funds.

The authors are with the Integrated Group for Engineering Research, Univer-sity of Coruña, Coruña, 15001, Spain (e-mail: [email protected]; [email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TAMD.2010.2086453

requirements imply the ability to learn the control function fromscratch, the ability to change or adapt it to new circumstances,the ability to interact with the world in real time while per-forming the aforementioned processes, and in some instances,even to change the objectives that guide the control system. Con-sequently, and to differentiate it from traditional control sys-tems, these structures are usually called cognitive systems orcognitive architectures [2].

This work tries to provide an overview of what would be re-quired as a first step towards a real cognitive robot and, in thisline, one that can learn throughout its lifetime in an adaptivemanner. An approach will be described that considers evolu-tion, and in particular, neuroevolution [3], an intrinsic part ofthe cognitive system that allows a robot to be able to learn dif-ferent tasks and objectives. This approach, called the multilevelDarwinist brain (MDB), has been extensively tested in differentreal robotics applications [4], [5], obtaining new requirementsand improvements that have been implemented and tested. TheMDB is not intended as a biologically plausible path, but rather,as a computationally effective way of providing the requiredfunctionality in real-time robotics. Consequently, the crucial as-pects of this work are those related with “reality,” real-time,real world, and real robots, mainly considering that evolutionarytechniques are highly time consuming and they do not seem tobe adequate for real time operation in robots.

Notwithstanding these previous comments, given the struc-ture of the MDB, it is tempting to draw some parallels withlarge-scale neural ensembles. For instance, multiple motor cor-tical populations are present and compete for access to actionselection as carried out by the basal ganglia through internal sen-sorimotor loops. Concepts such as emotion in action selection(amygdala) could be easily considered.

The paper has been organized as follows. Section II discussesthe elements that must be considered in the design of a cognitivearchitecture for real robots following a developmental approach.Section III introduces the principles, basic elements, operation,and computational implementation of the MDB architecture.Section IV is devoted to the presentation of two illustrative ap-plication examples implemented with the MDB in real robots.Finally, Section V provides some conclusions as well as someindications of the work that still needs to be done.

II. COGNITIVE ARCHITECTURES FOR REAL ROBOTS

In robotics, it has been quite common to study animal be-havior as a reference for developing cognitive architectures forreal robots, trying to reproduce the learning patterns observed

1943-0604/$26.00 © 2010 IEEE

BELLAS et al.: MULTILEVEL DARWINIST BRAIN (MDB): ARTIFICIAL EVOLUTION IN A COGNITIVE ARCHITECTURE FOR REAL ROBOTS 341

in nature [6]. Animals acquire many competences during theirlifetime by interacting with the world. Their secret weapon isan autonomously operating and nonpredetermined open-endedlifelong learning brain-body system.

It seems that, in autonomous robotics, instead of designingintelligent robots, for a long time researchers have been de-signing just controllers. Initially, years have been spent ondirectly programmed classical control schemes [7]. Later,different authors applied deliberative approaches based onsymbolic representations of knowledge using preprogrammedmodels and objectives [8]. Once this approach became toocomplex, other authors resorted to reactive approaches [9]that, to deal with complex tasks, required too much designerintervention. The main conclusion extracted from these initialapproaches to the problem is that all preprogrammed controlsystems are very limited in terms of autonomy and, conse-quently, inadequate for real robotics. As in the case of nature,open-ended lifelong learning systems present the potential ofachieving a more realistic level of autonomy. Here, this conceptwill be taken as a mandatory objective for the design of acognitive architecture.

A. Control Versus Cognition

Cognition may be defined as “the mental process of knowing,including aspects such as awareness, perception, reasoning, andjudgment” [11]. From a computational perspective, cognitioncan be considered as “a collection of emerging information tech-nologies inspired by the qualitative nature of biologically-basedinformation processing and information extraction found in thenervous system, human reasoning, human decision-making, andnatural selection” [12]. Therefore, on one hand, we have themental process of knowing, which in mathematical terms can beconsidered as extracting models from data. These models can beemployed in the process of decision making so that appropriateactions may be taken as a function of sensing and motivation.On the other, we have the decision making process itself, and, inrobotics, a decision is always related to an action or sequence ofactions. In a sense, the models must be used in order to decidethe appropriate actions so as to fulfill the robot’s motivations. Itis in how the model making and the action determination pro-cesses take place that cognitive architectures differ from eachother [2].

B. Developmental Robotics

The main objective of the developmental robotics fieldis to create “open-ended, autonomous learning systems thatcontinually adapt to their environment” [12], as opposed toconstructing robots that carry out particular, predefined tasks.The main inspiration within this field has been taken fromcomplex biological organisms and the developmental processthey follow during their lifetime, which, under the control ofa developmental program, develop mental capabilities throughautonomous real-time interactions with their environments byusing their own sensors and effectors [13].

The philosophy behind developmental robotics is thatlearning occurs by taking small steps and building on what isalready known [14]. As commented by Lisa Meeden “under adevelopmental process, a system can continually advance what

it knows by placing itself into situations where it almost knowssomething, and then learning it. Applied repeatedly, such adevelopmental process can potentially lead to much more com-plex, general-purpose behavior than has been achieved to date.”This developmental robotics approach has been followed inthe design of the MDB cognitive architecture, but the problemhas been addressed by making use of some of the conceptsof traditional cognition, introducing ontogenetic evolutionaryprocesses for the on-line adaptation of the knowledge bearingstructures.

In addition to the mental aspects commented above, cogni-tive developmental robotics (CDR) includes all the topics re-lated with body development [15]. However, fetal sensorimotordevelopment, voluntary movement acquisition, spatial percep-tion, and body/motor representation or understanding are prob-lems beyond the scope of this paper. Here, the focus of attentionwill be placed on the development of a cognitive architectureto be applied to “existing” physical robots. Embodiment, adap-tive motivation, open-ended lifelong learning, or autonomousknowledge acquisition are some of the typical CDR topics onwhich this approach will concentrate, dealing with all the im-plementation problems that arise when using a developmentalapproach in real time operation. As will be discussed later, someof the practical requirements to have an efficient computationalimplementation may imply modifying or relaxing previous the-oretical assumptions.

C. Embodied Cognition

A cognitive system makes no sense without its link to the real(or virtual) world in which it is immersed and which providesthe data to be manipulated into information and knowledge, thusrequiring the capability of acting and sensing and of doing so ina timely and unconstrained manner. Consequently, to design acognitive system for real autonomy, that is capable of thinkingthings out before acting, it seems necessary to start with a typ-ical deliberative structure that includes models as an intrinsic el-ement. Its general structure could be like the one shown in Fig. 1(top). Obviously, this structure assumes preset goals for whichthe models and the action selector are designed. Thus, to makeit independent from preset goals, a satisfaction model, which isbasically a learnable utility function, must be added as shown inFig. 1 (middle). However, deliberative structures usually requirequite a bit of time in order to decide on an action and embodiedsystems must often act very fast with whatever information isavailable in order to survive. To allow for this, it would be nec-essary to include a reactive part to the deliberative mechanismthat would be seamlessly linked to it. This is what is shown in thebottom part of Fig. 1 through the introduction of the concept ofbehavior as an independent processing element capable of pro-ducing sequences of actions for the agent as a response to someexternal or internal stimuli. This way, when the system is inter-acting with the world, it is using the behavior based part. On theother hand, the deliberative part of the mechanism, in its owntime frame, adapts the behavior module in order to maximizesatisfaction. Thus, Fig. 1 (bottom) can be seen as a behaviorbased deliberative structure. Its operation implies two differenttime scales: execution time scale, where the reactive behaviorsare directly applied, and deliberation time scale, in which the


Fig. 1. Conceptual evolution from a traditional deliberative architecture (top)towards an architecture that allows for the intrinsic change of goals or motiva-tion by introducing a satisfaction model (middle) and finally, to an architecturethat permits fast reactive real time behavior while preserving the deliberativecharacteristics by considering the selection of behaviors instead of simple ac-tions (bottom).

new behaviors are learned through interaction with the environ-ment. This scheme represents a starting point for the design ofan embodied cognitive system for real autonomous robots and,as will be later explained in detail, it is the basic structure of theMDB.

D. Cognitive Model

Starting from the general concepts stated in the previous sec-tions, a cognitive model based on a particularization of the stan-dard abstract architectures for agents [16] was used for the de-sign of the MDB architecture. In this case, a utilitarian cognitivemodel is adopted. It starts from the premise that, to carry out any

task, a motivation (defined as the need or desire that makes anagent act) must exist that guides the behavior as a function of itsdegree of satisfaction. The external perception of an agentis made up of the sensory information it is capable of acquiringthrough its sensors from the environment in which it operates(like distances, shapes, colors, sounds, etc). The environmentcan change due to the actions of the agent or to factors uncon-trolled by the agent. Consequently, the external perception canbe expressed as a function of the last action performedby the agent, the sensory perception it had of the external worldin the previous time instant and a description of theevents occurring in the environment that are not due to its ac-tions through a function

The internal perception of an agent is made up of the sen-sory information provided by its internal sensors, its propiocep-tion (like a battery level, stress level in terms of CPU load, mo-tivation level, etc.). Internal perception can be written in termsof the last action performed by the agent, the sensoryperception it had from the internal sensors in the previous timeinstant and other internal events not caused by the ac-tions of the agent through a function

The satisfaction of the agent may be defined as a mag-nitude or vector that represents the degree of fulfilment of themotivation or motivations of the agent and it can be related toits internal and external perceptions through a function . Asa first approximation, the social aspects of the robot’s develop-ment, that is, the events over which the agent has no controlwill be ignored and the problem reduced to the interactions ofthe agent with the world and itself. Thus, generalizing

The main objective of this utilitarian cognitive architecture isthe satisfaction of the motivation of the agent, which, withoutany loss of generality, may be expressed as the maximization ofthe value of the satisfaction in each instant of time and thesatisfaction can be expressed as a function of functions and

acting over the external perceptions, the internal perceptionsand previous actions According to the previous expression, tosolve this maximization problem, the only parameter the agentcan modify is the action it performs, as the external and internalperceptions should not be manipulated (doing this would leadto distorted perceptions as in altered mental states). That is, thecognitive architecture must explore the possible action space inorder to maximize the resulting satisfaction. To obtain a systemthat can be applied in real time, the optimization of the actionmust be carried out internally (without interaction with the en-vironment). , and are theoretical functions that must besomehow obtained. These functions correspond to what are tra-ditionally called the following:

• world model ( ): function that relates the external percep-tion before and after applying an action;


Fig. 2. Functional diagram of the cognitive model.

• internal model ( ): function that relates the internal per-ception before and after applying an action;

• satisfaction model ( ): function that provides a predictedsatisfaction from predicted perceptions provided by theworld and internal models.

Fig. 2 displays a functional diagram of this cognitive modelindicating the relationships among all the elements involved.This diagram is useful to realize that action evaluation is a se-quential process, and that satisfaction prediction requires theprior execution of the “perceptual” models.

As commented before, the main starting point in the designof a developmental cognitive architecture is that the acquisitionof knowledge should be automatic and occur during the agent’slifetime. Thus, it is necessary to establish that the three models

, , and must be obtained at execution time as the agentinteracts with the world. To be able to carry out this modelingprocess, information must be extracted from the real data theagent has after each interaction with the environment. Hereafter,these data will be called action–perception pairs and are madeup of the sensorial data in instant , the action applied in instant, the sensorial data in instant , and the satisfaction in .

This way, we have all the perceptual information before andafter applying an action.

The model organization displayed in Fig. 2 allows for anintrinsically adaptive operation. If the sensorial informationchanges (dynamic environment, hardware failures), the worldand internal models can be updated or replaced without anyconsequence in terms of the architecture. In addition, if themotivation changes, the satisfaction model would change whilethe action selection method remains unchanged.

Summarizing, for every interaction of the agent with its envi-ronment, two processes must be solved:

1) the modeling of functions , , and using the informa-tion in the action–perception pairs;

2) the optimization of the action using the models availableat that time trying to maximize the predicted satisfactionprovided by the satisfaction model.

To create models is to try to minimize the difference betweenthe reality that is being modeled and the predictions provided bythe model. Consequently, it is clear that a cognitive architecturemust involve some type of minimization strategy or algorithm.

Fig. 3. MDB elements and their relations.

As the search spaces are not really known beforehand and areusually very complex, here we propose using as the algorithmfor on-line modeling one of the most powerful stochastic mul-tipoint search techniques: artificial evolution.

III. MULTILEVEL DARWINIST BRAIN (MDB)

MDB is a general cognitive architecture first presented in[17], that follows a developmental robotics approach for the au-tomatic acquisition of knowledge in a real robot through the in-teraction with its environment, so that it can autonomously adaptits behavior to achieve its design objectives. The backgroundidea of the MDB of applying artificial evolution for knowl-edge acquisition takes inspiration from classical biopsycholog-ical theories by Changeaux [18], Conrad [19], and Edelman [20]in the field of cognitive science relating the brain and its opera-tion through a Darwinist process. All of these theories lead to thesame concept of cognitive structure based on the brain adaptingits neural connections in real time through evolutionary or se-lectionist processes.

Fig. 3 displays a block diagram of the current implementationof the MDB. It follows a generalization of the cognitive modeldescribed in the previous section through the addition of twonew elements.


1) Behavior structures: as commented before, they gener-alize the concept of single action used in the cognitivemodel. A behavior represents a decision element able toprovide actions or sequences of actions according to theparticular sensorial inputs. That is, the robot could havea behavior for “wall-following,” another for “wandering,”etc.

2) Memory elements: a short-term and a long-term memoryare required in the learning processes. They will be dis-cussed later in detail.

As presented in Fig. 3, the MDB is structured into two dif-ferent time scales, one devoted to the execution of the actionsin the environment (reactive part) and the other dealing with thelearning of the models and behaviors (deliberative part). Theoperation of the MDB can be described in terms of these twoscales.

1) Execution time scale: The following steps are continuouslyrepeated in a sequential manner.

1.1) There is a current behavior, which has been se-lected in the deliberative process that chooses, basedon its perception, the next action to be applied.1.2) The selected action is applied to the environmentthrough the actuators of the robot obtaining a new setof perceptions.

2) Deliberation time scale: these processes also take place indifferent time scales, and, consequently, it must be pointedout that they are not sequential.

2.1) The acting and sensing values obtained after theexecution of an action in the environment in the exe-cution time scale provide a new action–perception pairthat is stored in a short-term memory (STM).2.2) The evolutionary model learning processes (forworld, internal and satisfaction models) try to findfunctions that generalize the real samples stored in theSTM. Each evolutionary process has been representedby two blocks in Fig. 3, one related to the evolutionitself (labelled evolver) and the other one representingthe population of each evolution (labelled Base).The computation time required for each evolutionaryprocess may be different, depending on the complexityof the models. The practical implementation of suchprocesses will require attention to avoid incoherenceduring the interplay of the elements involved in realtime operation.2.3) The best models in a given instant of time are takenas current world model (WM), current internal model(IM), and current satisfaction model (SM) and are usedby the behavior evolver to select the best behavior withregards to the predicted satisfaction of the motivation(behavior proposer block in Fig. 3).Therefore, another evolutionary process has beenadded to the cognitive model presented above that iscontinuously obtaining new behaviors for the robotusing the best three models it has for the evaluation ofthe individuals. Consequently, the best behavior in agiven instant of time is the one that provides a higherlevel of satisfaction on average for all the samplesstored in the STM. The blocks labelled current WM,

current IM, current SM, and behavior proposer arereally included in the behavior evolver within theindividual evaluation stage.2.4) The behavior evolver is continuously proposingnew behaviors that are better adapted to the STM con-tents. Upon request, the behavior selector provides thebest one to the reactive part of the MDB (the one oper-ating in the execution time scale), the current behaviorblock in Fig. 3, which replaces the one there and whichshould be better adapted to the STM and, consequently,to the current “reality” of the robot.2.5) The block labelled long-term memory (LTM) inFig. 3 stores those models and behaviors that have pro-vided successful and stable results in their applicationto a given task in order to be reused directly in otherproblems or as seeds for new evolutionary learningprocesses.

Each time the robot executes an action during real time oper-ation, a new action–perception pair is obtained. This real infor-mation is the most relevant one in the MDB, as all the learningprocesses depend on the number and quality of action–percep-tion pairs. Consequently, each interaction of the robot with theenvironment has been taken as the basic time unit within theMDB and called iteration. As more iterations take place, theMDB acquires more information from the real environment andthus the model learning processes should produce better modelsand, consequently, the behaviors obtained using these modelsshould be more reliable, and the actions provided by them moreappropriate to fulfil the motivations.

The MDB diagram represented in Fig. 3 follows the structureshown in Fig. 1 (bottom) for an embodied cognitive architecture.It includes concurrent reactive and deliberative processes. Thefollowing three subsections will describe the most important as-pects of the MDB in practical terms: evolution, memories, andimplementation.

A. Lifelong Learning by Evolution

The main difference of the MDB with respect to other cog-nitive architectures for real robots lays in the way the knowl-edge is acquired through evolutionary techniques [2], [15]. Themodels resulting from this learning are usually complex due tothe fact that the real world is dynamic and the robot state, the en-vironment, and the objective may change in time. To achieve thedesired neural adaptation through evolution established by theDarwinist theories that are the base for this architecture, it wasdecided to use artificial neural networks (ANN) as the represen-tation for the models, mainly due to their suitability for beingadapted through evolutionary processes. There is no limitationregarding the type of ANN that can be used in the MDB. Reg-ular feedforward, radial basis functions, recurrent or delay basednetworks, or spiking neural networks may be useful dependingon the type of model that needs to be learned. Consequently, theacquisition of knowledge in the MDB is a neuroevolutionaryprocess, with an evolutionary algorithm devoted to learning theparameters of the ANN. Neuroevolution is a reference learningtool due to its robustness and adaptability to dynamic environ-ments and nonstationary tasks as commented in [3].


Here, the modeling is not an optimization process, but a life-long learning process taking into account that the best gener-alization for all times, or, at least, an extended period of timeis sought, which is different from minimizing an error functionin a given instant . Consequently, the modeling technique se-lected must allow for gradual application, as the information isknown progressively and in real time. Evolutionary techniquespermit this gradual learning process by controlling the numberof generations of evolution for a given content of the STM. Thus,if evolutions last just a few generations per iteration, graduallearning by all the individuals is achieved.

To obtain general modeling properties in the MDB, the popu-lation of the evolutionary algorithms must be preserved betweeniterations (represented in Fig. 3 through the world, internal , andsatisfaction model base blocks that are connected to the evolu-tionary blocks), leading to a sort of inertia learning effect wherewhat is being learned is not the contents of the STM in a giveninstant of time, but of sets of STMs that were previously seen.In addition, the dynamics of the real environments where theMDB will be applied imply that the architecture must be intrin-sically adaptive. This strategy of evolving for a few generationsand preserving populations between iterations permits a quickadaptation of models to the dynamics of the environment, as acollection of possible solutions is present in the populations andthey can be easily adapted to the new situation.

In the case of behaviors, in the current version of the MDBthey are also represented by ANNs, and consequently, they canbe viewed as neural behavior controllers that provide the actionthe robot must apply in the environment according to its senso-rial inputs. The learning of behaviors follows the same gradualprinciples as that for the models in order to avoid the fluctua-tions in the evolution caused by an exhaustive optimization foreach particular content of the STM. Again, a behavior will im-prove with iterations and not within iterations.

In the first versions of the MDB, the evolution of the modelswas carried out using standard canonical genetic algorithmswhere the ANNs were represented by simple two-layer per-ceptron models [17]. These tools provided successful resultswhen dealing with simple environments and tasks, but assoon as real world learning problems were faced, they wereinsufficient. Standard genetic/evolutionary algorithms whenapplied to these tasks tend to converge towards homogeneouspopulations, that is, populations where all of the individuals arebasically the same. In static problems this would not be a draw-back if this convergence took place after reaching the optimum.Unfortunately, there is no way to guarantee this, and diversitymay be severely reduced long before the global optimum isachieved. In dynamic environments, where even the objectivemay change, this is quite a severe drawback. In addition, theepisodic nature of real-world problems implies that whateverperceptual streams the robot receives could contain informationcorresponding to different learning processes or models thatare intermingled (periodically or not), that is, learning samplesneed not arise in an orderly and appropriate manner. Some ofthese sequences of samples are related to different sensorialor perceptual modalities and might not overlap in their infor-mation content; others correspond to the same modalities butshould be assigned to different models. The problem that arises

is how to learn all of these different models, the samples ofwhich are perceived as partial sequences that appear randomlyintermingled with those of the others.

In order to deal with the particularities of the learningprocesses involved in the MDB, it was necessary to developa new neuroevolutionary algorithm able to deal with generaldynamic problems, that is, combining both memory elementsand the preservation of diversity. This algorithm was calledthe promoter-based genetic algorithm (PBGA) [26]. In thisalgorithm, the chromosome is endowed with an internal orgenotypic memory and tries to preserve diversity using agenotype-phenotype encoding that prevents the loss of relevantinformation throughout the generations. One of the main fea-tures of the PBGA is that it automatically adjusts the numberof neurons required for each layer. The practical operation anddetails of the algorithm are detailed in [22], where a discussionon how this algorithm outperforms others like NEAT [21],in nonstationary conditions, is presented. An analysis of howthe algorithm can be improved with the incorporation of anexternal memory, in this case a LTM is presented in [23].

Summarizing this subsection, the MDB implements fourparallel neuroevolutionary processes during operation usingthe STM contents as fitness function for learning the modelsand behaviors. The cognitive architecture is transparent to theparticular type of algorithm or ANN, but taking into accountthe features of real world learning, a neuroevolutionary algo-rithm, the PBGA, that outperforms the existing ones in theseconditions has been designed.

B. Remembering Facts, Situations, and Behaviors

The management of the real data represented by the ac-tion–perception pairs that are stored in the STM (“the facts”)is critical in the real time learning processes of the MDB. Thequality of the learned models depends on what is stored inthis memory and the way it changes. On the other hand, theparticular conditions of the environment and the robot itselfthroughout its lifetime (“the situations”) and the actions taken(“the behaviors”) in those cases must be stored in a LTM tolearn from experience.

1) STM: STM is a memory element that stores data obtainedfrom the real time interaction of the agent with its environ-ment. The internal models the agent creates during the learningprocess should predict and generalize all the data stored in theSTM. Thus, what is learned and how it is learned depends on thecontents of the STM during time. Obviously, it is not realisticto store all the samples acquired throughout an agent’s lifetime.The STM must be limited in size and, consequently, a replace-ment strategy is required in order to store the information theagent considers more relevant in a given instant of time. Thereplacement strategy should be dynamic and adaptable to theneeds of the agent and, therefore, it must be subject to externalregulation. For this reason, a replacement strategy has been de-signed that labels the samples using four basic features relatedto saliency and temporal relevance.

• The point in time a sample is stored ( ): It favors the elim-ination of the oldest samples, maximizing the learning ofthe most current information acquired.


• The distance between samples ( ): Measured as the Eu-clidean distance between the action–perception pair vec-tors, this parameter favors the storage of samples from allover the feature space in order to achieve a general model.

• The complexity of a sample to be learned ( ): This param-eter favors the storage of the hardest samples to be learned.The error provided by the current models when predictinga sample is used to calculate it.

• The relevance of a sample ( ): This parameter favors thestorage of the most particular samples, that is, those that,even though they may be learned by the models very well,initially presented large errors.

Thus, each sample is stored in the STM has a label ( ) thatis calculated every iteration as a function of these four basicterms, that is, . Different functions requiredifferent management strategies. The regulation of these fourfeatures can be carried out by the cognitive mechanism or byother parts of the memory system so as to improve the learningand generalization properties.

2) LTM: LTM is a higher level memory element, becauseit stores information obtained after the analysis of the real datastored in the STM. From a psychological point of view, the LTMstores the knowledge acquired by the agent during its lifetime,its experience. This knowledge is represented in the MDB bythe models and behaviors.

In an initial approach, that has provided successful results,it has been considered that a model must be stored in theLTM if it predicts the contents of the STM with high accuracyduring an extended period of time (iterations in the MDB).In the case of a behavior, it should be stored if during itsapplication it has led to a relevant increase in the satisfactionobtained. These models are considered relevant for the robot’soperation in a given context and should not be forgotten. Eachmodel is stored together with its context, that is, the existingSTM where it performed properly. It would not be efficient tostore models obtained over equivalent STMs as it is assumedthat they predict the same “reality” of the robot. To determinewhether a model should be stored in the LTM is not evident,as models are generalizations of situations and, in the presentcase, where they are implemented as artificial neural networks,it is not easy to see if a model is the same or similar to anotherone or not. Thus, every time a new model is a candidate forinclusion in the LTM, it must be phenotypically compared tothe rest of models in the LTM. This is achieved by simplyperforming cross predictions of their associated STMs. Thatis, to compare two models, each is run over the context of theother to see how similar they are.

From a practical point of view, the addition of the LTM in theMDB avoids the need of relearning the models and behaviorsin a problem with a real agent in a dynamic situation every timethe agent changes into different states (different environments ordifferent operation schemas). The models and behaviors storedin the LTM in a given instant of time are introduced in theircorresponding evolving populations as seeds so that if the agentreturns to a previously learned situation, the model or behaviorwill be present in the population and the prediction will be ac-curate soon. If the new situation is similar to one the agent haslearned before, the fact of seeding the evolving population with

the LTM will allow the evolutionary process to reach a solutionvery fast.

3) Memory Interplay: A mutual regulation system has beendeveloped to control the interaction between these memories inthe MDB. There are two main undesirable effects in the learningprocess that can be avoided with a correct management system.

First of all, as was mentioned before, the replacement strategyof the STM favors the storage of relevant samples. But what isconsidered relevant could change in time (change of motivationor environment), and consequently the information stored in theSTM should also change so that the new models and behaviorsgenerated correspond to the new situation. If no regulation isintroduced, when situations change, the STM memory will bepolluted by information from previous situations (there is a mix-ture of information) and, consequently, the models and behav-iors that are generated do not correspond to any one of them.

These intermediate situations can be detected by the replace-ment strategy of the LTM as it is continuously testing the modelsand behaviors to be stored in the LTM. Thus, if it detects a modelor behavior that suddenly and repeatedly fails in the predictionsof the samples stored in the STM, it is possible to assume thata change of context has occurred. This detection will producea regulation of the parameters controlling the replacement inthe STM so that it will purge the older context. It can even be-come a completely temporal strategy for a while. This purge willallow new data to fill the STM and thus the models and behav-iors can be correctly generated. It is a clear case of LTM moni-toring affecting the operation of the STM and thus the learningprocesses.

The other undesirable effect that must be avoided is a contin-uous storage in the LTM. This happens because the data storedin the STM are not general enough and the models or behav-iors seem to be different although they are not. The replace-ment strategy of the LTM can detect if the agent’s situation haschanged or not and, consequently, after a change of situation itcan detect if the number of attempts to enter the LTM is high.In such case, the parameters of the replacement strategy of theSTM are regulated so that they favor information that is moregeneral by empowering parameters such as distance, relevanceor complexity and the reduction of the influence of time.

Summarizing, a strategy for avoiding model learning in in-termediate situations and another for avoiding the overload ofthe LTM are required. These two strategies are necessary inthe interplay between memories together with the managementmechanisms for each one of them individually. Hence, a dy-namic memory structure arises that improves the efficiency inthe use of memory resources, minimizing the number of modelsand behaviors stored in LTM without affecting performance andallowing them to be as general as possible. This last fact isquite important because, as the models and behaviors are usedas seeds in the evolution processes, the more general they arethe better they will adapt to new situations. A more detailed de-scription of the memory elements within the MDB can be foundin [24].

C. Real-Time Operation

The computational implementation of all the elements thatmake up a complex architecture like the MDB is the key to its


success in real robotic systems. There are several aspects thatmust be carefully designed and implemented to obtain a tool thatcan be practical in terms of reliability and computational cost.The current version of the MDB has been developed in JAVA,and in its object-oriented design there are four basic packagesthat constitute the computational core of the architecture: robot,evolutionary algorithm, model and memory.

1) Robot: As a first design requirement, already in the ini-tial versions of the MDB, it was imposed that the basic onboardprocessor of the current real robots (usually a microcontroller)should be used just to execute actions and to collect the senso-rial information in real time. Regarding the two different timescales shown in Fig. 3, this means that the onboard processor isin charge of the execution time scale elements. Thus, the delib-erative part of the MDB is always executed in a separate pro-cessor (on or off the robot), and the communications betweenthese two structures are carried out using the standard TCP/IPprotocol.

The second basic design requirement related with the robotis that the MDB should be as independent from the particularhardware as possible. That is, it is assumed that the MDB re-ceives sensorial information and provides actions to be appliedin a robot, but its core cannot include any particularity about therobot. To this end, inspiration was taken from the Player/Stageproject [25], which uses a network server for robot control thatprovides an interface to the robot’s sensors and actuators overthe IP network.

The robot package includes all the classes implementing theprevious two requirements. The designer must create a simpleconfiguration file including a description of the robot sensorsand actuators and the IP port where it is connected. On the otherhand, for each particular robot or simulator, an interface pro-gram must be developed to provide sensorial information andto capture the actions obtained by the MDB using the standardTCP/IP protocol. The onboard processor’s computational loadis minimized and, consequently, these data are always in rawformat. If any kind of processing is required, it will be carriedout by a dedicated perception package.

2) Evolutionary Algorithm: This package includes all theclasses in charge of executing the evolutionary processes for themodels and behaviors, which make up the core of the architec-ture. One of the main drawbacks of the application of evolu-tionary algorithms in real robotics is the computational cost theyimply, which makes them apparently unsuitable for real time op-eration. This problem has been addressed through the followingdesign decisions.

• First, as commented in Section III-A, the evolution of themodels and behaviors only last a few generations per iter-ation in order to obtain a smooth learning curve. This ob-viously reduces the computational cost in between interac-tions with the world and makes its real time implementa-tion feasible.

• In addition, the MDB is intrinsically concurrent and eachevolutionary process runs on an independent thread that isautomatically assigned to a different processor when avail-able. In the case of having a local area network, the pro-cesses may be executed in different computers over the net-work automatically.

The behavior evolution module uses the current modelsto evaluate candidate behaviors. Taking into account thedistributed execution of each model evolution, a coherenceprotocol has been implemented that ensures an updated evalu-ation of the individuals, using always the most recent models.A similar procedure has been implemented for the update ofthe current behavior in the reactive part of the MDB, whichis replaced every time a better one is obtained but ensuring acoherent action selection.

On the other hand, the MDB is independent from the par-ticular type of evolutionary algorithm and ANN. This is veryeasy to support with the object-oriented design of the architec-ture. In fact, the architecture has been tested with the differentalgorithms implemented in the JEAF library [26], by simplychanging the class in the configuration file.

3) Model: To create a new experiment using the MDB,the first step is to set up the model configuration. This impliesformally describing the world, internal and satisfaction models,that is, the knowledge representation within the architecturemust be chosen. This selection is very relevant because thesuccess or failure of learning depends highly on the complexityof the models. What the designer must do is simply indicatethe inputs and outputs corresponding to each model. Thisinformation corresponds to the external and internal sensorialinformation and to the action space. Take into account that therobot configuration file already includes all the real sensors andactuators, but here, virtual sensors and actuators that processthe raw data provided by the robot may be used as inputs andoutputs to the models allowing the designer to manipulate thesensorial information freely.

A general procedure was developed to automate model con-figuration based on the division of independent sensorial infor-mation into different models. For example, in the case of worldmodels, for a robot with four infrared sensors and two light sen-sors, the MDB will create two models as a first approach. Thefirst one relating the four infrared inputs in instant andand the action applied, and the second one relating the two lightinputs with the action in the same way.

What is relevant for the computational implementation isthat several concurrent evolutionary processes may be requiredfor all the submodels making up the world, internal and sat-isfaction models as well as for the behaviors. As commentedabove, a mechanism has been implemented that automaticallyexecutes these concurrent processes in different, and evenremote, threads.

4) Memory: As discussed in Section III-B1, the replacementstrategy of the STM determines the type of learning achieved,and it must be adjusted depending on the complexity of themodel. Taking into account the previously explained subdivi-sion into models, using a single STM for all the models is notfeasible. Hence, in the current implementation of the MDB, eachmodel evolution has its own STM with its particular replace-ment strategy. All the classes required for this management areincluded in the memory package. The corresponding configu-ration file only requires establishing the type of replacementstrategy, while the creation and execution of the STM is auto-matic. Regarding the LTM, the design and implementation fol-lows basically the same principles: there is a different LTM for


Fig. 4. Experimental setup with the Hermes II robot, an objective block, and ateacher that guides the learning process.

each evolutionary process, which is executed concurrently withit.

To summarize this section, the following implementation as-pects within the MDB, all related with the operation of the ar-chitecture in real robots, should be highlighted:

• object-oriented design and JAVA implementation;• hardware/simulator independence through the use of a

TCP/IP middleware approach;• time scale independence: reactive elements are executed

onboard and deliberative elements in remote or onboardcomputers through TCP/IP communication;

• automatic concurrent execution of the evolutionary pro-cesses in remote processors;

• automatic division of STM and LTM memories and execu-tion according to the evolutionary processes;

• persistent evolutionary processes that never stop althoughthe fitness function can change;

• easy integration with evolutionary and ANN libraries.Once the principles and operation of the MDB have been pre-

sented, the next section will be devoted to a series of results thatwere obtained applying it to real robot problems.

IV. APPLICATION RESULTS

This section describes two representative application exam-ples that summarize the behavior of the MDB in real robotlearning. The first one is very simple but highly conceptual, andit is focused on the developmental features of the architecture.The second one is more complex in learning terms and it in-cludes all the elements of the architecture working together.

A. Learning Basic Skills

The first experiment was carried out using the Hermes IIhexapod robot (see Fig. 4), which has six legs with two degreesof freedom (swing and lift), six infrared sensors, each one placedon top of each leg, two whiskers, inclinometers and six forcesensors. In the first part of the example, we want the Hermes IIrobot to learn to walk. The motion of each leg can be describedthrough 3 parameters (for the swing and lift motion): the ini-tial phase, which establishes the starting point of the leg’s mo-tion, the frequency, which increases or decreases the speed ofthe movement and the sweep amplitude. In this case, all of theparameters are fixed except the initial phase of the swing mo-tion for each leg. The different combinations of phases lead todifferent gaits, some useful, some useless and some even com-pletely impractical. The mechanism must allow the robot to de-

velop an efficient gait so that it can fulfil its motivations. A de-velopmental approach has been followed in the execution of theexperiment where the teacher guides the learning process stepby step.

The robot is placed in a standing pose in a random point of anempty environment, and an object (a block) is placed one meteraway from it (see these elements in Fig. 4). The mechanism se-lects the gait that must be applied and the robot uses it duringa fixed time (24 seconds). A gait is defined by the initial phaseof the swing motion in the six legs. Through its infrared sen-sors, using a time integration virtual sensor presented in [27],the robot always has an indication, in general noisy, of the dis-tance to the block.

In the MDB, the designer has to define the motivation of therobot in measurable terms and the particular models requiredaccording to the robot’s features. In this case, the motivation isvery simple and general: minimize the distance to the block. Inthe sensorial map of the robot, this implies a maximization of thedetection in the two front infrared sensors. A single world modelwas used as there is only one type of sensorial information (in-frared data). The world model has seven inputs: the distance tothe block provided by the virtual sensor (in a range from 0 to10) and the six input phases applied to the legs (in a range from

5 to 5, corresponding to the real limits of 45 to 45 ). Theunique output of this world model is the predicted distance to theblock. In this case, the ANN that represents the world model isa multilayer perceptron with two hidden layers of four neuronseach. No LTM was considered in this first experiment.

With such setup, the experiment was started and each time therobot falls or loses the block, the teacher is in charge of placingall the elements again in the correct positions (shown in Fig. 4).The MDB was run for 300 iterations until the gait was successfuland the robot reached the block consistently. In this first exper-iment a simple genetic algorithm was used for the evolution ofthe world models. The algorithm considered 700 individuals,57 genes (corresponding to the weights and bias of the 7-4-4-1neural network), 60% crossover and 2% mutation. No internalsensors were used in this experiment. For the sake of simplicity,in this case, the satisfaction is directly the predicted distance tothe block. The behaviors are, in this case, simple actions butthey have been obtained using another genetic algorithm with120 individuals, six genes (direct encoding of the input phases),60% crossover and 6% mutation. The STM was limited to 40action–perception pairs and it worked with a purely temporalreplacement strategy, that is, each sample was labeled with theiteration ( ) in which it was acquired, and when the STM is fullthe oldest samples are eliminated following a first-in–first-out(FIFO) strategy. Fig. 5 displays the variation in time of the stan-dard mean squared error (MSE) of the distance predicted by thebest world model as compared to the STM action–perceptionpairs. Specifically, the MSE is calculated using

MSE

where is the output predicted by the model, is the realoutput of the corresponding action–perception pair and is theSTM size. As shown in the figure, the error decreases clearly


Fig. 5. Evolution of the MSE in the STM prediction for the best world model.

Fig. 6. Efficiency of the gaits applied by the robot in each iteration of the MDBwith tendency line.

but with a continuous oscillation as a consequence of the con-stant variation of the STM, which in each iteration replaces oneaction–perception pair, thus modifying the fitness criteria.

In order to understand the evolutionary learning process thatoccurs in the MDB and how it affects the actions that are ap-plied, a gait efficiency parameter was defined as the distance inthe vertical direction to the objective covered by the robot in afixed simulation time ( ), weighted by the horizontal distance( ) that its trajectory is separated from a straight line (directlysubtracted) and normalized by the maximum possible distance( )

That is, a gait is taken as better if the robot goes straight tothe block without any lateral deviation. It is important to pointout that this measure is never used in the cognitive mechanism;it is just a way of clarifying the presentation of results. Fig. 6displays the behavior of this efficiency throughout the 300 iter-ations of the robot’s lifetime. It can be observed that the curvetends to 1, as expected. Initially, the gaits are poor and the robotmoves in irregular trajectories. This is reflected in the efficiencygraph by the large variations in the efficiency from one instantto the next. Sometimes, by chance it reaches the block, othersit ends up very far away from it. Note that whatever the re-sult of the action, it does produce a real action perception pair,which is useful data in order to improve the models. As the in-teraction progresses, the robot learns to reach the block withoutany deviation in a consistent manner, and the efficiency tendsto one. Comparing Figs. 5 and 6 it can be seen that, although

Fig. 7. Representation of the gaits obtained through iterations.

the learning of the models is a noisy process with large oscilla-tions, the resulting actions that are in the background of Fig. 6improve in a more continuous and “natural” trend.

In the three graphs of Fig. 7, we represent the temporal occur-rence of the end of the swing motion for each leg (considered asa swing angle of -45 , that is, the highest reverse turn) during the20 s. The top graph corresponds to iteration 6 and we can see thatthe swings are completely out of phase because the legs reachthe end point in different instants of time. The resulting gait isnot appropriate for walking in a straight line and the robot turns,leading to a low efficiency value. The middle graph correspondsto iteration 87 where the resulting gait is more efficient than be-fore according to the level of error for that iteration (see Fig. 5).Finally, the bottom figure shows the combination of phases cor-responding to iteration 300. As we can see, the initial phasesare equal in groups of three and the resulting gait is quite good.This combination of phases leads to a very common and effi-cient gait called tripod gait, where three legs move in phase andthe other three legs in counter-phase resulting in a very fast andstable straight line motion.

At this point, the Hermes II robot had learned to walk, andthus, in a developmental learning process, it was decided to usethe MDB to provide it with the basic skill of turning using thecombination of initial phases obtained (tripod gait) in the pre-vious case. In this case, the same block was placed in a semicir-cumference in front of the robot at a random distance between


Fig. 8. Iterations between two consecutive captures of the object.

Fig. 9. The left image shows the path followed by the Hermes II robot in thefirst iterations. The right image shows the path when the behavior is successful.

50 and 100 cm, and the MDB should provide the robot with thebest combination of amplitudes in the swing motion in order toreach it. The rest of the parameters in the gait are fixed. If therobot reaches the block (distance of less than 20 cm) or if it losesit (distance larger than 100 cm), the teacher places it again in anew position within the semicircumference.

The world model now has three inputs, the distance and angleof the robot with respect to the block (provided by the virtualsensor applied before) and the amplitude of turn. The outputsare the predicted distance and angle. In this case, an explicitsatisfaction model was used with these two outputs of the worldmodel as inputs and with just one output, the predicted satisfac-tion. The motivation of the robot was again the maximization ofthe infrared sensing in the two front sensors. Consequently, therobot had to reach the block (minimizing distance) with low de-viation (minimizing angle). The world models had two hiddenlayers with four neurons each and the satisfaction models withthree neurons. The population in the genetic algorithms was 600individuals for the world models and 300 for the satisfactionmodels. The STM size was 40 and the management strategy waspurely temporal (first in first out), as in the previous case.

Fig. 8 provides an alternative view of the learning evolution.We have represented the number of iterations between two con-secutive captures of the object. It can be clearly seen how, inthe first stages of the behavior, there is a large delay from onecapture to the next because the models are poor and the selectedactions are not successful. The tendency changes about iteration200 and the number of iterations between two consecutive cap-tures decreases to one, implying that the robot has acquired theturning skill.

The left image of Fig. 9 displays the path followed by thereal robot with the strategies applied in iterations 53, 54, 55,

Fig. 10. Interaction between teacher and robot using the Pioneer 2 (left) andthe AIBO (right) robot. The top images correspond to the learning stage and thebottom images to the induced behavior.

and 56. As indicated in Fig. 8, these iterations correspond tothe first stages of the mechanism where the number of iterationsrequired to reach the object is large. In fact, the block remains inthe same position during the application of these four strategiesand the robot never turns towards it. The right image of Fig. 9displays the path followed in iterations 421, 422, 423, and 424.In this case, as the block is reached by the robot, it is moved bythe teacher.

To conclude this first experiment, it must be pointed out thatthe robot was able to autonomously generate a tripod gait andmodulate the amplitudes of the legs in order to turn to reachan objective through continuous interaction with the environ-ment using its own sensors and a very simple motivation. Thisis very important because the mechanism allows the robot tofind the best solution according to the limitations of its environ-ment and its sensorial and actuation apparatus. In fact, the robotis adapting and surviving in this particular world. In addition,the computational implementation of the MDB performed suc-cessfully even with a highly limited real robot.

B. Induced Behavior

To show the behavior of the MDB in a more complex task thatis guided by a teacher, a typical example of induced behaviorhas been reproduced. This experiment was carried out using twodifferent physical agents to demonstrate the robustness of thearchitecture implementation and its transparency with respect tothe particular hardware: a Pioneer 2 wheeled robot and Sony’sAIBO.

The task the physical agent must carry out is simple: learnto obey the commands of a teacher that, initially, guides therobot towards an object located in its neighborhood. Fig. 10 dis-plays the experimental setup for both agents. In the case of thePioneer 2 robot (left images of Fig. 10), the target is a blackcylinder that must be caught and in the case of the AIBO robotthe target is a pink ball (right images of Fig. 10). The Pioneer2 robot is a wheeled robot that has a sonar sensor array aroundits body and a laptop placed on its top platform. The laptop pro-vides two more sensors, a microphone and the numerical key-board, and the MDB runs on it as explained in Section III-C. The


Fig. 11. Representation of the models used in this experiment.

AIBO robot is a dog-like robot with a richer set of sensors andactuators. Its digital camera, the microphones and the speakerwere used for this example. In this case, the MDB is executedremotely in a PC and communicates with the robot through aTCP/IP protocol and wireless connection.

Fig. 11 displays a schematic view of the current world andsatisfaction models (with their respective numbers of inputs andoutputs) that arise in this experiment in a given instant. The sen-sory meaning of the inputs and outputs of these models in bothphysical agents are the following.

Command (One Input) for the Pioneer 2 Robot:• group of seven possible values according to the seven

musical notes;• provided by the teacher through a musical keyboard;• sensed by the robot using the microphone of the laptop;• translated to a discrete numerical range from 9 to 9

(linear relation for the first teacher and a random associ-ation for the second one).

Command (One Input) for the AIBO Robot:• group of seven possible values according to seven

spoken words: hard right, medium right, right, straight,left, medium left, and hard left;

• the teacher speaks directly;• sensed using the stereo microphones of the robot;• speech recognition using Sphinx software translated into

a discrete numerical range from 9 to 9 (linear relationfor the first teacher and a random association for thesecond one).

Action (One Input) for the Pioneer 2 Robot:• group of seven possible actions: turn hard right, turn

medium right, turn right, follow straight, turn left, turnmedium left, and turn hard left that are encoded with adiscrete numerical range from 9 to 9;

• the selected action is decoded as linear and angularspeed.

Action (One Input) for the AIBO Robot:• group of seven possible actions: turn hard right, turn

medium right, turn right, follow straight, turn left, turnmedium left, and turn hard left that are encoded with adiscrete numerical range from 9 to 9;

• the selected action is decoded as linear speed, angularspeed, and displacement.

Human Feedback (One Output/Input) for the Pioneer 2Robot:• discrete numerical range that depends on the degree of

fulfillment of a command from 0 (disobey) to 5 (obey);

• provided by the teacher directly to the MDB using thenumerical keyboard of the laptop.

Human Feedback (One Output/Input) for the AIBO Robot:• group of five possible values according to five spoken

words: well done, good dog, ok, pay attention, and baddog;

• the teacher speaks directly;• sensed using stereo microphones of the robot;• speech recognition using Sphinx software translated into

a discrete numerical range from 0 to 5.Satisfaction (One Output) for the Pioneer 2 Robot: Con-tinuous numerical range from 0 to 11 that is automaticallycalculated after applying an action. It depends on:• the degree of fulfillment of a command from 0 (disobey)

to 5 (obey);• the distance increase from 0 (no increase) to 3 (max);• the angle with respect to the object from 0 (back turned)

to 3 (robot frontally to the object).Satisfaction (One Output) for the AIBO Robot: Continuousnumerical range from 0 to 11 that is automatically calcu-lated after applying an action. It depends on:• the degree of fulfillment of a command from 0 (disobey)

to 5 (obey);• the distance increase from 0 (no increase) to 3 (max);• the angle with respect to the object from 0 (back turned)

to 3 (robot frontally to the object).Distance and Angle (Two Outputs/Inputs) for the Pioneer2 Robot:• sensed by the robot using the sonar array sensor;• measured from the robot to the black cylinder and en-

coded directly in cm and degrees and transformed to arange [0:10].

Distance and Angle (Two Outputs/Inputs) for the AIBORobot:• sensed by the robot using the images provided by the

color camera;• color segmentation process and area calculation taken

from Tekkotsu software [28];• encoded in cm and degrees and transformed to a range

[0:10];• measured from the robot to the pink ball.

In this example, the internal sensors of the robots were notconsidered and, consequently, internal models were not used.The flow of the learning process is as follows: the teacher ob-serves the relative position of the robot with respect to the ob-ject and provides a command that guides it towards the object.Initially, the robot has no idea of what each command meansin regards to the actions it applies. After sensing the command,the robot acts and, depending on the degree of obedience, theteacher provides a reward or a punishment as a pleasure or painsignal. The motivation of the physical agent in this experimentis to maximize being rewarded by the teacher.

Consequently, to carry out this task, the robot just needs tofollow the commands of the teacher, and a world model withthat command as sensory input is obtained (top world model ofFig. 11) to select the action. From this point forward this modelwill be called communications model. The satisfaction model(top satisfaction model of Fig. 11) is trivial and it is not used


in the first part of the experiment (it is displayed in Fig. 11 forcoherence) as the satisfaction is directly related to the output ofthe communications model, this is, the reward or punishment.

Regarding the models corresponding to the remaining sen-sors of the robot, a second world model was simultaneouslyobtained (bottom world model of Fig. 11) that uses distanceand angle to the object as sensory inputs. Obviously, this modelis relating information different from the teacher’s commandsduring the performance of the task. If the commands produceany regularities in the information provided by other sensors inregards to the satisfaction obtained, these models can be appliedwhen operating without a teacher. This is, if in a given instantof time, the teacher stops providing commands, the communi-cations model will not have any sensory input and cannot beused to select the actions, leaving this task in the hands of othermodels that do have inputs. For this second case, the satisfac-tion model is more complex, relating the satisfaction value tothe distance and angle, directly associated with rewards or pun-ishments. The higher satisfaction (value 1) corresponds to theminimum distance and angle.

The four models are represented by multilayer perceptronANNs (with 2-3-3-1 neurons for the communications model,3-6-6-2 neurons for the world model, and 2-3-3-1 neurons forthe second satisfaction model). They were adjusted by means ofthe PBGA genetic algorithm [22] that automatically providedthe mentioned size of the ANNs. Summarizing, in this case, theMDB executes four evolutionary processes over four differentmodel populations every iteration. These processes run concur-rently in the current version of the MDB. The STM has a size of10 action–perception pairs in all the experiments and the labelL explained in Sections III-B1 and III-B3 was calculated using

in a stable context

if a context change is detected.

Fig. 12 displays the evolution of the mean squared error (cal-culated using the same expression explained above for theexperiment shown in Fig. 5) provided by the current models(communications, world and satisfaction) predicting the STMas iterations of the MDB take place in both physical agents(the top graph corresponds to the AIBO robot experiment andthe bottom graph to the Pioneer 2 robot one). The error clearlydecreases in all cases and in a very similar way for both agents(except at the beginning, first 10 iterations, where the STM isbeing filled up). This means that the MDB works similarly intwo very different real platforms and that the MDB is able toprovide a real modeling of the environment, the communica-tions and the satisfaction of the physical agents. As the errorvalues show in Fig. 12, both robots learned to follow teachercommands in an accurate way in about 20 iterations (from apractical point of view this means about 10 min of real time)and, what is more relevant, the operation without teacher wassuccessful using the induced world and satisfaction models. Inthis kind of real robot examples, the main measure that mustbe considered in order to decide the goodness of an experimentis the time consumed in the learning process to achieve perfectobedience. Fig. 10 displays a real execution of actions in bothrobots. In the pictures with a teacher, the robot is following

Fig. 12. Evolution of the mean squared error of the outputs provided by thecurrent models (predicted distance, angle, satisfaction, and human feedback)for the AIBO robot (top) and the Pioneer 2 robot (bottom) experiments.

Fig. 13. Evolution of the mean squared error provided by the outputs of the cur-rent communications model (predicted human feedback) and satisfaction model(predicted satisfaction) compared to the STM content as iterations of the MDBtake place when a dynamic language and reward policy is applied.

commands; otherwise it is performing the behavior withoutany commands, just using its induced models. It can be clearlyseen that the behavior is basically the same although a littleless efficient without teacher commands (as it has learned todecrease its distance to the object and not the fastest way todo it).

With the aim of showing the adaptive capabilities of the MDBin real robot operation, Fig. 13 represents the evolution of thestandard MSE provided by the current communications modelduring 200 iterations for the experiment with the Pioneer 2 robot(human feedback curve in the figure). Focusing our analysis inthe communications model, in the first 70 iterations the teacher


provides commands using the same encoding (language) ap-plied in the previous experiment. This encoding is not preestab-lished and the teacher can make use of any correspondence itwants as long as it is consistent. From iteration 70 to iteration160 another teacher appears using a different language (differentand more complex relationship between musical notes) and, fi-nally, from iteration 160 to iteration 200 the original teacher re-turns. As shown in Fig. 13, in the first 70 iterations the errordecreases fast to a level of 0.17, which results in a very accu-rate prediction of the rewards. Consequently, the robot success-fully follows the commands of the teacher. When the secondteacher appears, the error level increases because the STM startsto store samples of the new language and the previous modelsfail in the prediction. At this point, as commented before, theLTM management system detects this mixed situation (detectsan unstable model) and induces a change in the parameters ofthe STM replacement strategy to a FIFO strategy. The increasein the value of the error stops in about 10 iterations and, oncethe STM has been purged of samples from the first teacher’slanguage, the error decreases again (0.13 at iteration 160). Theerror level between iterations 70 and 160 is not as stable as in thefirst iterations. This happens because the language used by thesecond teacher is more complex than the previous one, that is,its relationship to the encoding variable is nonlinear and, in ad-dition, it must be pointed out that the evolution graphs obtainedfrom real robots oscillate, in general, much more than in simu-lated experiments due to the broad range of noise sources of thereal environments. But the practical result is that about iteration160 the robot follows the new teacher’s commands successfullyagain, adapting itself to teacher characteristics. When the orig-inal teacher returns using the original language (iteration 160 ofFig. 13), the adaptation is very fast because the communicationmodels stored in the LTM during the first iterations are intro-duced as seeds in the evolutionary processes.

Regarding the satisfaction model curve represented in Fig. 13,it corresponds to an equivalent experiment in which a change inthe rewards provided by the teacher was carried out. From theinitial iteration until iteration 70, the teacher rewards reachingthe object and, as shown in the graph, the error level is low(1.4%). From iteration 70 to 160, the teacher changes its be-havior and punishes reaching the object, rewarding escapingfrom it. There is a clear increase of error level due to the com-plexity of the new situation (high ambiguity of possible solu-tions, that is, there are more directions for escaping than forreaching the object). In iteration 160, the teacher returns to thefirst behavior and, as expected, the error level decreases to theoriginal levels quickly obtaining a successful adaptive behavior.

V. CONCLUSION

This paper presents the MDB cognitive architecture forrobots. It follows a developmental approach to provide realrobots with autonomous lifelong learning capabilities. Theknowledge acquisition is carried out by means of neuroevo-lutionary processes that use the real data obtained duringthe operation of the robots as fitness function. The compu-tational implementation of the architecture includes severalimprovements to maximize the efficiency and reliability of itspractical application to real robots. The experiments carriedout with the MDB have confirmed its capabilities for real-time

learning of basic skills and more complex behaviors in dynamicenvironments.

These results open a very promising line of research involvingevolutionary cognitive structures in which several aspects maybe considered and improved. The current version of the MDBdoes not take into account the social aspects of autonomous op-eration, a very important issue that must be studied in depth.Furthermore, the use of internal sensors and, consequently, in-ternal models, must be analyzed and considered as an intrinsicpart of the robot representation. The possibility of a dynamicchange of motivations adapted to the robot behavior and envi-ronmental conditions is an aspect that should be included in thearchitecture in order to produce really autonomous and adaptivesystems. Research opportunities may also be found in the con-trol of the short and long term memories, especially in terms ofdeciding what goes in or is forgotten and how to produce LTMrepresentations that are not directly a consequence of STM re-lated models but rather generalizations of knowledge alreadypresent in the LTM.

Regarding the immediate future work, new experiments withreal robots are being carried out implying complex sequencesof actions to study online behavior learning and adaptation indepth. In addition, other representations for the models apartfrom ANNs are being tested and analyzed. Finally, robots with alarger and redundant sensorial and actuation repertoire are beingconsidered in order to determine the suitability of the mecha-nism in these contexts. We expect the range of behaviors andmodels to increase exponentially but, at the same time, we ex-pect the mechanism to be able to cope better due to the fact that itwill have more information and options to achieve an objective.

REFERENCES

[1] R. Cotterill, Enchanted Looms: Conscious Networks in Brains andComputers. Cambridge, U.K.: Cambridge Univ. Press, 2000.

[2] D. Vernon, G. Metta, and G. Sandini, “A survey of artificial cognitivesystems: Implications for the autonomous development of mental ca-pabilities in computational agents,” IEEE Trans. Evol. Comput., vol.11, no. 2, pp. 151–180, Apr. 2007.

[3] X. Yao, “Evolving artificial neural networks,” Proc. IEEE, vol. 87, no.9, pp. 1423–1447, Sep. 1999.

[4] F. Bellas, A. Lamas, and R. J. Duro, “Adaptive behavior through aDarwinist machine,” Lecture Notes Artif. Intell., vol. 2159, pp. 86–89,2001.

[5] F. Bellas, J. A. Becerra, and R. J. Duro, “Induced behavior in a realagent using the multilevel Darwinist brain,” Lecture Notes Comput.Sci., vol. 3562, pp. 425–434, 2005.

[6] G. A. Bekey, Autonomous Robots: From Biological Inspiration to Im-plementation and Control. Cambridge, MA: MIT Press, 2005.

[7] J. A. Farrell and M. M. Polycarpou, Adaptive Approximation BasedControl: Unifying Neural, Fuzzy and Traditional Adaptive Approxima-tion Approaches. New York: Wiley, 2006.

[8] N. Nilsson, Principles of Artificial Intelligence. San Mateo, CA:Morgan Kaufmann, 1980.

[9] R. Arkin, Behavior-Based Robotics. Cambridge, MA: MIT Press,1998.

[10] The American Heritage Dictionary of the English Language 4th ed.Houghton Mifflin Company, 2000.

[11] L. M. Brasil, F. M. de Azevedo, J. M. Barreto, and M. Noirhomme-Fraiture, “Complexity and cognitive computing,” Lecture NotesComput. Sci., vol. 1415, pp. 408–417, 1998.

[12] D. Blank, J. Marshall, and L. Meeden, “What is it like to be a develop-mental robot?,” Newslett. Autonom. Mental Develop. Tech. Committee,vol. 4, no. 1, p. 7, 2007.

[13] J. Weng, J. McClelland, A. Pentland, O. Sporns, I. Stockman, M. Sur,and E. Thelen, “Autonomous mental development by robots and ani-mals,” Science, vol. 291, no. 5504, pp. 599–600, 2001.

[14] L. Meeden and D. Blank, “Editorial: Introduction to developmentalrobotics,” Connect. Sci., vol. 18, no. 2, pp. 93–96, 2006.


[15] M. Asada, K. Hosoda, Y. Kuniyoshi, H. Ishiguro, T. Inui, Y.Yoshikawa, M. Ogino, and C. Yoshida, “Cognitive developmentalrobotics: A survey,” IEEE Trans. Autonom. Mental Develop., vol. 1,no. 1, pp. 12–34, May 2009.

[16] M. R. Genesereth and N. Nilsson, Logical Foundations of ArtificialIntelligence. San Mateo, CA: Morgan Kauffman, 1987.

[17] R. J. Duro, J. Santos, F. Bellas, and A. Lamas, “On line Darwinist cog-nitive mechanism for an artificial organism,” in Proceedings Supple-ment Book SAB2000. New York: International Society for AdaptiveBehavior, 2000, pp. 215–224.

[18] J. Changeux, P. Courrege, and A. Danchin, “A theory of the epigenesisof neural networks by selective stabilization of synapses,” in Proc. Nat.Acad. Sci., 1973, vol. 70, pp. 2974–2978.

[19] M. Conrad, “Evolutionary learning circuits,” J. Theoret. Biol., vol. 46,1974.

[20] G. Edelman, Neural Darwinism. The Theory of Neuronal Group Selec-tion. New York: Basic Books, 1987, pp. 167–188.

[21] K. O. Stanley and R. Miikkulainen, “Evolving neural networks throughaugmenting topologies,” Evol. Comput., vol. 10, pp. 99–127, 2002.

[22] F. Bellas, J. A. Becerra, and R. J. Duro, “Using promoters and functionalintrons in genetic algorithms for neuroevolutionary learning in non-stationary problems,” Neurocomputing, vol. 72, pp. 2134–2145, 2009.

[23] F. Bellas, J. A. Becerra, and R. J. Duro, “Internal and external memoryin neuroevolution for learning in non-stationary problems,” LectureNotes Artif. Intell., vol. 5040, pp. 62–72, 2008.

[24] F. Bellas and R. J. Duro, “Introducing long term memory in an ANNbased multilevel Darwinist brain,” Lecture Notes Comput. Sci., vol.2686, pp. 590–597, 2003.

[25] T. H. J. Collett, B. A. MacDonald, and B. P. Gerkey, “Player 2.0: To-ward a practical robot programming framework,” in Proc. Annu. Sci.Meeting Exhibt. (ACRA), Sydney, Australia, Dec. 2005.

[26] P. Caamano, R. Tedin, and J. A. Becerra, Java Evolutionary AlgorithmFramework [Online]. Available: http://www.gii.udc.es/jeaf

[27] F. Bellas, J. A. Becerra, J. Santos, and R. J. Duro, “Applying synapticdelays for virtual sensing and actuation in mobile robots,” in Proc.IJCNN 2000, Como, Italy, 2000, pp. 6144–6153.

[28] Tekkotsu Homepage, [Online]. Available: http://www.tekkotsu.org/

Francisco Bellas (M’10) received the B.Sc. andM.Sc. degrees in physics from the University ofSantiago de Compostela, Spain, in 1999 and 2001,respectively, and the Ph.D. degree in computerscience from the University of A Coruña, Coruña,Spain, in 2003.

He is currently a Profesor Contratado Doctor at theUniversity of A Coruña. He is a member of the In-tegrated Group for Engineering Research at the Uni-versity of A Coruña. His current research interests arerelated to evolutionary algorithms applied to artificial

neural networks, multiagent systems, and robotics.

Richard J. Duro (M’94–SM’04) received the B.Sc.,M.Sc., and Ph.D. degrees in physics from the Univer-sity of Santiago de Compostela, Spain, in 1988, 1989,and 1992, respectively.

He is currently a Profesor Titular in the Depart-ment of Computer Science and head of the IntegratedGroup for Engineering Research at the Universityof A Coruña, Coruña, Spain. His research interestsinclude higher order neural network structures,signal processing, and autonomous and evolutionaryrobotics.

Andrés Faiña received the M.Sc. degree in indus-trial engineering from the University of A Coruña,Coruña, Spain, in 2006. He is currently working to-wards the Ph.D. degree in the Department of Indus-trial Engineering at the same university.

He is currently a Researcher at the IntegratedGroup for Engineering Research. His interestsinclude modular and self-reconfigurable robotics,mobile robotics, and electronic and mechanicaldesign.

Daniel Souto received the M.Sc. degree in industrialengineering from the University of A Coruña,Coruña, Spain, in 2007. He is working towardsthe Ph.D. degree in the Department of IndustrialEngineering at the same university.

He is currently a Researcher at the IntegratedGroup for Engineering Research. His research activ-ities are related to automatic design and mechanicaldesign of robots.

340 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL...

Documents

Transcript of 340 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL...