Real-timeAdaptation Technique to Real Robots: An ...Real-timeAdaptation Technique to Real Robots: An...

8
Real-time Adaptation Technique to Real Robots: An Experiment with a Humanoid Robot Shotaro Kamio Hitoshi Iba Graduate School of Frontier Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan. kamio,iba @iba.k.u-tokyo.ac.jp Abstract- We introduce a technique that allows a real robot to execute real-time learning, in which GP and RL are integrated. In our former research, we showed the result of an experiment with a real robot “AIBO” and proved the technique performed better than the tradi- tional Q-learning method. Based on the proposed tech- nique, we can acquire the common programs using a GP, applicable to various types of robots. We execute rein- forcement learning with the acquired program in a real robot. In this way, the robot can adapt to its own op- erational characteristics and learn effective actions. In this paper, we show the experimental results in which a humanoid robot “HOAP-1” has been evolved to perform effectively to solve the box-moving task. 1 Introduction We can use techniques of machine learning when applying a robot to achieve some task, while appropriate actions are unknown in advance. In such situations, the robot can learn appropriate actions in a try-and-error manner in a real en- vironment. Well-known techniques for this purpose are Ge- netic Programming (GP) [11], Genetic Algorithm (GA) [14] and Reinforcement Learning (RL) [17]. GP can directly generate programs to control the robot. There are many studies in which GP is applied to control real robots [1, 19]. We can use a GA in the combination with a Neural Network (NN) to control robots [14]. The evalu- ation of real robots demands a significant amount of time because of their mechanical actions. Moreover, we have to repeat the evaluations of many individuals over several gen- erations in both GP and GA. For example, Andersson et al. spent 15 hours for the evaluation of 111 generations to ac- quire a galloping behavior by GP [1], and Floreano et al. spent 10 days for 240 generations to evolve the motion of going toward a light source by GA with NN [5]. Therefore, in most of these studies, the learning was conducted through simulation and the results were applied to real robots. RL is capable of finding optimal actions from interac- tions with the environment. To obtain optimal actions using RL, it is necessary to repeat learning trials time and again. The enormous learning time of the task with a real robot is a critical problem. Accordingly, most studies deal with the problems of receiving an immediate reward from an ac- tion [10], or loading the results learned with a simulator into a real robot [2]. Learning by simulation requires the simulator to repre- sent agents, the environment and their interactions precisely. However, there are many tasks that are difficult to make the simulator precise. Learning with an imprecise simulator does not lead to effective performance in a real environment. Furthermore, there are certain variations due to minor errors in the manufacturing process or to changes with time even if the real robots are modeled exactly the same way. The above approach, i.e., to learn with a simulator and to apply the result to a real robot, has limitations. Therefore, learning with a real robot is unavoidable in order to acquire optimal actions. Now that various kinds of robots have been developed, it is very important to carry out a task by employing those dif- ferent robots. In this case, it will take much longer if we al- low each different robot has to learn the task independently. On the contrary, we can use some general knowledge for the common part of various robots and then fine-tune spe- cific parameters for a particular robot, thereby establishing a more effective learning scheme. In our previous paper [8], we have proposed a technique that allows a real robot to execute real-time learning in which GP and RL are integrated. Our technique does not need a precise simulator because learning is done with a real robot. In other words, the precision requirement can be easily met if the task is only expressed in a proper way. As a result of this concept, we can greatly reduce the cost to make the simulator highly precise. Since GP can generate programs, our technique has the possibility of producing more com- plex behaviors than the reactive behaviors of RL. Based on the proposed technique, we can apply the pro- gram, which has been acquired with the same simulator via GP as described in [8], to different types of robots. The program will adapt to the operational characteristics of real robots and learn effective actions. In our former research, we used a robot “AIBO” (an entertainment four- legged robot by SONY) and proved the effectiveness of our proposed technique. In this research, we perform an exper- iment with a humanoid robot “HOAP-1” (manufactured by Fujitsu Automation Limited). We provide the experimental results to confirm the adaptation ability of our methodology. This paper is organized as follows. The next section ex- plains the task of the experiment. After that, Section 3 ex-

Transcript of Real-timeAdaptation Technique to Real Robots: An ...Real-timeAdaptation Technique to Real Robots: An...

Page 1: Real-timeAdaptation Technique to Real Robots: An ...Real-timeAdaptation Technique to Real Robots: An Experiment with a Humanoid Robot Shotaro Kamio Hitoshi Iba Graduate School of Frontier

Real-timeAdaptation Technique to RealRobots:An Experiment with a Humanoid Robot

Shotaro Kamio Hitoshi IbaGraduateSchool of FrontierScience,TheUniversity of Tokyo,

7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan.�kamio,iba � @iba.k.u-tokyo.ac.jp

Abstract- We intr oducea technique that allows a realrobot to executereal-timelearning, in which GP and RLare integrated. In our former research, we showed theresult of an experiment with a real robot “ AIBO” andproved the technique performed better than the tradi-tional Q-learning method. Basedon the proposedtech-nique,wecanacquirethecommonprogramsusingaGP,applicable to various types of robots. We executerein-forcement learning with the acquired program in a realrobot. In this way, the robot can adapt to its own op-erational characteristics and learn effective actions. Inthis paper, we show the experimental resultsin which ahumanoid robot “HO AP-1” hasbeenevolvedto performeffectively to solve the box-moving task.

1 Intr oduction

We canusetechniquesof machinelearningwhenapplyinga robot to achieve sometask,while appropriateactionsareunknown in advance. In suchsituations,therobotcanlearnappropriateactionsin a try-and-error mannerin a real en-vironment.Well-known techniquesfor thispurposeareGe-neticProgramming(GP)[11], GeneticAlgorithm(GA) [14]andReinforcementLearning (RL) [17].

GPcandirectly generate programsto control therobot.Therearemany studiesin which GP is appliedto controlrealrobots[1, 19]. WecanuseaGA in thecombinationwitha Neural Network (NN) to control robots[14]. Theevalu-ation of real robots demands a significant amount of timebecauseof their mechanical actions.Moreover, we have torepeattheevaluationsof many individualsoverseveralgen-erationsin bothGPandGA. For example,Anderssonet al.spent15 hours for theevaluationof 111generations to ac-quire a galloping behavior by GP [1], andFloreanoet al.spent10 daysfor 240generations to evolve the motionofgoingtowarda light sourceby GA with NN [5]. Therefore,in mostof thesestudies,thelearning wasconductedthroughsimulationandtheresultswereappliedto realrobots.

RL is capableof finding optimal actionsfrom interac-tionswith theenvironment.To obtainoptimalactionsusingRL, it is necessaryto repeatlearningtrials time andagain.The enormouslearningtime of the task with a real robotis a critical problem. Accordingly, moststudiesdealwiththeproblems of receiving animmediaterewardfrom anac-tion [10], or loading theresultslearned with asimulatorinto

a realrobot[2].Learning by simulationrequires the simulatorto repre-

sentagents,theenvironment andtheirinteractionsprecisely.However, thereare many tasksthat are difficult to makethesimulatorprecise.Learning with animprecisesimulatordoesnotleadtoeffectiveperformancein arealenvironment.Furthermore,therearecertainvariationsdueto minorerrorsin themanufacturing processor to changeswith time evenif the real robotsaremodeledexactly the sameway. Theabove approach,i.e., to learnwith a simulatorandto applytheresulttoarealrobot,haslimitations.Therefore,learningwith a realrobot is unavoidablein order to acquireoptimalactions.

Now thatvariouskindsof robotshavebeendeveloped,itis veryimportant to carryouta taskby employing thosedif-ferent robots. In thiscase,it will take muchlonger if weal-low eachdifferent robot hasto learnthetaskindependently.On the contrary, we canusesomegeneral knowledgeforthecommon partof various robots andthenfine-tunespe-cific parametersfor a particularrobot, thereby establishingamoreeffective learningscheme.

In ourpreviouspaper [8], we haveproposeda techniquethatallowsarealrobot toexecutereal-timelearning in whichGP andRL areintegrated. Our technique doesnot needaprecisesimulatorbecauselearning is donewith arealrobot.In otherwords,theprecision requirementcanbeeasilymetif the task is only expressedin a proper way. As a resultof this concept, we cangreatlyreducethecostto make thesimulatorhighly precise.SinceGPcangenerateprograms,our technique hasthe possibility of producingmore com-plex behaviors thanthereactivebehaviors of RL.

Basedon theproposedtechnique,we canapplythepro-gram, which has beenacquired with the samesimulatorvia GP as describedin [8], to different types of robots.The program will adaptto the operational characteristicsof real robotsand learn effective actions. In our formerresearch, we useda robot “AIBO” (an entertainment four-leggedrobot by SONY)andprovedtheeffectivenessof ourproposedtechnique. In this research, we perform anexper-imentwith a humanoid robot “HOAP-1” (manufacturedbyFujitsuAutomation Limited). We provide theexperimentalresultsto confirmtheadaptationability of ourmethodology.

This paperis organizedasfollows. Thenext sectionex-plainsthe taskof theexperiment.After that,Section3 ex-

Page 2: Real-timeAdaptation Technique to Real Robots: An ...Real-timeAdaptation Technique to Real Robots: An Experiment with a Humanoid Robot Shotaro Kamio Hitoshi Iba Graduate School of Frontier

Figure 1: The robot “HOAP-1”, the box and the goalmarker.

Table1: Thespecificationof therobot “HOAP-1”.Height about 48cmWeight about 6kg,including 0.7kg of battery

JointMobility 6DOF/foot ��� , 4DOF/arm ���

Sensors

Jointangle sensor3-axis acceleration sensor

3-axisgyrosensorFoot loadsensor

plainsour proposedtechnique andsettingsin this experi-ment. Section4 presentsanexperimentalresultwith a hu-manoid robot “HOAP-1” andSection5 discussesresultsofcomparisonand future research. Finally, a conclusion isgivenin Section6.

2 Task Definition

The task is the box-moving task, the goal of which is tomove a box to a pre-defineddestinationarea.It is thesameasourpreviousworkswith AIBO [8].

Therobot usedin thispaperis “HOAP-1” whichis man-ufacturedby FujitsuAutomationLimited. It is a humanoidof 20 degreesof freedom (Fig. 1). Thespecificationof therobot is shown in Table1. This robot makesanactionfroma command givenby a hostcomputer(RT-Linux operatingsystem).It is alsoequippedwith aCCD camerain its head,whichprovidesimagedatafor thepurposeof environmentalunderstanding.

Thetargetobject,i.e., thebox, haswheelsonthebottomso that it canbe easilypushed. Note that the force powerof a humanoid’s armis soweakthat it hasdifficulty carry-ing something heavy. Thus, we have chosento fix thearmpositionandusethepushactionfor thesake of simplicity.The goal positionis marked with a red marker. Whenthehumanoid haspushedthe box in front of this red marker,we regardthetaskhasbeenachievedsuccessfully.

Moving thebox to anarbitrarypositionis generally dif-ficult. The moving behavior is achieved when the robotpushesthebox with its kneeswhile walking. A humanoidrobot is quitedifferent from AIBO in thesensethatit stands

GP loop(Population Learning)

Create initial individuals

Evaluate individuals

Selection

Reproduce individuals(crossover and mutation)

Reinforcement Learning(Individual Learning)

Figure2: Theflow of thealgorithm.

on onefoot while walking, during which its direction maybedivertedunexpectedlywith somedisturbances.In partic-ular, if thebox is in front of oneleg, theboxmaybepushedforward or sometimes moved outsideof the leg area. It isunpredictable becauseof the physical interactions or theirfrictions. It is very difficult to constructa precisesimulatorwhichexpressesthis movement.Therobot must,therefore,learntheseactionsin a realenvironment.

3 Real-timeAdaptation Technique

In our former research[8], we proposeda technique thatintegratesGP andRL. The algorithm is shown in Fig. 2.This technique enablesus (1) to speedup learning in realrobot and (2) to adapt to a real robot using the programsacquired in a simulator.

Theproposedtechnique consistsof two stages(GPpartandRL part).

Step1. Carryout GPin a simplifiedsimulator, andformu-lateprogramsthathave thestandard actions requiredfor executinga task.

Step2. Conduct individual learning(= RL) after loadingtheprogramsobtainedin Step1 above.

In thefirst stepabove,theprogramsfor thestandardactionsrequiredof arealrobot to executeataskarecreatedthroughtheGPprocess.Thelearning processof RL canbespeededupin thesecondstepbecausethestatespaceis dividedintopartialspacesunder thejudgmentstandards obtainedin thefirst step.Moreover, preliminary learningwith a simulatorallows usto anticipatethata robot performstarget-orientedactions from thebeginningof thesecondstage.

Page 3: Real-timeAdaptation Technique to Real Robots: An ...Real-timeAdaptation Technique to Real Robots: An Experiment with a Humanoid Robot Shotaro Kamio Hitoshi Iba Graduate School of Frontier

3.1 RL part conductedby the real robot

As youcanwell imagine,ahumanoid robot is verydifferentfrom therobot AIBO usedin our former research.Thedif-ference includesnot only theshapesof therobots (onehastwo legs andtheotherhasfour legs),but alsothe viewingangleof CCD cameraandthedynamics of behaviors.

However, our techniquecantreattheprogramsfor bothAIBO andHOAP-1 in thesamemanner becausethefunda-mentalactions(i.e., forwardmoving andturning) requiredfor the task are common to both of them. Thoseactionsarerepresentedin our simplified simulator. GP is appliedby usingthesimplesimulatorsoasto evolvesuchcommonprograms(thedetailsof GParedescribed in Sect.3.2). Atthestageof reinforcementlearning, theeffectiveadaptationis doneaccording to theoperationalcharacteristicsof apar-ticular robot.

3.1.1Action set

The robot can choose one of the seven actions: Forward(6 steps),Left-turn, Right-turn, Left-sidestep(onesteptothe left), Right-sidestep(one stepto the right), the combi-nationof Left-turnandRight-sidestep, andthecombinationof Right-turn andLeft-sidestep.However, theseactionsarefar from ideal.For instance,therobot tends to movea littlebackward duringtheRight-turn or Left-turn. Thus,it is nec-essaryfor therobotto adaptthemotion characteristics.Asmentionedbefore,although therobot hasanarm,its poweris soweakthatwe cannot rely on it to move thebox. Therobot carriestheboxto thegoal onlyby thesesevenactions.

As is often the casewith a real robot, any actiongivesriseto someerror. For example, it is inevitably affectedbythe slightestroughnessof the floor or the friction changedueto thebalanceshift. Thus,eventhoughtherobot startsfrom the samepositionunderthe sameconditions, it doesnotnecessarilyfollow thesamepath.

In addition, every action takes approximately ten sec-onds. It is, therefore, desirablethat the learning time beasshortaspossible.

3.1.2StateSpace

Thestatespaceis structuredbasedonpositions from wherethe box and the goal marker canbe seenin the CCD im-age,asdescribed in [8]. This recognition is performedaftereveryaction.

Figures3(a)and3(b)aretheprojectionsof theboxstateandthegoalmarkerstateonthegroundsurface.Thesestatespacesareconstructedfrom roughdirectionsandroughdis-tancesof objects.Weusedfour levelsof thedistancefor thebox (i.e., “proximate”, “near”, “middle”, “f ar”) and threelevelsfor thegoalmarker (“near”, “middle”, “f ar”).

We have to pay specialattentionto the positionof therobot’s legs. Depending on the physical relationship be-

proximatecenter

proximateleft

far centerfar left far right

middle center

middle left middle right

near centernear left near right

proximateright

lost into left lost into rightlost

leftleg

rightleg

proximatestraight rightproximate

straight left

(a) Statesof thebox.

far center

far leftfar right

middle center

middle left middle right

near center

near left near right

lost into left lost into rightlost

leftleg

rightleg

(b) Statesof thegoalmarker.

Figure 3: Statesin real robot. Thefront of therobot is up-sideof thesefigures.

tweentheboxandlegsof therobot, theboxmoves forwardor deviatesto a side.If anappropriatestatespaceis notde-fined,thentheMarkov propertyof theenvironment, whichisapremiseof RL, cannot bemet;therefore,optimal actionscannot befound. Consequently, we havedefined the“prox-imatestraightleft” andthe“proximatestraightright” statesat thefrontalpositionsof thefront legs.Thesestatesdonotexist in thesimulationbecausethe interactionbetweenthelegsof therobot andthebox is verydifficult to simulate.

The stateis definedto be “lost” whenthe robot missestheboxor thegoalposition.RememberthattheCCD cam-eraof ourrobot is fixedin aforward direction, notmovable.As a resultof this, the robot oftenmissesthe box or goal.To recover from thismissing,weusethefollowing strategy.The robot records the missingdirection (i.e., left or right)whenit hasmissedthebox or goal. Therobot canusethisinformationfor a certainnumber of stepssoasto generatethe “lost into left” or “lost into right” states,which meansthat thetarget haddisappearedin eithertheleft or right di-

Page 4: Real-timeAdaptation Technique to Real Robots: An ...Real-timeAdaptation Technique to Real Robots: An Experiment with a Humanoid Robot Shotaro Kamio Hitoshi Iba Graduate School of Frontier

center

right

Agent

box_ahead

lost

lost intoleft

lost intoright

left

Figure4: Statesfor box andgoal marker in the simulator.The areabox ahead is not the statebut the placewhereif box ahead executesfirst argument.

rectionin thelastfew steps.As shown in Figs.3(a)and3(b), thereare17 statesfor

thebox and14 statesfor thegoalmarker. Thestateof thisenvironment is representedwith theircrossproduct. Hence,thereare238statesin total.

When the goal marker is “near center”and the box isnearfront of therobot (i.e.,oneof thestatesin “box proxi-matecenter”,“box proximatestraightleft”, “box proximatestraightright”, and“box nearcenter”), therobot recognizesthatit hascompletedthetask.

3.2 GP part conductedby the simulated robot

Thesimulatorin our experimentusesa robotexpressedasa circle on a two-dimensionalplaneaswell asa box andagoalmarkerfixedontheplane.Thetaskis completedwhentherobot haspushedtheboxforwardto thegoalmarker[8].

We definedthreeactions(“move forward”, “turn left”,“turn right”) in an actionset. The statespacein the sim-ulator is simplified as shown in Fig. 4. While actions ofthereal robot arenot ideal, thesimulatoractionsareideal,i.e., “move forward” actionmovesthe robot truly straight-forward and“turn left” actionpurely turnsleft. Of all thestates,the “lost into left” and “lost into right” statesaresimilar to thoseusedby a real robot. Thesearethe statesproducedwhenthe box or the goal is not in view andthepreceding stepis eitherat theleft or at theright.

Suchactions with a statedivision are similar to thosefor a real robot, but arenot identical. In addition, physicalparameterssuchasboxweightandfrictionarenotmeasurednor is theshapeof therobot takeninto account. Therefore,this simulatoris sosimpleandit is possibleto build it withlow cost.

The operational characteristics of the box expressedinthesimulatorareasfollows:

1. Theboxmovesforwardif it comesin contactwith thefront of therobot andtherobot movesahead1.

2. If the box is nearthe centerof the robot when therobot turns,thenthe box remains nearthe centerof

1Thismeanstherobot is ableto pushthebox with theForwardaction.

turn_left move_forwardturn_left

turn_left move_forward turn_right

s

a

turn_right turn_right

s s

box_where

Figure 5: Action nodes pick up a real actionaccording tothe � -value of a realrobot’s state.

therobot aftertheturn2.

Weusedaterminal set=�move forward,turn left,

turn right � anda function set=�if box ahead,

box where, goal where, prog2 � . The above termi-nalnodescorrespondto the“moveforward”, “turn left”, and“turn right” actionsrespectively in thesimulator. Thefunc-tional nodesbox where andgoal where arethe func-tionsof six arguments,andthey execute oneof thesix ar-guments,depending on the statesof the box andthe goalmarker asseenthrough the robot’s eyes(Fig. 4). Furtherdetailsof thesettingof GParedescribedin [8].

3.3 Integration of GP and RL

Reinforcementlearning isexecutedtoadaptactionsacquiredvia GP to the operational characteristicsof a real robot.Moreprecisely, thisisaimedatrevisingthemove forward,turn left andturn right actionswith thesimulatorsoasto achievetheiroptimalactionsin a realenvironment.

Weapplied����� -learningmethodto arealrobot. ����� -learning is a variantof � -learning andcanlearnmoreef-ficiently thannormal � -learning. This methodrecords thetraceof visited statesandactionstaken. Whena temporaldifference(TD) erroroccurs, � -valuesinvolvedin thetraceareassignedcreditor blamefor theerror. We implemented“naive ����� ” with a replacing tracedescribedin [17].

A � -table,onwhich � -valueswerelisted,is allocatedtoeachof themove forward,turn left andturn rightactionnodes.Thestatesusedonthe � -tablesarethosefor arealrobot. Therefore,actualactionsselectedwith � -tablescan vary dependingon the state,even if the sameactionnodesareexecutedby a real robot. Figure5 illustratestheabovesituation.

For the initialization of the � -table, we allow a spe-cific actionto be chosenmore frequently. For instance,incaseof the � -tablefor move forward, the Forward ac-tion tendsto beselectedmoreoftenasarethe � -tables forturn left andturn right. However, whenthebox is

2This meansthat the robot is able to changethe direction of the boxonly by Left-turn andRight-turn actions.

Page 5: Real-timeAdaptation Technique to Real Robots: An ...Real-timeAdaptation Technique to Real Robots: An Experiment with a Humanoid Robot Shotaro Kamio Hitoshi Iba Graduate School of Frontier

Table2: Action nodesandtheir selectablerealactions.actionnode realactionswhich � -table canselect.

move forward Forward�� , Left-sidestep, Right-sidestepturn left Left-turn �� , Left-turn + sidestep���, Left-sidestepturn right Right-turn �� , Right-turn+ sidestep���, Right-sidestep

�� Theactionwhich � -tableprefersto select.��� Preferredactionsif the box is in a stateof “proximatecenter”,“near

center”,“middle center”,or “f arcenter”.

in astateof “proximatecenter”,“nearcenter”, “middle cen-ter”, or “f arcenter”, thentheactionsfor thecombination ofLeft(Right)-turn andsidestepare chosenwith greaterfre-quency. This maintainsconsistency with the simulatorre-sults. Becauseturn actionsarealwaysaccompaniedby asmall backwardmotion,whenthe robot takesan actionofthecombinationof turn andsidestepin oneof thesestates,then the next stateremains to be the same. In the imple-mentation, the initial valueof 0.0001 wasenteredinto therespective � -tables so that preferredactionswereselectedfor each � -table,while 0.0 wasenteredfor otheractions.The robot canlearnoptimalactionsin this settingbecause� -valuesconverge regardlessof the initial value[17]. Theactionswhich arepreferred to selecton eachactionnodearesummarizedin Table2. In addition, each� -table is ar-ranged to setthe limits of selectableactions.This referstotheideathat, for example, “turn right” actionsarenot nec-essaryto learnin theturn left node.

Sometranslationsof statesarerequired to run a GP in-dividual in the real robot. We translatedthe states“prox-imate straight left” and “proximate straightright”, whichexist only in therealrobot, into a “center” statein functionnodes of theGP. Whenthebox is in the“proximatecenter”statefor the real robot, if box ahead node executes itsfirst argument.

As for ����� -learningparameters, the rewardwassettobe1.0whenthegoalwasachievedand0.0for otherstates.We chosethe learning rate ������� � , the discountfactor� ����� � andthe trace-decay parameter ������� � . Theseparametersaredeterminedfrom preliminaryexperiments.

4 Experimental Resultswith a humanoid robotHOAP-1

An experimentwasperformedusinga humanoidrobot. Inthisexperiment,thestartingstatewaslimited to anarrange-mentin which boththebox andthegoal positionwerevis-ible. This was justified by the following reason. Even iflearningis performedfrom anarrangementin which eitherthe box or the goal is “lost”, it cannot be predictedthatthebox or goal positionwill subsequently becomevisible.Thus,therewill besubstantialvariations in statetransition,anda long timewill berequiredfor thelearning.

As a single trial, the learningwas performeduntil therobot protrudedfrom thefield of theexperimentor thepre-determined number of steps(i.e., 30 steps)wasexceeded.The learningin the real robot wasperformedin about sixhours.

Just at the start of learning: The robot succeededincompleting thetaskin many situations.This is becausetherobot actedrelatively well usingtheprogramevolvedwiththesimulator, although theoperationalcharacteristicdiffersfrom AIBO.

However, therobot took a long time in somesituations.This proves that the acquired actionswith the simulatorare not always optimal in a real environmentbecause ofthe differencesbetweenthe simulatorand the real robot.Thesedifferencesnecessitatethe statesadded to the boxfor therealrobot, i.e., “proximatestraightleft” and“proxi-matestraightright”. Theoperationalcharacteristicsin thesestatesareunknown to therobot before learning.

After six hours (after about 1800steps) Noticeablyim-provedactionswereobserved. Therobot selectedappropri-ateactionsin thesituations.Fig. 6 shows sucha successfulactionsequence. It completedthe taskmuchfasterthanitdid beforelearning.

Thesameimprovement hasbeenobserved asdescribedin our previousstudieson AIBO [8]. This underscorestheeffectivenessof ourapproach.SinceGPsucceededin learn-ing somegeneral knowledge,in the sensethat its usageisnot limited to aparticularrobot, thenit is applicable to bothAIBO andHOAP-1.

5 Discussion

5.1Measurementof Impr ovement

We performedthequantitative comparisonsoasto investi-gatehow efficiently therobot performedafterlearning. Forthis comparison,we randomly selectedsix situations:foursituationsareselectedfrom statesadded for therealrobots(“box proximatestraightleft” and“box proximatestraightright”), andtheothertwo areselectedfrom different states.We measuredthe number of stepsin completing the taskboth beforeand after learningin thesesituations. These

Page 6: Real-timeAdaptation Technique to Real Robots: An ...Real-timeAdaptation Technique to Real Robots: An Experiment with a Humanoid Robot Shotaro Kamio Hitoshi Iba Graduate School of Frontier

Figure6: Successfulactionseries.Thegoalis at thebottomcenterof eachfigure.

testsareexecuted in a greedypolicy in order to insuretherobot alwaysselectsthebestactionin eachstate.

Table3 shows the results.After the learning, the robotcompletedthe taskmoreefficiently in four out of six situ-ations(representedby bold font in the table) thanbefore.In particular, greatimprovementwasobserved in situation#2. Theseresultsproveourtechniqueworkedverywell andthentherobots learnedefficientactions.

Thereis anotherpoint to consider in termsof efficiency.Wehadto dealwith the“state-actiondeviation” problem[2]whenapplying ����� -learningto thisexperiment.Thisis theproblem thatoptimal actionscannot beachieveddueto thedispersionof statetransitions. Moreprecisely, if thestateiscomposedonly of theimages,thereareoftennoremarkable

Table3: Comparisonof thenumberof stepsbetweenbeforeandafterlearning.

#. state before aftersituation notation learning learning

Box: proximatestraightright1 GoalMarker: middleright 7.3 8.5

Box: proximatestraightright2 GoalMarker: middleleft 10.7 5.6

Box: proximatestraightleft3 GoalMarker: middleright 16.3 12.8

Box: proximatestraightleft4 GoalMarker: far right 14.0 15.3

Box: nearleft5 GoalMarker: far center 14.7 12.3

Box: middleright6 GoalMarker: far center 11.3 10.9

differencesin imagevalueswithin anaction.As a solutionto thisproblem,thesameactionshouldberepeateduntil thecurrentstatechanges. The � -valuesareupdatedwhenthestatechanges(for details,see[8]).

In ����� -learningaswell as in the usualreinforcementlearning, theagentlearnsoptimal actionsin orderto max-imize the sumof the discounted rewardswhich it receivesuntil completing thetask[17]. Thesum,which is calledasexpecteddiscountedreturn, canbewrittenasfollows:

�! �"#�$&%

� #(' *) # ) ,+ (1)

where- is a current time step,. is the last time step,� isthe discount factor(�0/ � /21 ). ' *) # ) meansa rewardreceived 3�451 time stepsin thefuture. In this experiment,therewardis definedas

' �617� � only whenthetaskis com-pleted, otherwise

' �8��� � . In thesituation,theequationcanbewrittensimplyas:

� � � " � (2)

This forces the agentto minimize . (i.e., the number ofsteps)in achieving thetask. Thestepis thestatetransitionbecauseof thetreatmentof “state-actiondeviation”. There-fore, we alsohave to compare thecountsof thestatetran-sitionsin completing the tasksoasto investigatehow effi-ciently therobotbehaves.

Table4 shows thecountsof statetransitionsin thesitua-tions.This tableillustratesthattheperformance in situation#4 was improved after learning in termsof the countsofstatetransitions.This is evidence thatthereal-timelearningprocessof our techniqueis veryeffective.

In situation#1, the unpredictablemovementof theboxwas observed many times. This unpredictability resultedin the longerconvergence,which meansthat it took muchlonger to learn.

Page 7: Real-timeAdaptation Technique to Real Robots: An ...Real-timeAdaptation Technique to Real Robots: An Experiment with a Humanoid Robot Shotaro Kamio Hitoshi Iba Graduate School of Frontier

Table4: Comparison of thecounts of statetransitions.#. situation beforelearning afterlearning

1 7.0 8.02 10.3 4.03 16.0 12.54 14.0 13.55 14.7 8.36 10.3 9.9

Thenumberof stepsdid notnecessarilybecome smallerafter learningin situation#4, whereasthe number of statetransitionsbecamesmaller. Onereasonseemsto bethatthedivisionof thestatespaceis notappropriate.Sincethedivi-sionis fixedduring �9��� -learning, we cannotexpect muchimprovement in caseof theincorrectstatespacedivision.

5.2 RelatedWorks

Therearemany studiescombining evolutionary algorithmsandRL [13, 3]. Although the approachesdiffer from ourproposedtechnique, we have seeseveral studiesin whichGP andRL werecombined [7, 4]. With thesetraditionaltechniques,Q-learning wasadopted asa RL, andan indi-vidual of GPrepresentedthestructure of thestatespacetobesearched.It is reportedthat its searchingefficiency wasimproved in QGPmethod[7], compared to the traditionalQ-learning. Thetechniquesusedin thesestudies,however,arealsoa type of population learning usingnumerousin-dividuals. RL mustbe executed for numerousindividualsin thepopulationbecauseRL is insidethe GP loop. A in-ordinateamount of time would be required for learning ifthewholeprocesswasdirectlyappliedto a realrobot. As aresult,nostudiesusingthesetechniqueshavebeenreportedwith a realrobot.

Noisein simulatorsareoftenessentialto overcomethedifferencesbetweena simulatorandrealenvironment[16].Therobot whichlearnedwith ourtechnique,however, showedsufficient performancein thenoisy realenvironment,eventhough therobot hadlearned in anidealsimulator. Onerea-son seemsto be that the coarse statedivision can absorbtheimageprocessingnoise.We planto conductacompara-tiveexperimentof therobustnessproducedbyourtechniquewith thatby noisysimulators.

As describedin Sect.5.1, it seemsthat the division ofthe statespaceis not appropriatein somesituations. It isdifficult for the robot to improve the actions in suchsitua-tions because the division is fixed in the learning process.Takahasiet al. [18] proposedtwo methodsof segmenting astatespaceautomatically. In thefirst method,therealrobotmoves in its environment andsamplesdata. After that, itsegments a statespaceconstructinglocal modelsof inputs.They pointed out that the methodrequires uniformly sam-pleddatato constructanappropriatestatespace.Therobot

in our experimenttakesabouttenseconds in oneaction.Inthis condition, uniform samplingis not reasonablebecauseit takes an enormousamount of time. Although the sec-ondmethodsegmentsthestatespaceincrementally on-line,it alsoseemsto require samplingmany datato construct asufficient statespace.It maybetime-consumingfor a realrobot, but it is still aninterestingapproach.

We usedseveral pre-defined actionsfor the humanoid.This is a shortcutto investigate the applicability of evolu-tionary methods to suchhigh-level functions as solving atask. In contrast, thereare relatedstudiesto evolve low-level functionsof ahumanoid. Nordinetal. havedevelopedahumanoid robot “ELVIS” [15]. Thesoftwareof this robotis built mainly on GP. They experimentedin the evolutionof stereoscopicvision [6] andsoundlocalization[9]. Theyalsoreportedtheresultof ahand-eyecoordination[12].

5.3Futur eResearches

We choseonly several discreteactionsin this study. Al-though this is simpleandeasyto use,continuous actionswill be more realistic in otherapplications. In that situa-tion, for example, “turn left in 30.0 degrees”at the begin-ning of RL canbe changed to “turn left in 31.5 degrees”after learning, dependingon theoperationalcharacteristicsof the robot. We plan to conduct an experimentwith suchcontinuousactions.

We intendto apply the technique to morecomplicatedtaskssuchas the multi-agent problem. It might be possi-ble to handle a multi-agent taskwith heterogeneousrobotsby extending our approach. For this purpose,we use asimulation-basedlearning to acquire thecommon programsapplicableto various typesof robots. After that,realrobotsaresupposedto learntheir effective actionsin spiteof theirdifferentoperational characteristics.This may be possiblewhile they arecarryingoutacooperativetaskin areal-worldsituation.

Another extensionis to adaptthesimulationparametersor modify the statespaceby using the information avail-able from a real environment. The simulatortuning willrequire thefeedbackprocessasdescribedin dottedlinesinFig. 2. This will enableGP to evolve moreeffective pro-grams. Note that programsevolved by GP cannot be runwithout any statespace.However, wecangivearoughstatespaceinitially andthenmodify it gradually according to therobot’s characteristics,which will establishmore effectivelearning scheme.

6 Conclusion

Wehave introducedareal-timeadaptation techniqueto realrobots. This approachis basedon our previously proposedmethod with AIBO. Weappliedthesameevolved programs

Page 8: Real-timeAdaptation Technique to Real Robots: An ...Real-timeAdaptation Technique to Real Robots: An Experiment with a Humanoid Robot Shotaro Kamio Hitoshi Iba Graduate School of Frontier

to ahumanoid robot.Weconfirmedthat,after6-hour learn-ing, the effective adaptation was establishedaccording totherobots’operationalcharacteristicssoasto solvethetaskeffectively. Our approachwassuccessfulin acquiring thecommon program, which will be applicable to heteroge-neous robots.

Bibliography

[1] BjornAndersson, PerSvesson,MatsNordahl, andPe-ter Nordin. On-line evolution of control for a four-legged robot using genetic programming. In Ste-fanoCagnoniet al., editors,Real-World Applicationsof Evolutionary Computing, LNCS 1803, pages319–326. Springer-Verlag, 2000.

[2] Minoru Asada,ShoichiNoda, SukoyaTawaratsumida,andKoh Hosoda. Purposive behavior acquisitionfora real robot by vision-basedreinforcementlearning.Machine Learning, 23:279–303,1996.

[3] MarcoDorigoandMarcoColombetti. Robot Shaping:An Experimentin Behavior Engineering. MIT Press,1998.

[4] Keith L. Downing. Adaptive geneticprogramsvia re-inforcementlearning. In Proc. of the Third AnnualGeneticProgramming Conference, 1998.

[5] D. FloreanoandF. Mondada. Evolution of homingnavigation in a realmobilerobot. IEEE TransactionsonSystems,Man,andCybernetics- Part B: Cybernet-ics, 26(3):396–407, 1996.

[6] Christopher T.M. Graae,PeterNordin,andMatsNor-dahl. Stereoscopic vision for a humanoid robot usinggeneticprogramming. In StefanoCagnoniet al., ed-itors, Real-World Applications of Evolutionary Com-puting, LNCS 1803, pages12–21. Springer-Verlag,2000.

[7] Hitoshi Iba. Multi-agent reinforcementlearning withgeneticprogramming. In Proc. of the Third AnnualGeneticProgramming Conference, 1998.

[8] Shotaro Kamio, Hideyuki Mitsuhasi, and HitoshiIba. Integration of genetic programming and rein-forcementlearning for real robots. In Proc. of theGenetic and Evolutionary ComputationConference(GECCO2003), 2003.

[9] Rikard Karlsson, PeterNordin, and Mats Nordahl.Soundlocalizationfor a humanoid robotby meansofgeneticprogramming. In StefanoCagnoniet al., ed-itors, Real-World Applications of Evolutionary Com-puting, LNCS 1803, pages65–76. Springer-Verlag,2000.

[10] Hajime Kimura, Toru Yamashita, and ShigenobuKobayashi. Reinforcementlearning of walkingbehav-ior for a four-leggedrobot. In 40thIEEE ConferenceonDecisionandControl, 2001.

[11] JohnR. Koza. GeneticProgramming, On the Pro-grammingof Computers by means of Natural Selec-tion. MIT Press,1992.

[12] William B. LangdonandPeterNordin. Evolvinghand-eye coordination for a humanoidrobot with machinecodegeneticprogramming. In J. Miller et al., edi-tors,Proc.of4thEuropeanConference, EuroGP2001,LNCS 2038, pages313–324,Lake Como,Italy, April18-202001. Springer-Verlag.

[13] David E. Moriarty, Alan C. Schultz, and John J.Grefenstette.Evolutionary algorithms for reinforce-ment learning. Journal of Artificial IntelligenceRe-search, 11:199–229, 1999.

[14] Stefano Nolfi and Dario Floreano. EvolutionaryRobotics: TheBiology, Intelligence, and Technologyof Self-OrganizingMachines. TheMIT Press,2000.

[15] J.P. NordinandM. Nordahl.An evolutionaryarchitec-turefor ahumanoid robot.In TheFourthInternationalSymposiumonArtificial Life andRobotics (AROB4th99), OitaJapan,1999.

[16] Alan C. Schultz,Connie Loggia Ramsey, andJohnJ.Grefenstette.Simulation-assistedlearningby compe-tition: Effects of noisedifferencesbetweentrainingmodelandtargetenvironment.In Proc.of Seventh In-ternational Conferenceon Machine Learning, pages211–215,SanMateo,1990. MorganKaufmann.

[17] RichardS. SuttonandAndrew G. Barto. Reinforce-mentLearning: An introduction. MIT Pressin Cam-bridge, MA, 1998.

[18] Yasutake Takahashi,Minoru Asada, Shoichi Noda,andKoh Hosoda.Sensorspacesegmentationfor mo-bile robot learning. In Proceedings of ICMAS’96WorkshoponLearning, Interaction andOrganizationsin Multiagent Environment, 1996.

[19] Kohsuke Yanai and Hitoshi Iba. Multi-agent robotlearningby meansof geneticprogramming: Solvinganescapeproblem.In Y. Liu andetal.,editors,Evolv-able Systems:From Biology to Hardware. Proceed-ingsof the4th International Conference onEvolvableSystems,ICES’2001,Tokyo,October3-5,2001, pages192–203.Springer-Verlag,Berlin, Heidelberg, 2001.