Learning Optimal Navigation Actions for Foresighted Robot ... · present the ﬁrst solution for...

Learning Optimal Navigation Actions for Foresighted Robot BehaviorDuring Assistance Tasks

AbdElMoniem Bayoumi Maren Bennewitz

Abstract— We present an approach to learn optimal navi-gation actions for assistance tasks in which the robot aims atefficiently reaching the final navigation goal of a human whereservice has to be provided. Always following the human at aclose distance might hereby result in inefficient trajectories,since people regularly do not move on the shortest path totheir destination (e.g., they move to grab the phone or makea note). Therefore, a service robot should infer the human’sintended navigation goal and compute its own motion basedon that prediction. We developed an approach that appliesreinforcement learning to get a Q-function that determines foreach pair of the robot’s and human’s relative positions thebest navigation action for the robot. Our approach applies aprediction of the human’s motion based on a softened Markovdecision process (MDP). This MDP is independent from thenavigation learning framework and is learned beforehand onpreviously observed trajectories. We thoroughly evaluated ourmethod in simulation and on a real robot. As the experimentalresults show, our approach leads to foresighted navigation be-havior and significantly reduces the path length and completiontime compared to naive following strategies.

I. INTRODUCTION

Following humans with mobile robots is a challengingproblem and is needed in several applications such as inindustrial settings where robots are deployed as transporta-tion systems, in home scenarios, especially for the elderlypeople, in stores with autonomous shopping carts, or inscenes where robotic wheelchairs should navigate next toan accompanying pedestrian. Various solutions have beenproposed for such problem instances as discussed in the nextsection. However, these solutions mainly focus on followingthe human at a certain distance irrespectively of its intendeddestination. Accordingly, the robot will keep on followingthe human, even if he does not move on the shortest pathto his goal. This inefficiency might arise since humans areeasily interrupted by unexpected events during navigation.For example, consider a home-assistance robot performinga delivery task and following the human when suddenly thephone rings and needs to be grabbed, or the doorbell rings.Or it might be that in the middle of the task an elderlyperson decides to rest for a while. In such situations, arobot applying a naive navigation strategy can only followthe human since the destination is initially unknown to therobot. This leads to inefficient navigation behavior causingunnecessary battery consumption or wear.

All authors are with the Institute of Computer Science, University ofBonn, Germany. This work has been supported by the German AcademicExchange Service (DAAD) and the Egyptian Ministry for Higher Educationas well as by the European Commission under contract number FP7-610532-SQUIRREL.

Fig. 1. The human moves through the environment between differentpossible designated locations (top) where it stays for a while and might needthe help of a mobile robotic assistant. The task of the robot is to efficientlyreach the initially unknown destination of the human, who might notmove on the shortest trajectory. Our learning framework generates optimalnavigation actions based on predicted motions of the human (bottom), whichresults in more foresighted behavior than just following the human at a closedistance.

In this paper, we present a reinforcement learning frame-work that generates foresighted navigation actions in thedescribed situations and is able to deal with such unexpectedhuman behaviors during the tasks. Our approach constantlypredicts the human’s intended destination. Based on thatprediction, the robot follows the learned navigation strategyand, thus, reaches the predicted navigation goal efficiently.

An overview of our framework is depicted in Fig. 1.We rely on a set of previously observed human trajectoriesbetween possible destinations where the human stays for awhile and might need the help of the robot, i.e., for generalassistance tasks, social interaction, or delivery tasks. Giventhese trajectories, we apply a technique to learn a predictionmodel that is used to reason about the future motions and thetarget destination of the human. The output of our learningframework is a table of Q-values that encodes the bestnavigation action for the robot based on the current robot’s

and human’s positions as well as the predicted human’smotion.

As we show in the experiments, our approach leads toforesighted navigation behavior. The robot focuses on reach-ing the human’s final destination, which is initially unknownto the robot, efficiently. We demonstrate that a robot execut-ing the learned navigation strategy can successfully handlecases in which the human’s trajectory contains detours. Ourmethod results in paths with significantly shorter path lengthand significantly reduced completion time compared to naivefollowing strategies. To the best of our knowledge, wepresent the first solution for applications involving peoplefollowing tasks that considers the efficiency of the generatedrobot paths.

II. RELATED WORK

The task of human following or target tracking has beenthoroughly investigated using control theory, e.g., Huang [1]presented two control models for both the linear and angularvelocity of a robot. These control models focus on origi-nating velocity commands that allow the robot to follow ahuman while ensuring smoothness of the resulting trajectory.Harmati and Skrzypczyk [2] applied a fuzzy logic controllerand proposed a game-theory based method for a team ofrobots tracking a target. In this approach, each robot hasknowledge about the positions of its teammates and opti-mizes its control signals using on a cost function that takesalso into account the predicted decisions of the other robots.Similarly, Nascimentoa et al. [3] developed an approachusing a nonlinear predictive formation control model in adistributed architecture for collaborating robots. Each robotshares information about its pose with the rest of the teamand optimizes its control signals under a prediction of thenext states of the target as well as of the other team members.Pradhan et al. [4] proposed a path planning method witha navigation function that uses predictive fields of movingobstacles to follow a target.

Furthermore, there are approaches that aim at followinga human with a fixed distance, e.g., Nishimura et al. [5]developed a modified shopping cart to follow a person ina certain range. Prassler et al. [6] considered side-by-sidefollowing within the application of a robotic wheelchairfollowing a human. Also Kuderer and Burgard [7] consideredthe wheelchair application and developed an approach topredict the human trajectory taking into account informationabout obstacles in the environment. The authors compute therobot’s trajectory so that the distance to the desired relativeposition along the predicted path is minimized. In this way,the robot can act foresightedly during the following task andlocal minima resulting from obstacles in the environment areavoided. Morales et al. [8] analyzed side-by-side trajectoriesof humans in order to predict the trajectory of a humanduring walking beside a robot. Using the learned utilityfunction, the optimal robot trajectory can then be plannedlocally.

Further approaches predict human trajectories to gener-ate robot motions that do not interfere with people. Ben-

newitz et al. [9] proposed to predict human trajectories basedon learned motion patterns to avoid interferences in tightenvironments. Laugier et al. [10] use hidden Markov modelsto represent the motion of dynamic obstacles and avoidcollision. Ziebart et al. [11] developed an approach usinga Markov decision process (MDP) to learn from previouslyobserved trajectories. The authors learn a Q-table to predictthe human’s trajectories for collision-free navigation. In ourwork, we use the method of Ziebart et al. to learn such aQ-table, however, we aim at making use of the prediction tolearn how to efficiently follow a human.

Regarding the use of machine learning techniques inrelated problems, Goldhoorn et al. [12] proposed solving ahide-and-seek game using reinforcement learning with theaim to find a human and follow him/her to a goal location.The reward here relies on minimizing the distance betweenthe following robot and the human during the followingtask. Tipaldi and Arras [13] focus on encountering a personin the environment, i.e., the robot has to find the person.The authors present a model that learns the spatio-temporalbehavior of humans and use this model within a MDP toincrease the encounter probability.

As opposed to all the approaches discussed above, wepresent a learning framework that enables a service robotto efficiently reach the intended destination of the human.Using our approach, the robot is not required to follow thehuman with a fixed distance, which might lead to inefficienttrajectories, instead the robot predicts the navigation goal,constantly updates it, and adapts its actions accordingly.

III. MOTION PREDICTION

In this section, we present the motion prediction techniqueapplied in our learning framework. We predict the human’strajectory based on the trajectory observed so far and use theprediction in the state space representation of our learningapproach for generating foresighted robot actions. Our tech-nique for motion prediction is based on the work of Ziebart etal. [11]. This approach models the sequence of motionactions performed by a human as a softened Markov decisionprocess (MDP) whose state space corresponds to the cellsof a discretized grid map of the environment. The authorspropose to train a prediction model using a softened versionof the Bellman equation and value iteration to get a Q-tablethat represents the most likely motion action performed bythe human at a certain position. This softened version usesthe soft-maximum function instead of the ordinary maximumto be able to reason about the distribution of the trajectories,instead of considering only one single trajectory. The soft-maximum function is defined as

softmaxxf(x) = log∑

xef(x) (1)

and is used within the computation of the state and actionvalues V (s) and Q(s, a):

Q(s, a) = R(s, a) + V (T (s, a)) (2)V (s) = softmaxa Q(s, a) (3)

Here, s and a represent the human’s current state andcorresponding action, respectively, T (s, a) is the transitionfunction, and R(s, a) is the reward after executing action aat the current state s. Eq. (2) and Eq. (3) are used to trainthe motion predictor based on a set of training trajectorieswith a reward function that also takes into account obstaclelocations. Ziebart et al. use the obtained Q-table to predictthe future destination of the trajectory observed so far asexplained in the following.

Let ζA→B denote the observed trajectory of the humanfrom the initial state A to the current state B and ζB→Cthe future trajectory from B to the unknown destination C.The probability P (dest C|ζA→B) of a certain destination Cgiven the observed trajectory ζA→B can then be computedwith Bayes’ rule, where the likelihood P (ζA→B |dest C)intuitively depends on the ratio of the reward of ζA→Band the expected value of ζB→C to the value of the wholetrajectory to the destination ζA→C :

P (dest C|ζA→B)Bayes’=

P (ζA→B |dest C)P (dest C)P (ζA→B)

(4)

=eR(ζA→B)+V (B→C)

eV (A→C) P (dest C)∑DeR(ζA→B)+V (B→D)

eV (A→D) P (dest D)(5)

Here, D corresponds to a destination among the set of pos-sible destinations according to the training trajectories. Theprior distribution P (dest D) is known from the training set1.Furthermore, the reward of the trajectory R(ζ) is the sum ofall individual rewards of state-action pairs according to ζ:

R(ζ) =∑

(s,a)∈ζ

R(s, a) (6)

and V (X → Y ) denotes the softmax value function of thetrajectory from state X to state Y .

We can then compute the probability of the future trajec-tory ζB→C to the unknown future destination C given theso far observed trajectory ζA→B :

P (ζB→C |ζA→B)= P (ζB→C |dest C)P (dest C|ζA→B) (7)

=eR(ζB→C)

eV (B→C)P (dest C|ζA→B) (8)

= eR(ζB→C)−V (B→C)P (dest C|ζA→B) (9)

The term P (dest C|ζA→B) is hereby computed usingEq. (5). This posterior probability is used to weight theconditional probability P (ζB→C |dest C) of the expectedfuture trajectory given the destination C.

In our implementation, we assume that the human canmove one step at a time step in any of the eight possibledirections corresponding to the neighbor cells or remain at

1When using a motion capture system, possible destinations can beidentified as places where the person frequently stays for a while withoutmoving much. The distribution P (dest D) can be learned by counting howoften the person moves between possible destinations and might depend onthe starting location.

the same state. Accordingly, our action space consists of nineactions. We set the reward to be inversely proportional to thedistance from obstacles within a certain range of the human.

Thus, using the probability distribution about the futuretrajectory, we can predict the position of the human at acertain time step given its trajectory history. Note that sincethere is uncertainty in the prediction, the robot does not knowthe human’s destination in advance and, thus, cannot directlyplan a path towards it. Accordingly, we use the informationabout possible destinations in our navigation framework andlearn optimal navigation actions as described in the followingsection.

IV. LEARNING NAVIGATION ACTIONS

A. Reinforcement Learning

To model a problem as reinforcement learning task, oneneeds to define a state space S that describes the relevantaspects of the situation the agent encounters and an actionspace A that represents the set of actions from which theagent can choose according to its policy π : S → A. Thestate of the agent at a time step st is transformed to thenew state st+1 after executing the action at according to thetransition function T : S × A → S , and the agent gets animmediate reward rt ∈ R.

The learning process is iterated multiple times until con-vergence is reached, where each iteration is called a learningepisode. A learning episode is considered complete when aterminal state is reached at time T . During each learningepisode, the value of each state-action pair Qπ(s, a) fol-lowing the policy π is updated according to the achievedreward Rt with

Rt =

T∑i=t+1

ri (10)

as follows [14]:

Qπ(st, at) = Eπ{Rt|st, at} (11)

Thus, Qπ(st, at) denotes the expected return of taking ac-tion at in state st and following the policy π afterwards. Theoptimal policy maximizes the expected return, which corre-sponds to choosing always the action with the maximum Q-value. To update the Q-values, we apply Sarsa reinforcementlearning:

Qπ(st, at)← Qπ(st, at)

+ α[rt+1 + γQπ(st+1, at+1)−Qπ(st, at)] (12)

Here, the parameters α and γ correspond to the step size andthe discounting factor, respectively. In our implementation,we use a variant called Sarsa(λ) that looks several steps intothe future and considers the rewards of these future stepsto update Q(st, at) [15], unlike Sarsa which considers onlyone-step forward. In this way, a more accurate estimate ofRt can be obtained. Throughout the learning phase, we usean ε-greedy action selection. This selection method choosesthe action with the highest Q-value with probability 1 − εand all non-greedy actions with equal probability.

B. State Space S

We use a discretized representation of the environment inform of a grid map in which static obstacles are representedas occupied cells. The state representation for reinforcementlearning includes the relative distance between the current2D position of the human xht and the current 2D positionof the following robot xr

t at time t as well as the relativedistance between xr

t and the predicted position of the hu-man xht+i after i further time steps

st =

[xht − xrt

xht+i − xrt

]. (13)

Here, we compute xht+i according to Eq. (9). We use relativepositions instead of global positions to reduce the learn-ing problem. The corresponding state space is much morecompact than one that considers all possible combinationsof current and predicted global positions. As shown in theexperiments, this generalization works well in practice.

C. Action Set A and Q-Table

The action space consists of a set of discrete movingactions in the eight neighbor cells in addition to standingstill. The Q-table computed by our framework containsentries Q(s, a) for each state-action pair where s is definedas in Eq. (13) and a is one of the nine navigation actions.

We assume that the human moves between different loca-tions where it stays for a while and might need the help ofthe robot. One of them is the starting position of the humanbefore moving to the next destination. Accordingly, we learnan independent Q-table for each of these designated locationswithin the map.

D. Reward R

We designed the reward function so as to combine both theshortest path to the predicted human position as well as thedifference between the distance traveled so far by the humanand the traveled distance by the follower robot. The first termaims at following the human, whereas the second term isfor preferring shorter paths during the task. Accordingly, wedefine the intermediate reward rt at time t as follows

rt =

{10, 000 if t = T

−distA?(xrt , xht+i) + (travht − trav r

t ) otherwise,(14)

where T is the final time step and distA∗ is a functionthat applies the A* algorithm on the grid map to computethe length of the shortest path between the current robotposition xrt and the predicted position of the human xht+i.Furthermore, trav t refers to the distance traveled until timestep t. The final state with time step T is reached when therobot is sufficiently close to the human after he has reachedthe next destination, which is the case when the human staysfor a while close to a designated destination. The rewardfunction leads to the generation of efficient navigation actionsthat minimize the cost of reaching the human’s destination.

(a) (b)

Fig. 2. Maps used for the experiments: (a) Environment with a fixed startposition and three possible destinations with overlapping trajectories, mak-ing the prediction more difficult. (b) Environment with several designatedlocations between which the human moves.

V. EXPERIMENTAL RESULTS

A. Environment Setup

We evaluated our approach in simulated and real-worldexperiments in two environments each of the size of4.8 m×3.6m. The first one contains three possible destina-tions reachable with overlapping trajectories from an initialposition while the other one contains trajectories betweenmultiple designated locations (see Fig. 2). We used a mapresolution of 60 cm, which yields reasonable navigationactions for the real robot with a diameter of 45 cm. Notethat even if the environment consists of 48 grid cells, thesize of the actual state space is much bigger since we donot only consider the relative distance between the currentpositions of the robot and the human but also the predicteddifference in a certain number of time steps (see Eq. (13)).

Currently, we assume perfect knowledge about the robot’sand the human’s trajectory and that both are moving ap-proximately with the same velocity, which only affects theprediction.

B. Trajectory Generation for Training and Testing

We randomly generated human trajectories for both, thetraining and test phases composed of straight line segmentsof 25 cm length (in the future we will use real data ac-quired with a motion capture system). The orientation ofthe segments relative to the destination is chosen uniformlyfrom the interval [0◦, 60◦]. For the map in Fig. 2 (a), ourtraining set consists of 60 trajectories, 20 for each of the threepossible destinations. For the map in Fig. 2 (b), we generateda training set consisting of 15 trajectories for each of thepossible combinations between the designated locations. Fortesting, we used five randomly generated trajectories for eachmotion class.

C. Parameters

During the learning of the Q-table, we used an initialvalue of 0.4 for ε of the greedy action selection to allow forexploring the state space and decreased the value to 0.2 aftera certain number of iterations. For the execution, we used a

value of 0.05, i.e., the robot chooses the action with thehighest Q-value with probability 0.95. The Q-tables for ourexperiments were learned from a minimum of 12,000 learn-ing episodes (learning was stopped after convergence). Weconsider a learning episode as successful if the distance of therobot to the human destination is smaller than 1.2 m within amaximum number of 100 time steps, otherwise the episodeis aborted. We experimentally determined the value of 3 for ifor the prediction in Eq. (13) and Eq. (14) to work best withthe chosen map resolution of 60 cm.

In the learning episodes and test runs, we generated therobot’s starting position randomly within a range of 60 cmaround the human’s initial position. In the test runs, we abortthe execution in case the robot does not reach the humandestination within 25 time steps after the human arrives there.

D. Evaluation Metrics

In order to evaluate our approach, we computed the savingw.r.t. the path length by comparing the distance traveledby the robot according to our learned navigation actionsto that of a naive following method, in which the robotfollows the shortest path to the position of the human ateach time step. Furthermore, we evaluated the completiontime of our approach to a ’wait-and-observe-first’ strategy inwhich the robot waits until the human reaches its destinationand only then starts moving according to the shortest pathto the destination. To show statistical significance of ourresults, we applied a statistical two-tailed paired t-test witha significance level of α = 0.05.

E. Experiments in Simulation

We performed 250 runs for each test trajectory to alleviatethe randomness arising from the generation of the initialposition of the robot as well as the randomness involved inthe generation of the actions. As can be seen from Tab. I (firstand third row, first column), our approach significantlyoutperforms the naive following strategy w.r.t. the trajectorylength in both environments. Note that in the environmentof Fig. 2 (b), in some cases the robot can directly predictthe correct navigation goal after one step and immediatelymove there. Depending on the human’s and the robot’s startlocation, the robot then reaches this destination faster thanthe human.

Furthermore, we added a number of non goal-directedtrajectories to the test set (25%) of the environment inFig. 2 (a) to show the effectiveness of the robot behaviorgenerated by our learning framework and its ability to handleunexpected trajectories. These trajectories were not includedin the training of the Q-table for the navigation actions. Inthe corresponding runs, the human trajectory leads aroundthe obstacle in the left part of the map (similar to thehuman’s trajectory in Fig. 5). This can be seen as thecase where the human moves to fetch some items and thencontinues walking to its actual destination. If such scenariosare included in the test set we even achieve an overall averagegain of 19.1% (see Tab. I second row, first column).

(a) (b)

Fig. 3. Experiment in which the human first makes a large detour beforewalking towards his intended destination. (a) The robot does not followthe human but waits first. (b) Only as the human continues moving in thedirection of the actual destination, the robot follows and, thus, avoids thedetour that would have resulted from closely following the human.

When comparing our approach to the ’wait-and-observe-first’ strategy regarding completion time, we achieve anaverage gain of 13.1% - 14.6% (Tab. I second column).

In 8% of the test runs, the robot was caught in localminima and did not reach the destination within a fixedtime limit while choosing actions according to the Q-table.The corresponding runs are not included in the evaluation.Note that we applied only the Q-table for action selectionin our approach and did not combine it with moving on theshortest path to the actual destination after the human arrivedthere, which would be reasonable in practice and improvethe performance in some cases, in particular when facingthe local minimum problem mentioned above.

We performed an additional experiment to illustrate thestrength of our approach. Here, the human was only walkingto his final destination after performing a larger detour (seeFig. 3 (a)). This unexpected behavior was correctly handledby the robot as it did not follow the human but waited. Onlyas the human continued his path towards the destination, therobot started moving again (see Fig. 3 (b)). Thus, the robotcould avoid the large detour that would have resulted fromclosely following the human.

F. Experiments with a Real Robot

We also carried out experiments with a realrobot (Robotino by Festo) to test the learned navigationstrategy. To detect the position of the human and localizethe robot, we used an external motion capture (MoCap)system. We focused on the scenario where the trajectoryof the human does not directly lead to his/her destination.

TABLE I

GAIN IN TRAVELED DISTANCE AND COMPLETION TIME

Test trajectories ofenvironments

Gainin distance

Gainin time

Statisticallysignificant(α = 0.05)

Dist. Time

Fig. 2 (a) Basic set 7.9% 13.1% Yes YesWith cycles 19.1% 14.6% Yes Yes

Fig. 2 (b) Basic set 18.3% 14.2% Yes Yes

(a) (b)

(c) (d)

Fig. 4. (a) The human starts walking towards his (unknown) destinationand the robot executes navigation actions to follow him. (b) The humandoes not move directly to his destination but walks in a cycle around anobject. (c) The robot behaving according to our learned policy ignores thiscycle and waits. (d) The human continues walking towards his destination,followed by the robot.

(a) (b)

Fig. 5. Trajectories of the human and the robot corresponding to theexperiment shown in Fig. 4 recorded by the MoCap system. (a) The robot’strajectory is rather efficient, as the robot correctly predicts that the humanwill continue moving to one of the destinations in the area on top andtherefore waits for the human half way. (b) Both human and robot resumethe path to the destination.

As shown in Fig. 4, the human performed an inefficienttrajectory by walking around the obstacle in the left partinstead of directly going to his/her goal location. As canbe seen, the follower robot ignored this cycle and waitedinstead before proceeding to follow the human afterwardsto its intended navigation goal. The corresponding path (seeFig. 5) is much shorter than the human’s trajectory.

VI. CONCLUSIONS

We developed an approach to generating foresightednavigation behavior of an assistance robot. We considerthe scenario in which the human moves between differ-ent designated locations where it might interact with therobot. Thus, the task of the robot is to reach these placeswhile minimizing trajectory length and completion time.Our framework relies on a learned prediction model for

the human’s motions that is used to reason about its futuretrajectory and target destination. Based on that prediction andthe current robot position, we apply reinforcement learningto generate the optimal navigation action for the robot. Asthe experiments carried out in simulation and with a realmobile robot demonstrate, our approach leads to foresightednavigation behavior resulting in significantly shorter pathsand a significantly reduced completion time compared tonaive following strategies.

We are currently extending our system to use the robot’son-board sensors for people tracking, which will lead toocclusions and uncertainty about the human’s position. Sothe robot will also need to learn when to consider activere-localization of the human based on its predicted motions.

REFERENCES

[1] L. Huang, “Control approach for tracking a moving target by awheeled mobile robot with limited velocities,” Control Theory &Applications, IET, vol. 3, no. 12, pp. 1565–1577, 2009.

[2] I. Harmati and K. Skrzypczyk, “Robot team coordination for targettracking using fuzzy logic controller in game theoretic framework,”Robotics and Autonomous Systems, vol. 57, no. 1, pp. 75–86, 2009.

[3] T. P. Nascimento, A. P. Moreira, and A. G. Scolari Conceio, “Multi-robot nonlinear model predictive formation control: Moving target andtarget absence,” Robotics and Autonomous Systems, vol. 61, no. 12,pp. 1502–1515, 2013.

[4] N. Pradhan, T. Burg, S. Birchfield, and U. Hasirci, “Indoor navigationfor mobile robots using predictive fields,” in Proc. of the AmericanControl Conference, 2013.

[5] S. Nishimura, H. Takemura, and H. Mizoguchi, “Development ofattachable modules for robotizing daily items – Person followingshopping cart robot,” in Proc. of the IEEE Int. Conf. on Roboticsand Biomimetics (ROBIO), 2007.

[6] E. Prassler, D. Bank, B. Kluge, and M. Hagele, “Key technologies inrobot assistants: Motion coordination between a human and a mobilerobot,” Transactions on Control, Automation and Systems Engineering,vol. 4, no. 1, 2002.

[7] M. Kuderer and W. Burgard, “An approach to socially compliant leaderfollowing for mobile robots,” in Social Robotics, ser. Lecture Notes inComputer Science, M. Beetz, B. Johnston, and M.-A. Williams, Eds.,vol. 8755. Springer International Publishing, 2014.

[8] Y. Morales, S. Satake, R. Huq, D. Glas, T. Kanda, and N. Hagita, “Howdo people walk side-by-side? Using a computational model of humanbehavior for a social robot,” in ACM/IEEE Int. Conf. on Human-RobotInteraction, 2012.

[9] M. Bennewitz, W. Burgard, G. Cielniak, and S. Thrun, “Learningmotion patterns of people for compliant robot motion,” Int. Journalof Robotics Research (IJRR), vol. 24, no. 1, 2005.

[10] C. Laugier, S. Petti, D. Vasquez, M. Yguel, T. Fraichard, andO. Aycard, “Steps towards safe navigation in open and dynamicenvironments,” in Autonomous Navigation in Dynamic Environments.Springer, 2007, pp. 45–82.

[11] B. D. Ziebart, N. Ratliff, G. Gallagher, C. Mertz, K. Peterson, J. A.Bagnell, M. Hebert, A. K. Dey, and S. Srinivasa, “Planning-basedprediction for pedestrians,” in Proc. of the IEEE/RSJ Int. Conf. onIntelligent Robots and Systems (IROS), 2009, pp. 3931–3936.

[12] A. Goldhoorn, A. Garrell, R. Alquezar, and A. Sanfeliu, “Continuousreal time POMCP to find-and-follow people by a humanoid servicerobot,” in Proc. of the IEEE-RAS Int. Conf. on Humanoid Robots(Humanoids), 2014.

[13] G. Tipaldi and K. Arras, “I want my coffee hot! learning to find peopleunder spatio-temporal constraints,” in Proc. of the IEEE Int. Conf. onRobotics & Automation (ICRA), 2011.

[14] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning.MIT Press, 1998.

[15] M. Wiering and J. Schmidhuber, “Fast online Q(λ),” Machine Learn-ing, vol. 33, no. 1, pp. 105–115, October 1998.

Learning Optimal Navigation Actions for Foresighted Robot ... · present the ﬁrst solution for...

Documents

Transcript of Learning Optimal Navigation Actions for Foresighted Robot ... · present the ﬁrst solution for...