High-level robot behavior control using POMDPsjpineau/files/jpineau-cogrob02.pdf · novel algorithm...

8
High-level robot behavior control using POMDPs Joelle Pineau and Sebastian Thrun Robotics Institute and Computer Science Department Carnegie Mellon University 5000 Forbes Ave, Pittsburgh PA 15213 jpineau,thrun @cs.cmu.edu Abstract This paper describes a robot controller which uses proba- bilistic decision-making techniques at the highest-level of behavior control. The POMDP-based robot controller has the ability to incorporate noisy and partial sensor informa- tion, and can arbitrate between information gathering and performance-related actions. The complexity of the robot control domain requires a POMDP model that is beyond the capability of current exact POMDP solvers, therefore we present a hierarchical variant of the POMDP model which exploits structure in the problem domain to accel- erate planning. This POMDP controller is implemented and tested onboard a mobile robot in the context of an in- teractive service task. During the course of experiments conducted in an assisted living facility, the robot success- fully demonstrated that it could autonomously provide guidance and information to elderly residents with mild physical and cognitive disabilities. Introduction High-level robot control has been a popular topic in AI, and decades of research has led to a reputable collection of archi- tectures (e.g., (Arkin 1998; Brooks 1985; Gat 1996)). How- ever, existing architectures rarely take uncertainty into account during planning. In this paper we describe a high-level robot control system that uses probabilistic decision-making to act under uncertainty. Partially Observable Decision Processes (POMDPs) (Sondik 1971) are techniques for calculating optimal control actions under uncertainty. They extend the well-known Markov Decision Processes (MDPs) (Howard 1960) to do- mains where considerations of noise and state uncertainty are crucial to good performance. They are useful for a wide range of real-world domains where joint planning and tracking is necessary, and have been successfully applied to problems of robot navigation (AAAI 1998; Simmons & Koenig 1995; Nourbakhsh, Powers, & Birchfield 1995; Roy & Thrun 2000) and robot interaction (Darrell & Pentland 1996; Roy, Pineau, & Thrun 2000). In this paper we describe a system that uses POMDPs at the highest-level of behavior control, in contrast with exist- ing POMDP applications in robotics where POMDP control is limited to specialized modules. We propose a robot control architecture where a POMDP performs high-level control by arbitrating between information gathering and performance- related actions, as well as negotiating over goals from differ- ent specialized modules. The POMDP also incorporates high- level uncertainty obtained through both navigation sensors Copyright c 2002, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. (e.g. laser range-finder) and interaction sensors (e.g. speech recognition and touchscreen.) Unfortunately, POMDPs of the size necessary for good robot control are an order of magnitude larger than today’s best exact POMDP algorithms can tackle (Kaelbling, Littman, & Cassandra 1998). However, the robot controller domain yields a highly structured POMDP, where certain actions are only applicable in certain situations. To exploit this struc- ture, we developed a hierarchical version of POMDPs, which breaks down the decision making problem into a collection of smaller problems that can be solved more efficiently. Our ap- proach is similar to the MAX-Q decomposition for MDPs (Di- etterich 2000), but defined over POMDPs (where states are unobserved.) Finally, we apply the high-level POMDP robot controller to the real-world task of guiding elderly people in a nurs- ing home. In systematic experiments, the robot successfully demonstrated that it could autonomously provide guidance for elderly residents and we found the POMDP approach to be highly effective. Review of POMDPs This section provides a brief overview of the essential con- cepts in POMDPs (see (Kaelbling, Littman, & Cassandra 1998) for a more complete description of the POMDP prob- lem formulation.) A POMDP model is an n-tuple, , consisting of: States: A set of states, , describes the prob- lem domain. The domain is assumed to be in a specific state at any point in time. Actions: A set of actions, , describes the agent’s interaction with the domain. At any point in time the agent applies an action, , through which it affects the domain. Observations: A set of observations, , de- scribes the agent’s perception of the domain. A received observation, , may only partially reflect the current state. Rewards: A set of numerical costs/rewards, , de- scribes the reinforcement received by the agent throughout its interaction with the domain. To fully characterize a specific POMDP model, the follow- ing probability distributions must be specified: Initial state probability distribution: (1) is the probability that the domain is in state at time . This distribution is defined over all states in .

Transcript of High-level robot behavior control using POMDPsjpineau/files/jpineau-cogrob02.pdf · novel algorithm...

Page 1: High-level robot behavior control using POMDPsjpineau/files/jpineau-cogrob02.pdf · novel algorithm to perform hierarchical POMDP planning and execution. Hierarchical POMDPs The basic

High-level robot behavior control usingPOMDPsJoellePineauand SebastianThrun

RoboticsInstituteandComputerScienceDepartmentCarnegie Mellon University

5000ForbesAve,PittsburghPA 15213�jpineau,thrun� @cs.cmu.edu

Abstract

This paperdescribesa robotcontrollerwhich usesproba-bilistic decision-makingtechniquesat thehighest-level ofbehavior control.ThePOMDP-basedrobotcontrollerhastheability to incorporatenoisyandpartialsensorinforma-tion, andcanarbitratebetweeninformationgatheringandperformance-relatedactions.Thecomplexity of therobotcontroldomainrequiresa POMDPmodelthat is beyondthecapabilityof currentexactPOMDPsolvers,thereforewe presenta hierarchicalvariant of the POMDP modelwhich exploits structurein theproblemdomainto accel-erateplanning. This POMDPcontroller is implementedandtestedonboardamobilerobotin thecontext of anin-teractive servicetask. During the courseof experimentsconductedin anassistedliving facility, therobotsuccess-fully demonstratedthat it could autonomouslyprovideguidanceand information to elderly residentswith mildphysicalandcognitivedisabilities.

Intr oductionHigh-level robot control hasbeena populartopic in AI, anddecadesof researchhasled to a reputablecollectionof archi-tectures(e.g.,(Arkin 1998;Brooks1985;Gat 1996)). How-ever, existingarchitecturesrarelytakeuncertaintyintoaccountduringplanning. In this paperwe describea high-level robotcontrol systemthat usesprobabilisticdecision-makingto actunderuncertainty.

Partially Observable Decision Processes (POMDPs)(Sondik1971)aretechniquesfor calculatingoptimal controlactions under uncertainty. They extend the well-knownMarkov DecisionProcesses(MDPs) (Howard 1960) to do-mainswhereconsiderationsof noiseandstateuncertaintyarecrucialto goodperformance.They areusefulfor awiderangeof real-world domainswhere joint planningand tracking isnecessary, and have beensuccessfullyapplied to problemsof robot navigation (AAAI 1998;Simmons& Koenig1995;Nourbakhsh, Powers, & Birchfield 1995; Roy & Thrun2000) and robot interaction (Darrell & Pentland 1996;Roy, Pineau,& Thrun2000).

In this paperwe describea systemthat usesPOMDPsatthe highest-level of behavior control, in contrastwith exist-ing POMDPapplicationsin roboticswherePOMDPcontrolis limited to specializedmodules.We proposea robotcontrolarchitecturewherea POMDPperformshigh-level control byarbitratingbetweeninformation gatheringand performance-relatedactions,aswell asnegotiatingover goalsfrom differ-entspecializedmodules.ThePOMDPalsoincorporateshigh-level uncertaintyobtainedthrough both navigation sensors

Copyright c�

2002,AmericanAssociationfor Artificial Intelligence(www.aaai.org). All rightsreserved.

(e.g. laserrange-finder)andinteractionsensors(e.g. speechrecognitionandtouchscreen.)

Unfortunately, POMDPs of the size necessaryfor goodrobot control are an order of magnitudelarger than today’sbestexactPOMDPalgorithmscantackle(Kaelbling,Littman,& Cassandra1998). However, the robot controller domainyields a highly structuredPOMDP, wherecertainactionsareonly applicablein certainsituations. To exploit this struc-ture,we developedahierarchical versionof POMDPs,whichbreaksdown thedecisionmakingprobleminto acollectionofsmallerproblemsthatcanbesolvedmoreefficiently. Our ap-proachis similarto theMAX-Q decompositionfor MDPs(Di-etterich2000), but definedover POMDPs(wherestatesareunobserved.)

Finally, we apply the high-level POMDP robot controllerto the real-world task of guiding elderly peoplein a nurs-ing home. In systematicexperiments,the robot successfullydemonstratedthatit couldautonomouslyprovideguidanceforelderly residentsand we found the POMDP approachto behighly effective.

Review of POMDPsThis sectionprovides a brief overview of the essentialcon-cepts in POMDPs (see (Kaelbling, Littman, & Cassandra1998)for a morecompletedescriptionof the POMDPprob-lem formulation.)

A POMDPmodelis ann-tuple,��� �������� ������������������ � ,consistingof:� States: A setof states,

� � ��������� ����!"!#! � , describestheprob-lemdomain.Thedomainis assumedto bein aspecificstate�%$

atany point in time.� Actions: A setof actions, � ��&'����&(����!#!"! � , describesthe

agent’s interactionwith the domain. At any point in timetheagentappliesanaction,

&�$, throughwhich it affectsthe

domain.� Observations: A setof observations, � ��)��%��)%����!"!#! � , de-

scribesthe agent’s perceptionof the domain. A receivedobservation,

)�$, mayonly partially reflectthecurrentstate.� Rewards: A setof numericalcosts/rewards,

��*���$+��&�$%,, de-

scribesthereinforcementreceivedby theagentthroughoutits interactionwith thedomain.

To fully characterizea specificPOMDPmodel,thefollow-ing probabilitydistributionsmustbespecified:� Initial stateprobabilitydistribution:���-*���,/. �0�21 *��3� � ��,

(1)

is theprobabilitythatthedomainis in state�

at time 4��65 .Thisdistribution is definedover all statesin

�.

Page 2: High-level robot behavior control using POMDPsjpineau/files/jpineau-cogrob02.pdf · novel algorithm to perform hierarchical POMDP planning and execution. Hierarchical POMDPs The basic

� Statetransitionprobabilitydistribution:�7*��(��&-���38�,9. �0�21 *���$ � �38;:#��$�<� � �=��&�$�<>� � &?,(2)

is theprobabilityof transitioningto state� 8

, given that theagentis in state

�andselectsaction

&. This is definedfor

all*��(��&-��� 8 ,

triplets and assumes @+ACB�D �7*��=��&E��� 8 , �GF ,H *��=��&',.� Observationprobabilitydistribution:�I*��=��&E��)�,9. �0�21 *�)�$ � )�:#�%$�<>� � �(��&�$�<� � &',

(3)

is the probability that the agent will perceive observa-tion

), given that it is in state

�, and hasappliedaction&

. This is definedfor all*��=��&E��)�,

triplets and assumesJ B�D �I*��=��&-��)�, �KF , H *��(��&?,.

At any givenpoint in time, thesystemis assumedto be insomestate

��$, which may not be completelyobservable,but

is partially observable throughobservation)�$

. In general,itis not possibleto determinethe currentstatewith completecertainty. Instead,a belief distribution is maintainedto suc-cinctly representthe history of the agent’s interaction(bothappliedandperceived)with thedomain:� Belief:�+$ *���,9. �0�21 *���$ � �(:#)�$+��&�$�<>�%��)�$�<����!"!#!"��&(�������-,

(4)

describestheprobabilitythat,at time 4 , theagentis in state�%$, given the history

��)�$+��&�$�<>�%��)�$�<����!"!#!"��&(� � andthe initialbelief

���. Thisdistribution is definedover all statesin

�.

Assuminga POMDPmodelasdefinedabove, we now dis-cusstwo interestingproblemsin POMDPs. The first is statetracking, that is, the computationof the belief stateat eachtime step. Thesecondis theoptimizationof anactionselec-tion policy.

StateTrackingTo operatein its domainandapplyabelief-conditionedpolicy,anagentmustconstantlyupdateits belief distribution. Equa-tion 5 definestheupdaterule for computingaposteriorbelief,� 8

, given thebelief at theprevious time step,�, andthe latest

action/observationpair,*�&-��)L,

:

��8�*�� 8�, ��I*�� 8 ��&E��)�, @�B�D �7*��(��&E��� 8 ,���*���,

�21 *�)�:#&E����, (5)

For mostPOMDPdomains,statetrackingis easy, relativeto theproblemof computinga usefulactionselectionpolicy.For very largescaleproblemsthebeliefupdatingmaybecomeproblematic,this hasbeenaddressedby earlierwork, in thecontext of dynamicBayesiannetworks(Boyen& Koller 1998;Poupart& Boutilier 2000)andprobabilistictracking(Thrunetal. 2000b).

Computing a PolicyA POMDPpolicy, M , is anaction-selectionstrategy. It is for-mally definedasamappingfrom belief to actionselection:

MONQP>RTS/N .-�+$VUW&(6)

In this, POMDPsdiffer significantly from MDPs, wherethepolicy is amappingfrom stateto action:

MORXSVN .-��$VUW&(7)

The goal of planning,or POMDP problemsolving, is tolearnan action-selectionpolicy that maximizesthe (possiblydiscounted)sumof expectedfuturerewardupto sometimeT:

YIZ\[]%^ $ �_*"`\,ba

(8)

where��*c`\,

is therewardat time`

.The most straight-forward approachto finding POMDP

policiesremainsthevalueiterationapproach,whereiterationsof dynamicprogrammingareappliedto computeincreasinglymoreaccuratevaluesfor eachbelief state

�.d .-efUW�

(9)After convergence,thevalueis theexpectedsumof all (pos-sibly discounted)futurepay-offs theagentexpectsto receiveup to time

�, if thecorrectbelief is

�. A remarkableresultby

Sondik(Sondik1971)showsthatfor afinite-horizonproblem,thevaluefunction is a piecewise linear, convex, andcontinu-ousfunctionof thebelief,with finitely many linearpieces.

Unfortunatelyhowever, the exact algorithmsusedto com-pute optimal policies are boundedby a doubly exponentialcomputationalgrowth in the planninghorizon, and in prac-tice areoftenat leastexponential.More specifically, a singlestepof valueiterationis on theorderof: g\$3: � �I*�:hi:": gO$�<>��:"j k\jc,

(10)where

: g\$�<��:representsthenumberof componentsnecessary

to representthevaluefunctionat iteration 42lKF . This pointsto theneedfor moreefficientalgorithms.

Recenteffortshave focusedon thedevelopmentof efficientalgorithmsthatusevalue-approximationtechniquesto gener-atenear-optimalpoliciesfor largedomains(Hauskrecht2000;Littman, Cassandra,& Kaelbling 1995). However many ofthesealgorithms,thoughscalableandhighly effectivein manydomains,generallygain computationaladvantageat the ex-penseof information-gatheringconsiderations,thus makingtheminappropriatefor domainswhereinformation-gainsarecrucialto goodperformance.

An alternative approachfor scalingdecision-makingis toexploit hierarchicalstructureandpartitionacomplex probleminto many smallerproblems.Theideaof leveragingstructurehasbeenextensively studiedin the context of Markov Deci-sion Processes(MDPs) (Singh1992;Dayan& Hinton 1993;Kaelbling 1993; Dean& Lin 1995; McGovern et al. 1998;Parr & Russell1998;Dietterich2000). Thesealgorithmsdonot extend naturally to POMDPsbecausethey definestruc-turedhierarchiesin termsof modularsubtaskswith fully ob-servablestartandterminationconditions.We now describeanovel algorithmto performhierarchicalPOMDPplanningandexecution.

Hierar chical POMDPsThe basicideaof the hierarchicalPOMDPis to partition theactionspace—notthe statespace,sincethe stateis not fullyobservable—intosmallerchunks. Thereforethe cornerstoneof our hierarchicalalgorithmis an action hierarchy. We as-sumethatit is providedby adesignerandrepresentsthestruc-tural prior knowledgeincludedto facilitatetheproblemsolu-tion. Figure1 illustratesthebasicconceptof anactionhierar-chy.

Formally, an actionhierarchy is a tree,whereeachleaf islabeledby anactionfrom thetargetPOMDPproblem’saction

Page 3: High-level robot behavior control using POMDPsjpineau/files/jpineau-cogrob02.pdf · novel algorithm to perform hierarchical POMDP planning and execution. Hierarchical POMDPs The basic

Contact

RingDoorBellGotoPatientRoom

Move

GuidetoPhysioCheckUserPresentTerminateGuidanceConfirmGuidetoPhysio

Assist

Act

RemindRemindPhysioApptRemindVitamin

RestRechargeGotoHome

UpdateChecklist

ConfirmDone

ConfirmGoHome

Inform

TellTimeTellWeatherConfirmWantTimeConfirmWantWeatherVerifyInfoRequest

Figure1: SampleAction Hierarchy

set. Eachaction&nmop�

(henceforthcalledprimitive action)mustbeattachedto at leastoneleaf (e.g. RingDoorBell, Go-toPatientRoom, etc.) In eachinternalnode(shown ascirclesin figure 1) we introducean abstract action. Eachof theseprovidesanabstractionof theactionsin thenodesdirectlybe-low it in thehierarchy (e.g.Contactis anabstractionof Ring-DoorBell andGotoPatientRoom.) Throughoutthis paperweusenotation q&�r to denoteanabstractaction.

A key steptowardshierarchicalproblemsolving is to usetheactionhierarchy to translatetheoriginal full POMDPtaskinto a collectionof smallerPOMDPs.Thegoal is to achieveacollectionof POMDPsthatindividually aresmallerthantheoriginalPOMDP, yet collectively defineacompletepolicy.

We proposethateachinternalnode q&�r in theactionhierar-chy spansa correspondingsubtaskP

r. Thesubtaskis a well-

definedPOMDPcomposedof:� astatespace�sr

, identicalto thefull originalstatespace�?�

;� anobservationspace r

, identicalto thefull originalobser-vationspace

t�;� an action spaceur

, containingthe children nodes(bothprimitive and abstract) immediatelybelow q&�r in the sub-task.

For example, figure 1 shows a problem that has beendi-vided into seven subtasks,where subtask �wv+x�y r"zL{

has ac-tion set:

v+x�y r"z�{=��&�| x�y rcz�{

N}=~ @r J+�w� � $ , &�| x�y rcz�{ �Orc$�� y r"z

,&�� � {%��$ x�� } x������r @ $ , q& � J zL$�� � $ � , andsoon.

Given that the actionhierarchy spansa collectionof sepa-ratePOMDPsubtasks,we canindependentlyoptimizean in-dependentpolicy for eachsubtask,suchthatwe obtaina col-lectionof correspondinglocal policies:

DEFINITION: GivenPr, a POMDPsubtaskwith actionsetur

. We say that M N?� , a POMDP policy definedover actionsubset

ur, is a localpolicy.

Usingthisrepresentation,weintroducetwo key algorithms:a hierarchicalplanningalgorithmwhich is usedto optimizelocal policiesfor all subtasks,anda hierarchicalexecutional-gorithm which is usedto extract a global actionpolicy fromthecollectionof localpolicies.

Hierar chical POMDP ExecutionWe first presenttheexecutionalgorithm,which assumesthatlocal policieshave beencomputedfor all subtasksandfrom

theseextractsaglobalpolicy mappingbelief to action.Ratherthancomputinga full globalpolicy, we proposeanonlineal-gorithmthatconsultsthelocalpoliciesatevery timestep.Be-fore doing a the policy lookup, the belief is first updatedus-ing equation5 (exceptat time 4T��5 , wherethe initial belief�+$ � ���

). The executionalgorithmthenbegins by invokingthepolicy of thetopsubtaskin thehierarchy, andthenoperatesthroughasequenceof recursivecallssuchthatthehierarchy istraversedin a top-downmanner, invoking a sequenceof localpoliciesuntil aprimitiveactionis selected.Thepreciseproce-dureis describedin the EXECUTE functionof table1. Theroutine is initially invoked using the top subtaskas the firstargument, � , andthe currentbelief asthe secondargument,�+$

.

EXECUTE(� , ��� )Let �?�����Q�w�c�=���If �?� is aprimitiveaction

Return�?�Elseif �?� is anabstractaction(i.e. ��?�

Let �9�;� �"��� bethesubtaskspannedby ��?�EXECUTE(�9��� �"��� , �=� )

endend

Table1: Executionfunction

It is important to emphasizethat the full top-down tracethroughthe hierarchy is repeatedat every time step. This isadeparturefrom many hierarchicalMDP planningalgorithmswhich operatewithin a given subtaskfor multiple time stepsuntil a terminalstateis reached.It is however consistentwithDietterich’spolling approach(Dietterich2000),andin ourap-proachis stronglymotivatedby thefactthatthepartialobserv-ability characteristicof POMDPswould limit thedetectionofterminal states,thuspreventingus from using the executionapproachcommonto hierarchicalMDPs.

Hierar chical POMDP PlanningWe now describetheplanningalgorithmwhich is responsiblefor optimizing the collection of local policies. Table 2 de-scribesin pseudo-codetheplanningalgorithm.

PLAN(� )forall primitiveactions�-� in �

PARAMETERIZE(�-� )endforall abstractactions��?� in �

Let �\� bethesubtaskspannedby ��'�PLAN(�\� )PARAMETERIZE( ��?� )

endSOLVE(� )

end

Table2: Planningfunction

The recursive PLAN( � ) routineis responsiblefor travers-ing thehierarchy andis initially invokedusingthetopsubtaskin thehierarchy astheargument.Howeverit traversesthehier-archy in a bottom-upmanner. Namely, eachsubtaskis solvedonly after all the subtasksdirectly below it in the hierarchy

Page 4: High-level robot behavior control using POMDPsjpineau/files/jpineau-cogrob02.pdf · novel algorithm to perform hierarchical POMDP planning and execution. Hierarchical POMDPs The basic

havebeensolved.Theroutineusesasimpletwo-partprocess,applied  in successionto all subtasksin thehierarchy.� PART1 (PARAMETERIZE): Infer the necessaryparame-

ters for thegivensubtask.� PART2 (SOLVE): Applypolicyoptimizationto thesubtask.The SOLVE function containsthe actualpolicy optimiza-

tion subroutine,which is implementedas a call to an exactPOMDP planningalgorithm: the incrementalpruning algo-rithm. This exact POMDP solution is describedin detailin (Cassandra,Littman, & Zhang1997); implementedcodecanbe obtainedfrom (Cassandra1999). The assumptionisthateachsubtaskis sufficiently smallto besolvedexactly, yetthefull POMDPproblemis muchtoo largefor anexactsolu-tion.

The PARAMETERIZE routine is usedto infer the neces-sary model parameters(transition,observation, and reward)for eachsubtask. We assumewe have a known model ofthe original full POMDPtask,andusetheseto infer a sub-setof parametersfor eachsubtask.We considertwo separateissues:that of definingparametersfor primitive actions(i.e.PARAMETERIZE(

&�¡)), and that of defining parametersfor

abstract actions(i.e. PARAMETERIZE( q& y ).) In general,theparametersthatareconditionedonprimitiveactionscanbedi-rectly extractedfrom theoriginal parameterset. We formallydefine:

Initial belief (11)�=�¢��£%�¥¤�� ��¦���£ ��§T¨w£V©«ª?¦Transition probabilities (12)¬ �¢��£�§¢�-��§ £�­C�¥¤�� ¬ ¦���£�§¢�E��§3£�­C��§T¨���£�§ £�­C�©«ª?¦�§¨¯®��-��§ °�°h°�§¢� ��± ©_²��

Observation probabilities (13)³ �C��£�§¢�-��§¢´��¥¤�� ³ ¦%��£�§¢�-��§¢´E��§T¨w£V©«ª?¦�§¨�´�©¶µ9¦�§¢¨¯®��-��§ °h°�°h§¢� ��± ©_²i�Reward function (14)· �¢��£�§C�-�%�¥¤�� · ¦���£�§¢�E�%��§T¨w£V©«ª?¦�§¨¯®��-��§ °�°h°�§¢�¸� ± ©_²��

Parameterizationof theabstractactionsis themainmotiva-tion for adoptinga bottom-upprocedureduring planning. Akey insight is the fact that the algorithmusesthe local pol-icy learnedfor subtask�wy whenmodellingthecorrespondingabstractaction q& y in the context of the higherlevel subtask.Goingbackto theexamplein Figure1, thegoalis to first learna localpolicy for subtask�w� J z�$�� � $ , andthenusethispolicy toinfer modelparametersfor action q& � J zL$�� � $ , suchthatit is thenpossibleto proceedandapply SOLVE( � | x�y r"zL{

). Equations15-17describetheexactprocedurefor inferring thoseparam-eters.

Transition probabilities (15)¬ �¢��£�§%��?�¹§ £�­C�¥¤�� ¬ ��£�§¢�Q��ºu��£%��§ £�­C��§¨���£�§ £�­b�>©«ª'¦�§�¨w��?�»©_²��

Observation probabilities (16)³ �¢��£�§ ��'�¹§¢´��¥¤�� ³ ��£�§¢�Q��ºt��£%��§¢´E��§T¨¯£V©¼ª'¦�§¨�´_©¶µ\¦�§�¨w��'�»©_²��Reward function (17)· ��£�§%��?�u�¥¤�� · ��£�§¢�Q��ºu��£%����§¨¯£V©«ª?¦�§¯¨¯��?�»©_²��

An importantassumptionof our approach,implicit to theabove discussionof local policy optimization, is that eachsubtaskmustcontainsomediscriminativerewardinformation.This meansthat we cannothave a uniform reward functionover all actionsand/orstates,otherwiselocal policiescouldnot be meaningfullyoptimized. This is a commonassump-tion in hierarchicalreinforcementlearning(Dietterich2000),andtheassumptionis easilymet in therobotcontroldomain,wherethe costof accomplishingvarioustasksnaturallypro-videsthenecessarylocal rewardstructure.In domainswherethisassumptionis notmet,wecan(assuggestedin (Dietterich2000))introducerewardshapingto provide subtaskswith therequiredlocal rewardinformation.

Stateand Observation AbstractionThe hierarchicalalgorithm, as describedso far, proposesto reduce the computationaloverhead of POMDP plan-ning by partitioning the action set. In terms of compu-tational complexity, the numberof linear piecesrepresent-ing an exact POMDPvaluefunction is recursively given by:: gO$3: � �½*�:#i:#: gO$�<��: j kOj ,

, and can be enumeratedin time:�I*�:#�V: � :#i:#: gO$�<>��: j k\j ,. Thereforeour hierarchicalPOMDPal-

gorithm,by solvingsubtaskswith reducedactionsets,cansig-nificantly reducethecomputationalcomplexity of computingPOMDPpolicies. However thesesavings arepartially offsetby the fact that we now have to computemany policies, in-steadof just one. We now discusshow in many domainsit ispossibleto further reducecomputationalcostsby alsoapply-ing stateandobservationabstraction.Thekey ideais to definereducedstateandobservationsetsfor eachsubtaskandapplyplanningusingthissmallerrepresentation.

Theformalconditionsunderwhich to applyexactstateandobservation abstractionaredirectly relatedto the modelpa-rameters.We considera POMDPsubtask� r , with actionsetur

, stateset�sr

andobservationset �r

, wherethestatesetisspannedby statefeatures¾ r � � ¾ �%� ¾ ����!�!�!�� ¾ � � andtheob-servation set is spannedby features¿ r � � ¿ ��� ¿ ����!�!�!�� ¿OÀ�� .Wenow consideracasewherethestatefeaturescanbedividedinto two disjoint sets,¾ÂÁ and ¾ <

, andsimilarly observationfeaturescanbedividedinto two disjointsets,¿Á and ¿ <

. Wesaythatstatefeatures¾ <

andobservationfeatures¿ <areir-

relevantto subtask� r ifH &�¡�mÃ2r

:|9Ä�Å\Æ\ÇcÅ9È9Ç ��É¢Ê ^ |/Ä�ÅOÆ\Ç �=É¢Ê(18)

N Ä�Å AÆ ÇcÅ9È j Å\Æ\ÇcÅ9È9Ç ��É¢Ê ^ N Ä�Å AÆ j Å\Æ\Ç ��É�Ê N Ä�Å AÈ j ÅOÆ\ÇcÅ/È9Ç ��É¢ÊN Ä�Ë'Æ\ÇcË?È j Å\Æ\ÇcÅ9È9Ç ��É¢Ê ^ N Ä�Ë-Æ j Å\Æ\Ç ��É¢Ê N Ä�Ë¸È j Å/È9Ç ��É¢Ê

Figure 2 illustratestheseconstraintsin the form of a dy-namicbelief network. In essence,we seethat statefeaturesin set ¾ <

have no effect on therewardfunction,andfurther-moreprovideno transitionor observationinformationregard-ing thosefeatures(namely¾TÁ ) thatdo have aneffect on thereward function. Consequently, statefeatures¾ <

andobser-vation features¿ <

canhave no effect on the valuefunction,and thereforecan be safely ignored. It is importantto notethatthefeatureirrelevanceconditionmusthold for all actions(primitiveandabstract)in agivensubtask.

For general POMDP planning, applying abstractionisequivalent to finding a minimum-sizerepresentationfor theproblem,but oncethe problemis specifiedthereis little op-portunityfor furtherabstraction.In thecontext of hierarchicalplanninghowever, abstractioncanbeappliedindependentlyto

Page 5: High-level robot behavior control using POMDPsjpineau/files/jpineau-cogrob02.pdf · novel algorithm to perform hierarchical POMDP planning and execution. Hierarchical POMDPs The basic

X

X

+

X

X−

+

Z+

−Z

R

Figure 2: Dynamic belief network illustratingstate/observation abstraction conditions (dashed line in-dicates state/observation features which can safely beignored

eachsubtask—thusreducing:#�V:

and: _:

—withoutinfluencingthe policy optimizationany further thanwhat is attributedtotheactiondecomposition.Theoverall resultingcomputationalsavingscanbetremendous(severalordersof magnitude).

It is importantto notethatstateandobservationabstractionis a naturalconsequenceof theactiondecomposition.In factwe observe that action decompositionand state/observationabstractionoften work handin hand. In general,a domainwith a structuredstatesetwill lenditself to hierarchicalplan-ning without significantperformanceloss. Furthermorethesamestructurethatgivesrise to a goodactiondecompositionoftenallows substantialstateandobservationabstraction.

A Real-World Application DomainIn this section,we presentresultsfrom the real-world im-plementationand testingof our hierarchicalPOMDP robotcontroller in the context of an interactive robot-humantask.This implementationwasconductedaspartof a largerprojectdedicatedto thedevelopmentof a prototypenursingassistantrobot. Theoverall goalof theprojectis to developpersonal-izedrobotictechnologythatcanplay anactive role in provid-ing improvedcareandservicesto non-institutionalizedelderlypeople.Thetargetuseris anelderlyindividualwith mild cog-nitiveand/orphysicalimpairment.

Robot PlatformThe robot Pearl (shown in figure 3 interactingwith someusers)is the primary test-bedfor the behavior managementsystem. The robot usesa differential drive systemand isequippedwith two on-boardPentiumPCs,wirelessEthernet,a SICK laserrangefinder, sonarsensors,a microphoneforspeechrecognition,speakers for speechsynthesis,a touch-sensitivegraphicaldisplay, actuatedheadunit,andstereocam-erasystem.

On the software side, the robot featuresoff-the-shelfau-tonomousmobile robot navigation system(Burgard et al.1999;Thrunet al. 2000a),speechrecognitionsoftware(Rav-ishankar1996), speechsynthesissoftware (Black, Talor, &Caley 1999), fast image captureand compressionsoftwarefor online video streaming,andfacedetectiontrackingsoft-ware (Rowley, Baluja, & Kanade1998). A final softwarecomponentis a prototypeof a flexible remindersystemus-ing advancedplanningandschedulingtechniques(McCarthy

Figure3: Pearl,theroboticnursingassistant,interactingwithelderlyusersatanursingfacility

& Pollack2002). For informationon additionalresearchin-volving this robot,thereaderis referredto (Montemerloetal.2002).

The robot’s environment is a retirementresort locatedinOakmont,PA. All experimentssofar primarily involvedpeo-ple with relatively mild cognitive, perceptual,or physical in-abilities,thoughin needof professionalassistance.

Task description

Fromthemany servicessucha robotcouldprovide (see(En-gelberger1999;Lacey & Dawson-Howe 1998)),thework re-portedherehasfocusedon the task of remindingpeopleofevents (e.g., appointments)and guiding them through theirenvironment. At present,nursingstaff in assistedliving fa-cilities spendssignificantamountsof time escortingelderlypeoplewalking from onelocationto another. Thenumberofactivities requiringnavigation is large, rangingfrom regulardaily events(e.g.,meals),appointments(e.g.,doctorappoint-ments,physiotherapy, hair cuts), socialevents(e.g., visitingfriends,cinema),to simply walking for the purposeof exer-cising. Many elderly peoplemove at extremelyslow speeds(e.g.,5 cm/sec),makingthetaskof helpingpeoplearoundoneof the most labor-intensive in assistedliving facilities. Fur-thermore,thehelpprovidedisoftennotof aphysicalnature,aselderlypeopleusuallyselectwalkingaidsoverphysicalassis-tanceby nurses,thuspreservingsomeindependence.Instead,nursesoftenprovide importantcognitive help, in the form ofreminders,guidanceand motivation, in addition to valuablesocialinteraction.

The particulartaskwe selectedrequiresthe robot to nav-igate to a person’s room, alert them, inform themof an up-comingeventor appointment,andinquireabouttheirwilling-nessto be assisted.It then involves a lengthy phasewherethe robot guidesa person,carefully monitoring the person’sprogressandadjustingthe robot’s velocity andpathaccord-ingly. Finally, the robot also serves the secondarypurposeof providing informationto thepersonuponrequest,suchas

Page 6: High-level robot behavior control using POMDPsjpineau/files/jpineau-cogrob02.pdf · novel algorithm to perform hierarchical POMDP planning and execution. Hierarchical POMDPs The basic

informationaboutupcomingcommunityevents,weatherre-ports,Ì TV schedules,etc.

From an AI point of view, several factorsmake this taska challengingone. In additionto thewell-developedtopic ofrobotnavigation(Kortenkamp,Bonasso,& Murphy 1998),thetask involves significant interactionwith people. The robotPearl interactsmainly through speechand visual displays.Whenit comesto speech,many elderlyhave difficulty under-standingeven simplesentences,andmoreimportantly, artic-ulatinganappropriateresponsein a computer-understandableway. Thosedifficulties arisefrom perceptualand cognitivedeficiencies,often involving a multitude of factorssuchasarticulation,comprehension,andmentalagility. In addition,people’s walking abilitiesvary drasticallyfrom personto per-son. Peoplewith walking aidsareusuallyan orderof mag-nitudeslower thanpeoplewithout, andpeopleoften stop tochator catchbreathalongtheway. It is thereforeimperativethattherobotadaptsto individualpeople—anaspectof peopleinteractionthat hasbeenpoorly exploredin AI androbotics.Finally, safetyconcernsaremuchhigherwhendealingwiththeelderlypopulation,especiallyin crowdedsituations(e.g.,diningareas.)

POMDP ModellingThe robot interfacedomain for the selectedtask was mod-elledusing576states,which aredescribedusinga collectionof multi-valuedstatefeatures.Thosestateswerenot directlyobservableby therobotinterfacemanager;however therobotwasableto perceive 18 distinct observations. The stateandobservationfeaturesarelistedin table3.

Observationswereperceivedthrough5differentmodalities;in many casesthelistedobservationsconstitutea summaryofmorecomplex sensorinformation.For example,in thecaseofa user-emittedspeechsignal,a keyword filter wasappliedtotheoutputof thespeechrecognizer(e.g.“Give metheweatherforecast for tomorrow.”

USpeechKeyword=weather); for

the lasersensor, the raw laserdatawas processedand cor-relatedto a mapto determinewhenthe robot hadreachedaknown landmark(e.g.

ULaser=robotAtHome.) In general

thespeechrecognitionandtouchscreeninputwereusedasre-dundantsensorsto eachother, passingin very muchthesameinformation,but assumedto have a greaterdegreeof reliabil-ity whencomingfrom thetouchscreen.TheReminderobser-vationswerereceivedfrom ahigh-level intelligentschedulingmodule(see(McCarthy & Pollack2002)for detailsaboutthiscomponent.)

In responseto theobservations,therobotcouldselectfrom19distinctactions,falling into threebroadcategories:� Communicate=

�RemindPhysioAppt, RemindVitamin,

UpdateChecklist,CheckUserPresent,TerminateGuidance,TellTime, TellWeather, ConfirmGuideToPhysio, Verify-InfoRequest, ConfirmWantTime, ConfimWantWeather,ConfirmGoHome,ConfirmDone�� Move=

�GotoPatientRoom,GuideToPhysio,GoHome�� Other=

�DoNothing,RingDoorBell,RechargeBattery�

Each discreteaction enumeratedabove invoked a well-defined sequenceof operationson the part of the robot(E.g. GiveWeather

USpeechSynthesis=“Tomorrow’s

weathershouldbe sunny, with a high of 80.” .) The actionsin the Communicatecategory involved a combinationof re-dundantspeechsynthesisandtouchscreendisplay, wherethe

Statefeatures FeaturevaluesRobotLocation home,room,physioUserLocation room,physioUserPresent yes,noReminderGoal none,physio,vitamin,checklistUserMotionGoal none,toPhysioWithRobotUserInfoGoal none,wantTime,wantWeatherObservationfeatures FeaturevaluesSpeech yes,no, time,weather, go,unknownTouchscreen t yes,t no, t time, t weather, t goLaser atRoom,atPhysio,atHomeReminder g none,g physio,g vitamin,g checklist

Table3: Componentdescriptionfor human-robotinteractionscenario

selectedinformation or questionwas presentedto the userthroughboth modalitiessimultaneously. Given the sensorylimitations commonin our target population,we found theuseof redundantaudio-visualimportantfor bothcommunica-tion to andfrom therobot. Theactionsin theMovecategoryweretranslatedinto a sequenceof motorcommandsby a mo-tion planner, whichusesdynamicprogrammingto planapathfrom therobot’s currentpositionto its destination.

ThePOMDPmodelparameterswereselectedby adesigner.The reward structure,also hand-crafted,reflectsthe relativecostsof applying actions in terms of robot resources(e.g.robot motion actionsare typically costlier thanspoken veri-ficationquestions),aswell asreflectingtheappropriatenessoftheactionwith respectto thestate.For example,weuse:� positive rewardsfor correctlysatisfyingagoal

E.g. R(�'�¯� ¬9Í�Î�ÏIÐCÑ ��Ò Í�ÓOÔ-ÐbÕ � Ñ?Ö3Í )=+50if £ �¢��×¹£ Í�Î�Ø ´�Ò Ð ´ ÑOÓ ´���Ù��Q�»®�Ò�´��½Ú�Ûs£ Ð ´-Ü Ð Ò Ú · ´��+´�Ò ±and £ �C� · ´��3´�Ò�Ý´ Ö ��Ò Ð ´ Ñ �Q�»®�Þ?Ú�Û¸£ Ð ´ ±and £ �C��×¹£ Í�Î Ý>´ Ö ��Ò Ð ´ Ñ ��»®�Þ?Ú�Û¸£ Ð ´ ± ;� a largenegative rewardsfor applyinganactionunnecessar-

ilyE.g. R(�'�¯� ÓOÔ-ÐCÕ(Í Ò¢´��½Ú�Û¸£ Ð ´ )=-200

if £ �¢��×¹£ Í�Î�Ø ´�Ò Ð ´ ÑOÓ ´���Ù��Q�»® Ñ ´ Ñ?Í ± ;� asmallnegative rewardfor verificationquestionsE.g. R(�'�¯�»ßO´ Ñ>à(ÐbÎ�Ï�ÓOÔ-ÐbÕ�Í Ò¢´��_Ú=Û¸£ Ð ´ )=-2

givenany statecondition.

andso on. This reward function satisfiesthe local rewardassumptionfor eachsubtask,assumingthe action hierarchyproposedin figure1.

Finally, wealsoappliedstate/observationabstractionto thisdomain, following the conditionsspecifiedin figure 2. Wewere thus able to eliminatebetweenone and five statefea-turesfor eachsubtask.In general,lower-level subtasksyieldedmore abstraction(e.g. subtask �wá z�â J v+y requiredonly theUserInfoGoalstate featureand the

�speech, touch� obser-

vation features),whereashighersubtasksrequiredaccesstomoststatefeatures.

Experimental ResultsTheplanningproblemdescribedhereis far beyond thereachof exactPOMDPsolvers.UsingthehierarchicalPOMDP, thehigh-level decisionmakingproblemin Pearlis tractable,andanear-optimalcontrolpolicy canbecomputedoff-line. Thus,

Page 7: High-level robot behavior control using POMDPsjpineau/files/jpineau-cogrob02.pdf · novel algorithm to perform hierarchical POMDP planning and execution. Hierarchical POMDPs The basic

Observation TrueState Action Reward

pearl hello requestbegun sayhello 100pearlwhatis like startmeds ask repeat -100pearlwhattime is it

for will the want time say time 100

pearlwasonabc want tv askwhich station -1pearlwasonabc want abc sayabc 100

pearlwhatis onnbc want nbc confirm channel nbc -1pearl yes want nbc saynbc 100

pearl go to thethatprettygoodwhat sendrobot ask robot where -1

pearl that thathellobe sendrobot bedroom confirm robot place -1pearl thebedroomanyi sendrobot bedroom go to bedroom 100

pearl go it eighta hello sendrobot ask robot where -1pearl thekitchenhello sendrobot kitchen go to kitchen 100

Table4: An exampledialog with a test subject. Actions inbold font areclarificationactions,generatedby the POMDPbecauseof highuncertaintyin thespeechsignal.

duringexecutiontime thecontrollersimplymonitorsthestate(calculatestheposterior)andlooksuptheappropriatecontrol.

An importantpropertyof the computedPOMDPpolicy isthe inclusionof information-gatheringactions.Theseactionshave the specificpurposeof gatheringstate-disambiguatingdata,asopposedto gettingcloserto the goal. In the domaindescribedhere,information-gatheringactionsareusedto clar-ify or confirmtheuser’s intent.

Table 4 shows an exampledialog betweenthe robot anda test subject(using a different task domain,developedforearlierin-lab experiments.)Becauseof the uncertaintyman-agementin POMDPs,therobotchoosesto aska clarificationquestionat threeoccasions.Thenumberof suchquestionsde-pendson the clarity of a person’s speech,asdetectedby theSphinxspeechrecognitionsystem.

In thenursinghomeenvironment,wetestedtherobotin fiveseparateexperiments,eachlastingonefull day. Thefirst threedaysfocusedon open-endedinteractionswith a largenumberof elderly users,during which the robot interactedverballyandspatiallywith elderlypeoplewith thespecifictaskof de-liveredsweets.This allowedusto gaugepeople’s initial reac-tionsto therobot.

Following this, we performedtwo daysof formal experi-mentsusing the exact domaindescribedin table 3. Duringtheseexperiments,the robot autonomouslyled 12 full guid-ances,involving 6 differentelderly people. Figure 4 showsanexampleguidanceexperiment,involving anelderlypersonwho usesa walking aid. The sequenceof imagesillustratesthemajorstagesof a successfuldelivery: from contactingtheperson,explainingto her thereasonfor thevisit, walking herthroughthe facility, andproviding informationafter the suc-cessfuldelivery—in thiscaseon theweather.

In all guidanceexperiments,the task was performedtocompletion. Post-experimentaldebriefingsillustrateda uni-form high level of excitementon thesideof theelderly. Over-all, only a few problemsweredetectedduring the operation.Noneof thetestsubjectsshoweddifficultiesunderstandingthemajorfunctionsof therobot.They all wereableto operatetherobot after lessthanfive minutesof introduction. However,initial flaws with a poorly adjustedspeechrecognitionsystemledto occasionalconfusion,whichwasfixedduringthecourseof this project. An additionalproblemarosefrom therobot’sinitial inability to adaptits velocity to people’s walking pace,

(a)Pearlapproachingelderly (b) Remindingof appointment

(c) Guidancethroughcorridor (d) Enteringphysiotherapy dept.

(e)Asking for weatherforecast (f) Pearlleaves

Figure4: Exampleof asuccessfulguidanceexperiment.Pearlpicksup thepatientoutsideherroom,remindsherof a phys-iotherapy appointment,walks the personto the department,and respondsto a requestof the weatherreport. In this in-teraction,the interactiontook placethroughspeechand thetouch-sensitive display.

whichwasfoundto becrucialfor therobot’seffectiveness.We arecurrentlyengagedin experimentalwork thatbuilds

on theresultspresentedhere,includingcomparingourhierar-chical POMDPapproachto alternative solutions. Futureex-perimentswill includecarryingout longerandmorecomplexscenariosin thenursinghome,wherethe robot will carryontaskssuchastaking residentsfor walks, in supportof variedphysicalandsocialactivities.

DiscussionThis paperdescribesa POMDPapproachto high-level robotbehavior control and presentsan algorithmic approachtoPOMDPswhich usesthestructurednatureof the robotplan-ning domainto facilitateplanning. The POMDP-basedcon-troller hasbeentestedsuccessfullyin experimentsin an as-sistedliving facility.

This work demonstratesthat POMDPshave maturedto alevel that makesthemapplicableto real-world robot controltasks.Furthermore,our experimentalresultssuggestthatun-certaintymattersin high-level decisionmaking. Thesefind-ings challengea long term view in mainstreamAI that un-certaintyis irrelevant,or at bestcanbehandleduniformly atthehigherlevelsof robotcontrol(Giacomo1998;Lakemeyer2000). We conjectureinsteadthat whenrobotsinteractwithpeople,uncertaintyis pervasiveandhasto beconsideredatalllevels of decisionmaking,not solely in low-level perceptualroutines.

Page 8: High-level robot behavior control using POMDPsjpineau/files/jpineau-cogrob02.pdf · novel algorithm to perform hierarchical POMDP planning and execution. Hierarchical POMDPs The basic

ReferencesAAAI.ã 1998. Fall symposium on planning withpartially observable Markov decision processes.See http://www.cs.duke.edu/˜mlittman/talks/pomdp-symposium.html.Arkin, R. 1998.Behavior-BasedRobotics. MIT Press.Black,A.; Talor, P.; andCaley, R. 1999.Thefestival speechsynthesissystem.1.4edition.Boyen,X., andKoller, D. 1998.Tractableinferencefor com-plex stochasticprocesses.In Proceedingsof theFourteenthConferenceonUncertaintyin Artificial Intelligence, 33–42.Brooks,R. A. 1985. A robust layeredcontrol systemfor amobilerobot. TechnicalReportTR AI memo864,MIT.Burgard, W.; Cremers,A. B.; Fox, D.; Hahnel,D.; Lake-meyer, G.; Schulz,D.; Steiner, W.; andThrun,S. 1999. Ex-perienceswith aninteractivemuseumtour-guiderobot.Arti-ficial Intelligence.Cassandra,A.; Littman, M. L.; andZhang,N. L. 1997. In-crementalpruning:A simple,fast,exactmethodfor partiallyobservable Markov decisionprocesses.In ProceedingsoftheThirteenthConferenceon Uncertaintyin Artificial Intel-ligence, 54–61.Cassandra,A. 1999. http://www.cs.brown.edu/research/ai/pomdp/code/.Darrell, T., andPentland,A. 1996. Active gesturerecogni-tion usingpartiallyobservableMarkov decisionprocesses.InProceedingsof13thIEEEIntl. ConferenceonPatternRecog-nition (ICPR).Dayan,P., andHinton,G. 1993.Feudalreinforcementlearn-ing. In Advancesin Neural InformationProcessingSystems,volume5, 271–278.SanFrancisco,CA: MorganKaufmann.Dean,T., andLin, S.H. 1995.Decompositiontechniquesforplanningin stochasticdomains.In Proceedingsof the1995InternationalJoint ConferenceonArtificial Intelligence.Dietterich,T. G. 2000. Hierarchicalreinforcementlearningwith the MAXQ valuefunction decomposition.Journal ofArtificial IntelligenceResearch 13:227–303.Engelberger, G. 1999. Handbookof Industrial Robotics.JohnWiley andSons.Gat, E. 1996. Esl: A languagefor supportingrobust planexecutionin embeddedautonomousagents.AAAI Fall Sym-posiumonPlanExecution.Giacomo,G. D., ed. 1998.AAAI Fall Symposiumon Cogni-tiveRobotics.Hauskrecht,M. 2000. Value-functionapproximationsforpartially observableMarkov decisionprocesses.Journal ofArtificial IntelligenceResearch 13:33–94.Howard, R. A. 1960. DynamicProgrammingand MarkovProcesses. MIT Press.Kaelbling,L. P.; Littman,M. L.; andCassandra,A. R. 1998.Planningand acting in partially observable stochasticdo-mains.Artificial Intelligence101:99–134.Kaelbling,L. P. 1993. Hierarchicalreinforcementlearning:Preliminaryresults. In Machine Learning: Proceedingsofthe1993InternationalConference(ICML).Kortenkamp,D.; Bonasso,R. P.; andMurphy, R., eds.1998.AI-basedMobile Robots: Casestudiesof successfulrobotsystems. MIT Press.

Lacey, G., andDawson-Howe,K. M. 1998.Theapplicationof roboticsto a mobility aid for theelderlyblind. RoboticsandAutonomousSystems23.Lakemeyer, G.,ed.2000.SecondInternationalWorkshoponCognitiveRobotics.Littman,M. L.; Cassandra,A. R.; andKaelbling,L. P. 1995.Learningpoliciesfor partiallyobsevableenvironments:Scal-ing up. In Proceedingsof Twelfth InternationalConferenceonMachineLearning.McCarthy, C. E., andPollack,M. 2002. A plan-basedper-sonalizedcognitiveorthotic. In Proceedingsof the6th Inter-nationalConferenceonAI Planning& Scheduling.McGovern, A.; Precup,D.; Ravindran, B.; Singh, S.; andSutton,R. S. 1998. Hierarchicaloptimalcontrolof MDPs.In Proceedingsof the 10th Yale Workshopon AdaptiveandLearningSystems, 186–191.Montemerlo,M.; Pineau,J.; Roy, N.; Thrun,S.; andVerma,V. 2002. Experientswith a mobile robotic guide for theelderly. In Proceedingsof theNationalConferenceon Arti-ficial Intelligence(AAAI’02).Nourbakhsh,I.; Powers, R.; and Birchfield, S. 1995.Dervish: An office-navigation robot. AI MagazineSummer:53–60.Parr, R., andRussell,S. 1998. Reinforcementlearningwithhierarchiesof machines.In Advancesin Neural InformationProcessingSystems, volume10.Poupart,P., and Boutilier, C. 2000. Value-directedbeliefstateapproximationfor POMDPs.In Proceedingsof theSix-teenthConferenceonUncertaintyin Artificial Intelligence.Ravishankar, M. 1996. Efficient Algorithms for SpeechRecognition. Ph.D.Dissertation,Carnegie Mellon.Rowley, H. A.; Baluja, S.; and Kanade,T. 1998. Neuralnetowrk-basedfacedetection.IEEETransactionsonPatternAnalysisandMachineIntelligence20(1).Roy, N., andThrun,S. 2000. Coastalnavigation with mo-bile robots. In Advancesin Neural InformationProcessingSystems, volume12.Roy, N.; Pineau,J.; and Thrun, S. 2000. Spoken dialogmanagementfor robots. In Proceedingsof the38thAnnualMeetingof theAssociationfor ComputationLinguistics.Simmons,R., andKoenig,S. 1995. Probabilisticnaviga-tion in partially observableenvironments.In Proceedingsofthe1995InternationalJoint Conferenceon Artificial Intelli-gence, 1080–1087.Singh,S. 1992.Transferof learningby composingsolutionsof elementalsequentialtasks.MachineLearning8:323–339.Sondik,E.J.1971.TheOptimalControl of Partially Observ-ableMarkov Processes. Ph.D.Dissertation,StanfordUniver-sity.Thrun,S.; Beetz,M.; Bennwitz,M.; Burgard,W.; Cremers,A. B.; Dellaert,F.; Fox, D.; Hahnel,D.; Rosenberg, C.; Roy,N.; Schulte,J.; andSchulz,D. 2000a. Probabilisticalgo-rithmsandtheinteractivemuseumtour-guiderobotMinerva.InternationalJournalof RoboticsResearch 19(11).Thrun, S.; Fox, D.; Burgard, W.; and Dellaert, F. 2000b.RobustMonteCarlolocalizationfor mobilerobots.ArtificialIntelligence101:99–141.