Monte-Carlo Policy Synthesis in POMDPs with Quantitative...

9
Monte-Carlo Policy Synthesis in POMDPs with Quantitative and Qualitative Objectives Abdullah Al Redwan Newaz, Swarat Chaudhuri, and Lydia E. Kavraki The authors are with the Department of Computer Science, Rice University, Houston, TX, 77005 USA, {redwan.newaz, swarat, kavraki}@rice.edu Abstract—Autonomous robots operating in uncertain environ- ments often face the problem of planning under a mix of formal, qualitative requirements, for example the assertion that the robot reaches a goal location safely, and optimality criteria, for example that the path to the goal is as short or energy-efficient as possible. Such problems can be modeled as Partially Observable Markov Decision Processes (POMDPs) with quantitative and qualitative objectives. In this paper, we present a new policy synthesis algorithm, called Policy Synthesis with Statistical Model Checking (PO-SMC), for such POMDPs. While previous policy synthesis approaches for this setting use symbolic tools (for example satisfiability solvers) to meet the qualitative requirements, our approach is based on Monte Carlo sampling and uses statistical model checking to ensure that the qualitative requirements are satisfied with high confidence. An appeal of statistical model checking is that it can handle rich temporal requirements such as safe-reachability, while being far more scalable than symbolic methods. The safe-reachability property combines the safety and reachability requirements as a single qualitative requirement. While our use of sampling introduces approximations that symbolic approaches do not require, we present theoretical results that show that the error due to approximation is bounded. Our experimental results demonstrate that PO-SMC consistently performs orders of magnitude faster than existing symbolic methods for policy synthesis under qualitative and quantitative requirements. I. I NTRODUCTION Integrating perception and planning under uncertainty is one of the key challenges in robot motion planning. The Partially Observable Markov Decision Process (POMDP) framework provides a standard way of modeling imperfect controllers and sensors in a single framework [1]. The central algorithmic problem in the POMDP setting is the automatic synthesis of a policy: a function that maps an agent’s belief about the current state of the world to an action. In the traditional definition of the problem, this policy max- imizes the agent’s (robot’s) expected long-term reward [2]– [5]. However, in many applications, there is value in comple- menting this quantitative objective with formal requirements that are guaranteed to be satisfied with high probability. For instance, consider the problem, illustrated in Fig. 1, of devising reliable navigation for a simulated Fetch robot in a cluttered kitchen. Suppose we have a safe-reachability requirement that asserts that the robot must pick up a cup from the table (i.e., reach a desired configuration), while avoiding, to the extent possible, a set of unknown obstacles and an unsafe location (illustrated by the red cross in the figure). In some cases, the optimal policy computed by considering quantitative Fig. 1: A simulated Fetch robot is required to navigate through a kitchen to pick up the glass on the table. The robot has imperfect vision and actuation. The robot needs to avoid the red cross cell as much as possible while avoiding all obstacles with high probability: concrete block, rack, and plant. The quantitative objective is to avoid the red cross and reach the goal location with maximum reward whereas the qualitative objective is to probabilistically guarantee collision avoidance with uncertain obstacles. requirements may violate this safe-reachability requirement. A natural planning objective for this setting is optimal safe- reachability, which asserts that the robot must pick up the cup while definitely avoiding the obstacles and minimizing the number of steps into the undesirable location. More generally, it is natural to pose the problem of syn- thesizing POMDP policies that satisfy a qualitative, formal requirement (in particular safe-reachability), and have the maximal long-term reward among all policies that meet this requirement. This problem has been studied in recent work. Specifically, Wang et al. [6] consider the problem under the usual quantitative requirements, as well as the rich qualitative requirement of safe-reachability. Their method symbolically represents sets of beliefs using logical formulas, and uses a constraint-solver to ensure that a policy satisfies the qualitative requirements. A second method, by Chatterjee et al. [7], frames the problem of synthesizing policies that satisfy a safety requirement as Expectation Optimization with Proba- bilistic Guarantee (EOPG), and solves it using a sampling-

Transcript of Monte-Carlo Policy Synthesis in POMDPs with Quantitative...

Page 1: Monte-Carlo Policy Synthesis in POMDPs with Quantitative ...rss2019.informatik.uni-freiburg.de/papers/0102_FI.pdfobjectives. In this paper, we present a new policy synthesis algorithm,

Monte-Carlo Policy Synthesis in POMDPs withQuantitative and Qualitative Objectives

Abdullah Al Redwan Newaz, Swarat Chaudhuri, and Lydia E. KavrakiThe authors are with the Department of Computer Science, Rice University, Houston, TX, 77005 USA,

{redwan.newaz, swarat, kavraki}@rice.edu

Abstract—Autonomous robots operating in uncertain environ-ments often face the problem of planning under a mix of formal,qualitative requirements, for example the assertion that the robotreaches a goal location safely, and optimality criteria, for examplethat the path to the goal is as short or energy-efficient as possible.Such problems can be modeled as Partially Observable MarkovDecision Processes (POMDPs) with quantitative and qualitativeobjectives. In this paper, we present a new policy synthesisalgorithm, called Policy Synthesis with Statistical Model Checking(PO-SMC), for such POMDPs. While previous policy synthesisapproaches for this setting use symbolic tools (for examplesatisfiability solvers) to meet the qualitative requirements, ourapproach is based on Monte Carlo sampling and uses statisticalmodel checking to ensure that the qualitative requirements aresatisfied with high confidence. An appeal of statistical modelchecking is that it can handle rich temporal requirements suchas safe-reachability, while being far more scalable than symbolicmethods. The safe-reachability property combines the safety andreachability requirements as a single qualitative requirement.While our use of sampling introduces approximations thatsymbolic approaches do not require, we present theoretical resultsthat show that the error due to approximation is bounded.Our experimental results demonstrate that PO-SMC consistentlyperforms orders of magnitude faster than existing symbolicmethods for policy synthesis under qualitative and quantitativerequirements.

I. INTRODUCTION

Integrating perception and planning under uncertainty is oneof the key challenges in robot motion planning. The PartiallyObservable Markov Decision Process (POMDP) frameworkprovides a standard way of modeling imperfect controllers andsensors in a single framework [1].

The central algorithmic problem in the POMDP setting isthe automatic synthesis of a policy: a function that maps anagent’s belief about the current state of the world to an action.In the traditional definition of the problem, this policy max-imizes the agent’s (robot’s) expected long-term reward [2]–[5]. However, in many applications, there is value in comple-menting this quantitative objective with formal requirementsthat are guaranteed to be satisfied with high probability. Forinstance, consider the problem, illustrated in Fig. 1, of devisingreliable navigation for a simulated Fetch robot in a clutteredkitchen. Suppose we have a safe-reachability requirement thatasserts that the robot must pick up a cup from the table(i.e., reach a desired configuration), while avoiding, to theextent possible, a set of unknown obstacles and an unsafelocation (illustrated by the red cross in the figure). In somecases, the optimal policy computed by considering quantitative

Fig. 1: A simulated Fetch robot is required to navigate througha kitchen to pick up the glass on the table. The robot hasimperfect vision and actuation. The robot needs to avoid thered cross cell as much as possible while avoiding all obstacleswith high probability: concrete block, rack, and plant. Thequantitative objective is to avoid the red cross and reach thegoal location with maximum reward whereas the qualitativeobjective is to probabilistically guarantee collision avoidancewith uncertain obstacles.

requirements may violate this safe-reachability requirement.A natural planning objective for this setting is optimal safe-reachability, which asserts that the robot must pick up thecup while definitely avoiding the obstacles and minimizingthe number of steps into the undesirable location.

More generally, it is natural to pose the problem of syn-thesizing POMDP policies that satisfy a qualitative, formalrequirement (in particular safe-reachability), and have themaximal long-term reward among all policies that meet thisrequirement. This problem has been studied in recent work.Specifically, Wang et al. [6] consider the problem under theusual quantitative requirements, as well as the rich qualitativerequirement of safe-reachability. Their method symbolicallyrepresents sets of beliefs using logical formulas, and uses aconstraint-solver to ensure that a policy satisfies the qualitativerequirements. A second method, by Chatterjee et al. [7],frames the problem of synthesizing policies that satisfy asafety requirement as Expectation Optimization with Proba-bilistic Guarantee (EOPG), and solves it using a sampling-

Page 2: Monte-Carlo Policy Synthesis in POMDPs with Quantitative ...rss2019.informatik.uni-freiburg.de/papers/0102_FI.pdfobjectives. In this paper, we present a new policy synthesis algorithm,

based method. While these methods are promising, muchmore remains to be done on the problem. In particular, whilethe former approach handles the rich requirement of safe-reachability, it faces scalability challenges: the method needsto iteratively update sets of beliefs using Bayes’ rule, andthis can be challenging as the sets of beliefs rapidly grow insize and complexity. On the other hand, the latter approach isrestricted to safety, as opposed to safe-reachability, properties.

In this paper, we present a new algorithm, called PolicySynthesis with Statistical Model Checking (PO-SMC), forpolicy synthesis in POMDPs under quantitative objectives aswell as the qualitative objective of safe-reachability1. Themethod is based on Monte Carlo sampling, and is morescalable than the symbolic method of Wang et al. [6], whilealso handling more expressive requirements than Chatterjeeet al. [7]. The central technical idea in our approach is theuse of Statistical Model Checking (SMC) [8] to ensure thatthe policy satisfies the qualitative requirements. In SMC, onechecks that a system (for example a policy) satisfies a propertythrough repeated Monte Carlo simulation. The PO-SMC algo-rithm embeds SMC, as well as periodic symbolic correctnesschecking, into a Monte Carlo tree search for optimal policies.Specifically, the algorithm constructs a search tree of historiesby simulating state transitions and observations from a blackbox simulator; this history tree is used to efficiently explore inbelief space. SMC and symbolic methods are used to guaranteethat the search converges to policies that satisfy the qualitativerequirement with confidence above a threshold, as well as thequantitative optimality criterion.

We present theoretical results that ensure an asymptoticspeed of convergence for PO-SMC and bound the errorsintroduced by the algorithm. The PO-SMC algorithm is simpleto implement, and we evaluate it in a kitchen environment fromprevious work by Wang et al. [6]. Our experiments demon-strate that PO-SMC performs well on large-scale POMDPswith up to 1010 states, and that it significantly outperformsWang et al.’s symbolic policy synthesis algorithm.

II. RELATED WORK

A. POMDPs

The curse of dimensionality and the curse of history aretwo key challenges in the POMDP setting. Typically, POMDPplanners approximate the value function of policies in thecontinuous belief space either using point-based planners[2], [5], [6], [9]–[11] or using sampling-based planners [1],[3], [4], [12], [13] to make the POMDP problem tractable.Sampling-based planners exhibit efficient scalability to higher-dimensional problems. Inspired by this fact, our work im-proves the performance of sampling-based planner by couplingit with a statistical model checker (SMC).

B. Objective functions for POMDPs

Recently, there is growing interest in constrained POMDPs[14], chance-constrained POMDPs (CC-POMDPs) [15], and

1While we focus on safe-reachability properties, the approach can inprinciple be extended to other classes of temporal properties.

the expectation optimization with probabilistic guarantee(EOPG) in POMDPs [7]. It has been shown in [16] thatPOMDPs with boolean objectives are crucial for the safety-critical domains where we want to have robots that can safelyaccomplish the task. Broadly, we can divide the POMDPplanning approaches into three categories based on whetherobjectives are quantitative, qualitative or both.

a) Qualitative objective: In the qualitative objective,properties are specified either in boolean logic or in temporallogic. Temporal logic algorithms often generate a transitionsystem in the belief space using a deterministic automaton.Symbolic methods work well in small-scale environments [16],[17]. However, in a large-scale environment, to break thecurse of history, the belief space is typically approximatedby sampling-based methods [18]–[20].

b) Quantitative objective: POMDP solvers that accountfor quantitative objectives maximize either the expected rewardor the utility of agent actions [21]–[23]. In the literature, someapproaches leverage the finite state controller for policy syn-thesis. In [24], the authors proposed a policy synthesis methodfor POMDPs with a set of finite state controllers (FSCs).The grid-based abstraction of belief space is popular for thePOMDP policy synthesis with temporal logics. The grid-basedabstraction technique proposed in [25] transforms the beliefspace of POMDP to a fully observable but continuous spaceMDP and then approximates its solution based on a finite setof grid points.

c) Quantitative and qualitative objectives: In many ap-plication domains, however, either the quantitative or thequalitative objective is not enough [6], [7], [15]. We needboth of quantitative and qualitative objectives. POMDPs withboth these objectives were first introduced in [17]. Recently,the PBPS algorithm proposed in [6] solves this problem forPOMDPs with safe-reachability and quantitative objectives.Given the goal constrained belief space in POMDPs, PBPScan efficiently produce a valid policy by combining policy it-eration with quantitative and Bounded Policy Synthesis (BPS)with only boolean objectives [16]. However, because of thesymbolic representation of the representative belief space, thismethod does not scale well in large-scale environments.

Table I compares the scope of our method with otherexisting methods for policy synthesis. The Partially Observable`````````Property

Method Symbolic Sampling-based

Safety Norman et al. [25](Quant)

RAMCP [7](Qual +Quant)

Reachability Chatterjee et al. [26](Qual)

SARSOP [5](Quant)

Safe-Reachability PBPS [6](Qual + Quant)

PO-SMC(Qual + Quant)

TABLE I: Scope of the proposed algorithm. Here quant andqual are quantitative and qualitative objectives respectively.

UCT (PO-UCT) search method was first introduced alongwith POMCP algorithm in [3]. In the literature, two distinctapproaches utilized PO-UCT search to satisfy objectives. Thefirst approach is a variant of chance-constrained POMDPs

Page 3: Monte-Carlo Policy Synthesis in POMDPs with Quantitative ...rss2019.informatik.uni-freiburg.de/papers/0102_FI.pdfobjectives. In this paper, we present a new policy synthesis algorithm,

in which the goal is to ensure that the probability of safetyproperty satisfaction is above a given threshold with at least aspecified probability [7]. The second approach [19] utilizes thePOMCP algorithm to synthesize POMDP policies that satisfy aboolean objective with high probability. However, it is unclearhow to extend these approaches to combined quantitative andqualitative objectives.

III. SYSTEM OVERVIEW

In this section, we present an overview of our approach.

A. POMDP with Model Checking

PlannerState estimator

Modelchecker

ControllerPerception

Requirement

z<latexit sha1_base64="6MGn8YTTIGaw81wZFXtbu/jlcPw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI4kXjxCIo8ENmR26IWR2dnNzKwJEr7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj27nffkSleSzvzSRBP6JDyUPOqLFS46lfLLlldwGyTryMlCBDvV/86g1ilkYoDRNU667nJsafUmU4Ezgr9FKNCWVjOsSupZJGqP3p4tAZubDKgISxsiUNWai/J6Y00noSBbYzomakV725+J/XTU1Y9adcJqlByZaLwlQQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5LtWoWRx7O4BwuwYMbqMEd1KEJDBCe4RXenAfnxXl3PpatOSebOYU/cD5/AOeNjPg=</latexit>

b<latexit sha1_base64="xXd3m+fmNwPVLSWXiWDxPCmIYdk=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV7LHgxWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DDcCu4lCGgUCO8Hkbu53nlBpHssHM03Qj+hI8pAzaqzUDAalsltxFyDrxMtJGXI0BqWv/jBmaYTSMEG17nluYvyMKsOZwFmxn2pMKJvQEfYslTRC7WeLQ2fk0ipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmrDmZ1wmqUHJlovCVBATk/nXZMgVMiOmllCmuL2VsDFVlBmbTdGG4K2+vE7a1Yp3Xak2b8r1Wh5HAc7hAq7Ag1uowz00oAUMEJ7hFd6cR+fFeXc+lq0bTj5zBn/gfP4Awy2M4A==</latexit>

�<latexit sha1_base64="UCo1fmcOUuSBKTBCb5bGsNwjZgo=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lqwR4LXjxWsB/QhrLZbpqlu5uwuxFK6F/w4kERr/4hb/4bN20O2vpg4PHeDDPzgoQzbVz32yltbe/s7pX3KweHR8cn1dOzno5TRWiXxDxWgwBrypmkXcMMp4NEUSwCTvvB7C73+09UaRbLRzNPqC/wVLKQEWxyaZREbFytuXV3CbRJvILUoEBnXP0aTWKSCioN4Vjroecmxs+wMoxwuqiMUk0TTGZ4SoeWSiyo9rPlrQt0ZZUJCmNlSxq0VH9PZFhoPReB7RTYRHrdy8X/vGFqwpafMZmkhkqyWhSmHJkY5Y+jCVOUGD63BBPF7K2IRFhhYmw8FRuCt/7yJuk16t5NvfHQrLVbRRxluIBLuAYPbqEN99CBLhCI4Ble4c0Rzovz7nysWktOMXMOf+B8/gAR1I45</latexit>

s0<latexit sha1_base64="wKip1pt4CW4bOOYai2wHhJibZBI=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ6KkkV7LHgxWMV+wFtKJvtpl262YTdiVBC/4EXD4p49R9589+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqxpsslrHuBNRwKRRvokDJO4nmNAokbwfj25nffuLaiFg94iThfkSHSoSCUbTSg7nol8puxZ2DrBIvJ2XI0eiXvnqDmKURV8gkNabruQn6GdUomOTTYi81PKFsTIe8a6miETd+Nr90Ss6tMiBhrG0pJHP190RGI2MmUWA7I4ojs+zNxP+8bophzc+ESlLkii0WhakkGJPZ22QgNGcoJ5ZQpoW9lbAR1ZShDadoQ/CWX14lrWrFu6pU76/L9VoeRwFO4QwuwYMbqMMdNKAJDEJ4hld4c8bOi/PufCxa15x85gT+wPn8AT1zjSI=</latexit>

a |= �<latexit sha1_base64="rL4n7ZXw62Qcki/V3upMMW9Tk64=">AAAB9XicbVBNS8NAEN3Ur1q/qh69LBbBU0mqYI8FLx4r2FZoYtlsJu3S3U3Y3Sgl9H948aCIV/+LN/+N2zYHbX0w8Hhvhpl5YcqZNq777ZTW1jc2t8rblZ3dvf2D6uFRVyeZotChCU/UfUg0cCahY5jhcJ8qICLk0AvH1zO/9whKs0TemUkKgSBDyWJGibHSA8G+SCLgGvvpiA2qNbfuzoFXiVeQGirQHlS//CihmQBpKCda9z03NUFOlGGUw7TiZxpSQsdkCH1LJRGgg3x+9RSfWSXCcaJsSYPn6u+JnAitJyK0nYKYkV72ZuJ/Xj8zcTPImUwzA5IuFsUZxybBswhwxBRQwyeWEKqYvRXTEVGEGhtUxYbgLb+8SrqNundRb9xe1lrNIo4yOkGn6Bx56Aq10A1qow6iSKFn9IrenCfnxXl3PhatJaeYOUZ/4Hz+APDkkh4=</latexit>

a<latexit sha1_base64="QpdaHAtULVhjrSKI0wwDhZf7M2g=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV7LHgxWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DDcCu4lCGgUCO8Hkbu53nlBpHssHM03Qj+hI8pAzaqzUpINS2a24C5B14uWkDDkag9JXfxizNEJpmKBa9zw3MX5GleFM4KzYTzUmlE3oCHuWShqh9rPFoTNyaZUhCWNlSxqyUH9PZDTSehoFtjOiZqxXvbn4n9dLTVjzMy6T1KBky0VhKoiJyfxrMuQKmRFTSyhT3N5K2JgqyozNpmhD8FZfXiftasW7rlSbN+V6LY+jAOdwAVfgwS3U4R4a0AIGCM/wCm/Oo/PivDsfy9YNJ585gz9wPn8AwamM3w==</latexit>

a 6|= �<latexit sha1_base64="yO1ISUf84rgAYeFg05JFW18cBA0=">AAAB/HicbVDLSsNAFJ34rPUV7dLNYBFclaQKdllw47KCfUATymQyaYfOI8xMhBDqr7hxoYhbP8Sdf+O0zUJbD1w4nHMv994TpYxq43nfzsbm1vbObmWvun9weHTsnpz2tMwUJl0smVSDCGnCqCBdQw0jg1QRxCNG+tH0du73H4nSVIoHk6ck5GgsaEIxMlYauTUEAyENDLiMCdMwSCd05Na9hrcAXCd+SeqgRGfkfgWxxBknwmCGtB76XmrCAilDMSOzapBpkiI8RWMytFQgTnRYLI6fwQurxDCRypYwcKH+nigQ1zrnke3kyEz0qjcX//OGmUlaYUFFmhki8HJRkjFoJJwnAWOqCDYstwRhRe2tEE+QQtjYvKo2BH/15XXSazb8q0bz/rrebpVxVMAZOAeXwAc3oA3uQAd0AQY5eAav4M15cl6cd+dj2brhlDM18AfO5w/5PJRO</latexit>

Fig. 2: Overview of our system. By combining (statistical)model checking with conventional POMDP planning, we areable to synthesize policies that satisfies the safe-reachabilityrequirement.

Conventionally, POMDP planning framework comprisesfour building blocks: the State Estimator, the Planner, theController, and the Perception module. In this work, we usean additional functional block: the (statistical) model checker.The purpose of this module is to check if all actions of a givenpolicy satisfies a given logical requirement (the POMDP’squalitative objective).

In this framework, we perform complete alternating stepsof action planning and verification. In the POMDP setting,uncertainty about the current system state is formalized usingthe notion of a belief b: a probability distribution over states.As depicted in Fig. 2, we begin with a belief b (statisticallyestimated in our algorithm) from the State Estimator; thePlanner plans for an action a based on the reward function;the Model Checker then computes the satisfaction probabilityof the safe-reachability requirement φ for a, and providesfeedback to the planner if the probability lies below a thresholdvalue; the controller executes only those actions that satisfythe requirement and drives the robot to a new state s′. ThePerception module processes an observation z upon arrivingon s′ and forwards this information to the State Estimator toupdate the robot’s belief b.

In prior symbolic approaches to policy synthesis with re-spect to qualitative properties, one explicitly represents eachstep transitions of POMDP model and then bootstraps all pos-sible outcomes for satisfying a given action. However, in manycases, it is infeasible to obtain the probability distribution inexplicit form. In contrast, the Monte Carlo method (statisticalmodel checking) presented here utilizes simulated experienceto satisfy a given action, and scales significantly better.

B. Statistical Model Checking (SMC)

Verifying that a policy satisfies a qualitative requirement is acentral task in our framework. This task can be accomplishedusing symbolic methods; however, such methods are oftentoo expensive in terms of runtime overhead and memoryusage in large-scale environments. SMC provides an inter-mediate framework between testing and exhaustive, symbolicverification by relying on statistics of the POMDP model. Itonly requires a black box simulator and can be applied inproblems that can be simulated and checked against state-based properties. It has the ability to check properties for amodel that cannot even be expressed in classical temporal logic[27], [28].

In our setting, SMC computes the probabilities that a policysatisfies the query property. Similar to PBPS, we have the safe-reachability property φ to check. Our goal is to synthesize avalid policy π that satisfies this property. PBPS performs point-based backup on a small set of representative beliefs and thenextracts a valid policy by enumerating all possible observationsover that belief space. In this work, we approximate the beliefspace using a sampling-based method and then synthesizea valid policy by SMC. Since the SMC algorithm requiressimulations to compute the probability of satisfaction, therobot generates a set of trajectories from multiple simulationsand feeds these trajectories to the SMC. Given a set ofprobability threshold pπ , the SMC then performs hypothesistesting on these trajectories to decide whether the policy πsatisfies the property φ or not. Note that, unlike PBPS thatenumerates over all possible observations for a representativebelief set, SMC enumerates over the most likely observations[27]. The details of SMC are explained further at Section V.

IV. PROBLEM FORMULATION

In this section, we state our policy synthesis problem. Westart by defining POMDPs.

Definition 1 (POMDP, M). A Partially ObservableMarkov Decision Process (POMDP) is a tupleM = 〈S,A,Z, I, T , R,O, γ〉, where S is a finite setof states; A is a finite set of actions; Z is a finite setof observations; I is the initial distribution over states;T (s, a, s′), where s, s′ ∈ S and a ∈ A, is a probabilistictransition function; R(s, a) is a real-valued reward function;O(s′, a, z) where z ∈ Z, is a probabilistic observationfunction; and 0 < γ < 1 is a discount factor.

In the remainder of the paper, we denote by M a fixed,arbitrary POMDP.

Let the state, action, observation, and reward at the timepoint t be given by the random variables st, at, zt, and rt,respectively. The initial state s0 follows the distribution I.T (s, a, s′) = P(st+1 = s | st = s, at = a) is the probabilityof moving to a state s′ after taking the action a. O(s′, a, z) =P(zt = z | st+1 = s′, at = a) specifies the probability ofobserving z after taking the action a and reaching the state s′.The agent’s (robot’s) reward for taking the action a from thestate s is given by r(s, a).

Page 4: Monte-Carlo Policy Synthesis in POMDPs with Quantitative ...rss2019.informatik.uni-freiburg.de/papers/0102_FI.pdfobjectives. In this paper, we present a new policy synthesis algorithm,

Let a history h in M be a sequence of actions and obser-vations: 〈a1, o1, . . . , at, ot〉. We denote the set of all historiesin M by H , the “empty” history by ε, and the history at timepoint t during the execution of the POMDP by the randomvariable ht.

The agent’s action-selection behaviour is described by apolicy.

Definition 2 (Policy, π). A policy is a function π(h) thatmaps a history h to a probability distribution over actions:(π(h))(a) = P(at+1 = a | ht = h).

We capture uncertainty about the current state of the systemusing the notion of a belief state, or simply a belief. Formally,a belief bh given a history h is a probability distribution bhover POMDP states.

Let bπt and rπt be random variables denoting the belief stateand the agent’s reward at time point t, respectively, under thepolicy π. The agent starts with known initial belief bπt=0 := Iand its return from time point t under the policy π is given byRπt =

∑∞i=t γ

i−trπi . The value function for the policy π(h)is given by V π(h) = E[Rπt | ht = h]. We denote by V π

the value function of π given the initial empty history: V π(ε).The goal of policy synthesis with purely quantitative objectivesis to find a policy that maximizes V π . On the other hand, inpolicy synthesis modulo qualitative objectives, we seek to finda policy that satisfies the probability for a requirement abovea threshold bound.

In this paper, we restrict ourselves to the requirement of(probabilistic) safe-reachability, defined as follows.

Definition 3 (Safe-reachability requirement, φ [6]). Given aset of states induced by policy π, a set of unsafe states unsafeand a set of goal states goal, a safe-reachability requirementis a tuple φ = 〈safe, reach〉. Given a probability thresholdξ, we say that a policy π satisfies φ, and write π |= φ, ifP(∀t : st /∈ unsafe ∧ ∃t : st ∈ goal) ≥ ξ.

The aim of safe-reachability is to ensure a goal state iseventually reached while keeping the probability of visitingunsafe states below a threshold. It combines safety and reach-ability requirements into a single requirement. From now on,we denote by φ a fixed, arbitrary safe-reachability requirementas in the above definition.

Now we state our problem.

Definition 4 (Policy synthesis). The policy synthesis problem,given M and φ, is to find a policy π∗ such that

π∗ = arg maxπ

V π

subject to π∗ |= φ.(1)

An alternative formulation as EOPG: Since we are usinga probabilistic bound on the safe-reachability property φ,we can also formulate our original problem as ExpectationOptimization with Probabilistic Guarantee (EOPG) problem[7]. EOPG problem is a variant of CC-POMDPs [15], wherethe goal is to optimize the expected rewards while ensuringthat the risk factor is below a given threshold with at least a

specified probability. However, in our case, this formulationneeds to deal with multiple chance constraints: one for safetyand another for reachability.

V. POLICY SYNTHESIS WITH STATISTICAL MODELCHECKING (PO-SMC)

Our policy synthesis method is based on the Monte-CarloTree Search (MCTS) algorithm [29]. At its base, MCTSconsists of four steps: Selection, Expansion, Simulation andBackup. In our case, the Planner and the Model Checkerin Fig. 2 use the same MCTS algorithm but with differentobjectives. We modify the Simulation and Backup steps ofMCTS to synthesize a valid policy that satisfies both ourquantitative and qualitative objectives. We now elaborate onthe belief space sampling procedure and the proposed PO-SMC algorithm in the next two subsections.

A. Belief state approximation

In large-scale environments, we do not want to have thecomputation of belief proportional to the number of states,which motivates us to approximate the belief b by samplingstates from a probabilistic generator G [3], [4].

More precisely, a generator G is a procedure that, given aPOMDP state s and an action a at time point t, produces asample of a successor state s′, observation z, and reward r:s′ ∼ P(st+1 | st = s, at = a), z ∼ P(zt+1 | st = s, at = a),and r ∼ P(rt+1 | st = s, at = a).

We call the samples s′ particles. Rather than representingbeliefs exactly, we approximate them using sets of m particles(line 1 in Algorithm 2), and this brings in a significantimprovement in performance. We call the approximate beliefparticle belief and denote by b. By using an unweightedparticle filter, we are able to reduce the computational costfrom quadratic to linear in the number of particles. Eventhough a higher number of particles yield improved accuracy,in practice, the number of particles is much smaller than thenumber of POMDP states. We use a particle reinvigorationtechnique to update the belief state where new particles canbe introduced by adding artificial noise to existing particles[3].

B. PO-SMC

In MCTS algorithm there are two policies, one that issynthesized about and that becomes the valid policy, and onethat is more exploratory and is used to generate behavior. Wecall the former one is tree policy denoted by π and the later oneis rollout policy denoted by πrollout. The distribution inducedby rollout policy is called rollout distribution.

In the synthesis phase, the PO-SMC algorithm creates atree representation of policies π. This tree representation isessentially the tree policy of MCTS. The tree π is rooted in theempty history. For a state s sampled from the current particlebelief b, our goal is to estimate the action-value function Va,which defines the expected return for taking the action a fromthe state s. This is done using a set of Monte-Carlo simulationsfollowed by πrollout (SIMULATE line 7 in Algorithm 1).

Page 5: Monte-Carlo Policy Synthesis in POMDPs with Quantitative ...rss2019.informatik.uni-freiburg.de/papers/0102_FI.pdfobjectives. In this paper, we present a new policy synthesis algorithm,

To find the action-value function for each action a, we createa node Na ← (a,Λa) after calling the SMC subroutine (line2-3 in Algorithm 1), where Λa is the rollout statistics thatcontains sufficient information of rollout episodes. For eachnode Na in π, we estimate the value of Va with a set ofMonte-Carlo Simulation (SIMULATE line 7 in Algorithm 1).To approximate the belief b for a given action a, we sample aset S′ of particles from G(s, a) (line 1 in Algorithm 2). Thus,we represent the belief space by m particles. The simulationlength d is empirically chosen in such a way that the estimationerror is sufficiently small.

Algorithm 1 PO-SMCInput Empty policy tree π, Probabilistic Generator G,

Particle belief b, Sample size m, Simulation depth d, Numberof iteration n, Probability threshold ξ

Output Valid policy π∗

1: for each a ∈ A do2: Λa ← SMC(π,G, a,m, d) . Model checker3: Na ← (a,Λa)4: Va ← 05: for i = 1, 2, · · · , n do6: s ∼ b7: Va ← Va + SIMULATE(π ∪ {Na}, Na, s)8: end for9: end for

10: a∗ ← argmaxP(a,πrollout|=φ)≥ξ Va + c√

logN(b)N(b,a)

11: b← b ∪ S′, . Λa := (S′, π,Γ)12: N(b)← N(b) + 113: N(b, a∗)← N(b, a∗) + 114: π∗ ← π ∪ {N∗a} . Update tree policy15: return π∗

Algorithm 2 SMCInput Policy tree π, Probabilistic Generator G, action a,

Sample size m, Simulation depth d,Output Rollout statistics (S′, π,Γ)

1: S′ ← {s′|s′ ∼ G(s, a)} . Samples from generator2: Γ← ∅ . Initialize trajectory3: for each N ∈ π do4: for each s ∈ S′ do5: r ← 0 . Track reward6: p← 0 . Track probability7: for i = 1, 2, · · · , d do8: r ← r + γSIMULATE(π,N, s)9: p← p+ SIMULATE(π,N, s)

10: end for11: p← p/d12: Γ← Γ ∪ {〈s, a, p, r〉} . Update trajectory13: end for14: end for15: return (S′, π,Γ)

Moreover, Λa shares the same simulation for computingprobability of satisfaction for the property φ. Given a prob-

ability threshold ξ to construct a valid policy π, for a givenaction a, we need to keep track not only the average rewardr but also the cumulative probability measure p of the set ofsimulated trajectories Γ satisfying the property φ (lines 5−12in Algorithm 2). Γ is a sequence (s, a, p, r, · · · ) such that givenan action a at each step the next belief state s′ with associatedreward r are sampled from the G(s, a) and it also contains theinformation about the probability p of satisfying the propertyφ followed by πrollout. Note that unlike [6], [7] , PO-SMCdoesn’t bootstrap as a means of calculating the probability ofsatisfying φ. It is easier to generate many sample episodesstarting from the current belief state, averaging returns fromonly current belief, ignoring bootstrapping. Thus, after therollout episode, the robot has enough statistics about action,and what extent it satisfies both the quantitative and thequalitative objectives respectively.N(b) and N(b, a) are the visitation counts for a belief node

and an intermediate belief-action node, respectively (line 10Algorithm 1). c is an exploration constant (line 10 Algorithm1) in the tree policy that balances trying actions to improve thepolicy while exploiting current knowledge on the environment.In Section VII, we discuss how to select c. When searchis complete after iterating the value function n times (line4 Algorithm 1), the robot selects the action a (line 10 inAlgorithm 1) based on not only the greatest reward but alsothe probability p satisfying the property φ.

Based on this action a the robot arrives at state s′ andreceives a real observation z according to O(s′, a, z). At thispoint the node N∗a becomes the root of the new search treeπ∗ and the process then repeats from this node.

VI. ALGORITHM ANALYSIS

In this section, we analyze the convergence property andthe approximation error of our algorithm. We use the termconfidence level (denoted by δ) which is the percentage ofall possible samples that can be expected to include the truebelief parameter.

A. Convergence analysis

We divide the convergence analysis of the proposed algo-rithm into two parts: a) convergence of Algorithm 1 for thequantitative objective and b) convergence of Algorithm 1 forthe qualitative objective.

Our analysis uses the data structure of the History-basedMarkov Decision Process (MDP) MH [3] derived from thePOMDP M.

Definition 5 (History-based MDP, MH [3]). The history-based MDP for M is a tuple MH = 〈H,A, T , R〉, whereH is the set of histories, A is the set of actions ofM, and theprobabilistic transition function T (H ×A×H) is defined as

T (h, a, h.a.z) =∑s,s′

b(s, h)T (s, a, s′)O(s′, a, z).

The reward function R is defined as R(h, a) =∑sB(s, h)R(s, a). A policy π(h) in MH is a function

Page 6: Monte-Carlo Policy Synthesis in POMDPs with Quantitative ...rss2019.informatik.uni-freiburg.de/papers/0102_FI.pdfobjectives. In this paper, we present a new policy synthesis algorithm,

mapping histories h to probability distributions over actions.The value function for the policy π(h), defined in the naturalway, is denoted by V π(h).

The following property holds:

Lemma 1. [3] The value function of History-based MDP isequal to the value function of the POMDP.

∀h ∈ H,∀π : V π(h) = V π(h). (2)

The rollout distribution of MH is the distribution over fullhistories when performing sampling of state from G.

Lemma 2. [3] Let Dπ(h) be the History-based MDP rolloutdistribution and D be the POMDP rollout distribution. Forany rollout policy π, the POMDP rollout distribution D isequal to the History-based MDP rollout distribution.

∀h ∈ H,∀a ∈ A,∀z ∈ O ∀π : D(h, a, z|π) = D(h, a, z|π).(3)

Theorem 1. For the quantitative objective (optimal planning),the Algorithm 1 with a suitable choice of exploration constantc will converge to the optimal value function.

Proof. In absence of qualitative objective, the Algorithm 1behaves as same as POMCP algorithm [3]. Since it is proventhat POMCP algorithm converges to optimal value function forsuitable choice of c, Algorithm 1 with Lemma 1 and Lemma2 also converges to the optimal value function for suitablechoice c.

Theorem 2. Let Y be the random variable that computesthe probability of satisfaction the safe-reachability property φfor a given policy π. There exist a fixed sample size m toapproximate Y with high confidence level δ.

Proof. Let (y1, y2, · · · , ym) be the independent trials sampledfrom the roll-out distribution D(h, a, z|π). For satisfying thesafe-reachability property, we perform hypothesis testing thattakes value 1 with probability p and 0 with probability (1−p).We can then compute the probability of satisfying a givenproperty φ as follows

Y =1

m·m∑i=1

yi.

Given the desired error ε with confidence level δ, the Chernoff-Hoeffding provides an exponential bounds for approximatingthe probability of |Y−p| < ε with (1−δ) confidence as followsP(Y −p) ≥ 1−2 exp−2m.ε

2

. In this setting, the History-basedMDP will converge to a near optimal policy π after a samplingof polynomial size m in [ 1ε , log 1

δ ] as follows

m =ln( 2

δ )

2ε2. (4)

Corollary 1. Algorithm 1 with a suitable choice of explorationconstant c will converge to the optimal value function for bothquantitative and qualitative objectives.

Proof. Theorem 1 and Theorem 2 provide necessary andsufficient conditions for convergence of Algorithm 1. If thereare not enough samples, the Algorithm 1 will not converge.However, using Theorem 2, we can calculate the number ofsamples that is required for satisfying the qualitative objectivefirst. Once we satisfy the qualitative objective, we can satisfythe quantitative objective using Theorem 1. The explorationconstant c in Theorem 1 determines how fast the algorithmwill converge to the solution given a fixed number of samplesize m. Therefore, if there exists an optimal value functionthat satisfies both of our quantitative and qualitative objec-tives, given enough computational budget Algorithm 1 willasymptotically converge to the optimal value function.

B. Error boundIn order to evaluate the quality of policy synthesized by our

algorithm, we need an error measurement. We also divide theerror analysis of the proposed algorithm into two parts. First,we transform the qualitative objective satisfaction problem to aprobability estimation problem which is, given a probabilisticgenerator G with safe-reachability property φ, to compute theprobability of measure p of the measurable set of executiontrajectories satisfying this property. We denote approximationerror w.r.t the qualitative objective by Ekφ. Theorem 3 withlemma 3 bounds the approximation error of satisfying thequalitative objective as computed in Algorithm 2. Second, letV ∗π (h) be the optimal value function and Vπ(h) be the valuefunction of PO-SMC for quantitative objective. As PO-SMCcontinues to improving π, the value of Vπ is getting closer tothe V ∗π . We denote approximation error w.r.t the quantitativeobjective by Ekv such that Ekv = |V ∗π (h)− V kπ (h)|.Lemma 3. Let Pmax be the value of the maximum probabilityof a bad estimate for a qualitative objective. Given a history-MDPMH with a size of observation C sampled at each step,the approximation error for safe-reachability property φ usingSMC method is bounded as follows

Ekφ ≤ (|A| · C)ke− C·ε2

16P2max ,

where k is the number of steps, ε is approximation error and|A| is the cardinality of action set.

Proof. Chernoff-Hoeffding equation has been used to estimateapproximation error of a given property φ in [30]. Howeverin [30], the property φ is used as a quantitative objective, herewe have shown how to estimate approximation error for thesafe-reachability property φ which is a qualitative objective.By using a Chernoff-Hoeffding bound, we can obtain that, ateach step, the probability of a single bad estimation for safe-

reachablity property φ is e− C·ε2

16P2max . The probability of bad

estimates increases by factor of the size of action |A| andobservation width C. Therefore, the probability of some bad

estimates after k steps is bounded by (|A| · C)ke− C·ε2

16P2max .

Lemma 4. Let Rmax be the maximum achievable reward fora policy π with the quantitative objective. Given a history-MDP MH with a permissible class of observations O, after

Page 7: Monte-Carlo Policy Synthesis in POMDPs with Quantitative ...rss2019.informatik.uni-freiburg.de/papers/0102_FI.pdfobjectives. In this paper, we present a new policy synthesis algorithm,

executing actions a ∈ A for k steps with synchronous backupover a sampled belief set, the error associated with thequantitative objective is bounded as follows

Ekv ≤ε

1− γ +2Rmaxδ

(1− γ)2+

2γkRmax

1− γ ,

where δ is confidence level.

Proof. Lemma 4 indicates that the approximation error for thequantitative property involves three main sources —PO-SMCBackup, the estimation of belief space by a finite set of beliefand the finite number of backup iterations. This analysis issimilar to that for the MCVI algorithm [1].

Theorem 3. The errors introduced by Algorithm 1 for the safe-

reachability is bounded by (|A| · C)ke− C·ε2

16P2max and for the

quantitative objective is bounded by ε1−γ + 2Rmaxδ

(1−γ)2 + 2γkRmax

1−γ .

VII. EXPERIMENTS

In this section, we demonstrate the benefit of PO-SMC tosynthesizing POMDPs policies. Here, our primary focus is toinvestigate when and what extent the following two claimshold 1) PO-SMC can synthesize policies more rapidly thanthe state-of-the-art algorithm PBPS and 2)PO-SMC improveson baseline policies. We use a general purpose laptop with a2.5 GHz Intel Core i7 processor and 16 GB memory to reportthe running time of all experiments. We develop the PO-SMCalgorithm on top of the POMCP implementation [31].

A. Navigation task

To provide a direct comparison with previous work, webegin with a simulated navigation task in the kitchen environ-ment directly taken from [6]. Following the previous work,we discretize the workspace into W grid cells. The robot’sobjective is to eventually pick up a cup from storage whilenavigating trough cluttered kitchen environment. The robotcan pick up the cup only if navigates to a specific cell whileavoiding X number of uncertain obstacles. Furthermore, toencapsulate the quantitative objective we add some certainregions in the environment that the robot needs to avoid asmuch as possible and the reward assigned to those locationsis −10, otherwise, the reward is −1. The robot can determineneither the obstacles nor the undesirable location before theyare encountered.

a) Parameter selection: We compute the number ofstates in each task by the combinations of obstacles per eachcell as similar to [6]. In the largest test (X = 7) thereare more than 1010 belief states. PBPS can synthesize upto 105 belief states. We set γ = 1 for this comparison. Toapply PO-SMC in such large-scale environments, we mainlyneed to tune the following hyperparameters —explorationconstant, timeout, policy horizon, and the simulation depth.In our experiments, we manually tuned these parameters toachieve near-optimal performance. To obtain G, we assumea linear and holonomic motion model of robot, where theobservations are disturbed by zero-mean additive Gaussiannoise. We empirically select the exploration constant c. To

determine the value of exploration constant c, we look at whichvalue of c converges to a solution not only as fast as possiblebut also with higher rewards. When c = 0, the Algorithm 1behaves greedily. For c = 5, the Algorithm 1 converges fasterthan others with higher average rewards. Therefore, we choosethe exploration constant c = 5 for all the experiments. Wefixed the policy horizon bound to 30 as similar to PBPS. Thetimeout for each action is 60 seconds and the simulation lengthis 90. For our simulation, we determine minimum number ofparticles are 2401 and the maximum number of particles are9604. We obtain these numbers by varying ε in Eqn. (4) from0.013 to 0.027 with constant δ = 0.05.

b) Results: Given a known initial state and choices ofeight different actions —move and look in four directions tonavigate and observe respectively, both the model checkingalgorithms verify the following conditional statements andproceed accordingly —if the policy is safe and reachable.Table II illustrates the performance comparison between PBPS

Grid size Obstacles PBPSreward

PO-SMCreward

PBPStime (s)

PO-SMCtime (s)

13× 3 1 −10 −6 18.16 116.81313× 3 2 −14 −14 347.281 136.89313× 3 3 −31 −31 3611.684 361.97514× 4 6 − −47 T imeout 1353.71414× 4 7 − −52 T imeout 2812.118

TABLE II: The performance of PBPS is better than PO-SMC subject to smaller scale environments. However, PO-SMC performs significantly faster when presented with morenumber of obstacles. In very large-scale environments, PBPSfails to find a valid policy.

and PO-SMC in worst case scenarios. The results were aver-aged over 5000 runs, and “ − ” means the problem size istoo large for the algorithm. While the performance of PBPSworks nearly perfect in small scale environments with exactconfidence on estimation, its performance drops dramaticallywhen encounters with larger environments or more obstacles(more than 1). PO-SMC outperforms PBPS most of the casesand especially when encounters in large-scale environments.One noticeable difference between these methods is that anexact valid policy takes a long time to evaluate because ofvast belief states. This is evident in the performance of PBPSwhich perform worse than PO-SMC as it tends to choose everybelief states in the goal-constrained set.

B. Baseline comparison

One hypothesis for why PO-SMC might be helpful is thatPO-SMC provides smart guidance for exploration. In thissubsection, we examine whether PO-SMC can overcome thelimitations of the baseline POMCP algorithm that synthesizesa coarse policy by trading safety for performance. In particular,this experiment is designed to capture the importance of modelchecking with the specified safe-reachability property.

Fig. 4 exhibits the performance comparison betweenPOMCP and PO-SMC with a number of obstacles 2, 3, 6 and7. We notice two common sources of suboptimality for the

Page 8: Monte-Carlo Policy Synthesis in POMDPs with Quantitative ...rss2019.informatik.uni-freiburg.de/papers/0102_FI.pdfobjectives. In this paper, we present a new policy synthesis algorithm,

(a) step 0 (b) step 2 (c) step 4 (d) step 10

Fig. 3: Simulation steps: In this execution, there is no difference in policies generated by PBPS and PO-SMC respectively.The proposed algorithm PO-SMC, however, is about 10 times faster than the PBPS algorithm for this experiment.

2 3 6 7

−90

−45

0

45

70

Number of obstacles

Avg

.Dis

coun

ted

Rew

ards

PO-SMCPOMCP

Fig. 4: PO-SMC vs POMCP: these results provide empiricalevidence that model checking on given properties is a morerobust way to make fast, consistent progress, compared tousing a fixed penalty

baseline POMCP algorithm. First, the PO-UCT used to guidePOMCP search can be misleading in some cases where it ispreferable to move over the obstacles rather than avoidingthem. Second, the discreteness of the actions sometimes leadsto circuitous executions in which the episode ends before therobot reaches the target.

These failure modes are especially standard when the rewardis not differentiable and not carefully designed, since thecumulative reward for reaching the goal while hitting theobstacles might be higher in some contexts. We graph thisresult by assigning a negative reward of −100 for hitting theobstacles and +100 reward for reaching the goal. We set thediscount factor γ = 0.95 and repeat each experiment 100times. It is evident from Fig. 4 that PO-SMC outperformsPOMCP in every test case.

C. Validation Experiments

In this final experiment, we execute policies constructedby PO-SMC and PBPS for the kitchen domain with the4×4 grid. We conducted this experiment on V-REP simulatorwith Bullet physics engine [32]. We execute the policy ineach case by using the URDF model of the Fetch from theFetch robotics [33]. Fig. 3 shows the simulation steps whileexecuting policies. To provide the benchmark, we considera linear and holonomic motion model of Fetch. Similar tothe previous navigation task, the robot starts from a knownlocation in Fig. 3a. The robot needs to pick up the cup from theyellow table while avoiding the uncertain obstacles —brownconcrete block, green plant, and white rack. As it is requiredto avoid the red cross cell, the robot navigates in between theconcrete block and the plant. We illustrated this behavior inFig. 3b-3c. Finally, the robot reaches the goal location in Fig.3d with 10 time steps.

VIII. CONCLUSION

This paper presents a novel algorithm for synthesizing poli-cies in POMDPs with quantitative and qualitative objectives.Our approach introduces PO-UCT search and SMC to synthe-size policies that satisfy safe-reachability and quantitative ob-jectives simultaneously. Our experimental results demonstratethe potential of the proposed algorithm for motion planningunder uncertainty. We also provide a rigorous analysis of theconvergence, error bounds and computational complexity ofthe algorithm. Our algorithm utilizes sampling-based synthesisinstead of symbolic synthesis. Thus, our proposed algorithmopens up a range of new opportunities for complex robotmotion planning, especially in large-scale environments. How-ever, currently, our algorithm requires a considerable numberof samples to synthesize a policy with high confidence. It isoften a question how to improve to the sampling strategy.

IX. ACKNOWLEDGEMENT

This project has been supported in part by NSF 1514372and NSF 1830549.

Page 9: Monte-Carlo Policy Synthesis in POMDPs with Quantitative ...rss2019.informatik.uni-freiburg.de/papers/0102_FI.pdfobjectives. In this paper, we present a new policy synthesis algorithm,

REFERENCES

[1] H. Bai, D. Hsu, and W. S. Lee, “Integrated perception and planning inthe continuous space: A POMDP approach,” The International Journalof Robotics Research, pp. 1288–1302, 2014.

[2] J. Pineau, G. Gordon, S. Thrun et al., “Point-based value iteration: Ananytime algorithm for POMDPs,” in International Joint Conferences onArtificial Intelligence, 2003, pp. 1025–1032.

[3] D. Silver and J. Veness, “Monte-carlo planning in large POMDPs,” inAdvances in Neural Information Processing Systems, 2010, pp. 2164–2172.

[4] A. Somani, N. Ye, D. Hsu, and W. S. Lee, “DESPOT: Online POMDPplanning with regularization,” in Advances in Neural Information Pro-cessing Systems, 2013, pp. 1772–1780.

[5] H. Kurniawati, D. Hsu, and W. S. Lee, “SARSOP: Efficient point-basedPOMDP planning by approximating optimally reachable belief spaces,”in Robotics: Science and systems IV, 2009.

[6] Y. Wang, S. Chaudhuri, and L. E. Kavraki, “Point-based policy synthesisfor POMDPs with boolean and quantitative objectives,” 2019, pp. 1860–1867.

[7] K. Chatterjee, A. Elgyutt, P. Novotny, and O. Rouille, “Expectationoptimization with probabilistic guarantees in POMDPs with discounted-sum objectives,” in Proceedings of the Twenty-Seventh InternationalJoint Conference on Artificial Intelligence, 2018, pp. 4692–4699.

[8] A. Legay, B. Delahaye, and S. Bensalem, “Statistical model checking:An overview,” in International Conference on Runtime Verification,2010, pp. 122–135.

[9] S. Ross, B. Chaib-Draa et al., “AEMS: An anytime online searchalgorithm for approximate policy refinement in large POMDPs.” inInternational Joint Conferences on Artificial Intelligence, 2007, pp.2592–2598.

[10] K. Sun and V. Kumar, “Stochastic 2-d motion planning with a POMDPframework,” arXiv preprint arXiv:1810.00204, 2018.

[11] D. M. Roijers, S. Whiteson, and F. A. Oliehoek, “Point-based planningfor multi-objective POMDPs.” in International Joint Conferences onArtificial Intelligence, 2015, pp. 1666–1672.

[12] Y. Luo, H. Bai, D. Hsu, and W. S. Lee, “Importance sampling for onlineplanning under uncertainty,” The International Journal of RoboticsResearch, pp. 162–181, 2019.

[13] H. Bai, D. Hsu, M. J. Kochenderfer, and W. S. Lee, “Unmanned aircraftcollision avoidance using continuous-state POMDPs,” Robotics: Scienceand Systems, pp. 1–8, 2012.

[14] P. Poupart, A. Malhotra, P. Pei, K.-E. Kim, B. Goh, and M. Bowling,“Approximate linear programming for constrained partially observablemarkov decision processes.” in Association for the Advancement ofArtificial Intelligence Conference on Artificial Intelligence, 2015, pp.3342–3348.

[15] P. Santana, S. Thiebaux, and B. Williams, “RAO*: an algorithm forchance constrained POMDPs,” in Association for the Advancement ofArtificial Intelligence Conference on Artificial Intelligence, 2016.

[16] Y. Wang, S. Chaudhuri, and L. E. Kavraki, “Bounded policy synthesis forPOMDPs with safe-reachability objectives,” in International Conferenceon Autonomous Agents and Multiagent Systems, 2016, pp. 238–246.

[17] K. Chatterjee, M. Chmelık, and J. Davies, “A symbolic sat-based al-gorithm for almost-sure reachability with small strategies in POMDPs.”in Association for the Advancement of Artificial Intelligence Conferenceon Artificial Intelligence, 2016, pp. 3225–3232.

[18] S. Haesaert, P. Nilsson, C. Vasile, R. Thakker, A. Agha-mohammadi,A. Ames, and R. Murray, “Temporal logic control of POMDPs via label-based stochastic simulation relations,” 2018, pp. 271–276.

[19] X. Zhang, B. Wu, and H. Lin, “Supervisor synthesis of POMDP basedon automata learning,” arXiv preprint arXiv:1703.08262, 2017.

[20] C.-I. Vasile, K. Leahy, E. Cristofalo, A. Jones, M. Schwager, andC. Belta, “Control in belief space with temporal logic specifications,”in Decision and Control (CDC), 2016 IEEE 55th Conference on, 2016,pp. 7419–7424.

[21] J. Marecki and P. Varakantham, “Risk-sensitive planning in partiallyobservable environments,” in International Conference on AutonomousAgents and Multiagent Systems, 2010, pp. 1357–1368.

[22] K. Lesser and A. Abate, “Multiobjective optimal control with safetyas a priority,” IEEE Transactions on Control Systems Technology, pp.1015–1027, 2018.

[23] A. Jain and S. Niekum, “Efficient hierarchical robot motion planning un-der uncertainty and hybrid dynamics,” in Conference on Robot Learning,2018, pp. 757–766.

[24] S. Junges, N. Jansen, R. Wimmer, T. Quatmann, L. Winterer, J. Katoen,and B. Becker, “Finite-state controllers of POMDPs using parametersynthesis,” pp. 519–529, 2018.

[25] G. Norman, D. Parker, and X. Zou, “Verification and control of partiallyobservable probabilistic systems,” Real-Time Systems, pp. 354–402,2017.

[26] K. Chatterjee, M. Chmelik, and U. Topcu, “Sensor synthesis forPOMDPs with reachability objectives,” in Twenty-Eighth InternationalConference on Automated Planning and Scheduling, 2018, pp. 47–55.

[27] K. Sen, M. Viswanathan, and G. Agha, “VESTA: A statistical model-checker and analyzer for probabilistic systems,” in Second InternationalConference on the Quantitative Evaluation of Systems, 2005, pp. 251–252.

[28] A. Legay, S. Sedwards, and L.-M. Traonouez, “Estimating rewards &rare events in nondeterministic systems,” Electronic Communications ofthe EASST, vol. 72, 2015.

[29] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.The MIT press, 2018.

[30] R. Lassaigne and S. Peyronnet, “Approximate planning and verificationfor large markov decision processes,” International Journal on SoftwareTools for Technology Transfer, pp. 457–467, 2015.

[31] P. Emami, A. J. Hamlet, and C. Crane, “POMDPy: An extensibleframework for implementing POMDPs in python,” 2015. [Online].Available: https://github.com/pemami4911/{POMDP}y

[32] M. F. E. Rohmer, S. P. N. Singh, “V-REP: a versatile and scalable robotsimulation framework,” in Proc. of The International Conference onIntelligent Robots and Systems (IROS), 2013, pp. 1321–1326.

[33] F. Robotics, “Open ROS components for robots from fetch robotics,”https://github.com/fetchrobotics/fetch ros, 2018.