A Composable Speciﬁcation Language for Reinforcement Learning...

A Composable Specification Language forReinforcement Learning Tasks

Kishor Jothimurugan, Rajeev Alur, Osbert BastaniUniversity of Pennsylvania

{kishor,alur,obastani}@cis.upenn.edu

Abstract

Reinforcement learning is a promising approach for learning control policies forrobot tasks. However, specifying complex tasks (e.g., with multiple objectivesand safety constraints) can be challenging, since the user must design a rewardfunction that encodes the entire task. Furthermore, the user often needs to manuallyshape the reward to ensure convergence of the learning algorithm. We proposea language for specifying complex control tasks, along with an algorithm thatcompiles specifications in our language into a reward function and automaticallyperforms reward shaping. We implement our approach in a tool called SPECTRL,and show that it outperforms several state-of-the-art baselines.

1 Introduction

Reinforcement learning (RL) is a promising approach to learning control policies for roboticstasks [5, 21, 16, 15]. A key shortcoming of RL is that the user must manually encode the task as areal-valued reward function, which can be challenging for several reasons. First, for complex taskswith multiple objectives and constraints, the user must manually devise a single reward functionthat balances different parts of the task. Second, the state space must often be extended to encodethe reward—e.g., adding indicators that keep track of which subtasks have been completed. Third,oftentimes, different reward functions can encode the same task, and the choice of reward functioncan have a large impact on the convergence of the RL algorithm. Thus, users must manually designrewards that assign “partial credit” for achieving intermediate goals, known as reward shaping [17].

For example, consider the task in Figure 1, where the state is the robot position and its remainingfuel, the action is a (bounded) robot velocity, and the task is

“Reach target q, then reach target p, while maintaining positive fuel and avoidingobstacle O”.

To encode this task, we would have to combine rewards for (i) reaching q, and then reaching p (where“reach x” denotes the task of reaching an ε box around x—the regions corresponding to p and q aredenoted by P and Q respectively), (ii) avoiding region O, and (iii) maintaining positive fuel, intoa single reward function. Furthermore, we would have to extend the state space to keep track ofwhether q has been reached—otherwise, the control policy would not know whether the current goalis to move towards q or p. Finally, we might need to shape the reward to assign partial credit forgetting closer to q, or for reaching q without reaching p.

We propose a language for users to specify control tasks. Our language allows the user to specifyobjectives and safety constraints as logical predicates over states, and then compose these primitives

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

Figure 1: Example control task. The blue dashed trajectory satisfies the specification φex (ignoring thefuel budget), whereas the red dotted trajectory does not satisfy φex as it passes through the obstacle.

sequentially or as disjunctions. For example, the above task can be expressed as

φex = achieve (reach q; reach p) ensuring (avoid O ∧ fuel > 0), (1)

where fuel is the component of the state space keeping track of how much fuel is remaining.

The principle underlying our approach is that in many applications, users have in mind a sequenceof high-level actions that are needed to accomplish a given task. For example, φex may encode thescenario where the user wants a quadcopter to fly to a location q, take a photograph, and then returnback to its owner at position p, while avoiding a building O and without running out of battery.Alternatively, a user may want to program a warehouse robot to go to the next room, pick up a box,and then bring this item back to the first room. In addition to specifying sequences of tasks, users canalso specify choices between multiple tasks (e.g., bring back any box).

Another key aspect of our approach is to allow the user to specify a task without providing thelow-level sequence of actions needed to accomplish the task. Instead, analogous to how a compilergenerates machine code from a program written by the user, we propose a compiler for our languagethat takes the user-provided task specification and generates a control policy that achieves the task.RL is a perfect tool for doing so—in particular, our algorithm compiles the task specification to areward function, and then uses state-of-the-art RL algorithms to learn a control policy. Overall, theuser provides the high-level task structure, and the RL algorithm fills in the low-level details.

A key challenge is that our specifications may encode rewards that are not Markov—e.g., in φex,the robot needs memory that keeps track of whether its current goal is reach q or reach p. Thus,our compiler automatically extends the state space using a task monitor, which is an automatonthat keeps track of which subtasks have been completed.1 Furthermore, this automaton may havenondeterministic transitions; thus, our compiler also extends the action space with actions for choosingstate transitions. Intuitively, there may be multiple points in time at which a subtask is consideredcompleted, and the robot must choose which one to use.

Another challenge is that the naïve choice of rewards—i.e., reward 1 if the task is completed and0 otherwise—can be very sparse, especially for complex tasks. Thus, our compiler automaticallyperforms two kinds of reward shaping based on the structure of the specification—it assigns partialcredit for (i) partially accomplishing intermediate subtasks, and (ii) for completing more subtasks. Fordeterministic MDPs, our reward shaping is guaranteed to preserve the optimal policy; we empiricallyfind it also works well for stochastic MDPs.

We have implemented our approach in a tool called SPECTRL,2 and evaluated the performance ofSPECTRL compared to a number of baselines. We show that SPECTRL learns policies that solve eachtask in our benchmark with a success rate of at least 97%. In summary, our contributions are:

• We propose a language for users to specify RL tasks (Section 2).• We design an algorithm for compiling a specification into an RL problem, which can be

solved using standard RL algorithms (Section 3).• We have implemented SPECTRL, and empirically demonstrated its benefits (Section 4).

Related work. Imitation learning enables users to specify tasks by providing demonstrations of thedesired task [18, 1, 26, 20, 10]. However, in many settings, it may be easier for the user to directlyspecify the task—e.g., when programming a warehouse robot, it may be easier to specify waypointsdescribing paths the robot should take than to manually drive the robot to obtain demonstrations.

1Intuitively, this construction is analogous to compiling a regular expression to a finite state automaton.2SPECTRL stands for SPECifying Tasks for Reinforcement Learning.

2

Also, unlike imitation learning, our language allows the user to specify global safety constraints onthe robot. Indeed, we believe our approach complements imitation learning, since the user can specifysome parts of the task in our language and others using demonstrations.

Another approach is for the user to provide a policy sketch—i.e., a string of tokens specifying asequence of subtasks [2]. However, tokens have no meaning, except equal tokens represent thesame task. Thus, policy sketches cannot be compiled to a reward function, which must be providedseparately.

Our specification language is based on temporal logic [19], a language of logical formulas forspecifying constraints over (typically, infinite) sequences of events happening over time. For example,temporal logic allows the user to specify that a logical predicate must be satisfied at some pointin time (e.g., “eventually reach state q”) or that it must always be satisfied (e.g., “always avoid anobstacle”). In our language, these notions are represented using the achieve and ensuringoperators, respectively. Our language restricts temporal logic in a way that enables us to performreward shaping, and also adds useful operators such as sequencing that allow the user to easily expresscomplex control tasks.

Algorithms have been designed for automatically synthesizing a control policy that satisfies a giventemporal logic formula; see [4] for a recent survey, and [12, 25, 6, 9] for applications to roboticmotion planning. However, these algorithms are typically based on exhaustive search over controlpolicies. Thus, as with finite-state planning algorithms such as value iteration [22], they cannot beapplied to tasks with continuous state and action spaces that can be solved using RL.

Reward machines have been proposed as a high-level way to specify tasks [11]. In their work, theuser provides a specification in the form of a finite state machine along with reward functions foreach state. Then, they propose an algorithm for learning multiple tasks simultaneously by applyingthe Q-learning updates across different specifications. At a high level, these reward machines aresimilar to the task monitors defined in our work. However, we differ from their approach in two ways.First, in contrast to their work, the user only needs to provide a high-level logical specification; weautomatically generate a task monitor from this specification. Second, our notion of task monitor hasa finite set of registers that can store real values; in contrast, their finite state reward machines cannotstore quantitative information.

The most closely related work is [13], which proposes a variant of temporal logic called truncatedLTL, along with an algorithm for compiling a specification written in this language to a rewardfunction that can be optimized using RL. However, they do not use any analog of the task monitor,which we demonstrate is needed to handle non-Markovian specifications. Finally, [24] allows theuser to separately specify objectives and safety constraints, and then using RL to learn a policy.However, they do not provide any way to compose rewards, and do not perform any reward shaping.Also, their approach is tied to a specific RL algorithm. We show empirically that our approachsubstantially outperforms both these approaches.

Finally, an alternative approach is to manually specify rewards for sub-goals to improve perfor-mance. However, many challenges arise when implementing sub-goal based rewards—e.g., howdoes achieving a sub-goal count compared to violating a constraint, how to handle sub-goals thatcan be achieved in multiple ways, how to ensure the agent does not repeatedly obtain a reward for apreviously completed sub-goal, etc. As tasks become more complex and deeply nested, manuallyspecifying rewards for sub-goals becomes very challenging. Our system is designed to automaticallysolve these issues.

2 Task Specification Language

Markov decision processes. A Markov decision process (MDP) is a tuple (S,D,A, P, T ), whereS ⊆ Rn are the states, D is the initial state distribution, A ⊆ Rm are the actions, P : S ×A× S →[0, 1] are the transition probabilities, and T ∈ N is the time horizon. A rollout ζ ∈ Z of length tis a sequence ζ = s0

a0−→ . . .at−1−−−→ st where si ∈ S and ai ∈ A. Given a (deterministic) policy

π : Z → A, we can generate a rollout using ai = π(ζ0:i). Optionally, an MDP can also include areward function R : Z → R. 3

3Note that we consider rollout-based rewards rather than state-based rewards. Most modern RL algorithms,such as policy gradient algorithms, can use rollout-based rewards.

3

Specification language. Intuitively, a specification φ in our language is a logical formula specify-ing whether a given rollout ζ successfully accomplishes the desired task—in particular, it can beinterpreted as a function φ : Z → B, where B = {true, false}, defined by

φ(ζ) = I[ζ successfully achieves the task],

where I is the indicator function. Formally, the user first defines a set of atomic predicates P0, whereevery p ∈ P0 is associated with a function JpK : S → B such that JpK(s) indicates whether s satisfiesp. For example, given x ∈ S, the atomic predicate

Jreach xK(s) = (‖s− x‖∞ < 1)

indicates whether the robot is in a state near x, and given a rectangular region O ⊆ S, the atomicpredicate

Javoid OK(s) = (s 6∈ O)

indicates if the robot is avoiding O. In general, the user can define a new atomic predicate as anarbitrary function JpK : S → B. Next, predicates b ∈ P are conjunctions and disjunctions of atomicpredicates. In particular, the syntax of predicates is given by 4

b ::= p | (b1 ∧ b2) | (b1 ∨ b2),

where p ∈ P0. Similar to atomic predicates, each predicate b ∈ P corresponds to a function JbK :S → B, defined recursively by Jb1 ∧ b2K(s) = Jb1K(s)∧Jb2K(s) and Jb1 ∨ b2K(s) = Jb1K(s)∨Jb2K(s).Finally, the syntax of our specifications is given by 5

φ ::= achieve b | φ1 ensuring b | φ1;φ2 | φ1 or φ2,

where b ∈ P . Intuitively, the first construct means that the robot should try to reach a state s suchthat JbK(s) = true. The second construct says that the robot should try to satisfy φ1 while alwaysstaying in states s such that JbK(s) = true. The third construct says the robot should try to satisfytask φ1 and then task φ2. The fourth construct means that the robot should try to satisfy either taskφ1 or task φ2. Formally, we associate a function JφK : Z → B with φ recursively as follows:

Jachieve bK(ζ) = ∃ i < t, JbK(si)Jφ ensuring bK(ζ) = JφK(ζ) ∧ (∀i < t, JbK(si))

Jφ1;φ2K(ζ) = ∃ i < t, (Jφ1K(ζ0:i) ∧ Jφ2K(ζi:t))Jφ1 or φ2K(ζ) = Jφ1K(ζ) ∨ Jφ2K(ζ),

where t is the length of ζ. A rollout ζ satisfies φ if JφK(ζ) = true, which is denoted ζ |= φ.

Problem formulation. Given an MDP and a specification φ, our goal is to compute

π∗ ∈ arg maxπ

Prζ∼Dπ

[JφK(ζ) = true], (2)

where Dπ is the distribution over rollouts generated by π. In other words, we want to learn a policyπ∗ that maximizes the probability that a generated rollout ζ satisfies φ.

3 Compilation and Learning Algorithms

In this section, we describe our algorithm for reducing the above problem (2) for a given MDP(S,D,A, P, T ) and a specification φ to an RL problem specified as an MDP with a reward function.At a high level, our algorithm extends the state space S to keep track of completed subtasks andconstructs a reward function R : Z → R encoding φ. A key feature of our algorithm is that the userhas control over the compilation process—we provide a natural default compilation strategy, but theuser can extend or modify our approach to improve the performance of the RL algorithm. We giveproofs in Appendix B.

Quantitative semantics. So far, we have associated specifications φ with Boolean semantics (i.e.,JφK(ζ) ∈ B). A naïve strategy is to assign rewards to rollouts based on whether they satisfy φ:

R(ζ) =

{1 if ζ |= φ

0 otherwise.

4Formally, a predicate is a string in the context-free language generated by this context-free grammar.5Here, achieve and ensuring correspond to the “eventually” and “always” operators in temporal logic.

4

q1

u

q2

u

q3

u

q4

u

ρ : min{x1, x2, x3, x4}

x1 ← 0x2 ← 0x3 ←∞x4 ←∞

Σ : s ∈ Qx1 ← 1− d∞(s, q)

uΣ : min{x1, x3, x4} > 0u

Σ : s ∈ Px2 ← 1− d∞(s, p)

u

Figure 2: An example of a task monitor. States are labeled with rewards (prefixed with “ρ :”).Transitions are labeled with transition conditions (prefixed with “Σ :”), as well as register updaterules. A transition from q2 to q4 is omitted for clarity. Also, u denotes the two updates x3 ←min{x3, d∞(s,O)} and x4 ← min{x4, fuel(s)}.

However, it is usually difficult to learn a policy to maximize this reward due to its discrete nature.A common strategy is to provide a shaped reward that quantifies the “degree” to which ζ satisfiesφ. Our algorithm uses an approach based on quantitative semantics for temporal logic [7, 8, 14].In particular, we associate an alternate interpretation of a specification φ as a real-valued functionJφKq : Z → R. To do so, the user provides quantitative semantics for atomic predicates p ∈ P0—inparticular, they provide a function JpKq : S → R that quantifies the degree to which p holds for s ∈ S.For example, we can use

Jreach xKq(s) = 1− d∞(s, x)

Javoid OKq(s) = d∞(s,O),

where d∞ is the L∞ distance between points, with the usual extension to sets. These semanticsshould satisfy JpKq(s) > 0 if and only if JpK(s) = true, and a larger value of JpKq should correspondto an increase in the “degree” to which p holds. Then, the quantitative semantics for predicatesb ∈ P are Jb1 ∧ b2Kq(s) = min{Jb1Kq(s), Jb2Kq(s)} and Jb1 ∨ b2Kq(s) = max{Jb1Kq(s), Jb2Kq(s)}.Assuming JpKq satisfies the above properties, then JbKq > 0 if and only if JbK = true.

In principle, we could now define quantitative semantics for specifications φ:

Jachieve bKq(ζ) = maxi<t

JbKq(si)

Jφ ensuring bKq(ζ) = min{JφKq(ζ), JbKq(s0), ..., JbKq(st−1)}Jφ1;φ2Kq(ζ) = max

i<tmin{Jφ1Kq(ζ0:i), Jφ2Kq(ζi:t)}

Jφ1 or φ2Kq(ζ) = max{Jφ1Kq(ζ), Jφ2Kq(ζ)}.

Then, it is easy to show that JφK(ζ) = true if and only if JφKq(ζ) > 0, so we could define a rewardfunction R(ζ) = JφKq(ζ). However, one of our key goals is to extend the state space so the policyknows which subtasks have been completed. On the other hand, the semantics JφKq quantify over allpossible ways that subtasks could have been completed in hindsight (i.e., once the entire trajectory isknown). For example, there may be multiple points in a trajectory when a subtask reach q could beconsidered as completed. Below, we describe our construction of the reward function, which is basedon JφKq , but applied to a single choice of time steps on which each subtask is completed.

Task monitor. Intuitively, a task monitor is a finite-state automaton (FSA) that keeps track of whichsubtasks have been completed and which constraints are still satisfied. Unlike an FSA, its transitionsmay depend on the state s ∈ S of a given MDP. Also, since we are using quantitative semantics,the task monitor has to keep track of the degree to which subtasks are completed and the degree towhich constraints are satisfied; thus, it includes registers that keep track of the these values. A keychallenge is that the task monitor is nondeterministic; as we describe below, we let the policy resolvethe nondeterminism, which corresponds to choosing which subtask to complete on each step.

Formally, a task monitor is a tuple M = (Q,X,Σ, U,∆, q0, v0, F, ρ). First, Q is a finite set ofmonitor states, which are used to keep track of which subtasks have been completed. Also, X is afinite set of registers, which are variables used to keep track of the degree to which the specification

5

holds so far. Given an MDP (S,D,A, P, T ), an augmented state is a tuple (s, q, v) ∈ S ×Q× V ,where V = RX—i.e., an MDP state s ∈ S, a monitor state q ∈ Q, and a vector v ∈ V encoding thevalue of each register in the task monitor. An augmented state is analogous to a state of an FSA.

The transitions ∆ of the task monitor depend on the augmented state; thus, they need to specify twopieces of information: (i) conditions on the MDP states and registers for the transition to be enabled,and (ii) how the registers are updated. To handle (i), we consider a set Σ of predicates over S × V ,and to handle (ii), we consider a set U of functions u : S × V → V . Then, ∆ ⊆ Q× Σ× U ×Q isa finite set of (nondeterministic) transitions, where (q, σ, u, q′) ∈ ∆ encodes augmented transitions(s, q, v)

a−→ (s′, q′, u(s, v)), where s a−→ s′ is an MDP transition, which can be taken as long asσ(s, v) = true. Finally, v0 ∈ RX is the vector of initial register values, F ⊆ Q is a set of finalmonitor states, and ρ is a reward function ρ : S × F × V → R.

Given an MDP (S,D,A, P, T ) and a specification φ, our algorithm constructs a task monitor Mφ =(Q,X,Σ, U,∆, q0, v0, F, ρ) whose states and registers keep track which subtasks of φ have beencompleted. Our task monitor construction algorithm is analogous to compiling a regular expressionto an FSA. More specifically, it is analogous to algorithms for compiling temporal logic formulasto automata [23]. We detail this algorithm in Appendix A. The underlying graph of a task monitorconstructed from any given specification is acyclic (ignoring self loops) and final states correspond tosink vertices with no outgoing edges (except a self loop).

As an example, the task monitor for φex is shown in Figure 2. It has monitor statesQ = {q1, q2, q3, q4}and registers X = {x1, x2, x3, x4}. The monitor states encode when the robot (i) has not yet reachedq (q1), (ii) has reached q, but has not yet returned to p (q2 and q3), and (iii) has returned to p (q4); q3

is an intermediate monitor state used to ensure that the constraints are satisfied before continuing.Register x1 records Jreach qK(s) = 1− d∞(s, q) when transitioning from q1 to q2, and x2 recordsJreach pKq = 1−d∞(s, p) when transitioning from q3 to q4. Register x3 keeps track of the minimumvalue of Javoid sK = d∞(s,O) over states s in the rollout, and x4 keeps track of the minimum valueof Jfuel > 0K(s) over states s in the rollout.

Augmented MDP. Given an MDP, a specification φ, and its task monitor Mφ, our algorithm con-structs an augmented MDP, which is an MDP with a reward function (S, s0, A, P , R, T ). Intuitively,if π∗ is a good policy (one that achieves a high expected reward) for the augmented MDP, thenrollouts generated using π∗ should satisfy φ with high probability.

In particular, we have S = S ×Q× V and s0 = (s0, q0, v0). The transitions P are based on P and∆. However, the task monitor transitions ∆ may be nonderministic. To resolve this nondeterminism,we require that the policy decides which task monitor transitions to take. In particular, we extendthe actions A = A × Aφ to include a component Aφ = ∆ indicating which one to take at eachstep. An augmented action (a, δ) ∈ A, where δ = (q, σ, u, q′), is only available in augmented states = (s, q, v) if σ(s, v) = true. Then, the augmented transition probability is given by,

P ((s, q, v), (a, (q, σ, u, q′))), (s′, q′, u(s, v))) = P (s, a, s′).

Next, an augmented rollout of length t is a sequence ζ = (s0, q0, v0)a0−→ ...

at−1−−−→ (st, qt, vt) ofaugmented transitions. The projection proj(ζ) = s0

a0−→ ...at−1−−−→ st of ζ is the corresponding

(normal) rollout. Then, the augmented rewards

R(ζ) =

{ρ(sT , qT , vT ) if qT ∈ F−∞ otherwise

are constructed based on F and ρ. The augmented rewards satisfy the following property.

Theorem 3.1. For any MDP, specification φ, and rollout ζ of the MDP, ζ satisfies φ if and only ifthere exists an augmented rollout ζ such that (i) R(ζ) > 0, and (ii) proj(ζ) = ζ.

Thus, if we use RL to learn an optimal augmented policy π∗ over augmented states, then π∗ is morelikely to generate rollouts ζ such that proj(ζ) satisfies φ.

Reward shaping. As discussed before, our algorithm constructs a shaped reward function thatprovides “partial credit” based on the degree to which φ is satisfied. We have already describedone step of reward shaping—i.e., using quantitative semantics instead of the Boolean semantics.

6

However, the augmented rewards R are −∞ unless a run reaches a final state of the task monitor.Thus, our algorithm performs an additional step of reward shaping—in particular, it constructs areward function Rs that gives partial credit for accomplishing subtasks in the MDP.

For a non-final monitor state q, let α : S ×Q× V → R be defined by

α(s, q, v) = max(q,σ,u,q′)∈∆, q′ 6=q

JσKq(s, v).

Intuitively, α quantifies how “close” an augmented state s = (s, q, v) is to transitioning to anotheraugmented state with a different monitor state. Then, our algorithm assigns partial credit to augmentedstates where α is larger.

However, to ensure that a good policy according to the shaped rewards Rs is also a good policyaccording to R, it does so in a way that preserves the ordering of the cumulative rewards for rollouts—i.e., for two length T rollouts ζ and ζ ′, it guarantees that if R(ζ) > R(ζ ′), then Rs(ζ) > Rs(ζ

′).

To this end, we assume that we are given a lower bound C` on the final reward achieved whenreaching a final monitor state—i.e., C` < R(ζ) for all ζ with final state sT = (sT , qT , vT ) such thatqT ∈ F is a final monitor state. Furthermore, we assume that we are given an upper bound Cu on theabsolute value of α over non-final monitor states—i.e., Cu ≥ |α(s, q, v)| for any augmented statesuch that q 6∈ F .

Now, for any q ∈ Q, let dq be the length of the longest path from q0 to q in the graph of Mφ (ignoringself loops in ∆) and D = maxq∈Q dq . Given an augmented rollout ζ, let si = (si, qi, vi) be the firstaugmented state in ζ such that qi = qi+1 = ... = qT . Then, the shaped reward is

Rs(ζ) =

{maxi≤j<T α(sj , qT , vj) + 2Cu · (dqT −D) + C` if qT 6∈ FR(ζ) otherwise.

If qT 6∈ F , then the first term of Rs(ζ) computes how close ζ was to transitioning to a new monitorstate. The second term ensures that moving closer to a final state always increases reward. Finally,the last term ensures that rewards R(ζ) for qT ∈ F are always higher than rewards for qT 6∈ F . Thefollowing theorem follows straightforwardly.

Theorem 3.2. For two augmented rollouts ζ, ζ ′, (i) if R(ζ) > R(ζ ′), then Rs(ζ) > Rs(ζ′), and

(ii) if ζ and ζ ′ end in distinct non-final monitor states qT and q′T such that dqT > dq′T , thenRs(ζ) ≥ Rs(ζ ′).

Reinforcement learning. Once our algorithm has constructed an augmented MDP, it can use anyRL algorithm to learn an augmented policy π : S → A for the augmented MDP:

π∗ ∈ arg maxπ

Eζ∼Dπ [Rs(ζ)]

where Dπ denotes the distribution over augmented rollouts generated by policy π. We solve this RLproblem using augmented random search (ARS), a state-of-the-art RL algorithm [15].

After computing π∗, we can convert π∗ to a projected policy π∗ = proj(π∗) for the original MDP byintegrating π∗ with the task monitor Mφ, which keeps track of the information needed for π∗ to makedecisions. More precisely, proj(π∗) includes internal memory that keeps track of the current monitorstate and register value (qt, vt) ∈ Q× V . It initializes this memory to the initial monitor state q0 andinitial register valuation v0. Given an augmented action (a, (q, σ, u, q′)) = π∗((st, qt, vt)), it updatesthis internal memory using the rules qt+1 = q′ and vt+1 = u(st, vt).

Finally, we use a neural network architecture similar to neural module networks [3, 2], where differentneural networks accomplish different subtasks in φ. In particular, an augmented policy π is a setof neural networks {Nq | q ∈ Q}, where Q are the monitor states in Mφ. Each Nq takes asinput (s, v) ∈ S × V and outputs an augmented action Nq(s, v) = (a, a′) ∈ Rk+2, where k is theout-degree of the q in Mφ; then, π(s, q, v) = Nq(s, v).

7

Figure 3: Learning curves for φ1, φ2, φ3 and φ4 (top, left to right), and φ5, φ6 and φ7 (bottom, leftto right), for SPECTRL (green), TLTL (blue), CCE (yellow), and SPECTRL without reward shaping(purple). The x-axis shows the number of sample trajectories, and the y-axs shows the probability ofsatisfying the specification (estimated using samples). To exclude outliers, we omitted one best andone worst run out of the 5 runs. The plots are the average over the remaining 3 runs with error barsindicating one standard deviation around the average.

4 Experiments

Setup. We implemented our algorithm in a tool SPECTRL6, and used it to learn policies for a varietyof specifications. We consider a dynamical system with states S = R2×R, where (x, r) ∈ S encodesthe robot position x and its remaining fuel r, actions A = [−1, 1]2 where an action a ∈ A is the robotvelocity, and transitions f(x, r, a) = (x+ a+ ε, r − 0.1 · |x1| · ‖a‖), where ε ∼ N (0, σ2I) and thefuel consumed is proportional to the product of speed and distance from the y-axis. The initial stateis s0 = (5, 0, 7), and the horizon is T = 40.

In Figure 3, we consider the following specifications, where O = [4, 6]× [4, 6]:

• φ1 = achieve reach (5, 10) ensuring (avoid O)

• φ2 = achieve reach (5, 10) ensuring (avoid O ∧ (r > 0))

• φ3 = achieve (reach [(5, 10); (5, 0)]) ensuring avoid O• φ4 = achieve (reach (5, 10) or reach (10, 0); reach (10, 10)) ensuring avoid O• φ5 = achieve (reach [(5, 10); (5, 0); (10, 0)]) ensuring avoid O• φ6 = achieve (reach [(5, 10); (5, 0); (10, 0); (10, 10)]) ensuring avoid O• φ7 = achieve (reach [(5, 10); (5, 0); (10, 0); (10, 10); (0, 0)]) ensuring avoid O

where the abbreviation achieve (b; b′) denotes achieve b; achieve b′ and the abbreviationreach [p1; p2] denotes reach p1; reach p2. For all specifications, each Nq has two fully con-nected hidden layers with 30 neurons each and ReLU activations, and tanh function as its outputlayer. We compare our algorithm to [13] (TLTL), which directly uses the quantitative semantics ofthe specification as the reward function (with ARS as the learning algorithm), and to the constrainedcross entropy method (CCE) [24], which is a state-of-the-art RL algorithm for learning policies toperform tasks with constraints. We used neural networks with two hidden layers and 50 neurons perlayer for both the baselines.

Results. Figure 3 shows learning curves of SPECTRL (our tool), TLTL, and CCE. In addition, itshows SPECTRL without reward shaping (Unshaped), which uses rewards R instead of Rs. Theseplots demonstrate the ability of SPECTRL to outperform state-of-the-art baselines. For specificationsφ1, ..., φ5, the curve for SPECTRL gets close to 100% in all executions, and for φ6 and φ7, it getsclose to 100% in 4 out of 5 executions. The performance of CCE drops when multiple constraints(here, obstacle and fuel) are added (i.e., φ2). TLTL performs similar to SPECTRL on tasks φ1, φ3 andφ4 (at least in some executions), but SPECTRL converges faster for φ1 and φ4.

6The implementation can be found at https://github.com/keyshor/spectrl_tool.

8

https://github.com/keyshor/spectrl_tool

Figure 4: Sample complexity curves (left) with number of nested sequencing operators on the x-axisand average number of samples to converge on the y-axis. Learning curve for cartpole example(right).

Since TLTL and CCE use a single neural network to encode the policy as a function of state, theyperform poorly in tasks that require memory—i.e., φ5, φ6, and φ7. For example, to satisfy φ5, theaction that should be taken at s = (5, 0) depends on whether (5, 10) has been visited. In contrast,SPECTRL performs well on these tasks since its policy is based on the monitor state.

These results also demonstrate the importance of reward shaping. Without it, ARS cannot learn unlessit randomly samples a policy that reaches final monitor state. Reward shaping is especially importantfor specifications that include many sequencing operators (φ;φ′)—i.e., specifications φ5, φ6, and φ7.

Figure 4 (left) shows how sample complexity grows with the number of nested sequencing operators(φ1, φ3, φ5, φ6, φ7). Each curve indicates the average number of samples needed to learn a policythat achieves a satisfaction probability ≥ τ . SPECTRL scales well with the size of the specification.

Cartpole. Finally, we applied SPECTRL to a different control task—namely, to learn a policy for theversion of cart-pole in OpenAI Gym, in which we used continuous actions instead of discrete actions.The specification is to move the cart to the right and move back left without letting the pole fall. Theformal specification is given by

φ = achieve (reach 0.5; reach 0.0) ensuring balance

where the predicate balance holds when the vertical angle of the pole is smaller than π/15 inabsolute value. Figure 4 (right) shows the learning curve for this task averaged over 3 runs of thealgorithm along with the three baselines. TLTL is able to learn a policy to perform this task, but itconverges slower than SPECTRL; CCE is unable to learn a policy satisfying this specification.

5 Conclusion

We have proposed a language for formally specifying control tasks and an algorithm to learn policiesto perform tasks specified in the language. Our algorithm first constructs a task monitor from thegiven specification, and then uses the task monitor to assign shaped rewards to runs of the system.Furthermore, the monitor state is also given as input to the controller, which enables our algorithmto learn policies for non-Markovian specifications. Finally, we implemented our approach in a toolcalled SPECTRL, which enables the users to program what the agent needs to do at a high level;then, it automatically learns a policy that tries to best satisfy the user intent. We also demonstratethat SPECTRL can be used to learn policies for complex specifications, and that it can outperformstate-of-the-art baselines.

Acknowledgements. We thank the reviewers for their insightful comments. This work was partiallysupported by NSF by grant CCF 1723567 and by AFRL and DARPA under Contract No. FA8750-18-C-0090.

References[1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning.

In Proceedings of the 21st International Conference on Machine Learning, pages 1–8, 2004.

9

[2] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning withpolicy sketches. In Proceedings of the 34th International Conference on Machine Learning,pages 166–175, 2017.

[3] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages39–48, 2016.

[4] Roderick Bloem, Krishnendu Chatterjee, and Barbara Jobstmann. Graph games and reactivesynthesis. In Handbook of Model Checking, pages 921–962. Springer, 2018.

[5] Steve Collins, Andy Ruina, Russ Tedrake, and Martijn Wisse. Efficient bipedal robots based onpassive-dynamic walkers. Science, 307(5712):1082–1085, 2005.

[6] Samuel Coogan, Ebru Aydin Gol, Murat Arcak, and Calin Belta. Traffic network control fromtemporal logic specifications. IEEE Transactions on Control of Network Systems, 3(2):162–172,2015.

[7] Jyotirmoy V. Deshmukh, Alexandre Donzé, Shromona Ghosh, Xiaoqing Jin, Garvit Juniwal,and Sanjit A. Seshia. Robust online monitoring of signal temporal logic. Formal Methods inSystem Design, 51(1):5–30, 2017.

[8] Georgios E. Fainekos and George J. Pappas. Robustness of temporal logic specifications forcontinuous-time signals. Theoretical Computer Science, 410(42):4262–4291, September 2009.

[9] Keliang He, Morteza Lahijanian, Lydia E. Kavraki, and Moshe Y. Vardi. Reactive synthesis forfinite tasks under resource constraints. In IEEE/RSJ International Conference on IntelligentRobots and Systems, pages 5326–5332, 2017.

[10] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances inNeural Information Processing Systems, pages 4565–4573, 2016.

[11] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. Using rewardmachines for high-level task specification and decomposition in reinforcement learning. InProceedings of the 35th International Conference on Machine Learning, pages 2112–2121,2018.

[12] Hadas Kress-Gazit, Georgios E. Fainekos, and George J. Pappas. Temporal-logic-based reactivemission and motion planning. IEEE Transactions on Robotics, 25(6):1370–1381, 2009.

[13] Xiao Li, Cristian-Ioan Vasile, and Calin Belta. Reinforcement learning with temporal logicrewards. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages3834–3839, 2017.

[14] Oded Maler and Dejan Nivckovic. Monitoring properties of analog and mixed-signal circuits.International Journal on Software Tools for Technology Transfer, 15(3):247–268, 2013.

[15] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search of static linear policiesis competitive for reinforcement learning. In Advances in Neural Information ProcessingSystems, pages 1800–1809, 2018.

[16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[17] Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transfor-mations: Theory and application to reward shaping. In Proceedings of the 16th InternationalConference on Machine Learning, pages 278–287, 1999.

[18] Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. InProceedings of the 17th International Conference on Machine Learning, pages 663–670, 2000.

[19] Amir Pnueli. The temporal logic of programs. In Proceedings of the 18th Annual Symposiumon Foundations of Computer Science, pages 46–57, 1977.

10

[20] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning andstructured prediction to no-regret online learning. In Proceedings of the 14th InternationalConference on Artificial Intelligence and Statistics, pages 627–635, 2011.

[21] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.Deterministic policy gradient algorithms. In Proceedings of the 31st International Conferenceon Machine Learning, pages 387–395, 2014.

[22] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,2018.

[23] M.Y. Vardi and P. Wolper. Reasoning about infinite computations. Information and Computation,115(1):1–37, 1994.

[24] Min Wen and Ufuk Topcu. Constrained cross-entropy method for safe reinforcement learning.In Advances in Neural Information Processing Systems, pages 7450–7460, 2018.

[25] Tichakorn Wongpiromsarn, Ufuk Topcu, and Richard M. Murray. Receding horizon temporallogic planning. IEEE Transactions on Automatic Control, 57(11):2817–2830, 2012.

[26] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropyinverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on ArtificialIntelligence, pages 1433–1438, 2008.

11

A Task Monitor Construction Algorithm

We describe our algorithm for constructing a task monitor Mφ for a given specification φ. Ourconstruction algorithm proceeds recursively on the structure of φ. Implicitly, our algorithm maintainsthe property that every monitor-state q has a self-transition (q, true, u, q); here, the update functionu is the identity by default, but may be modified as part of the construction.

Notation. We use t to denote the disjoint union. Given v ∈ RX , v′ ∈ RX′ , we define v′′ = v⊕ v′ ∈RXtX′ to be their concatenation—i.e.,

v′′(x) =

{v(x) if x ∈ Xv′(x) otherwise.

Given v ∈ RX and Y ⊆ X , we define v ↓Y ∈ RY to be the restriction of v to Y . Given v ∈ RX andY ⊇ X , we define v′ = extend(v)Y ∈ RY to be

v′(y) =

{v(y) if y ∈ X0 otherwise.

We drop the subscript Y when it is clear from context. Finally, given v ∈ RX and k ∈ R, we definev′ = v[x 7→ k] ∈ V to be

v(x′) =

{v(x′) if x′ 6= x

k otherwise.

Also, recall that a predicate b ∈ P is defined over states s ∈ S; it can straightforwardly be extendedto a predicate in Σ over (s, v) ∈ S × V by ignoring v. Note that every predicate b ∈ P is a negationfree Boolean combination of atomic predicates p ∈ P0.

Finally, the definition of Σ depends on the set X of registers in the task monitor. When necessary,we use the notation ΣX to make this dependence explicit. Finally, for X ⊆ X ′, any σ ∈ ΣX can beinterpreted as a predicate in ΣX′ by ignoring the components of X ′ not in X .

Objectives. Consider the case φ = achieve b, where b ∈ P . For this specification, our algorithmconstructs the following task monitor:

x← 0

Σ : b

x← JbK(s)

ρ : x

The initial state is marked with an arrow into the state. Final states are double circles. Predicatesσ ∈ Σ labeling a transition appear prefixed by “Σ :”. Rewards ρ labeling a state appear prefixed by“ρ :”. Self loops are associated with the true predicate (omitted). Updates u ∈ U are by default theidentity function. Intuitively, the state q1 on the left encodes when subtask b is not yet completed,and the state q2 on the right encodes when b is completed, and x1 records the degree to which b issatisfied upon completion.

Constraints. Consider the case φ = φ1 ensuring b, where b ∈ P . Let

Mφ1 = (Q1, X1,Σ, U1,∆1, q01 , v

01 , F1, ρ1)

Then, Mφ is the product of Mφ and M ensuring b defined as

Mφ = (Q1, X1 t {xb},Σ, U,∆, q01 , v0, F1, ρ),

where M ensuring b is

xb ← min(xb, JbK(s))

xb ←∞ρ : xb

1

M1 M2

M1 M2

ρ : r1

ρ : r2

ρ : r3

Σ : r1 > 0

Σ : r1 > 0

Σ : r1 > 0ρ : min(r2, xR)

ρ : min(r3, xR)

xR ← r1

xR ← r1

xR ← r1

M(φ1) M(φ2)

M(φ1;φ2)

M1

M2

ρ : r1

ρ : r2

ρ : r3

M (φ1 or φ2)

Figure 5: Overview of monitor construction for sequencing and choice operators.

Then, we have (q, σ, u′, q′) ∈ ∆ if and only if there is a transition (q, σ, u, q′) ∈ ∆1 such that

u′(s, v) = extend(u(s, v ↓X1))[xb 7→ min(v(xb), JbK(s))].

Furthermore, the initial register valuation v0 = extend(v01)[xb 7→ ∞] and the reward function ρ is

ρ(s, q, v) = min{ρ1(s, q, v ↓X1), v(xb)}.

Intuitively, xb encodes the minimum degree to which b is satisfied during a rollout.

Sequencing. Third, consider φ = φ1;φ2. Intuitively, Mφ is constructed by concatenating theregisters of Mφ1 and Mφ2 (extending the update functions u as needed), and adding transitions(q, σ, u, q0) from each final state q of Mφ1 to the initial state q0 of Mφ2 , where σ = true and u isthe identity on registers for Mφ1 and sets the registers of Mφ2 to their initial values. A subtle issue isthat transitioning from φ1 to φ2 takes one time step, yet it should take zero time steps. Therefore, weadd transitions from each final state of Mφ1

to all successors of the initial state of Mφ2.

More precisely, let

Mφ1= (Q1, X1,Σ, U1,∆1, q

01 , v

01 , F1, ρ1)

Mφ2 = (Q2, X2,Σ, U2,∆2, q02 , v

02 , F2, ρ2).

Assume without loss of generality that X2 ⊆ X1. Then,

Mφ = (Q1 tQ2, X1 t {xR},Σ, U,∆, q01 , v0, F2, ρ).

Here, ∆ = ∆′1 ∪∆′2 ∪∆1→2, where (q, σ, u′, q′) ∈ ∆′i if there exists (q, σ, u, q′) ∈ ∆i such that

u′(s, v) = u(s, v ↓Xi)⊕ v ↓X\Xi ,

and (q, σ′ ∧ σR, u′, q′) ∈ ∆1→2 if q ∈ F1 and there exists (q02 , σ, u, q

′) ∈ ∆2 such that the atomicpredicate σR is given by

JσRK(s, v) = ρ1(s, q, v ↓X1) > 0,

the predicate σ′ is given by

Jσ′K(s, v) = JσK(s, v02),

2

and u′ is given by

u′(s, v) = extend(u(s, v02))[xR 7→ ρ1(s, q, v ↓X1)].

The initial register valuation is v0 = extend(v01), and the reward function ρ is

ρ(s, q, v) = min{ρ2(s, q, v ↓X2), v(xR)}

for all q ∈ F2.

Choice. Consider the case φ = φ1 or φ2. Intuitively, Mφ is constructed by combining the initialstates of Mφ1 and Mφ2 into a single initial state q0, and concatenating their registers. The transitionsfrom q0 are the union of the transitions from the initial states of Mφ1 and Mφ2 . More precisely, let

Mφ1= (Q1, X1,Σ, U1,∆1, q

01 , v

01 , F1, ρ1)

Mφ2 = (Q2, X2,Σ, U2,∆2, q02 , v

02 , F2, ρ2).

This construction assumes that there are self loops on the initial states of Mφ1 and Mφ2 . Then,

Mφ = (Q,X1 tX2,Σ, U,∆, q0, v01 ⊕ v0

2 , F1 t F2, ρ).

Here,

Q = (Q1 \ {q01}) t (Q2 \ {q0

2}) t {q0},and ∆ = ∆′1 ∪ ∆′2 ∪ ∆0, where where (q, σ, u′, q′) ∈ ∆′i if q 6= q0 and there is a transition(q, σ, u, q′) ∈ ∆i such that

u′(s, r, v) = extend(u(s, r, v ↓Xi)).Also, let (q0

1 ,>, u01, q

01) ∈ ∆1 and (q0

2 ,>, u02, q

02) ∈ ∆2 be the self loops on the initial states of Mφ1

and Mφ2respectively. Let

u0(s, r, v) = u01(s, r, v ↓X1)⊕ u0

2(s, r, v ↓X2).

Then, (q0, σ, u′, q) ∈ ∆0 if either (i) (q0, σ, u

′, q) = (q0,>, u0, q0), or (ii) there exists i ∈ {1, 2}such that (q0

i , σ, u, q) ∈ ∆i, where q ∈ Qi \ {q0i } and

u′(s, r, v) = extend(u(s, r, v ↓Xi)).The reward function ρ for q ∈ Fi is given by:

ρ(s, q, v) = ρi(s, q, v ↓Xi).

B Proofs of Theorems

B.1 Proof of Theorem 3.1

First, the following lemma follows by structural induction:Lemma B.1. For σ ∈ Σ, JσK(s, v) = true if and only if JσKq(s, v) > 0.

Next, let GM denote the underlying state transition graph of a task monitor M . Then,Lemma B.2. The task monitors constructed by our algorithm satisfy the following properties:

1. The only cycles in GM are self loops.

2. The finals states are precisely those states from which there are no outgoing edges exceptfor self loops in GM .

3. In GM , every state is reachable from the initial state and for every state there is a final statethat is reachable from it.

4. For any pair of states q and q′, there is at most one transition from q to q′.

5. There is a self loop on every state q given by a transition (q,>, u, q) for some updatefunction u where > denotes the true predicate.

The first three properties ensure progress when switching from one monitor state to another. Thelast two properties enable simpler composition of task monitors. The proof follows by structuralinduction. Theorem 3.1 now follows by structural induction on φ and Lemmas B.1 and B.2.

3

B.2 Proof of Theorem 3.2

(i) Let ζ, ζ ′ be two augmented rollouts such that R(ζ) > R(ζ ′). There are three cases to consider:

• Case A. Both ζ and ζ ′ end in final monitor states: In this case, Rs(ζ) = R(ζ) > R(ζ ′) =

Rs(ζ′) as required.

• Case B. ζ ends in a final monitor state but ζ ′ ends in a monitor state q′T /∈ F : In this case,

Rs(ζ′) = max

i′≤j<Tα(s′j , q

′T , v

′j) + 2Cu · (dqT −D) + C`

≤ maxi′≤j<T

α(s′j , q′T , v

′j)− 2Cu + C` (dqT ≤ D − 1)

≤ C` (Cu is an upper bound on α)

< R(ζ) (C` is a lower bound on R)

= Rs(ζ).

• Case C. ζ ends in non-final monitor state: In this case R(ζ) = −∞ and hence the claim isvacuously true.

(ii) Let ζ, ζ ′ be two augmented rollouts ending in distinct (non-final) monitor states qT and q′T suchthat dqT > dq′T . Then,

Rs(ζ) = maxi≤j<T

α(sj , qT , vj) + 2Cu · (dqT −D) + C`

≥ maxi≤j<T

α(sj , qT , vj) + 2Cu + 2Cu · (dq′T −D) + C` (dqT ≥ dq′T − 1 & Cu ≥ 0)

≥ Cu + 2Cu · (dq′T −D) + C` (Cu is an upper bound on −α)

≥ maxi′≤j<T

α(s′j , q′T , v

′j) + 2Cu · (dq′T −D) + C` (Cu is an upper bound on α)

= Rs(ζ ′).

4

A Composable Speciﬁcation Language for Reinforcement Learning...

Documents

Transcript of A Composable Speciﬁcation Language for Reinforcement Learning...