Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et...

32
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Configurable Markov Decision Processes Alberto Maria Metelli * [email protected] Mirco Mutti * [email protected] Marcello Restelli [email protected] Departement of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy Abstract In many real-world problems, there is the possibility to configure, to a limited extent, some environmental parameters to improve the performance of a learning agent. In this paper, we propose a novel framework, Configurable Markov Decision Processes (Conf-MDPs), to model this new type of interaction with the environment. Furthermore, we provide a new learning algorithm, Safe Policy-Model Iteration (SPMI), to jointly and adaptively optimize the policy and the environment configuration. After having introduced our approach, we present the experimental evaluation in two explicative problems to show the benefits of the environment configurability on the performance of the learned policy. 1. Introduction Markov Decision Processes (MDPs) (Puterman, 2014) are a popular formalism to model sequential decision-making problems. Solving an MDP means to find a policy, i.e., a prescription of actions, that maximizes a given utility function. Typically, the environment dynamics is assumed to be fixed, unknown and out of the control of the agent. However, there exist several real-world scenarios in which the environment is partially controllable and, therefore, it might be beneficial to configure some of its features in order to select the most convenient MDP to solve. For instance, a human car driver has at her/his disposal a number of possible vehicle configurations she/he can act on (e.g., seasonal tires, stability and vehicle attitude, engine model, automatic speed control, parking aid system) to improve the driving style or quicken the process of learning a good driving policy. Another example is the interaction between a student and an automatic teaching system: the teaching model can be tailored to improve the student’s learning experience (e.g., increasing or decreasing the difficulties of the questions or the speed at which the concepts are presented). It is worth noting that the active entity in the configuration process might be the agent itself or an external supervisor guiding the learning process. In a more abstract sense, the possibility to act on the environmental parameters can have essentially two benefits: i) it allows improving the agent performance; ii) it may allow to speed up the learning process. In this paper, we propose a framework to model a Configurable Markov Decision Process (Conf-MDP), i.e., an MDP in which the environment can be configured to a limited extent. In principle, any of the Conf-MDP’s parameters can be tuned, but we restrict our attention to the transition model and we focus to the problem of identifying the environment that allows achieving the highest performance possible. At an intuitive level, there exists a tight connection between environment and policy: variations of the environment induce modifications of the optimal policy. Furthermore, even for the same task, in presence of *. Equal contribution. c 2018 . License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.

Transcript of Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et...

Page 1: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France.

Configurable Markov Decision Processes

Alberto Maria Metelli∗ [email protected]

Mirco Mutti∗ [email protected]

Marcello Restelli [email protected]

Departement of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy

Abstract

In many real-world problems, there is the possibility to configure, to a limited extent, someenvironmental parameters to improve the performance of a learning agent. In this paper,we propose a novel framework, Configurable Markov Decision Processes (Conf-MDPs), tomodel this new type of interaction with the environment. Furthermore, we provide a newlearning algorithm, Safe Policy-Model Iteration (SPMI), to jointly and adaptively optimizethe policy and the environment configuration. After having introduced our approach, wepresent the experimental evaluation in two explicative problems to show the benefits of theenvironment configurability on the performance of the learned policy.

1. Introduction

Markov Decision Processes (MDPs) (Puterman, 2014) are a popular formalism to modelsequential decision-making problems. Solving an MDP means to find a policy, i.e., aprescription of actions, that maximizes a given utility function. Typically, the environmentdynamics is assumed to be fixed, unknown and out of the control of the agent. However,there exist several real-world scenarios in which the environment is partially controllableand, therefore, it might be beneficial to configure some of its features in order to select themost convenient MDP to solve. For instance, a human car driver has at her/his disposal anumber of possible vehicle configurations she/he can act on (e.g., seasonal tires, stabilityand vehicle attitude, engine model, automatic speed control, parking aid system) to improvethe driving style or quicken the process of learning a good driving policy. Another exampleis the interaction between a student and an automatic teaching system: the teaching modelcan be tailored to improve the student’s learning experience (e.g., increasing or decreasingthe difficulties of the questions or the speed at which the concepts are presented). It is worthnoting that the active entity in the configuration process might be the agent itself or anexternal supervisor guiding the learning process. In a more abstract sense, the possibility toact on the environmental parameters can have essentially two benefits: i) it allows improvingthe agent performance; ii) it may allow to speed up the learning process.

In this paper, we propose a framework to model a Configurable Markov Decision Process(Conf-MDP), i.e., an MDP in which the environment can be configured to a limited extent.In principle, any of the Conf-MDP’s parameters can be tuned, but we restrict our attentionto the transition model and we focus to the problem of identifying the environment thatallows achieving the highest performance possible. At an intuitive level, there exists atight connection between environment and policy: variations of the environment inducemodifications of the optimal policy. Furthermore, even for the same task, in presence of

∗. Equal contribution.

c©2018 .

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.

Page 2: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

agents having access to different policy spaces, the optimal environment might be different.1

The spirit of this work is to investigate and exercise the tight connection between policy andmodel, pursuing the goal of improving the final policy performance.

After having introduced the definition of Conf-MDP, we propose a method to jointly andadaptively optimize the policy and the transition model, named Safe Policy-Model Iteration(SPMI). The algorithm adopts a safe learning approach (Pirotta et al., 2013) based on themaximization of a lower bound on the guaranteed performance improvement, yielding asequence of model-policy pairs with monotonically increasing performance. The safe learningperspective makes our approach suitable for critical applications where performance degra-dation during learning is not allowed (e.g., industrial scenarios where extensive explorationof the policy space might damage the machinery).

In the standard Reinforcement Learning (RL) framework (Sutton and Barto, 1998),the usage of a lower bound to guide the choice of the policy has been first introduced byConservative Policy Iteration (CPI) (Kakade and Langford, 2002), improved by Safe PolicyIteration (SPI) (Pirotta et al., 2013) and subsequently exploited by Ghavamzadeh et al.(2016); Abbasi-Yadkori et al. (2016); Papini et al. (2017). These methods revealed theirpotential thanks to the preference towards small policy updates, preventing from movingin a single step too far away from the current policy and avoiding premature convergenceto suboptimal policies. A similar rationale is at the basis of Relative Entropy PolicySearch (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization(TRPO) (Schulman et al., 2015) and Proximal Policy Optimization (PPO) (Schulman et al.,2017). In order to introduce our framework and highlight its benefits, we limit our analysisto the scenario in which the model space (and the policy space) is known. However, whenthe model space is unknown, we could resort to a sample-based version of SPMI, whichcould be derived by adapting that of SPI (Pirotta et al., 2013).

We start in Section 2 by recalling some basic notions about MDPs and providing thedefinition of Conf-MDP. In Section 3 we derive the performance improvement bound andwe outline the main features of SPMI (Section 4).2 Then, we present the experimentalevaluation (Section 5) in two explicative domains, representing simple abstractions of themain application of Conf-MDPs, with the purpose of showing how configuring the transitionmodel can be beneficial for the final policy performance. A detailed analysis of the Conf-MDPframework, along with a more extensive theoretical and empirical study of SPMI can befound in (Metelli et al., 2018).

2. Preliminaries

A discrete-time Markov Decision Process (MDP) (Puterman, 2014) is defined as M =(S,A, P,R, γ, µ) where S is the state space, A is the action space, P (s′|s, a) is a Markoviantransition model that defines the conditional distribution of the next state s′ given thecurrent state s and the current action a, γ ∈ (0, 1) is the discount factor, R(s, a) ∈ [0, 1] isthe reward for performing action a in state s and µ is the distribution of the initial state. A

1. In general, a modification of the environment (e.g., changing the configuration of a car) is more expensiveand more constrained w.r.t. to a modification of the policy.

2. The proofs of all the lemmas and theorems can be found in Appendix B.

2

Page 3: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

policy π(a|s) defines the probability distribution of an action a given the current state s.We now formalize the Configurable Markov Decision Process (Conf-MDP).

Definition 2.1 A Configurable Markov Decision Process is a tuple CM = (S,A, R, γ, µ,P,Π)where (S,A, R, γ, µ) is an MDP without the transition model and P and Π are the modeland policy spaces.

More specifically, Π is the set of policies the agent has access to and P is the set ofavailable environment configurations (transition models). The performance of a model-policypair (P, π) ∈ P ×Π is evaluated through the expected return, i.e., the expected discountedsum of the rewards collected along a trajectory

JP,πµ =1

1− γ

∫SdP,πµ (s)

∫Aπ(a|s)R(s, a)dads, (1)

where dP,πµ is the γ-discounted state distribution (Sutton et al., 2000), defined recursively as

dP,πµ (s) = (1− γ)µ(s) + γ∫S d

P,πµ (s′)P π(s′|s)ds′. We can also define the γ-discounted state-

action distribution as δP,πµ (s, a) = π(a|s)dP,πµ (s). While solving an MDP consists in finding

a policy π∗ that maximizes JP,πµ under the given fixed environment P , solving a Conf-MDP

consists in finding a model-policy pair (P ∗, π∗) such that P ∗, π∗ = arg maxP∈P,π∈Π Jπ,Pµ .

For control purposes, the state-action value function, or Q-function, is introduced as theexpected return starting from a state s and performing action a:

QP,π(s, a) = R(s, a) + γ

∫SP (s′|s, a)V P,π(s′)ds′. (2)

For learning the transition model we introduce the state-action-next-state value function orU-function, defined as the expected return starting from the state s, performing action aand landing to state s′:

UP,π(s, a, s′) = R(s, a) + γV P,π(s′), (3)

where V P,π is the state value function or V-function. These three functions are tightlyconnected due to the trivial relations: V P,π(s) =

∫A π(a|s)QP,π(s, a)da and QP,π(s, a) =∫

S P (s′|s, a)UP,π(s, a, s′)ds′. Furthermore, we define the policy advantage function AP,π(s, a) =QP,π(s, a)− V P,π(s) that quantifies how much an action is better than the others and themodel advantage function AP,π(s, a, s′) = UP,π(s, a, s′)−QP,π(s, a) that quantifies how muchthe next state is better than the other ones.

In order to evaluate the one-step improvement in performance attained by a newpolicy π′ or model P ′ when the current policy is π and the current model is P , we

introduce the relative advantage functions (Kakade and Langford, 2002): AP,π′

P,π (s) =∫A π′(a|s)AP,π(s, a)da, AP

′,πP,π (s, a) =

∫S P′(s′|s, a)AP,π(s, a, s′)ds′, and the corresponding

expected values under the γ-discounted distributions: AP,π′

P,π,µ =∫S d

P,πµ (s)AP,π

P,π (s)ds and

AP ′,πP,π,µ =

∫S∫A δ

P,πµ (s, a)AP

′,πP,π (s, a)dsda.

3

Page 4: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

3. Performance Improvement

The goal of this section is to provide a lower bound to the performance improvement obtainedby moving from a model-policy pair (P, π) to another pair (P ′, π′).3 We start providing abound for the difference of γ-discounted distributions under different model-policy pairs.

Proposition 3.1 Let (P, π) and (P ′, π′) be two model-policy pairs, the `1-norm of thedifference between the γ-discounted state distributions can be upper bounded as:∥∥∥dP ′,π′

µ − dP,πµ∥∥∥

1≤ γ

1− γ

(Dπ′,πE +DP ′,P

E

),

where Dπ′,πE = Es∼dP,πµ

∥∥π′(·|s)− π(·|s)∥∥

1and DP ′,P

E = E(s,a)∼δP,πµ

∥∥P ′(·|s, a)− P (·|s, a)∥∥

1.

We now exploit the previous result to obtain a lower bound on the performance improve-ment as an effect of the policy and model updates.

Theorem 3.1 (Decoupled Bound) The performance improvement of model-policy pair(P ′, π′) over (P, π) can be lower bounded as:

JP′,π′

µ − JP,πµ︸ ︷︷ ︸performanceimprovement

≥ B(P ′, π′) =AP ′,πP,π,µ +AP,π′

P,π,µ

1− γ︸ ︷︷ ︸advantage

− γ∆QP,πD

2(1− γ)2︸ ︷︷ ︸dissimilaritypenalization

,

where D is a dissimilarity term defined as D = Dπ′,πE

(Dπ′,π∞ +DP ′,P

∞)+DP ′,P

E

(Dπ′,π∞ +γDP ′,P

∞)

and Dπ′,π∞ = sups∈S

∥∥π′(·|s) − π(·|s)∥∥

1, DP ′,P

∞ = sups∈S,a∈A∥∥P ′(·|s, a) − P (·|s, a)

∥∥1

and

∆QP,π = sups,s′∈S,a,a′∈A∣∣QP,π(s′, a′)−QP,π(s, a)

∣∣.The bound is composed of two terms, like in (Kakade and Langford, 2002; Pirotta

et al., 2013): the first term, advantage, represents how much gain in performance can belocally obtained by moving from (P, π) to (P ′, π′), whereas the second term, dissimilaritypenalization, discourages updates towards model-policy pairs that are too far away.

4. Safe Policy Model Iteration

In this section, we present a new safe learning algorithm for the Conf-MDP framework, SafePolicy-Model Iteration (SPMI), inspired by (Pirotta et al., 2013), capable of learning thepolicy and the model simultaneously.4

4.1 The Algorithm

Following (Pirotta et al., 2013), we define the policy and model improvement update rules:π′ = απ + (1 − α)π, P ′ = βP + (1 − β)P , where α, β ∈ [0, 1], π ∈ Π and P ∈ P are thetarget policy and the target model respectively. Extending the rationale of (Pirotta et al.,2013) to our context, we aim to determine the values of α and β which jointly maximize thedecoupled bound (Theorem 3.1). In the following we will abbreviate B(P ′, π′) with B(α, β).

3. The proofs of the reported propositions along with other related results can be found in Appendix B.4. In the context of Conf-MDPs we believe that knowing the model of the configurable part of the environment

is a reasonable requirement.

4

Page 5: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

Table 1: The four possible optimal (α, β) pairs, the optimal pair is the one yielding themaximum bound value (all values are clipped in [0, 1]).

β∗ = 0 α∗0 =

(1−γ)AP,πP,π,µ

γ∆QP,πDπ,π∞ Dπ,πE

α∗ = 0 β∗0 =

(1−γ)AP,πP,π,µ

γ2∆QP,πDP,P∞ DP,PE

β∗ = 1 α∗1 = α∗

0 − 12

(DP,PE

Dπ,πE

+DP,P∞Dπ,π∞

)α∗ = 1 β∗

1 = β∗0 − 1

(Dπ,πE

DP,PE

+Dπ,π∞DP,P∞

)

Theorem 4.1 For any π ∈ Π and P ∈ P, the decoupled bound is optimized for α∗, β∗ =arg maxα,β{B(α, β) : (α, β) ∈ V}, where V = {(α∗0, 0), (α∗1, 1), (0, β∗0), (1, β∗1)} and the valuesof α∗0, α∗1, β∗0 and β∗1 are reported in Table 1.

The theorem expresses the fact that the optimal (α, β) pair lies on the boundary of[0, 1] × [0, 1], i.e., either one between policy and model is moved and the other is keptunchanged or one is moved and the other is set to target. Algorithm 1 reports the basicstructure of SPMI. The procedures PolicyChooser and ModelChooser are designated forselecting the target policy and model (see Section 4.3).

4.2 Policy and Model Spaces

Algorithm 1 Safe Policy Model Iteration

initialize π0, P0.for i = 0, 1, 2, ... until ε-convergence doπi = PolicyChooser(πi)P i = ModelChooser(Pi)Vi = {(α∗

0,i, 0), (α∗1,i, 1), (0, β∗

0,i), (1, β∗1,i)}

α∗i , β

∗i = arg maxα,β{B(α, β) : (α, β) ∈ Vi}

πi+1 = α∗i πi + (1− α∗

i )πiPi+1 = β∗

i P i + (1− β∗i )Pi

end for

The selection of the target policy and modelis a rather crucial component of the algo-rithm since the quality of the updates largelydepends on it. To effectively adopt a targetselection strategy we have to know whichare the degrees of freedom on the policy andmodel spaces. Focusing on the model spacefirst, it is easy to discriminate two macro-classes in real-world scenarios. In some cases,there are almost no constraints on the direc-tion in which to update the model. In other cases, only a limited model portion, typically aset of parameters inducing transition probabilities, can be accessed. While we can naturallydesign the first scenario as an unconstrained model space, to represent the second scenariowe limit the model space to the convex hull co(P ), where P is a set of extreme (or vertex)models. Since only the convex combination coefficients can be controlled, we refer to thelatter as a parametric model space. It is noteworthy that we can symmetrically extend thedichotomy to the policy space, although the need to limit the agent on the direction ofpolicy updates is less significant in our perspective.

4.3 Target Choice

To deal with unconstrained spaces, it is quite natural to adopt the target selection strategypresented in (Pirotta et al., 2013), by introducing the concept of greedy model as P+(·|s, a) ∈arg maxs′∈S U

P,π(s, a, s′), i.e., the model that maximizes the relative advantage in each state-action pair. At each step, the greedy policy and model w.r.t. the QP,π and UP,π are selectedas targets. When we are not free to choose the greedy model, like in the parametric setting,

5

Page 6: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

0 10,000 20,000

4

6

8

10

SPI

SMI

iteration

expe

cted

retu

rn

102 103 104

0

0.5

1

iteration

polic

ydi

ssim

ilari

ty

102 103 104

4

6

8

10

iteration

expe

cted

retu

rn

(a) (b) (c)

SPMI SPI+SMI SMI+SPI SPMI-greedy SPMI-sup SPMI-alt

Figure 1: Expected return for the Student-Teacher domain for different update strategiesand target selections.

we select the vertex model that maximizes the expected relative advantage (greedy choice).The greedy strategy is based on local information and is not guaranteed to provide amodel-policy pair maximizing the bound. However, testing all the model-policy pairs ishighly inefficient in the presence of large model-policy spaces. A reasonable compromise isto select, as a target, the model that yields the maximum bound value between the greedytarget P i ∈ arg maxP∈P A

P,πPi,π,µ

and the previous target P i−1 (the same procedure can beemployed for the policy). This procedure, named persistent choice, effectively avoids theoscillating behavior, common with the greedy choice.

5. Experimental Evaluation

The goal of this section is to show the benefits of configuring the environment while thepolicy learning goes on. The experiments are conducted on two explicative domains: theStudent-Teacher domain (unconstrained model space) the Racetrack Simulator (parametricmodel space). We compare different target choices (greedy and persistent, see Section 4.3)and different update strategies. Specifically, SPMI, that adaptively updates policy and model,is compared with some alternative model learning approaches: SPMI-alt(ernated) in whichmodel and policy updates are forced to be alternated, SPMI-sup that uses a looser bound,

obtained from Theorem 3.1 by replacing D?′,?E with D?′,?

∞ , SPI+SMI5 that optimizes policyand model in sequence and SMI+SPI that does the opposite.

5.1 Student-Teacher domain

The Student-Teacher domain is a simple model of concept learning, inspired by (Raffertyet al., 2011). A student (agent) learns to perform consistent assignments to literals as aresult of the statements (e.g., “A+C=3”) provided by an automatic teacher (environment,e.g., online platform). The student has a limited policy space as she/he can only change thevalues of the literals by a finite quantity, but it is possible to configure the difficulty of theteacher’s statements, selecting the number of literals in the statement (see Appendix D.1).

We consider the illustrative example in which there are two binary literals, and thestudent can change only one literal at a time. In Figure 1, we show the behavior of thedifferent update strategies starting from a uniform initialization. We first notice (Figure 1

5. SMI (Safe Model Iteration) is SPMI without policy updates.

6

Page 7: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

100 101 102 103 104

0

0.2

0.4

0.6

iteration

expe

cted

retu

rn

0 10,000 20,000 30,000

0.4

0.6

0.8

1

iteration

high

spee

dst

abili

ty

10,000 20,000 30,000 40,000 50,000

0

0.05

0.1

0.15

iteration

expe

cted

retu

rn

(a) (b) (c)

SPMI SPMI-sup SPMI-alt SPMI-greedy SMI+SPI SPI+SMI

Figure 2: Expected return and coefficient of the high speed stability vertex model for differentupdate strategies in track T1 and expected return for track T2.

(a)) that learning both policy and model is convenient since the performance of SPMIat convergence is higher than the one of SPI (only policy learned) and SMI (only modellearned). Moreover, we observe that the greedy target selection, while being the best localchoice maximizing the advantage, might result in an unstable behavior that slows down theconvergence of the algorithm, while the persistent target selection enforces more stability(Figure 1 (b)). Finally, we can see that the joint and adaptive strategy of SPMI outperformsSPMI-sup and SPMI-alt (Figure 1 (c)). Notably, both SPMI and SPMI-sup perform thepolicy updates and the model updates in sequence, since it is more convenient for the studentto learn an almost optimal policy with no intervention on the teacher and then refining theteacher model to gain further reward. For this reason, the alternated model-policy update(SPMI-alt) results in poorer performance in an early stage.

5.2 Racetrack simulator

The Racetrack simulator is a simple abstraction of a car driving problem. The driver (agent)has to optimize a driving policy to run the vehicle on the track, reaching the finish line asfast as possible. During the process, the agent can configure two vehicle settings to improveher/his driving performance: the vehicle stability and the engine boost (see Appendix D.2).

In Figure 2 (a), we highlight the effectiveness of SPMI w.r.t. all the alternatives on trackT1, in which only the vehicle stability can be modified. Notably, comparing SPMI with thesequential approaches, SPI+SMI and SMI+SPI, we can easily deduce that is not valuable toconfigure the vehicle stability, i.e., updating the model, while the driving policy is still reallyrough. In Figure 2 (b), we show the trend of the model coefficient related to high-speedstability. While the optimal configuration results in a mixed model for vehicle stability,SPMI exploits the maximal high-speed stability to efficiently learn the driving policy inan early stage, SPI+SMI and SPMI-greedy, instead, execute a sequence of policy updatesand then directly lead the model to the optimal configuration. Figure 2 (c) shows how theprevious considerations generalize to an example on a morphologically different track (T2),in which also the engine boost can be configured. The learning process is characterized by along exploration phase, both in the model and the policy space, in which the driver cannotlead the vehicle to the finish line to collect any reward. Then, we observe a fast growth inexpected return when the agent has acquired enough information to reach the finish line.

7

Page 8: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

6. Discussion and Conclusions

In this paper, we proposed a novel framework (Conf-MDP), to model a set of real-worldscenarios in which the environment dynamics can be partially modified, and a safe learningapproach (SPMI), to jointly and adaptively learn model and policy. Conf-MDPs allowmodelling many relevant sequential-decision making problems that we believe cannot beeffectively addressed using traditional frameworks.

Why not a unique agent? Representing the environment configurability in the agentmodel when the environment is under the control of a supervisor is certainly inappropriate.Even when the environment configuration is carried out by the agent, this approach wouldrequire the inclusion of “configuration actions” in the action space to allow the agent toconfigure the environment directly as a part of the policy optimization. However, in ourframework, the environment configuration is performed just once at the beginning of theepisode. Moreover, with configuration actions the agent is not really learning a probabilitydistribution on actions, i.e., a policy, but a probability distribution on state-state couples,i.e., a state kernel. This formulation prevents distinguishing, during the process, the effects ofthe policy from those of the model, making it difficult to finely constrain the configurations,limit the feasible model space, and recovering, a posteriori, the optimal model-policy pair.

Why not a multi-agent system? When there is no supervisor, the agent is the onlylearning entity and the environment is completely passive. In the presence of a supervisor,it would be misleading to adopt a cooperative multi-agent approach. The supervisor actsexternally, at a different level and could be, possibly, totally transparent to the learningagent. Indeed, the supervisor does not operate inside the environment but it is in chargeof selecting the most suitable configuration, whereas the agent needs to learn the optimalpolicy for the given environment.

The second contribution of this paper is the formulation of a safe approach, suitable tomanage critical tasks, to solve a learning problem in the context of the newly introducedConf-MDP framework. To this purpose, we proposed a novel tight lower bound on theperformance improvement and an algorithm (SPMI) optimizing this bound to learn thepolicy and the model configuration simultaneously. We then presented an empirical studyto show the effectiveness of SPMI in our context and to uphold the introduction of theConf-MDP framework.

We believe that this work is only a first step in solving these kinds of problems: manyfuture research directions are open. Clearly, a first extension could tackle the problem froma sample-based perspective, removing the requirement of knowing the full model space.Furthermore, we could consider different learning approaches, like policy search methods,suitable for continuous state-action spaces.

References

Yasin Abbasi-Yadkori, Peter L Bartlett, and Stephen J Wright. A fast and reliable policyimprovement algorithm. In Artificial Intelligence and Statistics, pages 1338–1346, 2016.

8

Page 9: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization.In Proceedings of the 34th International Conference on Machine Learning, volume 70 ofICML’17, pages 22–31, 2017.

Bruce Lee Bowerman. Nonstationary Markov decision processes and related topics innonstationary Markov chains. PhD thesis, Iowa State University, 1974.

Thiago P Bueno, Denis D Maua, Leliane N Barros, and Fabio G Cozman. Modeling markovdecision processes with imprecise probabilities using probabilistic logic programming. InProceedings of the Tenth International Symposium on Imprecise Probability: Theories andApplications, pages 49–60, 2017.

Torpong Cheevaprawatdomrong, Irwin E Schochetman, Robert L Smith, and Alfredo Garcia.Solution and forecast horizons for infinite-horizon nonhomogeneous markov decisionprocesses. Mathematics of Operations Research, 32(1):51–72, 2007.

Kamil Andrzej Ciosek and Shimon Whiteson. Offer: Off-environment reinforcement learning.In AAAI, pages 1819–1825, 2017.

Karina Valdivia Delgado, Leliane Nunes de Barros, Fabio Gagliardi Cozman, and RicardoShirota. Representing and solving factored markov decision processes with impreciseprobabilities. Proceedings ISIPTA, Durham, United Kingdom, pages 169–178, 2009.

Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reversecurriculum generation for reinforcement learning. In Conference on Robot Learning, pages482–495, 2017.

Alfredo Garcia and Robert L Smith. Solving nonstationary infinite horizon dynamicoptimization problems. Journal of Mathematical Analysis and Applications, 244(2):304–317, 2000.

Archis Ghate and Robert L Smith. A linear programming approach to nonstationaryinfinite-horizon markov decision processes. Operations Research, 61(2):413–425, 2013.

Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement byminimizing robust baseline regret. In Advances in Neural Information Processing Systems,pages 2298–2306, 2016.

Robert Givan, Sonia Leach, and Thomas Dean. Bounded parameter markov decisionprocesses. In Sam Steel and Rachid Alami, editors, Recent Advances in AI Planning,pages 234–246. Springer Berlin Heidelberg, 1997.

Moshe Haviv and Ludo Van der Heyden. Perturbation bounds for the stationary probabilitiesof a finite markov chain. Advances in Applied Probability, 16(4):804–818, 1984.

Wallace J Hopp, James C Bean, and Robert L Smith. A new optimality criterion fornonhomogeneous markov decision processes. Operations Research, 35(6):875–883, 1987.

Sham Kakade and John Langford. Approximately optimal approximate reinforcementlearning. In Proceedings of the 19th International Conference on Machine Learning,volume 2, pages 267–274, 2002.

9

Page 10: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

Alberto Maria Metelli, Mirco Mutti, and Marcello Restelli. Configurable Markov decisionprocesses. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th Interna-tional Conference on Machine Learning, volume 80 of Proceedings of Machine LearningResearch, pages 3491–3500, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR.URL http://proceedings.mlr.press/v80/metelli18a.html.

Yaodong Ni and Zhi-Qiang Liu. Bounded-parameter partially observable markov decisionprocesses. In Proceedings of the Eighteenth International Conference on InternationalConference on Automated Planning and Scheduling, pages 240–247. AAAI Press, 2008.

Takayuki Osogami. Robust partially observable markov decision process. In Proceedings ofthe 32nd International Conference on Machine Learning, volume 37 of ICML’15, pages106–115, 2015.

Matteo Papini, Matteo Pirotta, and Marcello Restelli. Adaptive batch size for safe policygradients. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages3594–3603. Curran Associates, Inc., 2017.

Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. InAAAI, pages 1607–1612. Atlanta, 2010.

Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policyiteration. In Proceedings of the 30th International Conference on International Conferenceon Machine Learning, volume 28 of ICML’13, pages 307–315, 2013.

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.John Wiley & Sons, 2014.

Anna N Rafferty, Emma Brunskill, Thomas L Griffiths, and Patrick Shafto. Faster teachingby pomdp planning. In AIED, pages 280–287. Springer, 2011.

Jay K Satia and Roy E Lave Jr. Markovian decision processes with uncertain transitionprobabilities. Operations Research, 21(3):728–740, 1973.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trustregion policy optimization. In Proceedings of the 32nd International Conference onMachine Learning, ICML’15, pages 1889–1897, 2015.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximalpolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.MIT press Cambridge, 1998.

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policygradient methods for reinforcement learning with function approximation. In Advances inNeural Information Processing Systems, pages 1057–1063, 2000.

Chelsea C White III and Hany K Eldeib. Markov decision processes with imprecise transitionprobabilities. Operations Research, 42(4):739–749, 1994.

10

Page 11: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

Appendix A. Related Works

Although in the standard framework the environment is assumed to be fixed, severalexceptions can be found in the literature, especially in the context of Markov DecisionProcesses with imprecise probabilities (MDPIPs) (Satia and Lave Jr, 1973; White IIIand Eldeib, 1994; Bueno et al., 2017) and non-stationary environments (Bowerman, 1974;Hopp et al., 1987). In the former case, the transition kernel is known under uncertainty.Therefore, it cannot be specified using a conditional probability distribution, but it must bedefined through a set of probability distributions (Delgado et al., 2009). In this context,Bounded-parameter Markov Decision Processes (BMDPs) consider a special case in whichupper and lower bounds on transition probabilities are specified (Givan et al., 1997; Niand Liu, 2008). A common approach is to solve a minimax problem to find a robustpolicy maximizing the expected return under the worst possible transition model (Osogami,2015). In non-stationary environments, the transition probabilities (possibly also the rewardfunction) change over time (Bowerman, 1974). Several works tackle the problem of definingan optimality criterion (Hopp et al., 1987) and finding optimal policies in non-stationaryenvironments (Garcia and Smith, 2000; Cheevaprawatdomrong et al., 2007; Ghate andSmith, 2013). Although the environment is no longer fixed, both Markov decision processeswith imprecise probabilities and non-stationary Markov decision processes do not admit thepossibility to dynamically alter the environmental parameters. The environment can bemodified also to favor the learning process. This instance has been previously addressedin (Ciosek and Whiteson, 2017; Florensa et al., 2017), where the transition model and theinitial state distribution are altered in order to reach a faster convergence to the optimalpolicy. However, in both the cases the environment modification is only simulated, whilethe underlying environment dynamic remains unchanged.

Appendix B. Proofs and Derivations

In this appendix, we provide the proofs of the theorems and lemmas presented in the paper,we will also provide some comments on the intermediate results. Given a model-policy pair(P, π) we indicate with P π the state kernel function defined as:

P π(s′|s) =

∫Aπ(a|s)P (s′|s, a)da. (4)

Proposition B.1 Let (P, π) and (P ′, π′) be two model-policy pairs, the `1-norm of thedifference between the γ-discounted state distributions can be upper bounded as:∥∥∥dP ′,π′

µ − dP,πµ∥∥∥

1≤ γ

1− γDP ′π′ ,Pπ

E ,

where DP ′π′ ,Pπ

E = Es∼dP,πµ

∥∥P ′π′(·|s)− P π(·|s)

∥∥1.

Proof Exploiting the recursive equation of the γ-discounted state distribution we can writethe distributions difference as follows:

11

Page 12: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

dP′,π′

µ (s)− dP,πµ (s) = (1− γ)µ(s) + γ

∫SdP

′,π′µ (s′)P ′π

′(s|s′)ds′ − (1− γ)µ(s)

− γ∫SdP,πµ (s′)P π(s|s′)ds′

= γ

∫SdP

′,π′µ (s′)P ′π

′(s|s′)ds′ − γ

∫SdP,πµ (s′)P π(s|s′)ds′

± γ∫SdP,πµ (s′)P ′π

′(s|s′)ds′

= γ

∫S

(dP

′,π′µ (s′)− dP,πµ (s′)

)P ′π

′(s|s′)ds′

+ γ

∫SdP,πµ (s′)

(P ′π

′(s|s′)− P π(s|s′)

)ds′. (5)

Applying the `1-norm to the equation (5) we can state the following:∥∥∥dP ′,π′µ − dP,πµ

∥∥∥1≤ γ

∫S

∣∣∣∣ ∫S

(dP

′,π′µ (s′)− dP,πµ (s′)

)P ′π

′(s|s′)ds′

∣∣∣∣ds+ γ

∫S

∣∣∣∣ ∫SdP,πµ (s′)

(P ′π

′(s|s′)− P π(s|s′)

)ds′∣∣∣∣ds (6)

≤ γ∫S

∣∣∣dP ′,π′µ (s′)− dP,πµ (s′)

∣∣∣ ∫SP ′π

′(s|s′)dsds′

+ γ

∫SdP,πµ (s′)

∫S

∣∣∣P ′π′(s|s′)− P π(s|s′)

∣∣∣dsds′ (7)

≤ γ∥∥∥dP ′,π′

µ − dP,πµ∥∥∥

1+ γ E

s∼dP,πµ

∥∥∥P ′π′(·|s)− P π(·|s)

∥∥∥1

(8)

≤ γ

1− γ Es∼dP,πµ

∥∥∥P ′π′(·|s)− P π(·|s)

∥∥∥1

= DP ′π′ ,Pπ

E .

In (6) we exploited the subadditivity of the norm ‖x+ y‖ ≤ ‖x‖+ ‖y‖ and (8) derives from(7) by observing that

∫S P′π′

(s|s′)ds = 1.

Proposition 3.1 Let (P, π) and (P ′, π′) be two model-policy pairs, the `1-norm of thedifference between the γ-discounted state distributions can be upper bounded as:∥∥∥dP ′,π′

µ − dP,πµ∥∥∥

1≤ γ

1− γ

(Dπ′,πE +DP ′,P

E

),

where Dπ′,πE = Es∼dP,πµ

∥∥π′(·|s)− π(·|s)∥∥

1and DP ′,P

E = E(s,a)∼δP,πµ

∥∥P ′(·|s, a)− P (·|s, a)∥∥

1.

Proof We prove this proposition from the precedent result by decomposing the expression

Es∼dP,πµ

∥∥∥P ′π′(·|s)− P π(·|s)

∥∥∥1

as follows:∥∥∥P ′π′(·|s)− P π(·|s)

∥∥∥1

=

∫S

∣∣∣P ′π′(s′|s)− P π(s′|s)

∣∣∣ds′12

Page 13: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

=

∫S

∣∣∣ ∫A

(P ′(s′|s, a)π′(a|s)− P (s′|s, a)π(a|s)

)da∣∣∣ds′

=

∫S

∣∣∣ ∫A

(P ′(s′|s, a)π′(a|s)− P (s′|s, a)π(a|s)

± P ′(s′|s, a)π(a|s))

da∣∣∣ds′

=

∫S

∣∣∣ ∫A

((π′(a|s)− π(a|s)

)P ′(s′|s, a)

+ π(a|s′)(P ′(s′|s, a)− P (s′|s, a)

))da∣∣∣ds′

≤∫S

∫A

∣∣∣π′(a|s)− π(a|s)∣∣∣P ′(s′|s, a)dads′

+

∫S

∫Aπ(a|s)

∣∣∣P ′(s′|s, a)− P (s′|s, a)∣∣∣dads′.

We now take the expectation w.r.t. dP,πµ and exploit the monotonicity property:

Es∼dP,πµ

∥∥∥P ′π′(·|s)− P π(·|s)

∥∥∥1

≤∫SdP,πµ (s)

∫S

∫A

∣∣∣π′(a|s)− π(a|s)∣∣∣P ′(s′|s, a)dadsds′

+

∫SdP,πµ (s)

∫S

∫Aπ(a|s)

∣∣∣P ′(s′|s, a)− P (s′|s, a)∣∣∣dadsds′ (9)

≤∫SdP,πµ (s)

∫A

∣∣∣π′(a|s)− π(a|s)∣∣∣dads′+

+

∫S

∫AδP,πµ (s, a)

∫S

∣∣∣P ′(s′|s, a)− P (s′|s, a)∣∣∣dsdads′ (10)

= Es∼dP,πµ

∥∥∥π′(·|s)− π(·|s)∥∥∥

1+ E

(s,a)∼δP,πµ

∥∥P ′(·|s, a)− P (·|s, a)∥∥

1

= Dπ′,πE +DP ′,P

E ,

where (10) follows from (9) by observing that dP,πµ (s)π(a|s′) = δP,πµ (s, a).

It is worth noting that when P = P ′ the bound resembles Corollary 3.2 in (Pirotta et al.,2013), but it is tighter as:

Es∼dP,πµ

∥∥π′(·|s)− π(·|s)∥∥

1≤ sup

s∈S

∥∥π′(·|s)− π(·|s)∥∥

1,

in particular the bound of (Pirotta et al., 2013) might yield a large bound value in case thereexist states in which the policies are very different even if those states are rarely visitedaccording to dP,πµ . In the context of policy learning, a lower bound employing the same

dissimilarity index Dπ′,πE in the penalization term has been previously proposed in (Achiam

et al., 2017).

13

Page 14: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

We introduce the coupled relative advantage function:

AP′,π′

P,π (s) =

∫S

∫Aπ′(a|s)P ′(s′|s, a)AP,π(s, a, s′)ds′da, (11)

where AP,π(s, a, s′) = UP,π(s, a, s′)− V P,π(s). AP′,π′

P,π represents the one-step improvementattained by the new model-policy pair (P ′, π′) over the current one (P, π), i.e., the localgain in performance yielded by selecting an action with π′ and the next state with P ′.

The corresponding expectation under the γ-discounted distribution is given by: AP ′,π′

P,π,µ =∫S d

P,πµ (s)AP

′,π′

P,π (s)ds. Now, we have all the elements to express the performance improvementin terms of the relative advantage functions and the γ-discounted distributions.

Theorem B.1 The performance improvement of model-policy pair (P ′, π′) over (P, π) isgiven by:

JP′,π′

µ − JP,πµ =1

1− γ

∫SdP

′,π′µ (s)AP

′,π′

P,π (s)ds.

This theorem is the natural extension of the result proposed by Kakade and Langford(2002), but, unfortunately, it cannot be directly exploited in an algorithm as the dependence

of dP′,π′

µ on the candidate model-policy pair (P ′, π′) is highly nonlinear and difficult to treat.

We aim to obtain, from this result, a lower bound on JP′,π′

µ − JP,πµ that can be efficientlycomputed using information on the current pair (P, π).

Proof Let us start from the definition of JP′,π′

µ :

(1− γ)JP′,π′

µ =

∫S

∫AdP

′,π′µ (s)π′(a|s)R(s, a)dads

=

∫SdP

′,π′µ (s)

∫Aπ′(a|s)R(s, a)dads±

∫SdP

′,π′µ (s)V P,π(s)ds (12)

=

∫SdP

′,π′µ (s)

∫Aπ′(a|s)R(s, a)dads (13)

+

∫S

((1− γ)µ(s′) + γ

∫S

∫AdP

′π′µ (s)π′(a|s)P ′(s′|s, a)dads

)V P,π(s′)ds′

−∫SdP

′,π′µ (s)V P,π(s)ds

=

∫SdP

′,π′µ (s)

(∫Aπ′(a|s)

∫SP ′(s′|s, a)

(R(s, a) + γV P,π(s′)

)ds′da

− V P,π(s)

)ds+ (1− γ)

∫Sµ(s′)V P,π(s′)ds′ (14)

=

∫SdP

′,π′µ (s)AP

′,π′

P,π (s)ds+ (1− γ)JP,πµ , (15)

where we have exploited the recursive formulation of dP′,π′

µ to rewrite (12) into (13) and (15)

follows from (14) by observing that∫S µ(s′)V P,π(s′)ds′ = JP,πµ and using the definition

UP,π(s, a, s′) = R(s, a) + γV P,π(s′).

14

Page 15: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

Theorem B.2 (Coupled Bound) The performance improvement of model-policy pair(P ′, π′) over (P, π) can be lower bounded as:

JP′,π′

µ − JP,πµ︸ ︷︷ ︸performanceimprovement

≥AP ′,π′

P,π,µ

1− γ︸ ︷︷ ︸advantage

−γ∆AP

′,π′

P,π DP ′π′ ,Pπ

E

2(1− γ)2︸ ︷︷ ︸dissimilarity penalization

,

where ∆AP′,π′

P,π = sups,s′∈S∣∣AP ′,π′

P,π (s′)−AP′,π′

P,π (s)∣∣.

ProofExploiting the bounds on the γ-discounted state distributions difference (Proposition B)

we can easily attain the performance improvement bound:

JP′,π′

µ − JP,πµ =1

1− γ

∫SdP

′,π′µ (s)AP

′,π′

P,π (s)ds =

=1

1− γ

∫SdP,πµ (s)AP

′,π′

P,π (s)ds+1

1− γ

∫S

(dP

′,π′µ (s)− dP,πµ (s)

)AP

′,π′

P,π (s)ds

(16)

≥AP ′,π′

P,π,µ

1− γ− 1

1− γ

∣∣∣∣ ∫S

(dP

′,π′µ (s)− dP,πµ (s)

)AP

′,π′

P,π (s)ds

∣∣∣∣ (17)

≥AP ′,π′

P,π,µ

1− γ− 1

1− γ∥∥dP ′,π′

µ − dP,πµ∥∥

1

∆AP′,π′

P,π

2(18)

≥AP ′,π′

P,π,µ

1− γ− γ

(1− γ)2 Es∼dP,πµ

∥∥P ′π′− P π

∥∥1

∆AP′,π′

P,π

2, (19)

where (17) follows from (16) by observing that b ≥ −|b|, line (18) follows from (17) byapplying Corollary 2.4 of (Haviv and Van der Heyden, 1984) and (19) is obtained by usingCorollary B.

The coupled bound, however, is not suitable to be used in an algorithm as it does notseparate the contribution of the policy and that of the model. In practice, an agent cannotdirectly update the kernel function P π since the environment model can only partially becontrolled, whereas, in many cases, we can assume a full control on the policy. For thisreason, it is convenient to derive a bound in which the policy and model effects are decoupled.

Lemma 1 The following equality relates the joint relative advantage function and the relativeadvantage functions:

AP′,π′

P,π (s) = AP,π′

P,π (s) +

∫Aπ′(a|s)AP

′,πP,π (s, a)da.

Proof

AP′,π′

P,π (s) =

∫A

∫Sπ′(a|s)P ′(s′|s, a)UP,π(s, a, s′)ds′da− V P,π(s)

15

Page 16: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

=

∫A

∫Sπ′(a|s)P ′(s′|s, a)UP,π(s, a, s′)ds′da− V P,π(s)

±∫A

∫Sπ′(a|s)P (s′|s, a)UP,π(s, a, s′)ds′da

=

∫A

∫Sπ′(a|s)P (s′|s, a)UP,π(s, a, s′)ds′da− V P,π(s)

+

∫A

∫Sπ′(a|s)

(P ′(s′|s, a)− P (s′|s, a)

)UP,π(s, a, s′)ds′da

=

∫Aπ′(a|s)QP,π(s, a)da− V P,π(s)

+

∫Aπ′(a|s)

∫S

(P ′(s′|s, a)− P (s′|s, a)

)UP,π(s, a, s′)ds′da (20)

= AP,π′

P,π (s) +

∫Aπ′(a|s)AP

′,πP,π (s, a)da, (21)

where we have applied in the first addendum (20) the definition of AP,π′

P,π (s) observing

that AP,π′

P,π (s) =∫A π′(a|s)AP,π(s, a)da =

∫A π′(a|s)

(QP,π(s, a) − V P,π(s)

)to get (21); and

similarly for the second addendum of (20) the fact that:

AP′,π

P,π (s, a) =

∫SP ′(s′|s, a)AP,π(s, a, s′)ds′ =

∫SP ′(s′|s, a)

(UP,π(s, a, s′)−QP,π(s, a)

)ds′.

Lemma 2 The following bound relates the joint relative advantage function and the relativeadvantage functions:∣∣∣∣AP ′,π′

P,π,µ −(AP ′,πP,π,µ +AP,π′

P,π,µ

)∣∣∣∣ ≤ γDπ′,πE DP ′,P

∞∆QP,π

2,

where DP ′,P∞ = sups∈S,a∈A

∥∥P ′(·|s, a)−P (·|s, a)∥∥

1and ∆QP,π = sups,s′∈S,a,a′∈A

∣∣QP,π(s′, a′)−QP,π(s, a)

∣∣.Proof

We can rewrite the expected relative advantage AP ′,π′

P,π,µ exploiting the definition:

AP ′,π′

P,π,µ =

∫SdP,πµ AP

′,π′

P,π (s)ds =

=

∫SdP,πµ (s)

(AP,π

P,π (s) +

∫Aπ′(a|s)AP

′,πP,π (s, a)da

)ds

=

∫SdP,πµ (s)AP,π

P,π (s)ds+

∫S

∫AdP,πµ (s)π(a|s)AP

′,πP,π (s, a)dads

+

∫SdP,πµ (s)

∫A

(π′(a|s)− π(a|s)

)AP

′,πP,π (s, a)dads

16

Page 17: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

= AP,π′

P,π,µ +AP ′,πP,π,µ +

∫SdP,πµ (s)

∫A

(π′(a|s)− π(a|s)

)AP

′,πP,π (s, a)dads. (22)

From equation (22) we can straightforwardly state the following inequalities:

AP ′,π′

P,π,µ ≥ AP,π′

P,π,µ +AP ′,πP,π,µ −

∣∣∣∣ ∫SdP,πµ (s)

∫A

(π′(a|s)− π(a|s)

)AP

′,πP,π (s, a)dads

∣∣∣∣,AP ′,π′

P,π,µ ≤ AP,π′

P,π,µ +AP ′,πP,π,µ +

∣∣∣∣ ∫SdP,πµ (s)

∫A

(π′(a|s)− π(a|s)

)AP

′,πP,π (s, a)dads

∣∣∣∣,then we bound the right hand side:

∣∣∣AP ′,π′

P,π,µ − (AP,π′

P,π,µ +AP ′,πP,π,µ)

∣∣∣ ≤ ∣∣∣∣ ∫SdP,πµ (s)

∫A

(π′(a|s)− π(a|s)

)AP

′,πP,π (s, a)dads

∣∣∣∣≤∫S

∫AdP,πµ (s)

∣∣∣(π′(a|s)− π(a|s))AP

′,πP,π (s, a)

∣∣∣dads

≤∫SdP,πµ (s)

∫A

∣∣∣π′(a|s)− π(a|s)∣∣∣dads

∆AP′,π

P,π

2

≤ Es∼dP,πµ

∥∥π′(·|s)− π(·|s)∥∥

1

∆AP′,π

P,π

2

≤ Dπ′,πE

∆AP′,π

P,π

2.

We conclude by bounding the term∆AP

′,πP,π

2 :

∆AP′,π

P,π

2≤∥∥∥∆AP

′,πP,π

∥∥∥∞≤

≤ sups∈S,a∈A

∫S

(P ′(s′|s, a)− P (s′|s, a)

)UP,π(s, a, s′)ds′ (23)

≤ γ sups∈S,a∈A

∫S

(P ′(s′|s, a)− P (s′|s, a)

)V P,π(s′)ds′ (24)

≤ γ sups∈S,a∈A

∥∥P ′(·|s, a)− P (·|s, a)∥∥

1

∆V P,π

2(25)

≤ γDP ′,P∞

∆QP,π

2,

where (24) follows from (23) by observing that∫S(P ′(s′|s, a)− P (s′|s, a)

)UP,π(s, a, s′)ds′ =∫

S(P ′(s′|s, a)− P (s′|s, a)

)(R(s, a) + γV P,π(s′)

)ds′ and (25) is obtained by observing that

∆V P,π ≤ ∆QP,π. Putting all together we get the lemma.

17

Page 18: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

Lemma 3 Let (P, π) and (P ′, π′) be two model-policy pairs, it holds that:

∆AP′,π′

P,π

2≤ (Dπ′,π

∞ + γDP ′,P∞ )

∆QP,π

2,

where DP ′,P∞ = sups∈S,a∈A

∥∥P ′(·|s, a) − P (·|s, a)∥∥

1, Dπ′,π∞ = sups∈S

∥∥π′(·|s) − π(·|s)∥∥

1and

∆QP,π = sups,s′∈S,a,a′∈A∣∣QP,π(s′, a′)−QP,π(s, a)

∣∣.Proof Let us start rewriting the expression of the relative advantage AP

′,π′

P,π :

AP′,π′

P,π (s) =

∫Aπ′(a|s)

(R(s, a) + γ

∫SP ′(s′|s, a)V P,π(s′)ds′

)da− V P,π(s)

=

∫Aπ′(a|s)

(R(s, a) + γ

∫SP ′(s′|s, a)V P,π(s′)ds′

)da

−∫Aπ(a|s)

(R(s, a) + γ

∫SP (s′|s, a)V P,π(s′)ds′

)da

=

∫A

(π′(a|s)− π(a|s)

)R(s, a)da

+ γ

∫A

∫S

(π′(a|s)P ′(s′|s, a)− π(a|s)P (s′|s, a)

)V P,π(s′)dsda

=

∫A

(π′(a|s)− π(a|s)

)R(s, a)da

+ γ

∫A

∫S

(π′(a|s)− π(a|s)

)P (s′|s, a)V P,π(s′)dsda

+ γ

∫A

∫Sπ′(a|s)

(P ′(s′|s, a)− P (s′|s, a)

)V P,π(s′)dsda

=

∫A

(π′(a|s)− π(a|s)

)(R(s, a) + γ

∫SP (s′|s, a)V P,π(s′)ds

)da

+ γ

∫A

∫Sπ′(a|s)

(P ′(s′|s, a)− P (s′|s, a)

)V P,π(s′)dsda

=

∫A

(π′(a|s)− π(a|s)

)QP,π(s, a)da

+ γ

∫A

∫Sπ′(a|s)

(P ′(s′|s, a)− P (s′|s, a)

)V P,π(s′)dsda,

then we can obtain the proof straightforwardly, noticing that sups∈S ∆QP,π(s, ·) ≤ ∆QP,π

and recalling that ∆V P,π ≤ ∆QP,π:

∆AP′,π′

P,π

2≤∥∥∥AP ′,π′

P,π

∥∥∥∞≤ Dπ′,π

∞sups∈S ∆QP,π(s, ·)

2+ γDP ′,P

∞∆V P,π

2≤

≤ (Dπ′,π∞ + γDP ′,P

∞ )∆QP,π

2.

18

Page 19: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

Theorem 3.1 (Decoupled Bound) The performance improvement of model-policy pair(P ′, π′) over (P, π) can be lower bounded as:

JP′,π′

µ − JP,πµ︸ ︷︷ ︸performanceimprovement

≥ B(P ′, π′) =AP ′,πP,π,µ +AP,π′

P,π,µ

1− γ︸ ︷︷ ︸advantage

− γ∆QP,πD

2(1− γ)2︸ ︷︷ ︸dissimilaritypenalization

,

where D is a dissimilarity term defined as D = Dπ′,πE

(Dπ′,π∞ +DP ′,P

∞)+DP ′,P

E

(Dπ′,π∞ +γDP ′,P

∞)

and Dπ′,π∞ = sups∈S

∥∥π′(·|s) − π(·|s)∥∥

1, DP ′,P

∞ = sups∈S,a∈A∥∥P ′(·|s, a) − P (·|s, a)

∥∥1

and

∆QP,π = sups,s′∈S,a,a′∈A∣∣QP,π(s′, a′)−QP,π(s, a)

∣∣.Proof Exploiting the bounds on the coupled expected relative advantage AP ′,π′

P,π,µ (Lemma 2),

on the γ-discounted state distributions difference (Corollary 3) and on ∆AP′,π

P,π (Lemma 3),we can state the following:

JP′,π′

µ − JP,πµ ≥AP ′,π′

P,π,µ

1− γ−∥∥dP ′,π′

µ − dP,πµ∥∥

1

1− γ∆AP

′,π′

P,π

2

≥AP,π′

P,π,µ +AP ′,πP,π,µ

1− γ− γ

(1− γ)2(Dπ′,π

E +DP ′,PE )(Dπ′,π

∞ + γDP ′,P∞ )

∆QP,π

2

− γ

1− γDπ′,πE DP ′,P

∞∆QP,π

2

≥AP,π′

P,π,µ +AP ′,πP,π,µ

1− γ− γ

(1− γ)2

(Dπ′,πE Dπ′,π

∞ +Dπ′,πE DP ′,P

+Dπ′,π∞ DP ′,P

E + γDP ′,P∞ DP ′,P

E

)∆QP,π

2.

With a factorization of the last expression we get the result.

Theorem 4.1 For any π ∈ Π and P ∈ P, the decoupled bound is optimized for α∗, β∗ =arg maxα,β{B(α, β) : (α, β) ∈ V}, where V = {(α∗0, 0), (α∗1, 1), (0, β∗0), (1, β∗1)} and the valuesof α∗0, α∗1, β∗0 and β∗1 are reported in Table 1.

Proof Let us write explicitly the update coefficients in the decoupled bound (3.1):

JP′,π′

µ − JP,πµ ≥αAP,π

P,π,µ + βAP ,πP,π,µ

1− γ− γ

(1− γ)2

×(α2Dπ,π

E Dπ,π∞ + αβDπ,π

E DP ,P∞ + αβDπ,π

∞ DP ,PE + γβ2DP ,P

∞ DP ,PE

)∆QP,π

2,

19

Page 20: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

we now take the derivatives w.r.t. α and β to find the stationary points:

∂B

∂α=AP,πP,π,µ

1− γ− γ

(1− γ)2

(2αDπ,π

E Dπ,π∞ + βDπ,π

E DP ,P∞ + βDπ,π

∞ DP ,PE

)∆QP,π

2,

∂B

∂β=AP ,πP,π,µ

1− γ− γ

(1− γ)2

(αDπ,π

E DP ,P∞ + αDπ,π

∞ DP ,PE + 2γβDP ,P

∞ DP ,PE

)∆QP,π

2.

When the target policy is different from the current one and, symmetrically, the target modelis different from the current model the linear system of the derivatives admits a uniquesolution. We compute the second order derivative to discover the nature of such point:

∂B2

∂2α= − γ

(1− γ)2Dπ,πE Dπ,π

∞ ∆QP,π,

∂B2

∂α∂β=

∂B2

∂β∂α= − γ

(1− γ)2

(Dπ,πE DP ,P

∞ +Dπ,π∞ DP ,P

E

)∆QP,π

2,

∂B2

∂2β= − γ2

(1− γ)2DP ,P∞ DP ,P

E ∆QP,π,

from the second order derivatives we can write the Hessian matrix:

H(α, β) = −γ∆QP,π

(1− γ)2

(2Dπ,π

E Dπ,π∞ Dπ,π

E DP ,P∞ +Dπ,π

∞ DP ,PE

Dπ,πE DP ,P

∞ +Dπ,π∞ DP ,P

E 2γDP ,P∞ DP ,P

E

),

having trace and determinant:

tr(H(α, β)) = −γ∆QP′,π

(1− γ)2

(2Dπ,π

E Dπ,π∞ + 2γDP ,P

∞ DP ,PE

)≤ 0,

det(H(α, β)) =γ2∆QP

′,πP,π

2

(1− γ)4

((4γ − 2)Dπ,π

E Dπ,π∞ DP ,P

∞ DP ,PE − (Dπ,π

E DP ,P∞ )2 − (Dπ,π

∞ DP ,PE )2

)≤γ2∆QP

′,πP,π

2

(1− γ)4

(2Dπ,π

E Dπ,π∞ DP ,P

∞ DP ,PE − (Dπ,π

E DP ,P∞ )2 − (Dπ,π

∞ DP ,PE )2

)≤ −

γ2∆QP′,π

P,π

2

(1− γ)4

(Dπ,πE DP ,P

∞ −Dπ,π∞ DP ,P

E

)2≤ 0.

When P 6= P and π 6= π we observe that the Hessian matrix is indefinite since both thetrace and the determinant are negative. This means that the unique stationary point isa saddle point which is uninteresting for optimization purposes. By the way, B(α, β) is aquadratic function, therefore it is continuous on the compact set [0, 1]2 and therefore, fromWeierstrass theorem, it admits a global maximum (and minimum). Since such point is not astationary point it must lie on the boundary of [0, 1]2.

Then, by setting to zero the equations ∂B∂α

∣∣β=0

, ∂B∂α

∣∣β=1

, ∂B∂β

∣∣α=0

, ∂B∂β

∣∣α=1

we can obtain

the following optimal values (which are clipped to lie in the interval [0, 1]):

α∗0 =(1− γ)

γ∆QP,πAP,πP,π,µ

Dπ,π∞ Dπ,π

E

,

20

Page 21: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

α∗1 =(1− γ)

γ∆QP,πAP,πP,π,µ

Dπ,π∞ Dπ,π

E

− 1

2

(DP ,PE

Dπ,πE

+DP ,P∞

Dπ,π∞

),

β∗0 =(1− γ)

γ2∆QP,πAP ,πP,π,µ

DP ,P∞ DP ,P

E

,

β∗1 =(1− γ)

γ2∆QP,πAP ,πP,π,µ

DP ,P∞ DP ,P

E

− 1

(Dπ,πE

DP ,PE

+Dπ,π∞

DP ,P∞

).

Instead, for γ ∈ (0, 1), the Hessian is singular when either the target policy or the targetmodel are equal to the current one. Those cases can be treated separately and clearly yieldmaxima points. When P = P then we have α∗ = α∗0, when π = π we have β∗ = β∗0 .

We report the values of the decoupled bound B(α, β) (Theorem 3) in correspondence ofthe optimal coefficients:

B(α∗0, 0) =(1− γ)AP,π

P,π,µ

2

2γ∆QP,πDπ,π∞ Dπ,π

E

,

B(0, β∗0) =(1− γ)AP ,π

P,π,µ

2

2γ2∆QP,πDP ,P∞ DP ,P

E

,

B(1, β∗1) = AP,πP,π,µ +

(1− γ)AP ,πP,π,µ

2

2γ2∆QP,πDP ,P∞ DP ,P

E

−AP ,πP,π,µ

(Dπ,πE

DP ,P∞

+Dπ,π∞

DP ,PE

)

+∆QP,π

2(1− γ)

(1

2DP ,PE Dπ,π

E D1 +1

2DP ,P∞ Dπ,π

E D1 −1

4DP ,P∞ DP ,P

E D21 − γD

π,πE Dπ,π

),

B(α∗1, 1) = AP ,πP,π,µ +

(1− γ)AP,πP,π,µ

2

2γ∆QP,πDπ,π∞ Dπ,π

E

−AP,πP,π,µ

2

(DP ,PE

Dπ,π∞

+DP ,P∞

Dπ,πE

)+γ∆QP,π

2(1− γ)

(1

2Dπ,πE DP ,P

E D2 +1

2Dπ,π∞ DP ,P

E D2 −1

4Dπ,π∞ Dπ,π

E D22 − γD

P ,PE DP ,P

),

where

D1 =Dπ,πE

DP ,PE

+Dπ,π∞

DP ,P∞

, D2 =DP ,PE

Dπ,πE

+DP ,P∞

Dπ,π∞

.

21

Page 22: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

A0

B1

C−M

D1

E0

1ωp+(1−ω)(1−p) ω(1−p)+(1−ω)p ωp+(1−ω)(1−p)

1

ω(1−p)+(1−ω)p

ωp+(1−ω)(1−p)

ω(1−p)+(1−ω)p

1/21/2

Figure 3: An example ofsome additional plots. Conf-MDP with local maxima.

Appendix C. Examples of Conf-MDPs

In this section we report two examples of Conf-MDPs having some interesting behaviors. Inall figures, the transition probabilities are reported on the edges and the reward is writtenbelow the state name.

C.1 An example of Conf-MDP with local optima

Let us consider the Conf-MDP represented in Figure 3 where ω ∈ [0, 1] is the parameter,p ∈ [0, 1] is a small fixed probability (say 0.1) and M is a large positive number. In eachstate there is only one action available (i.e., all policies are optimal). The vertex modelsare obtained for ω ∈ {0, 1}. For both target models there is a small probability to get thepunishment −M since for ω = 0 the probability to reach state C from B is p and for ω = 1state B is reachable from A with probability p. We expect that by mixing the two targetmodels we can only worsen the performance. It is simple to realize that the expected returnis a cubic function of ω. We report the expression for p = 0.1 and γ = 1:

JPωµ =1

2

(0.512ω3 + (0.64M − 1.088)ω2 − (0.64M + 0.296)ω + 1.981− 0.09M

).

We can find the stationary points by looking at the derivative:

∂JPωµ∂ω

= 0.768ω2 + (0.64M − 1.088)ω − 0.32M − 0.148.

For M sufficiently large the derivative has one sign variation thus it has two solutions ofopposite sign, having expression:

ω1,2 =1

24

(17− 10M ± 10

√M2 −M − 4

).

Clearly, we are interested only in the solutions within [0, 1] thus we discard the negative one.It is simple to see that the positive solution is approximately 1

2 for M sufficiently large, as:

limM→+∞

1

24

(17− 10M + 10

√M2 −M − 4

)=

1

2.

However, having a look at the second derivative we realize that this is a point of minimum,since

∂2JPωµ∂ω2

= 1.536ω + 0.64M − 1.088∣∣ω= 1

2> 0.

22

Page 23: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

A0

B0

C1

D0

1ωp+(1−ω)(1−p)

ω(1−p)+(1−ω)p 1

ω(1−p)+(1−ω)p

ωp+(1−ω)(1−p)1

Figure 4: An example of Conf-MDP with mixed optimal model.

Notice that in the unfortunate case in which SPMI is initialized at this value of ω theexpected relative advantage (which is the same as the gradient) is zero for both the vertexmodels and therefore there would be no update. Therefore, the maximum must lie on theborder, specifically either for ω = 0 or ω = 1. It is simple to see that JP1

µ > JP0µ . Moreover,

if we compute the value of the gradient for ω = 0 and ω = 1 we realize that in both casesthe value is negative. Having a negative advantage, SPMI would never make any step evenwhen the model is initialized at the lower performance vertex ω = 0.

C.2 An example of Conf-MDP with mixed optimal model

We consider the Conf-MDP as represented in Figure 4. As in the previous case, the parameteris ω ∈ [0, 1] and p ∈ [0, 1] is a fixed probability. We want to show that there exists no valueof ω such that Pω maximizes the value function in all states, while there exist one value ofω maximizing the expected return. It is simple to compute the value function in each state:

V Pω(A) = γ2(ωp+ (1− ω)(1− p)

)(ω(1− p) + (1− ω)p

),

V Pω(B) = γ(ω(1− p) + (1− ω)p

),

V Pω(C) = 1,

V Pω(D) = 0.

Since the initial state is A we have that JPωµ = V Pω(A) which is maximized for ω = 1/2.However, there is no value of ω for which the value function of each state is maximized. Asshown in Figure 5, while V Pω(A) is maximal in ω = 1/2, V Pω(B) is maximal for ω = 1. Allvalues of ω ∈ [1/2, 1] are indeed Pareto optimal (Figure 6).

With some boring calculation we can determine the expression of the expected relativeadvantage functions:

AP1Pω ,µ

= γ2(1− ω)(1− 2ω)(1− 2p)2

AP2Pω ,µ

= −γ2ω(1− 2ω)(1− 2p)2.

We clearly see that they both vanish for ω = 1/2.

Appendix D. Environment description

In this appendix we provide a more detailed description of the environments used in theexperimental evaluation.

23

Page 24: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

0 0.2 0.4 0.6 0.8 1

0

0.5

1

ω

value

V Pω (A) V Pω (B)

V Pω (C) V Pω (D)

Figure 5: The state value function of theConf-MDP in Figure 4 as a function ofthe parameter.

0 0.1 0.2 0.3

0.2

0.4

0.6

0.8

V Pω (A)

VPω(B

)

Figure 6: The state value function ofstates A and B (the only ones varyingwith the parameter) of the Conf-MDP inFigure 4. The green continuous line is thePareto frontier.

D.1 Student-Teacher domain

The teaching/learning process involves two entities: the teacher and the student (learner)that interact. We assume both entities share the same goal, i.e., maximizing the learning.The teaching model, however, should be suited for the specific learning policy of the student.For instance, not all students have the same skills and are able to capture the informationprovided by the teacher with the same speed and effectiveness. Thus, the teaching modelshould be tailored in order to meet the student’s needs. Given the goal of maximizinglearning, a teaching model induces an optimal learning policy (within the space of thepolicies that a certain student can play). Symmetrically, a learning policy determines anoptimal teaching model (within the space of models available to the teacher). The questionwe want to answer in this experiment is: can we dynamically adapt the teaching modelto the learning policy and the learning policy to the teaching model, so to maximize thelearning?

We take inspiration from (Rafferty et al., 2011) and we formalize the teaching/learningprocess as an MDP in which the student is the agent and the teacher is the environment. Tofit our framework to this context, we can think to the teacher as an online learning platformthat can be configured by the student in order to improve the learning experience. Asin (Rafferty et al., 2011) we test the model on the “alphabet arithmetic” a concept-learningtask in which literals are mapped to numbers.

We consider n literals L1, ..., Ln, to which the student can assign the values {0, ...,m}.The teacher, at each time step, provides an “example”, i.e., an equation where a numberof distinct literals (from 2 to p ≤ n) sum to a numerical answer. The set of all possibleexamples is given by:

E ={∑i∈I

Li = l : I ⊆ {1, ..., n}, 2 ≤ |I| ≤ p, l ∈ {0, ..., |I|m}}.

24

Page 25: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

The student reacts to an example by performing an action, i.e., an assignment of literals.The set of all assignments, i.e., actions, is given by:

A ={L1 = l1, L2 = l2, . . . , Ln = ln : li ∈ {0, ...,m}, i = 1, . . . , n

},

thus |A| = (m+ 1)n. In order to model the student policy space we assume that a studentcan modify an arbitrary number of literals under the assumption that two consecutiveassignments satisfy

∑ni=1 |l′i− li| ≤ k. This models the learning limitations of the student, in

particular how hard is for the student to capture the teacher information. We assume thatthe teacher can provide any example. The set of states is the combination of an assignmentand an example, i.e., S = E × A.

The goal of the student is to perform assignments that are consistent with the teacher’sexamples (within its limitations on the possible assignments). So, while the student islearning the optimal policy it can configure the teacher to provide more suitable examples.The reward is 1 when the assignment is consistent, 0, when it is not and the horizon His finite. Notice that we don’t have goal state, differently from (Rafferty et al., 2011) sothe problem becomes fully observable. We assume that, at the beginning, both policy andmodel are uniform distribution on the allowed actions/states.

D.2 Racetrack Simulator

The Racetrack Simulator aims to represent a basic abstraction of an autonomous car drivingproblem. In this context the autonomous driver, the learning agent, has to optimize a drivingpolicy in order to run the vehicle to the track finish line as fast as possible. The vehicleand the track naturally compose the model of the learning process, however there is thepossibility to tune a set of vehicle parameters, s.t. aerodynamic profile (to affect the vehiclestability) and engine setting. Therefore, to maximize the performance, the driving policy ofthe agent and the model configuration has to be jointly considered. It is noteworthy that aspecific model parametrization (vehicle setting) induces an optimal driving policy and, inthe other hand, a driving policy determines an optimal model parametrization. Moreover amodel-policy pair that results to be optimal for a specific track may not be optimal for a(morphologically) different track. Then, the question we aim to answer with this experimentis the following: can we learn the optimal model-policy pair for a given track by dynamicallyadapt the vehicle parametrization to the driving policy and, conversely, the driving policyto the vehicle parametrization during the learning process?

We formalize the learning process as an MDP in which the driver is the agent and theenvironment is composed by the track and the vehicle. The track is represented by a grid ofpositions, each grid point is either of type roadway, wall, initial position, goal position. Astate in the learning process belongs to the set:

S ={

(x, y, vx, vy) : x ∈ {0, ..., xmax}, y ∈ {0, ..., ymax}, vx ∈ {vmin, ..., vmax}, vy ∈ {vmin, ..., vmax}},

where (x, y) corresponds to a grid position and (vx, vy) are the speed along the coordinateaxes. At each step the agent can increment the speed along a coordinate direction or donothing, then the action space is represented by the following:

A ={keep, increment vx, increment vy, decrement vx, decrement vy

}.

25

Page 26: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

The learning process starts at the state corresponding to the initial position with zerovelocities; the agent collects reward 1 when it reaches a state corresponding to the goalposition within the finite horizon H, he collects 0 reward in any other case.

The transition model induces a success probability to any action, a failed action causesa random action to occur instead of the one selected by the agent. This probability aimsto model the stability of the vehicle, the more the vehicle is unstable, the more is hard forthe agent to drive it (or select an action). The model also induces a failure probability toevery action: a failure represent a break of the vehicle, thus it directly cause the end of theepisode. This feature represents the pressure on the vehicle engine, the more performancethe driver asks for, the more it may break down. We formalize the transition model as aconvex combination between a set of vertex models: these correspond to vehicle configurationpushed towards the limit in terms of the aspects described above. For our purpose we definea model dichotomy related to vehicle stability: P highspeed (P hs) trades stability at lowerspeed to have more stability (or high action success probability) in high speed situations,P lowspeed (P ls), instead, provides more stability in low speed situation and poor stabilityat higher speed. We define also a model dichotomy related to engine boost: P boost (P b)guarantees higher engine performance and a lower reliability (or higher failure probability),at the opposite P noboost (P nb) provides higher reliability but poor engine performance.In Figure 7 we propose a graphical representation of the features of these extreme models.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

speed

stab

ility

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

boost

failu

re

P highspeed P lowspeed P noboost P boost

Figure 7: Graphical representation of the racetrack extreme models.

Considering any possible combination of stability and engine setting, we define the modelset (set of vertex models) P = {P hs b, P hs nb, P ls b, P ls nb}. Each model in this set isobtained by taking, for each state-action pair, the product of the transition probabilitiesof the components (e.g., P hs b(s, a) = P hs(s, a)× P b(s, a)). Then, we derive the modelspace as the convex hull of the vertices in the model set:

Pω ={Pω = ω0P hs b+ ω1P hs nb+ ω2P ls b+ ω3P ls nb

}.

While the agent is learning the optimal driving policy, the model parametrization can beconfigured (selecting a vector (ω0, ω1, ω2, ω3)) trying to fit the vehicle settings to the drivingpolicy and simultaneously trying to fit the policy-settings pair to the morphology of thetrack. At the beginning of the learning process, we assume the policy to be a uniformdistribution on the action space and the model to be (0, 0.5, 0, 0.5), that we can consider themost conservative parametrization in our context. We also report in Figure 8 an illustrativerepresentation of the tracks used in the experiments.

26

Page 27: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 4 4 4 4 4 4 3 3 1 4 4 4 4 4 2 3 3 3 4 4 4 4 3 3

3 4 4 4 4 2 4 3 3 4 3 4 4 4 4 4 3 3 4 4 4 3 4 4 3

3 4 4 4 4 4 4 3 3 4 4 3 3 3 3 3 3 3 4 4 3 4 4 2 3

3 4 4 4 3 3 3 3 3 3 4 3 3 3 1 3 4 4 3 4 3

3 4 4 4 3 3 2 4 3 3 3 4 4 4 3 4 4 3

3 1 1 4 3 3 3 3 3 3 3 3 4 4 4 4 3 3

3 3 3 3 3 3 3 3 3 3 3 3 3

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 3

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Figure 8: Basic representation of the tracks used in the Racetrack Simulator. From left toright: T1, T3, T4 and T2 just below. Each position have a type label: 1 for initial states, 2for goal states, 3 for walls, 4 for roadtracks.

Appendix E. Experiments details

In this appendix, we report the hyper-parameters used for each experiment and someadditional plots.

E.1 Student-Teacher domain

In Table 2 we report the number of iterations to convergence for the different problemsettings we considered. We can see that SPMI is the first or the second to converge inmost of the cases. In Table 3 we report the hyper-parameters we used for the runs in thedifferent problem settings. In Figures 9, 10, 11 we provide numerous plots showing differentinteresting metrics for the problem settings we considered in the experimental evaluation.

27

Page 28: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

Table 2: Number of steps to convergence for the update strategies in different problemsettings of the Student-Teacher domain. In bold the best algorithm and underlined thesecond best. The run has been stopped after 50000 iterations.

Problem SPMI SPMI-sup SPMI-alt SPI+SMI SPI+SMI

2-1-1-2 16234 18054 30923 22130 77052-1-2-2 2839 3194 5678 2839 129732-2-1-2 20345 18287 >50000 39722 109042-2-2-2 12025 14315 >50000 >50000 152572-3-1-2 14187 13391 11772 >50000 121833-1-1-2 15410 17929 22707 31122 142573-1-2-2 3313 3313 8434 3313 228463-1-3-2 2945 3435 5891 2945 18090

Table 3: Additional information and hyper-parameters used for the different problem settingsof the Student-Teacher domain.

Problem |S| |A| horizon ∆QP,π γ ε

2-1-1-2 12 4 10 1−γ10

1−γ ' 9.56 0.99 0

2-1-2-2 12 4 10 9.56 0.99 02-2-1-2 45 8 10 9.56 0.99 02-2-2-2 45 8 10 9.56 0.99 02-3-1-2 112 16 10 9.56 0.99 03-1-1-2 72 9 10 9.56 0.99 03-1-2-2 72 9 10 9.56 0.99 03-1-3-2 72 9 10 9.56 0.99 0

28

Page 29: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

102 103 104

4

6

8

10

iteration

expe

cted

retu

rn

102 103 104

0

0.5

1

·10−4

iteration

boun

d

102 103 104

10−2

100

iteration

α

102 103 104

10−4

10−2

100

iteration

β

102 103 104

0

0.2

0.4

iteration

polic

yad

vant

age

102 103 104

0

0.1

0.2

0.3

iteration

mod

elad

vant

age

102 103 104

0

0.5

1

iteration

polic

ydi

stan

ce

102 103 104

0

0.5

1

iteration

mod

eldi

stan

ce

102 103 104

0

0.5

1

iteration

polic

ydi

ssim

ilari

ty

102 103 104

0

0.5

1

iteration

mod

eldi

ssim

ilari

ty

102 103 104

0

0.5

1

iterationpo

licy

chan

ges

102 103 104

0

0.5

1

iteration

mod

elch

ange

s

SPMI SPMI-sup SPMI-alt

Figure 9: Several statistics of the persistent target choice for different update strategies inthe case of Student-Teacher domain 2-1-1-2.

102 103 104

4

6

8

10

iteration

expe

cted

retu

rn

102 103 104

0

0.5

1

·10−4

iteration

boun

d

102 103 104

10−4

10−2

100

iteration

α

102 103 10410−7

10−4

10−1

iteration

β

102 103 104

0

0.2

0.4

iteration

polic

yad

vant

age

102 103 104

0

0.1

0.2

iteration

mod

elad

vant

age

102 103 104

0

0.5

1

iteration

polic

ydi

stan

ce

102 103 104

0

0.5

1

iteration

mod

eldi

stan

ce

102 103 104

0

0.5

1

iteration

polic

ydi

ssim

ilari

ty

102 103 104

0

0.5

1

iteration

mod

eldi

ssim

ilari

ty

102 103 104

0

10

20

30

iteration

polic

ych

ange

s

102 103 104

0.9

1

1.1

1.2

iteration

mod

elch

ange

s

SPMI SPMI-sup SPMI-alt

Figure 10: Several statistics of the greedy target choice for different update strategies in thecase of Student-Teacher domain 2-1-1-2.

29

Page 30: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

102 103 104

2

4

6

8

iteration

expe

cted

retu

rn

102 103 104

2

4

6

·10−5

iteration

boun

d

102 103 10410−5

10−3

10−1

iteration

α

102 103 104

10−4

10−2

100

iteration

β

102 103 104

0

0.2

0.4

iteration

polic

yad

vant

age

102 103 104

0

0.2

0.4

iteration

mod

elad

vant

age

102 103 104

0

1

2

iteration

polic

ydi

stan

ce

102 103 104

0

1

2

iteration

mod

eldi

stan

ce

102 103 104

0

0.5

1

1.5

iteration

polic

ydi

ssim

ilari

ty

102 103 104

0

0.5

1

1.5

iteration

mod

eldi

ssim

ilari

ty

102 103 104

0

10

20

iteration

polic

ych

ange

s

102 103 104

0

0.5

1

iteration

mod

elch

ange

s

SPMI SPMI-sup SPMI-alt SPI+SMI SMI+SPI

Figure 11: Several statistics of the persistent target choice for different update strategies inthe case of Student-Teacher domain 2-3-1-2.

30

Page 31: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Configurable Markov Decision Processes

E.2 Racetrack Simulator

In this section, we provide additional information about the experiments in the RacetrackSimulator environment (Table 4) along with a more detailed presentation of the resultsreported in the Experimental Evaluation section of the paper (Figure 12 and Figure 13).Moreover, we present the performance obtained in some cases not mentioned in the paper(Figure 14).

Table 4: Additional information and hyper-parameters used for the different problem settingsof the Racetrack Simulator. In the cases with two vertices only stability vehicle configurationsare considered.

Track vertices |S| |A| horizon ∆QP,π γ ε

T1 2 675 5 20 1 0.9 0T3 2 450 5 20 1 0.9 0T4 2 675 5 20 1 0.9 0T2 2 450 5 20 1 0.9 0T2 4 451 5 20 1 0.9 0

100 102 104

0

0.2

0.4

0.6

iteration

expe

cted

retu

rn

100 102 104

0

0.5

1

1.5

·10−4

iteration

boun

d

100 102 104

10−4

10−2

100

iteration

α

100 102 10410−13

10−6

101

iteration

β

100 102 104

0

2

4

6·10−2

iteration

polic

yad

vant

age

100 102 104

0

2

4

6

8

·10−4

iteration

mod

elad

vant

age

100 102 104

0

1

2

iteration

polic

ydi

stan

ce

100 102 104

0

0.5

1

1.5

iteration

mod

eldi

stan

ce

100 102 104

0

0.5

1

iteration

polic

ydi

ssim

ilari

ty

100 102 104

0

2

4

·10−2

iteration

mod

eldi

ssim

ilari

ty

100 102 104

0.2

0.4

0.6

0.8

1

iteration

high

spee

dst

abili

ty

100 102 104

0

0.2

0.4

0.6

0.8

iteration

low

spee

dst

abili

ty

SPMI SPMI-sup SPMI-greedy SPMI-alt

Figure 12: Several statistics of different update strategies and target choices in the case ofRacetrack Simulator in the T1 considering vehicle stability only.

31

Page 32: Con gurable Markov Decision Processes - WordPress.com · 2018. 9. 28. · Search (REPS) (Peters et al., 2010), and, more recently, Trust Region Policy Optimization (TRPO) (Schulman

Metelli et al.

104.4 104.6 104.8

0

5 · 10−2

0.1

0.15

iteration

expe

cted

retu

rn

104.4 104.6 104.8

0

0.5

1

1.5

·10−5

iteration

boun

d

104.4 104.6 104.8

10−11

10−5

101

iteration

α

104.4 104.6 104.8

10−5

10−4

10−3

iteration

β

104.4 104.6 104.8

0

5

·10−3

iteration

polic

yad

vant

age

104.4 104.6 104.8

0

1

2

·10−2

iteration

mod

elad

vant

age

104.4 104.6 104.8

0

1

2

iteration

polic

ydi

stan

ce

104.4 104.6 104.8

0.5

1

iteration

mod

eldi

stan

ce

104.4 104.6 104.8

0

0.5

1

1.5

iteration

polic

ydi

ssim

ilari

ty

104.4 104.6 104.8

0.2

0.4

0.6

0.8

iteration

mod

eldi

ssim

ilari

ty

104.4 104.6 104.8

0

200

400

600

iteration

polic

ych

ange

s104.4 104.6 104.8

0

50

100

150

iteration

mod

elch

ange

s

104.4 104.6 104.8

0.2

0.4

0.6

iteration

hsnb

104.4 104.6 104.8

0

0.2

0.4

iteration

lsnb

104.4 104.6 104.8

0

0.1

0.2

0.3

0.4

iteration

hsb

104.4 104.6 104.8

0

0.2

0.4

iteration

lsb

SPMI SPMI-sup SPMI-alt

Figure 13: Several statistics of different update strategies in the case of Racetrack Simulatorin the T2 considering vehicle stability and engine setting.

100 101 102 103

0.2

0.4

0.6

0.8

iteration

expe

cted

retu

rnJP,π

100 101 102 103

0

0.2

0.4

iteration

expe

cted

retu

rnJP,π

100 101 102 103

0

0.2

0.4

0.6

iteration

expe

cted

retu

rnJP,π

SPMI SPMI-sup SPMI-alt

Figure 14: Expected return of the Racetrack Simulator in the T2, T3, T4 for differentupdate strategies and considering vehicle stability configuration only.

32