Adversarial Imitation Learning under Covariate Shift · 2017-10-27 · politecnico di milano...

politecnico di milano

Facoltà di Ingegneria

Scuola di Ingegneria Industriale e dell’Informazione

Dipartimento di Elettronica, Informazione e Bioingegneria

Master of Science in

Computer Science and Engineering

Adversarial Imitation Learning underCovariate Shift

Supervisor:

marcello restelli

Co-Supervisor (University of Illinois at Chicago):

brian d. ziebart

Master Graduation Thesis by:

andrea tirinzoni

Student Id n. 849827

Academic Year 2016-2017

To my dear friend Pino

A C K N O W L E D G M E N T S

I would like to thank my advisor, Professor Marcello Restelli, for his help in therealization of this thesis and his precious suggestions. I would also like to thankmy advisor at the University of Illinois at Chicago, professor Brian Ziebart, forgiving me the opportunity to work on this project and supporting me during itsdevelopment.

I am grateful to all my friends for their continuous support during this longjourney. Also, many thanks to all who shared the last months in Chicago withme.

Last but not least, my sincere gratitude goes to my parents. Without their eco-nomic and moral support, I would not be writing this document.

iii

C O N T E N T S

Abstract viii1 introduction 1

1.1 Problem Description 1

1.2 Contribution 2

1.3 Document Outline 3

2 background 4

2.1 Markov Decision Processes 4

2.2 Partially Observable Markov Decision Processes 6

2.2.1 Point-based Value Iteration 8

2.3 Directed Information Theory 9

2.4 Imitation Learning 10

2.4.1 Behavioral Cloning 10

2.4.2 Inverse Reinforcement Learning 11

3 related work 13

3.1 Feature Matching 13

3.2 Maximum Causal Entropy IRL 14

3.3 Maximum Margin Planning 15

3.4 Adversarial Inverse Optimal Control 15

4 adversarial formulation 18

4.1 Problem Definition 18

4.2 Formulation 19

4.3 Learning Algorithm 20

4.3.1 Gradient Estimation 21

4.3.2 Gradient Descent 23

5 multiple-mdp optimization 25

5.1 Problem Definition 25

5.2 Approximate Dynamic Programming 25

5.2.1 Dynamic Program Properties 29

5.3 Modified Point-Based Value Iteration 31

5.3.1 Modified Value Backup 32

5.3.2 Modified Belief Expansion 33

5.4 Application to Adversarial Imitation Learning 35

6 experiments 36

6.1 Gridworld 36

6.2 Taxi Problem 40

iv

7 conclusions 43

bibliography 44

v

L I S T O F F I G U R E S

Figure 6.1 A 10x10 gridworld with 2x2 blocks. White states have highreward, while black states have low reward. 37

Figure 6.2 The state distribution induced by the expert’s policy un-der (a) test dynamics, and (b) train dynamics. White de-notes high-probability, black denotes low-probability, andred circles denote absorbing cells. 38

Figure 6.3 Mean test loss for different numbers of trajectories in thecase without covariate shift. 39

Figure 6.4 Mean test loss for different numbers of trajectories in thecovariate shift case. 39

Figure 6.5 The original taxi environment (a) and the modified one weuse to generate demonstrations (b). 40

Figure 6.6 Mean test loss (a) and mean test reward (b) for differentnumbers of trajectories in the case without covariate shift. 41

Figure 6.7 Mean test loss (a) and mean test reward (b) for differentnumbers of trajectories in the covariate shift case. 42

vi

A C R O N Y M S

IRL Inverse Reinforcement Learning

IOC Inverse Optimal Control

RL Reinforcement Learning

BC Behavioral Cloning

MDP Markov Decision Process

POMDP Partially Observable Markov Decision Process

PWLC Piece-Wise Linear and Convex

PBVI Point-Based Value Iteration

AIL Adversarial Imitation Learning

AIOC Adversarial Inverse Optimal Control

ME Maximum Entropy Inverse Reinforcement Learning

FM Feature Matching

vii

A B S T R A C T

Imitation learning, the problem of estimating a policy that reproduces demon-strated behavior, has become an essential methodology for training intelligentagents to solve complex tasks. A powerful approach to solving imitation problemsis inverse reinforcement learning, which attempts to rationalize given trajectoriesby recovering the unknown reward function being optimized by the expert. Mostof the research in this field has focused on estimating a policy capable of imitatingthe demonstrator under the unknown state distribution of the latter, given sam-ples distributed in the same way. In this work, we analyze the case where thereis a shift between these two distributions. We propose an adversarial formulation,based on inverse reinforcement learning, that is able to produce a single deter-ministic policy minimizing a general loss function with respect to the unknownexpert’s policy. We prove that covariate shift leads to an NP-hard optimization sub-problem, the computation of a deterministic policy maximizing the total expectedreward from two different Markov decision processes. We propose a tractable ap-proximation by reducing the latter to the optimal control of partially observableMarkov decision processes. We evaluate the performance of our approach on twocommon reinforcement learning benchmarks and show its advantages over otherstate-of-the-art algorithms.

viii

S O M M A R I O

L’apprendimento per imitazione, il problema di stimare una politica che ripro-duce un comportamento dimostrato, è diventato un metodo essenziale per adde-strare agenti intelligenti a risolvere compiti complessi. Un potente approccio aproblemi di imitazione è l’apprendimento per rinforzo inverso, il quale cerca dirazionalizzare delle date traiettorie recuperando la funzione di rinforzo sconosci-uta ottimizzata dall’esperto. La maggior parte della ricerca in questo campo siè concentrata sulla stima di una politica capace di imitare il dimostratore sulladistribuzione degli stati sconosciuta di quest’ultimo, dati campioni distribuitinello stesso modo. In questo lavoro, analizziamo il caso in cui è presente unadifferenza tra queste due distribuzioni. Proponiamo una formulazione, basatasull’apprendimento per rinforzo inverso, in grado di produrre una singola po-litica deterministica che minimizza una funzione di perdita generale rispetto allapolitica sconosciuta dell’esperto. Dimostriamo che la differenza nelle distribuzionigenera un problema di ottimizzazione NP-difficile, ovvero il calcolo di una polit-ica deterministica che massimizza la ricompensa totale da due processi di deci-sione diversi. Proponiamo un’approssimazione trattabile riducendo quest’ultimoal controllo ottimale di processi di decisione parzialmente osservabili. Valutiamoil rendimento del nostro approccio in due comuni problemi nell’apprendimentoper rinforzo e mostriamo i suoi vantaggi su altri algoritmi allo stato dell’arte.

ix

1I N T R O D U C T I O N

Imitation learning, also known as learning from demonstrations [1], is the prob-lem of estimating a control policy capable of performing a task demonstrated byan expert. When building intelligent agents using Reinforcement Learning (RL)[2], it is often difficult to design a reward function that allows to learn complextasks. On the other hand, it is much easier to directly show how to perform thetask and let the agent infer the best policy from such demonstrations. This is themain motivation behind imitation learning. Consider, for example, the problem oflearning how to autonomously drive a car [3, 4]. A driver typically takes into con-sideration several variables in order to decide what is the best thing to do, suchas the speed limit on the current road, the presence of other cars or pedestrians,and so on. Intuitively, a reward function that allows the agent to learn how todrive should somehow weigh all these variables, but finding the right weights soas to obtain the desired behavior is very difficult. In practice, this leads to a trial-and-error approach to tuning the reward function that might take a long time.However, it seems much wiser to have a person show the driving task and makethe agent learn from such demonstrations.

problem description

The two main approaches to imitation learning are Behavioral Cloning (BC) [5]and Inverse Reinforcement Learning (IRL) [3, 6, 7]. The former uses supervisedtechniques to directly infer a policy from the data. Since sequential predictionproblems violate the i.i.d. assumption made by supervised algorithms, BC typi-cally does not generalize well and requires a large number of samples. IRL, onthe other hand, poses the problem in the Markov Decision Process (MDP) formal-ism and tries to recover the unknown reward function being optimized by theexpert. Then, the best imitation policy is computed using RL and the recoveredreward function. As a consequence of these more general assumptions, IRL gener-ally performs much better than BC and has been successfully applied to a varietyof real-world problems [3, 8].

However, IRL approaches still suffer some limitations. To the best of our knowl-edge, no state-of-the-art algorithm handles the case where there is a shift betweenthe demonstrator’s state distribution and that of the provided data. Such situationarises rather frequently in practice. Consider, for example, that the environmental

1

1.2 contribution 2

dynamics under which demonstrations are generated might be noisy and, thus,slightly different than the expected ones. Going back to the autonomous drivingexample, the human might provide demonstrations on a car he has never usedbefore, or in a place he has never seen. The expert still knows how to drive prop-erly, but it might take sub-optimal decisions due to the fact that the environmentis slightly different than the one he is used to. If we simply ignore this shift, IRL al-gorithms are likely to consider such decisions as optimal, and the inferred policywill be a poor imitator of the expert. Furthermore, the typical approach used inmachine learning to handle covariate shift [9, 10], i.e., importance sampling, is noteasily applicable in our case due to the fact that neither of the two distributions isknown (the expert’s policy is not given). These examples motivate us to explicitlydeal with this problem.

We suppose the expert’s policy is optimal under some given dynamics (whichwe call test dynamics), but we only have access to a finite number of demonstra-tions under the same policy but different known dynamics (which we call traindynamics). Our goal is to imitate the expert policy under the test dynamics. Build-ing on top of the framework described in [11], we propose a new formulation tosolve the imitation learning under covariate shift problem. Our formulation isa zero-sum game where the learner tries to minimize an imitation loss underthe test dynamics, while the adversary tries to maximize such loss while beingconstrained to match the feature counts derived from the data under the traindynamics. As in [11], our algorithm does not make any particular assumptionon the loss function (as far as it decomposes over states). This constitutes a sig-nificant improvement over other imitation learning algorithms which typicallyrequire convexity. We show that the introduction of covariate shift into [11] leadsto an NP-hard optimization problem, that is, the computation of a deterministicpolicy maximizing the total expected reward from two different MDPs. We pro-pose a tractable approximation by reducing the latter to the optimal control ofpartially observable Markov decision processes [12, 13].

contribution

Our contribution is two-fold. First, we propose an approximate solution to theNP-hard problem of computing the optimal deterministic policy maximizing thetotal reward from different processes. This a general problem, not necessarily onlyrelated to IRL, whose solution can be re-used in many contexts. To this end, weanalyze and solve it from the most general perspective possible, so as to easeits re-usability. Then, we adopt such result to solve our imitation learning undercovariate shift problem.

1.3 document outline 3

document outline

The rest of this document is organized as follows. In Chapter 2, we providethe mathematical background needed to understand this work. Focus is givento MDPs, the mathematical model that specifies our decision-making settings,POMDPs, which we adopt to approximate the NP-hard sub-problem of our for-mulation, and causally conditioned probability distributions, which compactlyrepresent our stochastic processes. Furthermore, we provide a deeper descriptionof imitation learning, behavioral cloning, and inverse reinforcement learning. InChapter 3, we introduce the most important state-of-the-art IRL algorithms (fea-ture matching, maximum entropy and maximum margin planning) which weuse for comparison. We also detail the adversarial approach we extend to deriveour formulation. Chapter 4 formally defines the problem we are trying to solve,presents our adversarial formulation and derives our learning algorithm. In chap-ter 5, we analyze the NP-hard problem of computing the optimal policy for dif-ferent MDPs. We propose a modified POMDP algorithm, analyze its performance,and specify its application to our formulation. In Chapter 6, we demonstrate thebenefits of our approach on two common reinforcement learning benchmarks,gridworld and the taxi problem [14], by comparing to other state-of-the-art algo-rithms. Finally, in Chapter 7 we draw some conclusions on this work.

2B A C K G R O U N D

This chapter introduces the concepts that are used extensively throughout thisdocument. Section 2.1 starts by describing MDPs, the mathematical tool to modelenvironments where reinforcement learning and inverse reinforcement learningare applied. Then, Section 2.2 provides an overview of POMDPs, which we em-ploy in chapter 5 to solve an important sub-problem of our formulation. Sincethese two topics are well known and well described in the literature, we simplyprovide a quick overview, while referring the reader to other sources for moredetailed explanations. Section 2.3 quickly introduces directed information theoryand causally conditioned probability distributions, which allow for a more con-cise representation of our stochastic processes. Finally, Section 2.4 describes theproblem of imitation learning, focusing on the two main methodologies to solveit: behavioral cloning and inverse reinforcement learning.

markov decision processes

This chapter provides a quick introduction to MDPs. Since in this document werestrict ourselves to finite-state, finite-horizon MDPs, we focus on such case. Adescription of the more general settings can be found in [15] and [2].

MDPs are discrete time stochastic control processes used to model complexdecision-making problems under uncertainty. At each time of the process, theagent is in some state, takes some action and observes a reward for taking thataction in that particular state. The agent’s task is to find the sequence of actions soas to maximize the long-term total reward. Definition 2.1 formalizes this concept.

Definition 2.1. A MDP is a tuple < S,A, τ, R, p0 >, where:

• S is a finite set of states;

• A is a finite set of actions;

• τ are the state transition dynamics, where τ(st+1 | st, at) is the probabilityof the next state being st+1 given that the agent executes action at in statest;

4

2.1 markov decision processes 5

• R is a reward function, where R(st) specifies the reward that the agent re-ceives for entering state st1;

• p0 is a probability distribution over initial states, where p0(s) is the proba-bility that s is the state where the process starts.

A control policy π(at | st) is a probability distribution over actions given states.The process starts at time t = 1, where the first state is drawn from the initialdistribution p0. Then, the agent selects the first action a1 according to its policyπ(a1 | s1) and makes a transition to another state as specified by the dynamicsτ(s2 | s1, a1). The agent successively selects the second action, and so on until theprocess ends in state sT . The goal is to find the policy π? that maximizes the sumof expected rewards, that is:

π? = argmaxπ

E

[T∑t=1

R(St) | τ, π

](2.1)

It is possible to prove that the knowledge of the current state is sufficient for actingoptimally in the future; that is, knowing the past history of state-action sequencesdoes not add any further information. A process where this is the case is said tobe Markovian. Furthermore, the optimal policy in an MDP is always deterministic,i.e., it can be specified as a mapping π?(st) from states to actions returning thebest action at for each state st.

For a particular policy π, the state value function represents the total expectedreward that is achieved from each state st by following π:

Vπ(st) = R(st) + E

[T∑

i=t+1

R(Si)|τ, π

](2.2)

while the state-action value function represents the total expected reward that isachieved by executing action at from state st and then following π:

Qπ(st, at) = R(st) + Est+1∼τ(·|st,at)

[T∑

i=t+1

R(Si)|τ, π

](2.3)

These two quantities are related by the so-called Bellman expectation equations[16], as specified in Theorem 2.1.

1 Notice that the reward is usually specified as a function R(s, a) of state and action, but in thisdocument we consider it as a function of the state alone. The extension is trivial.

2.2 partially observable markov decision processes 6

Theorem 2.1. Let M = < S,A, τ, R, p0 > be an MDP and π(at | st) be a policy.Then we can compute the state value function Vπ(st) and the state-action valuefunction Qπ(st, at) for π by solving the following dynamic program:

Vπ(sT ) = R(sT )

Vπ(st) =∑atπ(at | st)Q

π(st, at) ∀t = T − 1, . . . , 1Qπ(st, at) = R(st) +

∑st+1

τ(st+1 | st, at)Vπ(st+1) ∀t = T − 1, . . . , 1

(2.4)

The following theorem states Bellman’s optimality equations [16], which repre-sent one of the most important results in this field and allow the computation ofthe optimal policy π? by means of dynamic programming.

Theorem 2.2. Let M = < S,A, τ, R, p0 > be an MDP. Then we can compute theoptimal policy π?(st) for M by solving the following dynamic program:

Vπ?(sT ) = R(sT )

Vπ?(st) = max

atQπ

?(st, at) ∀t = T − 1, . . . , 1

π?(st) = argmaxat

Qπ?(st, at) ∀t = T − 1, . . . , 1

(2.5)

where Qπ?

is computed as specified in 2.4.

partially observable markov decision processes

This section quickly introduces POMDPs, focusing on the properties we use in thiswork. For a more detailed description, we refer the reader to [17].

POMDPs [12, 13] provide an extension of MDPs to partially observable environ-ments, i.e., those where the agent cannot directly observe the state but only re-ceives partial information about it. Definition 2.2 formalizes this new model.

Definition 2.2. A Partially Observable Markov Decision Process (POMDP) is a tu-ple < S,A,Ω, τ,O, R, b0, γ >,where:

• S is a finite set of states;

• A is a finite set of actions;

• Ω is a finite set of observations, where each observation gives partial infor-mation about the state the agent is into;


• τ are the state transition dynamics, where τ(st+1 | st, at) is the probabilityof the next state being st+1 given that we execute action at from state st;

• O are the conditional observation probabilities, where O(ot+1 | st+1, at) isthe probability of observing ot+1 after taking action at and transitioning tostate st+1;

• R is the reward function, where R(st, at) specifies the utility the agent ob-tains after taking action at from state st;

• b0 is the initial state probability distribution (also called initial belief state);

• γ ∈ [0, 1] is the discount factor and it is used to discount future rewards ininfinite-horizon processes.

The process starts at time t = 1, where the first state s1 is drawn from b0. How-ever, the agent cannot directly observe such state but receives an observation o1.Then, the agent selects an action a1, transitions to the unobserved state s2 accord-ing to τ(s2 | s1, a1), receives an observation o2 according to O(o2 | s2, a1), andfinally obtains a reward R(s1, a1). Then, the process is repeated (forever in case ofinfinite-horizon). The agent’s goal is to select the sequence of actions maximizingthe total (discounted) reward over time.

In order to solve the problem, the agent keeps, at each time, a distribution overstates S called belief state. A belief state is an |S|-dimensional vector where eachcomponent represents the probability that the agent is in that particular state.Then, the partially observable MDP is reduced to a continuous fully observableMDP where the state space is B, the space of all belief states (if we have |S| states,B is the (|S|− 1)-dimensional simplex). In such MDP, the state-transition dynamicsare:

τ(bt+1 | bt, at) =∑o∈Ω

Pr(bt+1 | bt, at, o)Pr(o | at, bt) (2.6)

and the reward function is:

R(bt, at) =∑s∈S

R(s, at)b(s) (2.7)

According to Bellman optimality equations, the value function for the optimalpolicy of the fully observable MDP can be written as:

Vt(bt) = maxat∈A

R(bt, at) + γ ∑bt+1∈B

τ(bt+1 | bt, at)Vt+1(bt+1)

(2.8)


This is a function of a continuous variable and cannot be computed by standardvalue iteration [2]. However, it turns out that such function is piece-wise linearand convex in b [12] and can be written as:

Vt(bt) = maxα∈Vt

bᵀtα (2.9)

where Vt is a set of vectors representing the normal directions to the hyperplanesdescribing the function (we call them alpha-vectors).

Such formulation allows for efficient algorithms to compute the value function(and the associated optimal policy). The most common approach, and the one weadopt in this document, is Point-Based Value Iteration (PBVI), which we describein the next section.

Point-based Value Iteration

PBVI [18] approximates the optimal value function of a POMDP by considering onlya restricted set of representative belief states B and computing one alpha-vectorfor each of them. Then, the value function is represented as the set V containing allalpha-vectors learned by the algorithm. The motivation behind PBVI is that mostalpha-vectors in the full set exactly describing V are dominated by some others.This means that, when taking the maximum over actions, such vectors are neverused and can be safely pruned. Since the total number of alpha-vectors growsexponentially with the number of actions and observations, pruning is necessaryto make the problem tractable. In PBVI, this is done implicitly by computing thedominating alpha-vector at each belief in B. Thus, no dominated alpha-vector isever added to V.

The algorithm proceeds iteratively by alternating two main procedures:

• value backup: given the current belief set B and the current value functionV, computes an update V′ of V by applying some backup equations similarto the Bellman optimality equations (a modified version of 2.8);

• belief expansion: adds new belief states to the current belief set B. For eachbelief in B, this is done by taking each action and simulating a step forward,thus leading to a new belief state. Then, only the farthest belief from thecurrent set B is retained.

These two procedures are then repeated for a specified number of iterations. Themore iterations are executed, the better is the value function approximation (i.e.,the number of alpha-vectors that are computed grows with the number of itera-tions).

2.3 directed information theory 9

directed information theory

In this document, we make use of causally conditioned probability distributionsto represent our stochastic processes. Such distributions arise naturally from di-rected information theory [19], which associates a direction to the flow of infor-mation in a system.

Given two random vectors Y1:T and X1:T , the causally conditioned probabilitydistribution of Y given X is:

p(Y1:T ||X1:T ) =T∏t=1

p(Yt |X1:t,Y1:t−1) (2.10)

Notice the difference with respect to the conditional probability of Y given X inthe limited knowledge about the conditioning variable X available:

p(Y1:T |X1:T ) =T∏t=1

p(Yt |X1:T ,Y1:t−1) (2.11)

Using this notation, we can compactly represent our processes as:

τ(S1:T || A1:T−1) =T∏t=1

p(St | S1:t−1,A1:t−1)(M)=

T∏t=1

p(St | St−1, At−1) (2.12)

π(A1:T−1 || S1:T−1) =T−1∏t=1

p(At | S1:t,A1:t−1)(M)=

T∏t=1

p(At | St) (2.13)

where the last equalities (M) hold only under the Markovian assumption. Theproduct of these two quantities represent the joint distribution of state-action se-quences:

p(S1:T ,A1:T−1) = τ(S1:T || A1:T−1)π(A1:T−1 || S1:T−1) (2.14)

and can be used to concisely write the total expected reward as:

E

[T∑t=1

R(st) | τ, π

]=

∑S1:T ,A1:T−1

p(S1:T ,A1:T−1)R(S1:T ) (2.15)

where R(S1:T ) is the sum of rewards received along the state sequence S1:T .

2.4 imitation learning 10

imitation learning

In imitation learning, an agent tries to mimic another agent’s behavior. The im-itating agent is usually called "learner", while the imitated one is referred to as"expert" or "demonstrator". The environment for this context is generally modeledas a set of states in which agents can be located, and a set of actions that they cantake to perform state transitions. Then, the problem of imitation learning reducesto finding a policy, that is, a mapping from states to actions (or a probability dis-tribution over actions given states) that best approximates the expert’s policy. Thelatter is known only through demonstrations, i.e., the expert shows how to behave(by taking actions in different states) and the learner has to find the policy thatbest reproduces the demonstrated state-action sequences.

The two most common approaches to imitation learning are behavioral cloningand inverse reinforcement learning, which are described in the next two sections.

Behavioral Cloning

BC [5] is the most common approach to imitation learning. The main problem isreduced to a supervised learning one, where a mapping from states, or featuresof such states, to actions is learned by adopting classification techniques. Suchmethod has been successfully adopted in several different fields. Robotics is themost common, where excellent results have been achieved in a wide variety oftasks. Among the notable examples, Pomerleau [20] implemented an autonomousland vehicle that learns to follow a road given demonstrated trajectories (fromcamera images) and using a neural network. LeCun et al. [4] proposed a systemfor learning obstacle avoidance from human demonstrations (again in the form ofimages) and using a convolutional neural network. Sammut et al. [21] designedan algorithm to learn how to fly an aircraft given human demonstrations from asimulation software and using decision trees.

Although behavioral cloning has proven to perform very well on some spe-cific tasks, it provides poor results when the goal is to maximize a long-termutility. The main issue with behavioral cloning (and with the supervised tech-niques adopted) is that samples (typically state-action couples) are supposed tobe independent. This is clearly not the case when sequential decision making isdemonstrated. For example, the state the agent is into depends on the previousstate and action. Thus, when the demonstrator’s behavior aims at maximizing along-term reward, behavioral cloning algorithms tend to generalize poorly andfail at reproducing the optimal performance. Furthermore, they usually require alarge number of samples to learn good imitation policies.


Inverse Reinforcement Learning

The idea to tackle problems of imitation learning when the demonstrator is show-ing sequential decision-making behavior is to formalize them as MDPs. In thiscontext, the expert is supposed to be optimizing an unknown long-term rewardfunction, and imitation learning reduces to estimating such function. This prob-lem is known as IRL or Inverse Optimal Control (IOC) [3, 6, 7].

In optimal control and RL [2], the agent is given a model of the environmentand a reward function, and must generate optimal behavior by finding a policythat maximizes the long-term reward2. In IRL, on the other hand, the agent isgiven trajectories showing the expert’s (optimal) policy together with a model ofthe environment, and must recover the reward function being optimized.

Differently from behavioral cloning, IRL attempts to rationalize demonstratedsequential decisions by estimating the utility function that is being maximized bythe expert. Since the whole field of reinforcement learning [2] is based on the ideathat "the reward function is the most succinct, robust, and transferable definitionof the task" [7], its recovery seems wiser than directly learning a mapping fromstates to actions. The estimated reward function can be successively used to learnthe best control policy via classic RL.

We can formalize the IRL problem as follows [6]. Given:

• a model of the environment the expert is acting in (i.e., state-transition dy-namics τ), and

• a set of trajectories ζi, where each trajectory is a sequence (s1, a1, s2, ...) ofstates and actions generated by executing the expert’s (optimal) policy π?

under τ,

estimate the reward function R? being optimized by the expert. More specifically,the problem is reduced to estimating a reward function that makes the demon-strated behavior optimal (i.e., rationalizes such behavior). Formally, the estimatedreward function R must satisfy:

E

[T∑t=1

R(st) | τ, π?

]> E

[T∑t=1

R(st) | τ, π

]∀π (2.16)

However, this formulation has several challenges. First, it constitutes an ill-posedproblem since it is easy to prove that there exist many solutions (actually, infinitely

2 To be precise, in RL the agent is not explicitly given these two quantities but has the ability to takeactions in the environment and observe the resulting state and reward.


many) [7]. As an example, a constant reward (e.g., a reward that is always zero)makes every policy optimal. Nevertheless, it is very unlikely that such functionmatches the one that is sought. Second, it is not possible to explicitly compute theleft-hand side since the expert’s policy π? is not given but is demonstrated fromsample trajectories. Furthermore, this formulation makes the assumption that theexpert is optimal (i.e., π? is optimal). When this is not the case, the problembecomes infeasible. Last, we can solve the inequalities only by enumerating allpossible policies, which is computationally not practical.

Many algorithms have been proposed to tackle such difficulties. We defer thedescription of the most important ones to the next chapter.

3R E L AT E D W O R K

In Chapter 2, we introduced and formally defined IRL, one of the main approachesto imitation learning, while describing its main challenges. Since our work isbased on IRL, in this chapter we present the state-of-the-art algorithms which weuse for comparison, focusing on how they tackle such complications. Section 3.1describes feature matching, one of the first IRL algorithms and whose underlyingassumptions represent the foundations of many successive works. Then, Section3.2 describes maximum entropy IRL, which provides a principled way of estimat-ing the demonstrated policy, while Section 3.3 presents maximum margin plan-ning, which casts the main problem into a maximum margin one. Finally, Section3.4 introduces the adversarial approach that we extend to build our framework.

feature matching

Abbeel and Ng [3] represent rewards as linear combinations of given features ofthe states:

R(s) = wᵀφ(s) (3.1)

Given a policy π, they define the feature expectations of π as:

µ(π) = E

[ ∞∑t=0

γtφ(st) | π

](3.2)

Their main intuition is that, if the feature expectations of a policy π match thoseof the expert’s policy π?:

‖µ(π) −µ(π?)‖ 6 ε (3.3)

then π is guaranteed to perform as well as π? for all rewards with ‖w‖1 6 1.They propose an iterative procedure that looks for a policy satisfying the condi-

tion of Equation 3.3. The algorithm keeps a set of policies π(i) together with thecorresponding feature expectations µ(i). At each iteration, the following quadraticprogram is solved to find the weights w that maximally separates the expert’s fea-ture expectations µE from the ones in the above-mentioned set:

maxw:‖w‖261

miniwᵀ(µE −µ(i)) (3.4)

13

3.2 maximum causal entropy irl 14

Notice that this is equivalent to finding a maximum margin hyperplane separat-ing µE from all µ(i). Then, the optimal policy for the new weights is computedtogether with its feature expectations, and the algorithm is iterated until the twosets are separated by a margin less than ε. Finally, the output is one of the learnedpolicies, if the demonstrator is optimal, or a mixture of some of them, if the demon-strator is sub-optimal.

Although this algorithm always achieves the feature matching property (whichprovides a way to solve the degeneracy problem of reward functions), it is notguaranteed to recover a reward that is similar to the true one.

maximum causal entropy irl

While Abbeel and Ng’s algorithm [3] solves the degeneracy problem by matchingthe empirical sum of features, which leads to a (mixture) policy that achieves aperformance very close to the one of the demonstrator even without recoveringthe true reward function, it introduces another ambiguity: many policies, or mix-tures of policies, that match the feature expectations exist. No principled way tochoose among them is proposed by the authors.

Ziebart et al. [8, 22] solve the above-mentioned ambiguity by employing theprinciple of maximum causal entropy. Their formulation seeks a probability distri-bution over actions given states that maximizes the causal entropy while matchingthe feature expectations:

argmaxπ

H(A1:T || S1:T )

such that µ(π) = µE

(3.5)

where H(A1:T || S1:T ) = E[− logπ(A1:T || S1:T )] denotes the causal entropy ofthe distribution π [23]. The authors prove that solving this optimization problemreduces to minimizing the worst-case prediction log-loss and yields a stochasticpolicy of the form:

π(at | st) = eQ(st,at)−V(st) (3.6)

where:

Q(st, at) = wᵀφ(st) + Eτ(.|st,at) [V(St+1)] (3.7)

V(st) = softmaxat

Q(st, at) (3.8)

3.3 maximum margin planning 15

and the softmax function is defined as:

softmaxx

f(x) = log∑x

ef(x) (3.9)

Then, the reward weights achieving such probability distribution can be com-puted by adopting a convex optimization procedure. The authors show how theresulting algorithm allows for efficient inference and successfully apply it to theproblem of modeling route preferences and inferring the destination given partialtrajectories.

maximum margin planning

Ratliff et al. [24] propose a different approach to selecting a policy (and corre-sponding reward weights) that makes the expert better than all alternatives. Theycast the problem into a maximum margin one, where the goal is to find a hyper-plane separating the demonstrator’s feature expectations from those of any otherpolicy by a structured margin. The resulting formulation is the quadratic program:

minw

1

2‖w‖2 +Cξ

s.t. wᵀµ(π?) > wᵀµ(π) + L(π?, π) − ξ ∀π(3.10)

where L denotes a loss function comparing two policies and ξ are slack variablesused to soften the margin (whose effect is controlled by constant C). The rationalebehind the inclusion of a loss function into the maximum margin is that the lattershould be larger for policies that are very different than the demonstrated one.The authors allow the usage of any loss function, as far as it can be factored intostate-action couples.

One drawback of the quadratic program of 3.10 is that it has a very large num-ber of constraints. In order to make learning practical, the authors solve it usingan approximate algorithm based on subgradient methods.

Differently from feature matching and maximum causal entropy IRL, this algo-rithm does not try to find a policy that achieves the same performance of theexpert, but it directly estimates reward weights w. The policy used for imitationcan successively be computed by optimizing over the learned reward function.

adversarial inverse optimal control

Chen et al. [11] propose an adversarial framework for a more general imitationlearning and IRL setting, that is, the case where learner and demonstrator are

3.4 adversarial inverse optimal control 16

acting in different environments. They consider a demonstrator acting accordingto the (unknown) policy π under (known) dynamics τ, and a learner acting underdifferent (known) dynamics τ. Then, the main idea is to find a policy πminimizinga loss function comparing learner and demonstrator’s state sequences. Formally:

argminπ

E

[T∑t=1

loss(St, St) | τ, π, τ, π

](3.11)

However, the demonstrator policy π is not known and the expectation cannot ex-plicitly be computed. In order to provide an estimate of such policy, an adversaryis introduced to find the policy π that maximizes such imitative loss. To preventthe adversary from choosing very bad policies, they consider the set of all stochas-tic policies that match the empirical sum of features of the demonstrated trajecto-ries [3] and allow π to be picked only from that set. Thus, the idea to deal with thefeature-matching ambiguity is, in this case, to choose the worst case (i.e., loss max-imizing) policy from the restricted set. The final formulation is a zero-sum gamebetween the learner, looking for the best (stochastic) policy to minimize the loss,and the adversary, trying to maximize such loss by selecting another (stochastic)policy in a constrained way:

minπ

maxπ∈Θ

E

[T∑t=1

loss(St, St) | τ, π, τ, π

](3.12)

where Θ is the set of all policies matching the expert’s feature expectations:

π ∈ Θ⇔ E

[T∑t=1

φ(St) | τ, π

]= µE (3.13)

In order to solve such problem, they reduce the constrained zero-sum game toone that is parameterized by a vector of Lagrange multipliers and solve it usingthe double oracle method [25], while optimizing over such parameters using asimple convex optimization procedure [26].

This algorithm provides several advantages over existing methods. First, anyloss function can be used (as far as it can be decomposed over states). Second, itprovides another principled way to solve the feature matching ambiguity: simplypick the worst case policy. This allows the algorithm to generalize well to newsituations. Finally, the algorithm is proven to be Fisher consistent, i.e., the learnedpolicy minimizes the loss under the assumption that the feature set is sufficientlyrich.

3.4 adversarial inverse optimal control 17

Notice, however, that this approach assumes the demonstrator’s distributionmatches the data distribution, which simplifies the estimation of the expert’s pol-icy. Thus, it is not suitable to deal with our imitation learning under covariateshift problem and needs to be adapted accordingly. We describe how to achievethis in the next chapter.

4A D V E R S A R I A L F O R M U L AT I O N

This chapter describes the adversarial formulation we adopt to solve the imita-tion learning under covariate shift problem. Section 4.1 formalizes such problem,while Section 4.2 details our approach. Finally, Section 4.3 describes our learningalgorithm and points out its main challenges.

problem definition

We start by formalizing the imitation learning under covariate shift problem. Con-sider a finite set of states S, a finite set of actions A, and a probability distributionp0(s) over initial states. Suppose the expert acts according to the unknown policyπ(A1:T−1 || S1:T−1), which is optimal for the unknown reward function R?(s)under known test dynamics τ(S1:T || A1:T−1). Furthermore, suppose R? can bewritten as a linear combination of given state features φ [3]:

R?(s) = w? ·φ(s) (4.1)

where φ and w? are K-dimensional vectors, and · denotes the inner product op-erator. We are given a finite set of N trajectories under the known train dynamicsτ(S1:T || A1:T−1) and the expert’s policy π, where each trajectory is a sequence:

ζi = (s1, a1, s2, ..., aT−1, sT | π, τ)i (4.2)

Then, our goal is to estimate a policy π minimizing the expected loss with respectto the expert’s policy π under the test dynamics τ. Formally:

argminπ

E

[T∑t=1

loss(St, St) | τ, π, π

](4.3)

Minimizing such loss requires knowledge of the unknown demonstrator’s statedistribution:

p(st) =∑st−1

∑at−1

τ(st | st−1, at−1)π(at−1 | st−1)p(st−1) (4.4)

while we only have data under:

p(st) =∑st−1

∑at−1

τ(st | st−1, at−1)π(at−1 | st−1)p(st−1) (4.5)

18

4.2 formulation 19

Notice that, although we supposed the demonstrator’s policy to be optimal andthe reward to be linear in given features, we do not require this conditions to hold,as we show in our experiments.

Of fundamental importance for our algorithm is the empirical sum of featuresobserved on a certain trajectory, which allows to estimate the expert’s featureexpectations under train dynamics τ:

µE =1

N

N∑i=1

∑s∈ζi

φ(s) (4.6)

formulation

We now describe how to adapt the adversarial framework described in [11] to han-dle covariate shift. Our new formulation is a constrained zero-sum game wherethe learner chooses a stochastic policy π that minimizes a given loss under the testdynamics τ, whereas the adversary chooses a stochastic policy π that maximizessuch loss while matching the feature expectations under the train dynamics τ:

minπ

maxπ∈Θ

E

[T∑t=1


](4.7)

where Θ is the set of all stochastic policies matching the expert’s feature expecta-tions (defined in Equation 4.6):

π ∈ Θ⇔ E

[T∑t=1

φ(St) | τ, π

]= µE (4.8)

The main idea of this new formulation is to have the adversary estimate theexpert’s policy by choosing the loss maximizer among the set of all policies match-ing the feature expectations under the train dynamics. Then, the learner chooses apolicy minimizing the loss with respect to such estimate under the test dynamics.As proved in [11], when features are sufficiently rich, the constraint set containsonly the expert’s policy, and the learner achieves the minimum theoretical loss.Notice that the difference in embodiment considered in [11] is removed, and thenew formulation deals with covariate shift instead.

As can be seen from this formulation, we do not require the true reward to belinear in given features, nor do we require to know such features. Such assump-tions are only needed to provide guarantees on the final expected reward. Aslong as the chosen features well constrain the adversary’s choice, we are sure thelearned policy is a good imitator of expert’s.

4.3 learning algorithm 20

As described in [11], we reduce the constrained zero-sum game of Equation4.7 to an unconstrained zero-sum game parameterized by a vector of Lagrangemultipliers. Theorem 4.1 formulates our new game.

Theorem 4.1. The constrained zero-sum game of Equation 4.7 can be reduced tothe following optimization problem:

minwminπ

maxπ

E[∑T

t=1 loss(St, St) | τ, π, π]+wᵀ

(E[∑T

t=1φ(St) | τ, π]− µE

)(4.9)

Proof. This result follows immediately from [11] and [27].

learning algorithm

We now present the learning algorithm we employ to solve the optimization prob-lem of Equation 4.9. We consider the outermost minimization over Lagrange mul-tipliers:

minw

f(w) (4.10)

where the function f is defined as:

f(w) = minπ

maxπ

E[∑T

t=1 loss(St, St) | τ, π, π]+wᵀ

(E[∑T

t=1φ(St) | τ, π]− µE

)+ λ2‖w‖

2

Notice that we add a regularization term to the objective function. This is moti-vated in [8] by the fact that a small amount of regularization (controlled by theparameter λ) helps in dealing with the uncertainty in the data.

Theorem 4.2 states a fundamental result that allows us to find a simple algo-rithm to solve the problem of Equation 4.10.

Theorem 4.2. The function f(w) of Equation 4.10 is convex.

Proof. We start by rewriting f in the more concise form:

f(w) = minπ

maxπ

L(π, π) +wᵀ(F(π) − µE) +λ

2‖w‖2 (4.11)

where L and F are, respectively:

L(π, π) = E

[T∑t=1


](4.12)

F(π) = E

[T∑t=1

φ(St) | τ, π

](4.13)


In order to prove that f is convex, we need to show that [26]:

f(θw1 + (1− θ)w) 6 θf(w1) + (1− θ)f(w2) ∀θ ∈ [0, 1], ∀w1, ∀w2 (4.14)

Since the regularization term is a well-known convex function, we prove the above-mentioned property only for the remaining part of f. Then, convexity followsfrom the fact that the sum of convex functions is also convex [26]. By expandingEquation 4.14 we obtain:

minπ

maxπ

L(π, π) + (θw1 + (1− θ)w)ᵀ(F(π) − µE) =

minπ

maxπ

L(π, π) + (θw1 + (1− θ)w)ᵀF(π)− (θw1 + (1− θ)w)

ᵀµE 6

θ minπ

maxπ

L(π, π) +wᵀ1(F(π) − µE) + (1− θ) min

πmaxπ

L(π, π) +wᵀ2(F(π) − µE) =

minπ

maxπ

θL(π, π) + θwᵀ

1F(π)− θwᵀ

1µE+

minπ

maxπ

(1− θ)L(π, π) + (1− θ)wᵀ

2F(π)− (1− θ)wᵀ

2µE =

minπ

maxπ

θL(π, π) + θwᵀ

1F(π)+min

πmaxπ

(1− θ)L(π, π) + (1− θ)wᵀ

2F(π)−

(θw1 + (1− θ)w)ᵀµE

Notice that the terms (θw1 + (1− θ)w)ᵀµE cancel out from both equations. By

applying the triangle inequality of the max function, we can now write:

minπ

maxπ

L(π, π) + (θw1 + (1− θ)w)ᵀF(π) =

minπ

maxπ

θL(π, π) + (1− θ)L(π, π) + θwᵀ1F(π) + (1− θ)wᵀ

2F(π) 6

minπ

maxπ

θL(π, π) + θwᵀ1F(π) +minπ

maxπ

(1− θ)L(π, π) + (1− θ)wᵀ2F(π) =

θ minπ

maxπ

L(π, π) +wᵀ1F(π) + (1− θ) min

πmaxπ

L(π, π) +wᵀ2F(π)

which concludes the proof.

Given the result of Theorem 4.2, we can minimize the function f(w) using stan-dard convex optimization techniques [26]. Before doing that, we need to showhow we can compute the gradient of f.

Gradient Estimation

It is easy to show that the gradient of f with respect to Lagrange multipliers w is:

Owf(w) = E

[T∑t=1

φ(St) | π?, τ

]− µE + λw (4.15)


where π? is the adversary’s equilibrium strategy. Thus, in order to estimate thegradient of f, we need to compute a Nash equilibrium of the zero-sum game ofEquation 4.7. That is, we want to solve the optimization problem:

minπ

maxπ

E

[T∑t=1


]+ E

[T∑t=1

wᵀφ(St) | τ, π

](4.16)

Notice that we can neglect feature expectations µE since they are constant termsand do not affect the equilibrium strategy.

As described in [11], we consider deterministic policies as the basic strategiesof each player and we denote them by letter δ. Then, we represent stochasticpolicies as mixtures of deterministic ones. Since there are two different definitionsof mixture policy, we specify the one we consider in Definition 4.1.

Definition 4.1. Given a set P of N deterministic policies δ1, δ2, ..., δN and mixingcoefficients α1, α2, ..., αN, such that αi > 0 and

∑Ni=1 αi = 1, the mixture policy π

of P is the stochastic policy where, at the first time step, one of the N deterministicpolicies in P is chosen with probability given by αi and then deterministicallyused throughout the whole process.

Notice the difference between Definition 4.1 and the more general case wheremixtures are fully stochastic policies obtained by mixing the deterministic onesat each time step (and not only at the beginning of the process). Although morerestrictive, our definition is necessary to allow the computation of expectationsunder the mixture policy as a linear combination of the expectations under deter-ministic policies (which is necessary when we want to use the mixture policy tomatch feature counts). Notice also that there always exists a stochastic policy thatachieves the same expectation as our mixture policy, thus we are able to matchthe features even when the demonstrator’s policy is stochastic.

Since the payoff matrix that would result from these assumptions is exponentialin the number of states [11], we use the double oracle method to iteratively buildthe matrix and find a Nash equilibrium. The algorithm we employ is exactly theone described in [11], thus we do not specify it here. The only big difference is inthe two subroutines to compute the best responses of each player. We now showhow this can be achieved.

learner’s best response Given the adversary’s mixed strategy π, the learner’sbest response is the deterministic policy δ given by:

BRmin(π) = argminδ

E

[T∑t=1

loss(St, St) | τ, δ, π

](4.17)


This can be solved as a simple optimal control problem by considering an MDPwith dynamics τ and reward:

R(st) = E[−loss(st, St) | τ, π

](4.18)

Notice that the minus sign in front of the loss function is necessary since we areminimizing. Notice also that the feature-matching terms are not controlled by thelearner’s strategy, thus they can be safely omitted. Value iteration can be used toefficiently solve this problem, as it is proposed in [11].

adversary’s best response The computation of the adversary’s best re-sponse, on the other hand, represents the main difficulty of this work. Given thelearner’s mixed strategy π, the adversary’s best response is the deterministic pol-icy δ given by:

BRmax(π) = argmaxδ

E

[T∑t=1

loss(St, St) | τ, π, δ

]+E

[T∑t=1

wᵀφ(St) | τ, δ

](4.19)

This is the problem of finding the optimal deterministic policy that maximizes thetotal expected reward over two different MDPs and has been shown to be NP-hard[28]. Since the solution of such problem is a big part of this work and should beanalyzed in a more general context, its description is deferred to the next chapter.At this point, the reader should simply suppose the existence of a procedure totractably compute BRmax.

Gradient Descent

We are now ready to describe our learning algorithm. This is shown in Algorithm1. We adopt a simple gradient descent procedure which uses the double oraclemethod to estimate the gradient. We stop when the L1-norm of the gradient di-vided by the number of features is less than a given threshold ε. Since the gradientis the difference between the feature expectations under the estimated policy π?

and the expert’s feature expectations, we are sure that, upon termination, all fea-tures are, on average, matched with an error at most ε.


Algorithm 1: Gradient Descent for Imitation Learning under Covariate ShiftInput: learning rate η, convergence threshold ε, regularization weight λOutput: weights w, imitation policy δ

Initialize weights: w ← N(0, 1)

while 1K‖Owf(w)‖1 > ε do

(π?, π?)← doubleOracle(w)

Owf(w)← E[∑T

t=1φ(St) | π?, τ]− µE + λw

w ← w− ηOwf(w)

endCompute imitation policy: δ← BRmin(π

?)

After gradient descent terminates, the final estimate π? of the expert’s policy isused to compute the best (deterministic) imitation policy. Furthermore, the finalweights w returned by our algorithm can be used to estimate the reward functionbeing optimized by the expert as:

R(s) = wᵀφ(s) (4.20)

An analysis of the relationship to existing IRL methods can be found in [11].

5M U LT I P L E - M D P O P T I M I Z AT I O N

In Chapter 4, we proved that a difficult sub-problem arising from our adversar-ial formulation is the computation of a deterministic policy maximizing the totalexpected reward over two different MDPs. In this chapter, we analyze the moregeneral problem where we are given multiple MDPs and we must find a determin-istic policy that maximizes the total reward from all of them. We start by formallydefining such problem in Section 5.1. Then, we present how we are able to reduceits solution to the optimal control of POMDPs [12] in Section 5.2, and we proposea modified version of PBVI [18] that solves such reduction in Section 5.3. Finally,Section 5.4 shows how we can use this result to tractably find the adversary’s bestresponse in our adversarial framework.

problem definition

Suppose we are given a set of D MDPs M = M1,M2, ...,MD, all defined overstate space S and action space A. Each MDP Mj ∈M is a tuple < S,A, τj, Rj, p0 >,where τj(S1:T || A1:T−1) are the state-transition dynamics, Rj(St, At, St+1) is thereward function, and p0(S) is the initial state distribution. Our goal is to find adeterministic policy π that maximizes the total expected reward from all MDPs inM. Formally, we want to solve the optimization problem:

argmaxπ

D∑j=1

E

[T−1∑t=1

Rj(St, At, St+1) | τj, π

](5.1)

Unfortunately, such problem is NP-hard [28]. It is easily possible to show that thepolicy achieving the maximum is non-Markovian, i.e., it depends on the wholestate-action sequence and not only on the current state. The next section proposesa tractable approximation that makes the policy Markovian by introducing a newcontinuous variable.

approximate dynamic programming

The optimization problem of Equation 5.1 is not practical to solve using classicdynamic programming since the optimal policy is non-Markovian. Therefore, we

25

5.2 approximate dynamic programming 26

propose an approximate dynamic program that makes the optimal policy Marko-vian by incorporating knowledge of the state-action sequences into a new contin-uous variable. We call the latter "belief state", in analogy to POMDPs. Our mainresult is given in Theorem 5.1.

Theorem 5.1. The optimization problem of 5.1 can be solved by considering thefollowing dynamic program. We define a "belief state" vector b incorporatingknowledge of the transition probabilities from all MDPs in M as:

btdef=

1∑Dj=1 τj(s1:t || a1:t−1)

τ1(s1:t || a1:t−1)

τ2(s1:t || a1:t−1)...

τD(s1:t || a1:t−1)

(5.2)

Then, the state-action value and state value functions are, respectively:

Q(st, at, bt) =∑Dj=1 bt,j

∑st+1

τj(st+1 | st, at)[Rj(st, at, st+1) + V(st+1, bt+1)

](5.3)

V(st, bt) = maxat

Q(st, at, bt) (5.4)

The next belief state bt+1 can be computed from bt, given st, at and st+1, as:

bt+1 =bt∑D

j=1 bt,jτj(st+1 | st, at)

τ1(st+1 | st, at)

τ2(st+1 | st, at)...

τD(st+1 | st, at)

(5.5)

where the symbol denotes the point-wise (or element-wise) product of the twovectors. Finally, the optimal deterministic policy is:

π?(st, bt) = argmaxat

Q(st, at, bt) (5.6)

Proof. Suppose we reach time step T − 1 after observing state sequence s1:T−1and action sequence a1:T−2. It only remains to pick the best last action aT−1,


which leads to the final state sT . The contribution of this last decision to the totalexpected reward is:

D∑j=1

E[Rj(ST−1, AT−1, ST | τj, π

]=

D∑j=1

τj(s1:T−1 || a1:T−2)π(a1:T−2 || s1:T−2)∑aT−1

π(aT−1 | a1:T−2, s1:T−1)∑sT

τj(sT | sT−1, aT−1)Rj(sT−1, aT−1, sT ) =

∑aT−1

π(a1:T−1 || s1:T−1)D∑j=1

τj(s1:T−1 || a1:T−2)∑sT

τj(sT | sT−1, aT−1)Rj(sT−1, aT−1, sT )

We want to find the best action aT−1 which maximizes this expectation, namely:

argmaxaT−1

D∑j=1

τj(s1:T−1 || a1:T−2)∑sT


argmaxaT−1

∑Dj=1 τj(s1:T−1 || a1:T−2)

∑sTτj(sT | sT−1, aT−1)Rj(sT−1, aT−1, sT )∑D

j=1 τj(s1:T−1 || a1:T−2)=

argmaxaT−1

D∑j=1

bT−1,j∑sT

τj(sT | sT−1, aT−1)Rj(sT−1, aT−1, sT )

(5.7)

where we define the belief state bt as:

btdef=

1∑Dj=1 τj(s1:t || a1:t−1)

τ1(s1:t || a1:t−1)

τ2(s1:t || a1:t−1)...

τD(s1:t || a1:t−1)

=

1∑Dj=1

∏ti=1 τj(si | si−1, ai−1)

∏ti=1 τ1(si | si−1, ai−1)∏ti=1 τ2(si | si−1, ai−1)

...∏ti=1 τD(si | si−1, ai−1)

(5.8)

Notice that the last equality holds since we are considering only Markovian dy-namics. Furthermore, the dependence on the whole state-action sequence has now


been removed: the maximization over actions depends only on the current stateand belief state. We now define the state-action value function at time step T − 1as the part of 5.7 that is being maximized:

Q(sT−1, aT−1, bT−1) =D∑j=1

bT−1,j∑sT

τj(sT | sT−1, aT−1)Rj(sT−1, aT−1, sT ) (5.9)

and the state value function at time step T − 1 as:

V(sT−1, bT−1) = maxaT−1

Q(sT−1, aT−1, bT−1) (5.10)

Finally, the computation of the optimal action at time step T − 1 can be reformu-lated in terms of the state-action value function as:

π?(sT−1, bT−1) = argmaxaT−1

Q(sT−1, aT−1, bT−1) (5.11)

We now prove the update rule for the belief state. If the current belief state is bt,we take action at from state st and we end up in state st+1, the i-th componentof the next belief state is:

bt+1,i =bt,iτi(st+1 | st, at)∑Dj=1 bt,jτj(st+1 | st, at)

(5.12)

To see this, we substitute the definition of bt in Equation 5.12:

bt+1,i =

τi(s1:t||a1:t−1)∑Dj=1 τj(s1:t||a1:t−1)

τi(st+1 | st, at)∑Dj=1

τj(s1:t||a1:t−1)∑Dk=1 τk(s1:t||a1:t−1)

τj(st+1 | st, at)=

τi(s1:t || a1:t−1)τi(st+1 | st, at)∑Dj=1 τj(s1:t || a1:t−1)τj(st+1 | st, at)

=

τi(s1:t+1 || a1:t)∑Dj=1 τj(s1:t+1 || a1:t)

(5.13)

and we get that the last equation is exactly the i-th component of bt+1. We nowmove to time step T − 2. The state-action value function can be easily modified toaccount for the immediate reward and the value of the next state:

Q(sT−2, aT−2, bT−2) =∑Dj=1 bT−2,j

∑sT−1

τj(sT−1 | sT−2, aT−2)[Rj(sT−2, aT−2, sT−1) + V(sT−1, bT−1)

]where the state value function is again:


Q(sT−2, aT−2, bT−2) (5.14)


and the optimal action is:

π?(sT−2, bT−2) = argmaxaT−2

Q(sT−2, aT−2, bT−2) (5.15)

We can now continue this procedure backward up to the first time step, thusconcluding the proof.

Theorem 5.1 defines a dynamic program that can be used to solve the optimiza-tion problem of Equation 5.1. However, its solution is still not trivial. In order tomake everything Markovian, we introduce a continuous variable that prevents usfrom using any tabular representation of policies and value functions. The sim-plest solution is belief discretization: for a D-dimensional belief state vector, wepartition the space RD into a finite set of hypercubes and discretize the belief oversuch partition. Then, we are able to solve the dynamic program of Theorem 5.1using a tabular representation. However, the number of discretized belief statesnecessary to get a satisfactory approximation grows exponentially with the beliefdimension D. Thus, this algorithm would be practical only for low-dimensionalbeliefs. The next section analyzes some properties of this formulation that allow amuch more efficient solution.

Dynamic Program Properties

We analyze some properties of the dynamic program introduced in Theorem 5.1,focusing on its relation to POMDPs.

Let us start by commenting on the meaning of the belief state. Although thename is derived in analogy to POMDPs, there is no relation between the two def-initions. In a POMDP, the belief state is an |S|-dimensional vector whose i-th com-ponent represents the probability of the process being in the i-th state. Thus, thebelief state vector belongs to the (|S|− 1)-dimensional simplex. In our case, thebelief state is a D-dimensional vector whose i-th component represents the (nor-malized) probability of the state-action sequence observed up to the current timeunder the i-th dynamics. Since our belief is normalized, it belongs to the (D− 1)-dimensional simplex.

The other main difference with respect to POMDPs is in the value function V . Inour case, V is a function of both the state st and belief state bt, and representsthe expected future reward that we get starting from st, executing the optimalpolicy π? and given that the state-transition probabilities are encoded into bt. Ina POMDP, V is a function of the belief state b alone, and represents the expectedfuture reward the agents obtains considering that the unknown current state isdistributed according to b.


Although many differences exist, our value function has a property that allowsus to solve the dynamic program of Theorem 5.1 using (modified) POMDP algo-rithms. Such property is given in Theorem 5.2.

Theorem 5.2. The value function defined in Equation 5.4 is Piece-Wise Linear andConvex (PWLC) in the belief state bt.

Proof. We prove the theorem by induction. By replacing Q with its definition, wecan write the value function at time step T − 1 as:


D∑j=1

bT−1,j∑sT


maxaT−1

bᵀT−1

∑sTτ1(sT | sT−1, aT−1)R1(sT−1, aT−1, sT )∑

sTτ2(sT | sT−1, aT−1)R2(sT−1, aT−1, sT )

...∑sTτD(sT | sT−1, aT−1)RD(sT−1, aT−1, sT )

(5.16)

The function that is maximized in 5.16 is clearly linear in bT−1 for any choice ofstate sT−1 and action aT−1. Thus, V(sT−1, bT−1) is the maximum over a set oflinear functions, i.e., it is PWLC.

Suppose now the value function is PWLC at time step t+1. This means that it canbe written as a maximization over a set of hyperplanes:

V(st+1, bt+1) = maxk

bᵀt+1αk(st+1) (5.17)

where αk(st+1) are the normal directions to such hyperplanes (we call themalpha-vectors in the remaining, again in analogy to POMDPs). We need to provethat this implies that the state value function is PWLC at time step t. This can bewritten as:

V(st, bt) = maxat

bᵀt

∑st+1

τ1(st+1 | st, at)[R1(st, at, st+1) + V(st+1, bt+1)

]∑st+1

τ2(st+1 | st, at)[R2(st, at, st+1) + V(st+1, bt+1)

]...∑

st+1τD(st+1 | st, at)

[RD(st, at, st+1) + V(st+1, bt+1)

]

=

maxat

∑st+1

bᵀt

τ1(st+1 | st, at)R1(st, at, st+1)

τ2(st+1 | st, at)R2(st, at, st+1)...

τD(st+1 | st, at)RD(st, at, st+1)

+ V(st+1, bt+1)bᵀt

τ1(st+1 | st, at)

τ2(st+1 | st, at)...

τD(st+1 | st, at)

(5.18)

By expanding the definition of bt+1 in the PWLC representation of V(st+1, bt+1)given in 5.17, we get:

V(st+1, bt+1) =1∑D

j=1 bt,jτj(st+1|st,at)maxk

∑Dj=1 bt,jτj(st+1 | st, at)αk,j(st+1)

5.3 modified point-based value iteration 31

(5.19)

If we now substitute this into 5.18, we notice that the normalization term of 5.19

and the last term of 5.18 cancel out. Thus, we have:

V(st, bt) = maxat

∑st+1

bᵀt

τ1(st+1 | st, at)R1(st, at, st+1)

τ2(st+1 | st, at)R2(st, at, st+1)...

τD(st+1 | st, at)RD(st, at, st+1)

+maxk

bᵀt

τ1(st+1 | st, at)αk,1(st+1)

τ2(st+1 | st, at)αk, 2(st+1)...

τD(st+1 | st, at)αk,D(st+1)

=

maxat

bᵀt

∑st+1

τ1(st+1 | st, at)R1(st, at, st+1)∑st+1

τ2(st+1 | st, at)R2(st, at, st+1)...∑

st+1τD(st+1 | st, at)RD(st, at, st+1)

+∑st+1

maxk

bᵀt

τ1(st+1 | st, at)αk, 1(st+1)

τ2(st+1 | st, at)αk, 2(st+1)...

τD(st+1 | st, at)αk,D(st+1)

(5.20)

We see that the function being maximized over actions is the sum of a linear termand a set of |S| PWLC functions, i.e., it is again a PWLC function of bt. Since themaximum over a set of PWLC functions is PWLC, this concludes the proof.

Theorem 5.2 proves a fundamental property: as for POMDPs, our value functionis PWLC. This means that the alpha-vectors, the normal directions to the hyper-planes describing the function, are sufficient to fully represent it. Algorithms forsolving POMDPs rely on this fact to efficiently approximate the value function and,thus, the optimal policy. Since the PWLC property of V is the only requirement toapply such algorithms, we have reduced the solution of our problem to that of apartially observable domain.

Many algorithms have been proposed in the POMDP literature. In this work, weadopt point-based procedures, which approximate the value function by storinga finite set of representative belief states and computing one alpha-vector for eachof them. Since, as we proved at the beginning of this section, there are a fewdifferences between our dynamic program and that of a POMDP, we need to extendsuch algorithms to solve our case. The next section shows how we can do this forPBVI [18].

modified point-based value iteration

In the previous section, we proved that the value function defined in our dynamicprogram is PWLC. This allows the reduction of our problem to the optimal controlof a POMDP. In this section, we show how we can adapt PBVI [18], one of the mostcommon and efficient algorithms for solving POMDPs, to our specific case.

We show the skeleton of our modified algorithm, which is very similar to theoriginal one, in Algorithm 2, while we defer the description of the modified sub-routines to the next two sections.


Algorithm 2: Modified Point-Based Value IterationInput: number of iterations NOutput: value function V = V1,V2, ...,VT

Initialize belief set: B = b1for i = 1 to N do

VT (sT ) = ∅ ∀sT ∈ S

for t = T − 1 to 1 doVt = modifiedValueBackup(Vt+1,B)

endB = modifiedBeliefExpansion(B)

end

Notice immediately two differences with respect to the original algorithms pre-sented in [18]. First, the initial belief state we use to initialize B is not a distributionover states (as in POMDPs), but is defined as:

b1 =1∑D

j=1 p0(s1)

p0(s1)

p0(s1)...

p0(s1)

=

1/D

1/D...

1/D

(5.21)

where the last equality holds since we consider all MDPs to have the same initialstate distribution. The second main difference is that, since we are consideringa finite-horizon problem, our value function is represented by a different set ofalpha-vectors at each time step. Thus, we write V = V1,V2, ...,VT and we runT − 1 backups before we apply the modified belief expansion. Furthermore, sinceour value function depends on the state, we have a different set of alpha-vectorsfor each state and time step. Thus, we write each Vt as the set Vt = Vt(st) | ∀st ∈S. Notice that VT (sT ) = ∅ ∀sT ∈ S since there is no action to take at the last timestep.

Modified Value Backup

Value backup updates the current value function given the current belief set B bycomputing, for each b ∈ B, the alpha-vector achieving the maximum over actions.Since our value function is very different than that of a POMDP, we need to adaptthe original procedure. We use a notation similar to the one used in [18], so thatthe reader can easily understand the new formulation.


Our goal is to compute the alpha-vectors describing V(st, b) for each b ∈ B.We denote the set containing such vectors as Vt(st). Notice that, from now on,we drop the time subscript from b since we are keeping a time-independent setof belief states. We consider the representation of V(st, b) given in Equation 5.20,which most highlights the different terms forming the alpha-vectors. Then, wedefine the following quantities, generalizing from the original algorithm [18], foreach state st ∈ S:

• Γat,?t : the vector multiplying b in the first linear term of V(st, b) for actionat;

• Γat,st+1t : a set of all vectors multiplying b in the inner maximization ofV(st, b) for state st+1 and action at;

• Γat,bt : the alpha-vector that is multiplied with b to compute the outer maxi-mum of V(st, b) for action at and belief b ∈ B.

In order to simplify the notation, we define τ (st+1 | st, at) as the vector contain-ing τj(st+1 | st, at) for each j = 1, ...D. We do the same for the reward functions bydefining R(st, at, st+1). Given such quantities, the modified algorithm is shownin Algorithm 3.

Algorithm 3: Modified Value BackupInput: value function Vt+1, belief set BOutput: updated value function Vt = Vt(st) | ∀st ∈ S

foreach st ∈ S doΓat,?t =

∑st+1

τ (st+1 | st, at)R(st, at, st+1)

Γat,st+1t =

τ (st+1 | st, at)α(st+1) | ∀α ∈ Vt+1(st+1)

Γat,bt = Γat,?t +

∑st+1

argmaxα∈Γ

at,st+1t

bᵀα

Vt(st) =argmaxΓat,bt ,∀at

bᵀΓat,bt | ∀b ∈ B

end

Modified Belief Expansion

Belief expansion updates the current belief set B by simulating every action andadding only the resulting belief state that is the farthest from B (if not already


present). This is done for each b ∈ B, thus the current belief set is at most doubledat each expansion. However, our definition of belief state is very different fromthat of a POMDP. We cannot simply take an action from a certain belief since wealso need to know the state. Thus, we modify the original belief expansion to sim-ulate every action from every state, compute the updated belief state according toEquation 5.5, and finally add the new belief that has the maximum distance fromB. We define the latter as the minimum L1 norm from a belief in B. Our modifiedalgorithm is shown in Algorithm 4. Notice that we write updateBelief(b, s, a, s′)to concisely represent the belief update performed by Equation 5.5.

Algorithm 4: Modified Belief ExpansionInput: belief set BOutput: expanded belief set B′

Initialize expanded belief set: B′ ← B

foreach b ∈ B doforeach s ∈ S do

foreach a ∈ A doSample s′j according to τj(. | s, a) ∀j = 1, · · · , Dfor j = 1 to D dob′s,a,j = updateBelief(b, s, a, s

′j)

dist(b′s,a,j) = minb′′∈B

‖b′s,a,j − b′′‖1

endend

endif argmax

b′s,a,j

dist(b′s,a,j) 6∈ B′ then

B′ = B′ ∪argmaxb′s,a,j

dist(b′s,a,j)

endend

5.4 application to adversarial imitation learning 35

application to adversarial imitation learning

We finally show how we can apply the algorithm presented in this chapter toour adversarial formulation. Recall that we need to compute the adversary’s bestresponse as:

BRmax(π) = argmaxδ

E

[T∑t=1

loss(St, St) | τ, π, δ

]+E

[T∑t=1

wᵀφ(St) | τ, δ

](5.22)

It is easy to notice that this problem can be reduced to a multiple-MDP optimiza-tion. In fact, we are looking for the optimal deterministic policy δ that maximizesthe total expected reward from 2 MDPs. The first of such MDPs has dynamics τand reward defined by:

R(st) = E[loss(St, st) | τ, π

](5.23)

The second MDP has dynamics τ and reward given by the Lagrangian potentialterms:

R(st) = wᵀφ(st) (5.24)

The only thing we are missing is to generalize the reward from a function of thestate alone to a function of state,action and next state (as we adopt in this chapter).If we have reward function R and dynamics τ, this is easily done by considering:

R(st, at, st+1) =

R(st) if t < T − 1R(st) + τ(st+1 | st, at)R(st+1) if t = T − 1

(5.25)

All we need to do now is to convert our reward functions as specified in Equation5.25 and apply modifiedPBVI to get the optimal value function. Then, we can sim-ply obtain the optimal policy for each state st and belief b by taking the actionassociated with the hyperplane achieving the maximum at b.

6E X P E R I M E N T S

In this chapter, we evaluate our algorithm on two common RL benchmarks. Thefirst one, gridworld, is described in Section 6.1, while the second one, the taxiproblem, is described in Section 6.2. In each experiment, we analyze both the typi-cal case where test and train dynamics coincide, and the case where covariate shiftis present. We compare our Adversarial Imitation Learning (AIL) method to fourother algorithms: a simple BC approach that computes a policy by estimating theexpert’s state-action frequencies from the data, Feature Matching (FM), MaximumEntropy Inverse Reinforcement Learning (ME), and Adversarial Inverse OptimalControl (AIOC). Since FM and ME recover a reward function, we compute the bestimitation policy by solving a RL problem under the test dynamics, while we con-sider the loss-minimizing policy for AIOC.

gridworld

In this section, we evaluate our approach on the well-known RL problem grid-world. We now briefly describe the version we adopt, which is very similar to theone considered in [29].

In gridworld, an agent navigates through an NxN grid. At each time, the agentcan take one of five possible actions: go north, go east, go south, go west, or stay inthe current cell. Each action succeeds with probability 0.7, while another randomaction is taken with probability 0.3. The agent starts in one of the four corners,with probability 0.25 each, and navigates for a fixed number of time steps. Wegroup the states in BxB blocks (where N must be a multiple of B) and we supposethe reward function is constant in each block. We consider a (N/B)2-dimensionalfeature vector where each component is either 1, if the agent is located in thatblock, or 0 otherwise. We write the reward as a linear combination of such features.An example of gridworld with such reward function is shown in Figure 6.1.

In this experiment, we consider the original gridworld dynamics (describedabove) as the test dynamics, and we compute the expert’s policy π to be optimalunder such dynamics. In order to introduce covariate shift, we change the behav-ior of a small number of cells, which we denote as "absorbing" cells. Wheneverthe agent falls in one of such cells, she cannot leave it anymore, no matter what ac-tions she takes. We consider the test dynamics augmented with absorbing cells asour train dynamics. Intuitively, an agent trying to maximize the above-mentioned

36

6.1 gridworld 37

1 2 3 4 5 6 7 8 9 10

123456789

10

Figure 6.1: A 10x10 gridworld with 2x2 blocks. White states have high reward, whileblack states have low reward.

reward function should avoid such cells as much as possible, since they mightprevent her from reaching other high-reward states. Thus, the expert’s policy π,which acts unaware of absorbing cells, becomes sub-optimal when executed un-der the train dynamics. Consequently, the state distributions induced by π underthe two dynamics differ. An example is shown in Figure 6.2. As we can see fromFigure 6.2b, it is still possible to infer the agent’s goal (i.e., to reach the grid cen-ter). However, the randomly-placed absorbing cells make the problem harder, andmany demonstrations are likely not to reach the goal.

We adopt the Euclidean distance between two states as our loss function:

loss(s(1), s(2)) =

√(s

(1)x − s

(2)x )2 + (s

(1)y − s

(2)y )2 (6.1)

We start by analyzing the easier case without covariate shift. We generate dif-ferent numbers of demonstrations under the expert’s optimal policy and the testdynamics. For each of them, we run all algorithms 20 times and we average theresults. Figure 6.3 shows the mean test loss, normalized by the minimum theo-retical loss, as a function of the number of demonstrations. Standard deviationbars are not reported for clarity reasons. Notice that, in case covariate shift isnot present, AIL and AIOC become the same algorithm. Thus, the performanceof the latter is not reported. From figure 6.3, we see that our algorithm alwaysperforms better than all alternatives. As expected for a simple problem like theone considered here, FM and ME obtain similar results, while BC needs many moresamples to learn. In the limit, all these three algorithms learn a policy very close to

6.1 gridworld 38

1 2 3 4 5 6 7 8 9 10

123456789

10

(a)

1 2 3 4 5 6 7 8 9 10

123456789

10

(b)

Figure 6.2: The state distribution induced by the expert’s policy under (a) test dynamics,and (b) train dynamics. White denotes high-probability, black denotes low-probability, and red circles denote absorbing cells.

the demonstrated one. However, since none of them is explicitly minimizing thesquared loss, our approach is always able to perform slightly better. As the num-ber of demonstrations increases, AIL goes very close to the minimum theoreticalloss.

We now consider the case with covariate shift. This time, we generate differentnumbers of demonstrations under the expert’s optimal policy and the train dy-namics. We show the normalized mean test loss (where the mean is still takenover 20 runs) in Figure 6.4. As expected, FM and ME have a much more unpre-dictable behavior. ME is still able to converge to a very good policy, due to the factthat we made sure the task is still recoverable even under covariate shift, althoughit needs more trajectories. On the other hand, FM fails at doing so. BC keeps al-most the same behavior since it does not use the dynamics at all. Similarly, ourapproach is not much affected by the shift since it explicitly deals with it. Further-more, AIL now makes much better use of limited data than other methods. Finally,AIOC is able to minimize the same loss function but cannot handle the shift. Thus,it performs worse when limited trajectories are provided, but still converges tothe same result since demonstrations are only slightly sub-optimal.

6.1 gridworld 39

10 0 10 1 10 2 10 3 10 4

Number of Trajectories

1

1.05

1.1N

orm

aliz

ed T

est L

oss AIL

BCFMME

Figure 6.3: Mean test loss for different numbers of trajectories in the case without covari-ate shift.

10 0 10 1 10 2 10 3 10 4


1

1.05

1.1

1.15

1.2

1.25

Nor

mal

ized

Tes

t Los

s AIOCAILBCFMME

Figure 6.4: Mean test loss for different numbers of trajectories in the covariate shift case.

6.2 taxi problem 40

1 2 3 4 5

1

2

3

4

5

P

DT

(a)

1 2 3 4 5

1

2

3

4

5

P

DT

(b)

Figure 6.5: The original taxi environment (a) and the modified one we use to generatedemonstrations (b).

taxi problem

The taxi problem was first introduced in [14] as a benchmark for hierarchical RLalgorithms. We briefly describe it here.

A taxi agent acts in the 5x5 grid shown in Figure 6.5a. There are four particu-lar locations in this grid, denoted as R(ed), G(reen), B(lue), and Y(ellow), wherea passenger and her destination can be located. The agent’s goal is to move tothe passenger position, pick her up, move to her destination, and finally drop heroff. There are six different actions the taxi can take: go-north, go-east, go-south,go-west, pick-up, and drop-off. The reward is +20 for successfully dropping offthe passenger at her destination and −1 for each time step the taxi takes to do so.There is also a penalty of −10 for illegal pick-up or drop-off actions. In case theagent hits a wall, the action has no effect and the reward is still −1. We supposethe taxi position (T), the passenger location (P), and her destination (D) are ini-tially randomized among the four special positions. An example of starting stateis shown again in Figure 6.5a. We consider stochastic transition dynamics where,similarly to gridworld, the four movement actions have 0.7 probability of succeed-ing. In case of failure, another random movement action is selected. We supposepick-up and drop-off are deterministic and, thus, always succeed. A state for thisproblem is encoded as a 4-tuple (x, y, p, d), where (x, y) are the taxi coordinates inthe grid, p is the passenger location (either R,G,B,Y, or T), and d is the destination(either R,G,B, or Y). In total, there are 500 states and 6 actions.

6.2 taxi problem 41

10 0 10 1 10 2 10 3 10 4


1

1.2

1.4

1.6

Nor

mal

ized

Tes

t Los

s AILBCFMME

(a)

10 0 10 1 10 2 10 3 10 4


0.2

0.4

0.6

0.8

1

Nor

mal

ized

Tes

t Rew

ard

AILBCFMME

(b)

Figure 6.6: Mean test loss (a) and mean test reward (b) for different numbers of trajecto-ries in the case without covariate shift.

In order to introduce covariate shift, we modify the original environment bychanging the position of the walls as shown in Figure 6.5b. We consider the orig-inal environment as test dynamics and the modified one as train dynamics. Intu-itively, executing the expert’s policy (which is optimal under the test dynamics)in the modified environment produces highly sub-optimal trajectories. In manycases, the expert would not even be able to solve the task if the stochasticity in thedynamics were not present. Thus, we expect covariate shift to have a much largerimpact than the one it has on gridworld.

In this experiment, we adopt a 28-dimensional feature vector where the first 25components encode the agent (x, y) position, the 26th component is 1 if the taxi isempty, the 27th component is 1 if the passenger has been picked up, and the 28thcomponent is 1 if the latter has been dropped off. Notice that the reward functionis not linear in such features. Furthermore, we adopt the following loss function:

loss(s(1), s(2)) =

1 if s(1)p = T xor s(2)p = T

1 if s(1) = DROPPED-OFF xor s(2) = DROPPED-OFF0 otherwise

(6.2)

Intuitively, the loss is 1 if any of the two agents has already picked up the passen-ger and the other has not, or if any of the two agents has already dropped her offand the other has not.

Once again, we start by analyzing the case without covariate shift. As for grid-world, we generate different numbers of demonstrations and, for each of them,we run all algorithms 20 times. We show the normalized mean test loss and the

6.2 taxi problem 42

10 0 10 1 10 2 10 3 10 4


1

1.2

1.4

1.6

1.8

Nor

mal

ized

Tes

t Los

s AIOCAILBCFMME

(a)

10 0 10 1 10 2 10 3 10 4


0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Tes

t Rew

ard

AIOCAILBCFMME

(b)

Figure 6.7: Mean test loss (a) and mean test reward (b) for different numbers of trajecto-ries in the covariate shift case.

normalized mean test reward in Figure 6.6a and 6.6b, respectively. Since the lossfunction we adopt is strictly related to the agent’s goal, all algorithms convergeto the minimum theoretical loss and maximum theoretical reward. As expected,FM and ME have a very similar behavior, while BC needs many more trajectories.Notice how our AIL approach is able to achieve better results than all alternativeswhen limited data is provided. We do not report the performance of AIOC sincethe latter coincides with AIL.

We now introduce covariate shift as described before. The normalized mean testloss and normalized mean test reward are shown in Figure 6.7a and 6.7b, respec-tively. We immediately see that our approach now outperforms all alternatives.Due to the highly sub-optimal demonstrations, FM and ME fail at converging tothe optimal result, with the latter being slightly better. As for gridworld, BC is notvery affected by the shift and, in the limit, converges to a nearly optimal solution.However, it requires too many trajectories to be used in practice. AIOC convergesto the same result as AIL, but takes more than ten times the number of demon-strations to do so. Finally, our AIL approach is only slightly affected by covariateshift and still converges to the optimal solution using a reasonable number oftrajectories.

7C O N C L U S I O N S

In this work, we analyzed the problem of imitation learning under covariate shift.This is the problem of learning a policy from demonstrations whose state distri-bution is different than that of the expert. In particular, we focused on the casewhere only the dynamics differ, while the policy remains the same. We motivatedthe importance of explicitly dealing with covariate shift by showing how the latternaturally arises in many real-world learning scenarios.

Building on top of existing inverse reinforcement learning methods, we pro-posed an adversarial formulation that handles such distribution shift. We con-sidered a zero-sum game between an adversary trying to provide a worst-caseestimate of the expert’s policy while being forced to match the empirical featureexpectations, and a learner trying to minimize a given imitation loss with respectto such estimate. We reduced the constrained zero-sum game to an unconstrainedone by employing the method of Lagrange multipliers and we proposed a simpleconvex optimization procedure to solve it.

We proved that, due to the introduction of covariate shift, the problem of findingan equilibrium becomes NP-hard since it requires the computation of a determin-istic policy maximizing the total expected reward over different Markov decisionprocesses. We proposed a tractable approximate dynamic program to solve thelatter and we reduced its solution to the optimal control of partially observableMarkov decision processes. We modified point-based value iteration to efficientlyhandle our specific case.

We evaluated our AIL approach on two common reinforcement learning bench-marks, gridworld and the taxi problem. We compared it to other state-of-the-artalgorithms: feature matching, maximum entropy, AIOC, and, as a weak baseline, asimple behavioral cloning approach. In both cases, we proved that our approachperforms slightly better than feature matching and maximum entropy when noshift is considered (due to the fact that these two algorithms are not explicitlyminimizing any loss), while it always outperforms them as covariate shift is intro-duced. AIOC provides good results even under distribution shift, but needs manymore trajectories than AIL to do so. Behavioral cloning is always outperformed byall alternatives. We concluded that our approach successfully deals with imitationlearning under covariate shift.

43

B I B L I O G R A P H Y

[1] Brenna D. Argall et al. “A Survey of Robot Learning from Demonstration.”In: Robot. Auton. Syst. 57.5 (May 2009), pp. 469–483. issn: 0921-8890. doi:10.1016/j.robot.2008.10.024.

[2] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learn-ing. 1st. Cambridge, MA, USA: MIT Press, 1998. isbn: 0262193981.

[3] Pieter Abbeel and Andrew Y. Ng. “Apprenticeship Learning via InverseReinforcement Learning.” In: Proceedings of the Twenty-first International Con-ference on Machine Learning. ICML ’04. Banff, Alberta, Canada: ACM, 2004,pp. 1–. isbn: 1-58113-838-5. doi: 10.1145/1015330.1015430.

[4] Yann LeCun et al. “Off-road obstacle avoidance through end-to-end learn-ing.” In:

[5] Michael Bain and Claude Sammut. “A Framework for Behavioural Cloning.”In: (2001).

[6] Stuart Russell. “Learning agents for uncertain environments.” In: Proceed-ings of the eleventh annual conference on Computational learning theory. ACM.1998, pp. 101–103.

[7] Andrew Y. Ng and Stuart J. Russell. “Algorithms for Inverse ReinforcementLearning.” In: Proceedings of the Seventeenth International Conference on Ma-chine Learning. ICML ’00. San Francisco, CA, USA: Morgan Kaufmann Pub-lishers Inc., 2000, pp. 663–670. isbn: 1-55860-707-2.

[8] Brian D Ziebart et al. “Maximum Entropy Inverse Reinforcement Learning.”In: 2008.

[9] Bianca Zadrozny. “Learning and evaluating classifiers under sample selec-tion bias.” In: Proceedings of the twenty-first international conference on Machinelearning. ACM. 2004, p. 114.

[10] Hidetoshi Shimodaira. “Improving predictive inference under covariate shiftby weighting the log-likelihood function.” In: Journal of statistical planningand inference 90.2 (2000), pp. 227–244.

[11] Xiangli Chen et al. “Adversarial Inverse Optimal Control for General Imita-tion Learning Losses and Embodiment Transfer.” In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI’16. Jersey City,New Jersey, USA: AUAI Press, 2016, pp. 102–111. isbn: 978-0-9966431-1-5.

44

http://dx.doi.org/10.1016/j.robot.2008.10.024

http://dx.doi.org/10.1145/1015330.1015430

Bibliography 45

[12] Edward J. Sondik. “The Optimal Control of Partially Observable MarkovProcesses over the Infinite Horizon: Discounted Costs.” In: Oper. Res. 26.2(Apr. 1978), pp. 282–304. issn: 0030-364X. doi: 10.1287/opre.26.2.282.

[13] Anthony R Cassandra, Leslie Pack Kaelbling, and Michael L Littman. “Act-ing optimally in partially observable stochastic domains.” In: 1994.

[14] Thomas G Dietterich. “Hierarchical reinforcement learning with the MAXQvalue function decomposition.” In: ().

[15] Martin L Puterman. Markov decision processes: discrete stochastic dynamic pro-gramming. John Wiley & Sons, 2014.

[16] Richard Bellman. Dynamic Programming. 1st ed. Princeton, NJ, USA: Prince-ton University Press, 1957.

[17] Olivier Sigaud and Olivier Buffet. Markov Decision Processes in Artificial Intel-ligence. Wiley-IEEE Press, 2010. isbn: 1848211678, 9781848211674.

[18] Joelle Pineau, Geoff Gordon, and Sebastian Thrun. “Point-based Value Iter-ation: An Anytime Algorithm for POMDPs.” In: Proceedings of the 18th Inter-national Joint Conference on Artificial Intelligence. IJCAI’03. Acapulco, Mexico:Morgan Kaufmann Publishers Inc., 2003, pp. 1025–1030.

[19] James Massey. “Causality, feedback and directed information.” In: Citeseer.

[20] Dean A Pomerleau. ALVINN, an autonomous land vehicle in a neural network.Tech. rep.

[21] Claude Sammut et al. “Learning to Fly.” In: In Proceedings of the Ninth Inter-national Conference on Machine Learning. Morgan Kaufmann, 1992, pp. 385–393.

[22] Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. “Modeling Interac-tion via the Principle of Maximum Causal Entropy.” In: (2010).

[23] Gerhard Kramer. “Directed information for channels with feedback.” PhDthesis. University of Manitoba, Canada, 1998.

[24] Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. “Maxi-mum Margin Planning.” In: Proceedings of the 23rd International Conferenceon Machine Learning. ICML ’06. Pittsburgh, Pennsylvania, USA: ACM, 2006,pp. 729–736. isbn: 1-59593-383-2. doi: 10.1145/1143844.1143936.

[25] H. Brendan McMahan, Geoffrey J. Gordon, and Avrim Blum. “Planning inthe Presence of Cost Functions Controlled by an Adversary.” In: Proceedingsof the Twentieth International Conference on International Conference on MachineLearning. ICML’03. Washington, DC, USA: AAAI Press, 2003, pp. 536–543.isbn: 1-57735-189-4.

http://dx.doi.org/10.1287/opre.26.2.282

http://dx.doi.org/10.1145/1143844.1143936

Bibliography 46

[26] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. New York,NY, USA: Cambridge University Press, 2004. isbn: 0521833787.

[27] Kaiser Asif et al. “Adversarial Cost-sensitive Classification.” In: Proceedingsof the Thirty-First Conference on Uncertainty in Artificial Intelligence. UAI’15.Amsterdam, Netherlands: AUAI Press, 2015, pp. 92–101. isbn: 978-0-9966431-0-8.

[28] Nikos Vlassis, Michael L Littman, and David Barber. “On the computa-tional complexity of stochastic controller optimization in POMDPs.” In:ACM Transactions on Computation Theory (TOCT) 4.4 (2012), p. 12.

[29] Sergey Levine, Zoran Popovic, and Vladlen Koltun. “Nonlinear inverse re-inforcement learning with gaussian processes.” In: Advances in Neural Infor-mation Processing Systems. 2011, pp. 19–27.

Adversarial Imitation Learning under Covariate Shift · 2017-10-27 · politecnico di milano...

Documents

Transcript of Adversarial Imitation Learning under Covariate Shift · 2017-10-27 · politecnico di milano...