GENERATIVE ADVERSARIAL SELF-IMITATION LEARNING

Under review as a conference paper at ICLR 2019

GENERATIVE ADVERSARIAL SELF-IMITATIONLEARNING

Anonymous authorsPaper under double-blind review

ABSTRACT

This paper explores a simple regularizer for reinforcement learning by proposingGenerative Adversarial Self-Imitation Learning (GASIL), which encourages theagent to imitate past good trajectories via generative adversarial imitation learningframework. Instead of directly maximizing rewards, GASIL focuses on reproducingpast good trajectories, which can potentially make long-term credit assignmenteasier when rewards are delayed. GASIL can be easily combined with any policygradient objective by using GASIL as a learned reward shaping function. Ourexperimental results show that GASIL improves the performance of proximalpolicy optimization on 2D Point Mass and MuJoCo environments with delayedreward and stochastic dynamics.

1 INTRODUCTION

A major component of Reinforcement learning (RL) is the temporal credit assignment problem thatamounts to figuring out which action in a state leads to a better outcome in the future. Different RLalgorithms have different forms of objectives to solve this problem. For example, policy gradientapproaches learn to directly adapt the policy to optimize the RL objective (i.e., maximizing cumulativerewards), while value-based approaches (e.g., Q-Learning (Watkins & Dayan, 1992)) estimate long-term future rewards and induce a policy from it. Policies optimized for different objectives manyhave different learning dynamics, which end up with different sub-optimal policies in complexenvironments, though all of these objectives are designed to maximize cumulative rewards.

In this paper, we explore a simple regularizer for RL, called Generative Adversarial Self-ImitationLearning (GASIL). Instead of directly maximizing rewards, GASIL aims to imitate past goodtrajectories that the agent has generated using generative adversarial imitation learning framework (Ho& Ermon, 2016). GASIL solves the temporal credit assignment problem by learning a discriminatorwhich discriminates between the agent’s current trajectories and good trajectories in the past, whilethe policy is trained to make it hard for the discriminator to distinguish between the two types oftrajectories by imitating good trajectories. GASIL can potentially make long-term temporal creditassignment easier when reward signal is delayed, because reproducing certain trajectories is oftenmuch easier than maximizing long-term delayed rewards. GASIL can be interpreted as an optimalreward learning algorithm (Singh et al., 2009; Sorg et al., 2010), where the discriminator acts as alearned reward function which provides dense rewards for the agent to reproduce relatively bettertrajectories. Thus, it can be used as a shaped reward function and combined with any RL algorithms.

Our empirical results on 2D Point Mass and OpenAI Gym MuJoCo tasks (Brockman et al., 2016;Todorov et al., 2012) show that GASIL improves the performance of proximal policy optimization(PPO) (Schulman et al., 2017), especially when rewards are delayed. We also show that GASIL isrobust to stochastic dynamics to some extent in practice.

2 RELATED WORK

Generative adversarial learning Generative adversarial networks (GANs) (Goodfellow et al.,2014) have been increasingly popular for generative modeling. In GANs, a discriminator is trainedto discriminate whether a given sample is drawn from data distribution or model distribution. Agenerator (i.e., model) is trained to “fool” the discriminator by generating samples that are close tothe real data. This adversarial play allows the model distribution to match to the data distribution.This approach has been very successful for image generation and manipulation (Radford et al., 2015;

1


Reed et al., 2016; Zhu et al., 2017). Recently, Ho & Ermon (2016) proposed generative adversarialimitation learning (GAIL) which extends this idea to imitation learning. In GAIL, a discriminatoris trained to discrimiate between optimal trajectories (or expert trajectories) and policy trajectories,while the policy learns to fool the discriminator by imitating optimal trajectories. Our work furtherextends this idea to RL. Unlike GAN or GAIL setting, however, optimal trajectories are not availableto the agent in RL. Instead, our GASIL treats “relatively better trajectories” that the policy hasgenerated as optimal trajectories that the agent should imitate.

Reward learning Singh et al. (2009) discussed a problem of learning an internal reward functionthat is useful across a distribution of environments in an evolutionary context. Simiarly, Sorg et al.(2010) introduced optimal reward problem under the motivation that the true reward function definedin the environment may not be optimal for learning, and there exists an optimal reward function thatallows learning the desired behavior much quickly. This claim is consistent with the idea of rewardshaping (Ng et al., 1999) which helps learning without changing the optimal policy. There has been afew attempts to learn such an internal reward function without domain-specific knowledge in deepRL context (Sorg et al., 2010; Guo et al., 2016; Zheng et al., 2018). Our work is closely related tothis line of work in that GASIL learns a discriminator which acts as an interal reward function thatallows the agent to learn to maximize external rewards more easily.

Self-imitation There has been a line of work that introduces a notion of learning and inducing agood policy by focusing on good experiences that the agent has generated. For example, episodiccontrol (Lengyel & Dayan, 2008; Blundell et al., 2016; Pritzel et al., 2017) and the nearest neighborpolicy (Mansimov & Cho, 2017) construct a non-parametric policy directly from the past experienceby retreiving similar states in the past and following the best decision made in the past. Instead, ourwork aims to learn a parametric policy from past good experiences. Self-imitation has been shown tobe useful for program synthesis (Liang et al., 2016; Abolafia et al., 2018), where the agent is trainedto generate K-best programs generated by itself. Our work proposes a different objective based ongenerative adversarial learning framework and evaluates it on RL benchmarks. More recently, Goyalet al. (2018) proposed to learn a generative model of preceding states of high-value states (i.e., top-Ktrajectories) and update a policy to follow the generated trajectories. In contrast, our GASIL directlylearns to imitate past good trajectories without learning a generative model. GASIL can be viewed asa generative adversarial extension of self-imitation learning (Oh et al., 2018) which updates the policyand the value function towards past better trajectories. Contemporaneously with our work, Gangwaniet al. (2018) also proposed the same method as our GASIL, which was independently developed.Most of the previous works listed above including ours may not guarantee policy improvement undercertain types of stochastic environments due to its bias towards positive outcome, though they havebeen shown to work well on existing benchmarks when used as a regularizer. Looking forward,dealing with stochasticity with this type of approach with stronger theoretical guarantees would be aninteresting future direction.

3 BACKGROUND

Throughout the paper, we consider a finite state space S and a finite action space A. The goal ofRL is to find a policy π ∈ Π : S × A → [0, 1] which maximizes the discounted sum of rewards:η(π) = Eπ [

∑∞t=0 γ

trt] where γ is a discount factor and rt is a reward at time-step t.

Alternatively, we can re-write the RL objective η(π) in terms of occupancy measure. Occupancymeasure ρπ ∈ D : S ×A → R is defined as ρπ(s, a) = π(a|s)

∑∞t=0 γ

tP (st = s|π). Intuitively, itis a joint distribution of states and actions visited by the agent following the policy π. It is shown thatthere is a one-to-one correspondence between the set of policies (Π) and the set of valid occupancymeasures (D = {ρπ : π ∈ Π}) (Syed et al., 2008). This allows us to write the RL objective in termsof occupancy measure as follows:

η(π) = Eπ

[ ∞∑t=0

γtrt

]=

∑s,a

ρπ(s, a)r(s, a). (1)

where r(s, a) is the reward for choosing action a in state s. Thus, policy optimization amounts to find-ing an optimal occupancy measure which maximizes rewards due to the one-to-one correspondencebetween them.

2


3.1 POLICY GRADIENT

Policy gradient methods compute the gradient of the RL objective η(πθ) = Eπθ [∑∞t=0 γ

trt]. Sinceη(πθ) is non-differentiable with respect to the parameter θ when the dynamics of the environment areunknown, policy gradient methods rely on the score function estimator to get an unbiased gradientestimator of η(πθ). A typical form of policy gradient objective is given by:

JPG(θ) = Eπθ[log πθ(at|st)At

](2)

where πθ is a policy parameterized by θ, and At is an advantage estimation at time t. Intuitively, thepolicy gradient objective either increases the probability of the action when the return is higher thanexpected (At > 0) or decreases the probability when the return is lower than expected (At < 0).

3.2 GENERATIVE ADVERSARIAL IMITATION LEARNING

Generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) is an imitation learningalgorithm which aims to learn a policy that can imitate expert trajectories using the idea fromgenerative adversarial network (GAN) (Goodfellow et al., 2014). More specifically, the objective ofGAIL for maximum entropy IRL (Ziebart et al., 2008) is defined as:

argminθ

argmaxφ

LGAIL(θ, φ) = Eπθ [logDφ(s, a)] + EπE [log(1−Dφ(s, a))]− λH(πθ) (3)

where πθ, πE are a policy parameterized by θ and an expert policy respectively. Dφ(s, a) : S ×A →[0, 1] is a discriminator parameterized by φ. H(π) = Eπ [− log π(a|s)] is the entropy of the policy.Similar to GANs, the discriminator and the policy play an adversarial game by either maximizing orminimizing the objective LGAIL, and the gradient of each component is given by:

∇φLGAIL = Eτπ [∇φ logDφ(s, a)] + EτE [∇φ log(1−Dφ(s, a))] (4)∇θLGAIL = Eτπ [∇θ logDφ(s, a)]− λH(πθ) (5)

= Eτπ [∇θ log πθ(a|s)Q(s, a)]− λH(πθ), (6)

where Q(s, a) = Eτπ [logDφ(s, a)|s0 = s, a0 = a], and τπ, τE are trajectories sampled from πθ andπE respectively. Intuitively, the discriminator Dφ is trained to discriminate between the policy’strajectories (τπ) and the expert’s trajectories (τE) through cross entropy loss. On the other hand, thepolicy πθ is trained to fool the discriminator by generating trajectories that are close to the experttrajectories according to the discriminator. Since logDφ(s, a) is non-differentiable with respect to θin Equation 5, the score function estimator is used to compute the gradient, which leads to a form ofpolicy gradient (Equation 6) using the discriminator as a reward function.

It has been shown that GAIL amounts to minimizing the Jensen-Shannon divergence between thepolicy’s occupancy measure and the expert’s (Ho & Ermon, 2016; Goodfellow et al., 2014) as follows:

argminθ

argmaxφ

LGAIL(θ, φ) = argminθ

DJS(ρπθ ||ρπE )− λH(πθ) (7)

where DJS(p||q) = DKL(p||(p+ q)/2) +DKL(q||(p+ q)/2) denotes Jensen-Shannon divergence, adistance metric between two distributions, which is minimized when p = q.

4 GENERATIVE ADVERSARIAL SELF-IMITATION LEARNING

The main idea of Generative Adversarial Self-Imitation Learning (GASIL) is to update the policyto imitate past good trajectories using GAIL framework (see Section 3.2 for GAIL). We describethe details of GASIL in Section 4.1 and make a connection between GASIL and reward learning inSection 4.2, which leads to a combination of GASIL with policy gradient in Section 4.3.

4.1 ALGORITHM

The keay idea of GASIL is to treat good trajectories collected by the agent as trajectories that theagent should imitate as described in Algorithm 1. More specifically, GASIL performs the followingtwo updates for each iteration.

3


Algorithm 1 Generative Adversarial Self-Imitation Learning

Initialize policy parameter θInitialize discriminator parameter φInitialize good trajectory buffer B ← ∅for each iteration do

Sample policy trajectories τπ ∼ πθUpdate good trajectory buffer B using τπSample good trajectories τE ∼ BUpdate the discriminator parameter φ via gradient ascent with:

∇φLGASIL = Eτπ [∇φ logDφ(s, a)] + EτE [∇φ log(1−Dφ(s, a))] (8)

Update the policy parameter θ via gradient descent with:

∇θLGASIL = Eτπ [∇θ log πθ(a|s)Q(s, a)]− λ∇θH(πθ),

where Q(s, a) = Eτπ [logDφ(s, a)|s0 = s, a0 = a](9)

end for

Updating good trajectory buffer (B) GASIL maintains a good trajectory buffer B = {τi} thatcontains a few trajectories (τi) that obtained high rewards in the past. Each trajectory consists of asequence of states and actions: s0, a0, s1, a1, ..., sT . We define ‘good trajectories’ as any trajectorieswhose the discounted sum of rewards are higher than expected return of the current policy. Thoughthere can be many different ways to obtain such trajectories, we propose to store top-K episodesaccording to the return R =

∑∞t=0 γ

trt.

Updating discriminator (Dφ) and policy (πθ) The agent learns to imitate good trajectories con-tained in the good trajectory buffer B using generative adversarial imitation learning. More formally,the discriminator (Dφ(s, a)) and the policy (πθ(a|s)) are updated via the following objective:

argminθ

argmaxφ

LGASIL(θ, φ) = Eτπ [logDφ(s, a)] + EτE∼B [log(1−Dφ(s, a))]− λH(πθ) (10)

where τπ, τE are sampled trajectories from the policy πθ and the good trajectory buffer B respectively.Intuitively, the discriminator is trained to discriminate between good trajectories and the policy’strajectories, while the policy is trained to make it difficult for the discriminator to distinguish byimitating good trajectories.

4.2 CONNECTION TO REWARD LEARNING

The discriminator in GASIL can be interpreted as a reward function for which the policy optimizesbecause Equation 9 uses the score function estimator to maximize the reward given by− logDφ(s, a).In other words, the policy is updated to maximize the discounted sum of rewards given by thediscriminator rather than the true reward from the environment. Since the discriminator is alsolearned, GASIL can be viewed as an instance of optimal reward learning algorithm (Sorg et al., 2010).A potential benefit of GASIL is that the optimal discriminator can provide intermediate rewards tothe policy along good trajectories, even if the true reward from the environment is delayed. In sucha scenario, GASIL can allow the agent to learn more easily compared to the true reward function.Indeed, as we will show in Section 5.4, GASIL performs significantly better than a state-of-the-artpolicy gradient baseline in a delayed reward setting.

4.3 COMBINING WITH POLICY GRADIENT

As the discriminator can be interpreted as a learned internal reward function, it can be easily combinedwith any RL algorithms. In this paper, we explore a combination of GASIL objective and policygradient objective (Equation 2) as follows:

∇θJPG − α∇θLGASIL = Eπθ[∇θ log πθ(a|s)Aαt + λ∇θH(πθ)

](11)

where Aαt is an advantage estimation using a modified reward function rα(s, a) , r(s, a) −α logDφ(s, a). Intuitively, the discriminator is used to shape the reward function to encouragethe policy to imitate good trajectories.

4


5 EXPERIMENTS

The experiments are designed to answer the following questions: (1) What is learned by GASIL?(2) Is GASIL better than behavior cloning approach? (3) Is GASIL competitive to policy gradientmethod?; (4) Is GASIL complementary to policy gradient method when combined together?

5.1 IMPLEMENTATION DETAILS

We implemented the following agents:

• PPO: Proximal policy optimization (PPO) baseline (Schulman et al., 2017).

• PPO+BC: PPO with additional behavior cloning to top-K trajectories.

• PPO+SIL: PPO with self-imitation learning (Oh et al., 2018).

• PPO+GASIL: Our method using both the discriminator and the true reward (Section 4.3).

The details of the network architectures and hyperparameters are described in the appendix. Ourimplementation is based on OpenAI’s PPO and GAIL implementations (Dhariwal et al., 2017).

5.2 2D POINT MASS

0M 0.5M 1M 1.5M 2M

0

10

20

PPOPPO+GASIL

Figure 1: Learning curve on 2Dpoint mass. See text for details.

To better understand how GASIL works, we implemented a simple2D point mass environment with continuous actions that determinethe velocity of the agent in a 2D space as illustrated in Figure 2.In this environment, the agent should collect as many blue/greenobjects as possible that give positive rewards (5 and 10 respectively)while avoiding distractor objects (orange) that give negative rewards(-5). There is also an actuation cost proportional to L2-norm ofaction which discourages large velocity.

The result in Figure 1 shows that PPO tends to learn a sub-optimalpolicy quickly. Although PPO+GASIL learns slowly in the earlystage, it finds a better policy at the end of learning compared to PPO.

Figure 2 visualizes the learning progress of GASIL with the learneddiscriminator. It is shown that the initial top-K trajectories collectseveral positive objects as well as distractors on the top area of the environment. This encourages thepolicy to explore the top area because GASIL encourages the agent to imitate those top-K trajectories.As visualized in the third row in Figure 2, the discriminator learns to put higher rewards for state-actions that are close to top-k trajectories, which strongly encourages the policy to imitate suchtrajectories. As training goes and the policy improves, the agent finds better trajectories that avoiddistractors while collecting positive rewards. The good trajectory buffer is updated accordingly as theagent collects such trajectories, which is used to train the discriminator. The interaction between thepolicy and the discriminator converges to a sub-optimal policy which collects two green objects.

In contrast, Figure 3 visualizes the learning progress of PPO. Even though the agent collected thesame top-k trajectories at the beginning as in PPO+GASIL (compare the first columns of Figure 2and Figure 3), the policy trained with PPO objective quickly converges to a sub-optimal policy whichcollects only one green object depending on initial positions. We conjecture that this is becausethe policy gradient objective (Eq 2) with the true reward function strongly encourages collectingnearby positive rewards and discourages collecting negative rewards. Thus, once the agent learns asub-optimal behavior as shown in Figure 3, the true reward function discourages further explorationdue to distractors (orange objects) nearby green objects and the actuation cost.

On the other hand, our GASIL objective does not explicitly encourage nor discourage the agent tocollect positive/negative rewards, because the discriminator gives the agent internal rewards accordingto how close the agent’s trajectories are to top-K trajectories regardless of whether it collects someobjects or not. Though this can possibly slow down learning, it can often help finding a better policyin the end depending on tasks as shown in this experiment. This result also implies that directlylearning to maximize true reward such as in the policy gradient method may not always lead to thebest behavior due to the learning dynamics.

5


Polic

ytr

ajec

tori

esTo

p-K

traj

ecto

ries

Dis

crim

inat

orre

war

d

Figure 2: Visualization of GASIL policy on 2D point mass. The first two rows show the agent’s trajectories andtop-k trajectories at different training steps from left to right. The third row visualizes the learned discriminatorat the corresponding training steps. Each arrow shows the best action at each position of the agent for whichthe discriminator gives the highest reward. The transparency of each arrow represents the magnitude of thediscriminator reward (higher transparency represents lower reward).

Polic

ytr

ajec

tori

esTo

p-K

traj

ecto

ries

Figure 3: Visualization of PPO policy on 2D point mass. Compared to GASIL (Figure 2), PPO tends toprematurely learn a worse policy.

5.3 MUJOCO

To further investigate how well GASIL performs on complex control tasks, we evaluated it onOpenAI Gym MuJoCo tasks (Brockman et al., 2016; Todorov et al., 2012).1 The result in Figure 4shows that GASIL improves PPO on most of the tasks. This indicates that GASIL objective can becomplementary to PPO objective, and the learned reward acts as a useful reward shaping that makeslearning easier.

It is also shown that GASIL significantly outperforms the behavior cloning baseline (‘PPO+BC’) onmost of the tasks. Behavior cloning has been shown to require a large amount of samples to imitatecompared to GAIL as shown by Ho & Ermon (2016). This can be even more crucial in the RL contextbecause there are not many good trajectories in the buffer (e.g., 1K-10K samples). Besides, GASILalso outperforms self-imitation learning (‘PPO+SIL’) (Oh et al., 2018) showing that our generativeadversarial approach is more sample-efficient than self-imitation learning. In fact, self-imitationlearning can be viewed as a type of behavior cloning with different sample weights according to theiradvantages, which can be the reason why GASIL is more sample-efficient. Another possible reason

1The demo video of the learned policies are available at https://youtu.be/AwrtIUS2_pc.

6

https://youtu.be/AwrtIUS2_pc


0M 1M 2M 3M 4M 5M0

50

100

150

200

250

300Swimmer-v2

PPOPPO+BCPPO+SILPPO+GASIL

0M 1M 2M 3M 4M 5M0

500

1000

1500

2000

2500

3000Hopper-v2

0M 1M 2M 3M 4M 5M

0

1000

2000

3000

4000HalfCheetah-v2

0M 2M 4M 6M 8M 10M0

1000

2000

3000

4000

5000

Walker2d-v2

0M 2M 4M 6M 8M 10M

0

1000

2000

3000

4000

5000Ant-v2

0M 2M 4M 6M 8M 10M0

1000

2000

3000

4000

5000Humanoid-v2

Figure 4: Learning curves on OpenAI Gym MuJoCo tasks averaged over 10 independent runs. x-axis and y-axiscorrespond to the number of steps and average reward.

0M 2M 4M 6M 8M 10M0

1000

2000

3000

4000

5000

6000Walker2d-v2 (noise_std=0)

PPOPPO+GASIL

0M 2M 4M 6M 8M 10M0

1000

2000

3000

4000

5000

6000Walker2d-v2 (noise_std=0.05)

0M 2M 4M 6M 8M 10M0

1000

2000

3000

4000

5000


0M 2M 4M 6M 8M 10M0

1000

2000

3000

4000

5000


Figure 5: Learning curves on stochastic Walker2d-v2 averaged over 10 independent runs. The leftmost plotshows the learning curves on the original task without any noise in the environment. The other plots showlearning curves on stochastic Walker2d-v2 task where Gaussian noise with standard deviation of {0.05, 0.1, 0.5}(from left to right) is added to the observation for each step independently.

would be that GASIL generalizes better than behavior cloning method under non-stationary data (i.e.,good trajectory buffer changes over time).

We further investigated how robust GASIL is to the stochasticity of the environment by adding aGaussian noise to the observation for each step on Walker2d-v2. The result in Figure 5 shows that thegap between PPO and PPO+GASIL is larger when the noise is added to the environment. This resultsuggests that GASIL can be robust to stochastic environments to some extent in practice.

5.4 DELAYED MUJOCO

OpenAI Gym MuJoCo tasks provide dense reward signals to the agent according to the agent’sprogress along desired directions. In order to see how useful GASIL is under more challengingreward structures, we modified the tasks by delaying the reward of MuJoCo tasks for 20 steps. Inother words, the agent receives an accumulated reward only after every 20 steps or when the episodeterminates. This modification makes it much more difficult for the agent to learn due to the delayedreward signal.

The result in Figure 6 shows that GASIL is much more helpful on delayed-reward MuJoCo taskscompared to non-delayed ones in Figure 4, improving PPO on all tasks by a large margin. Thisresult demonstrates that GASIL is useful especially for dealing with delayed reward because thediscriminator gives dense reward signals to the policy, even though the true reward is extremelydelayed.

5.5 EFFECT OF HYPERPARAMETERS

Figure 7 shows the effect of GASIL hyperparameters on Walker2d-v2. Specifically, Figure 7a showsthe effect of the size of good trajectory buffer in terms of maximum steps in the buffer. It turns outthat the agent performs poorly when the buffer size is too small (500 steps) or large (5000 steps).

7


0M 1M 2M 3M 4M 5M0

50

100

150

DelayedSwimmer-v2

PPOPPO+BCPPO+SILPPO+GASIL

0M 1M 2M 3M 4M 5M0

500

1000

1500

2000

2500

DelayedHopper-v2

0M 1M 2M 3M 4M 5M

0

1000

2000

3000DelayedHalfCheetah-v2

0M 2M 4M 6M 8M 10M0

1000

2000

3000

4000

5000DelayedWalker2d-v2

0M 2M 4M 6M 8M 10M

0

1000

2000

3000

DelayedAnt-v2

0M 2M 4M 6M 8M 10M0

500

1000

1500

2000

2500

3000

DelayedHumanoid-v2

Figure 6: Learning curves on delayed-reward versions of OpenAI Gym MuJoCo tasks averaged over 10independent runs. x-axis and y-axis correspond to the number of steps and average reward.

0M 2M 4M 6M 8M 10M0

1000

2000

3000

4000

5000

Walker2d-v2

Size-500Size-1000Size-2000Size-3000Size-5000

(a) Buffer size

0M 2M 4M 6M 8M 10M0

1000

2000

3000

4000

5000

6000Walker2d-v2

Update-1Update-5Update-20Update-40

(b) Discriminator updates

Figure 7: Effect of GASIL hyperparameters.

Although it is always useful to have more samples for imitation learning in general, the averagereturn of good trajectories decreases as the size of the buffer increases. This indicates that there is atrade-off between the number of samples and the quality of good trajectories.

Figure 7b shows the effect of the number of discriminator updates with a fixed number of policyupdates per batch. It is shown that too small or too large number of discriminator updates hurts theperformance. This result is also consistent with GANs (Goodfellow et al., 2014), where the balancebetween the discriminator and the generator (i.e., policy) is crucial for the performance.

6 DISCUSSIONSAlternative ways of training the discriminator We presented a simple way of training the dis-criminator: discriminating between top-K trajectories and policy trajectories. However, there can bemany different ways of defining good trajectories and training the discriminator. Developing a moreprincipled way of training the discriminator with strong theoretical guarantees would be an importantfuture work.

Dealing with multi-modal trajectories In the experiment, we used a Gaussian policy with anindependent covariance. This type of parameterization has been shown to have difficulties in learningdiverse behaviors (Haarnoja et al., 2018; 2017). In GASIL, we observed that the good trajectorybuffer (B) often contain multi-modal trajectories because they are collected by different policies withdifferent parameters over time. We observed that a Gaussian policy struggles with imitatating themreliably. In fact, there has been recent studies (Hausman et al., 2017; Li et al., 2017) that aim toimitate multi-modal behaviors using the GAIL framework. We believe that combining such methodswould further improve the performance.

8


Model-based approach We used a model-free GAIL framework which requires policy gradientfor training the policy. However, our idea can be extended to model-based GAIL (MGAIL) (Baramet al., 2017) where the policy is updated by directly backpropagating through a learned discriminatorand a learned dynamics model. Since MGAIL has been shown to be more sample-efficient thanGAIL, we expect that a model-based counterpart of GASIL would also improve the performance.

7 CONCLUSION

This paper proposed Generative Adversarial Self-Imitation Learning (GASIL) as a simple regularizerfor RL. The main idea is to imitate good trajectories that the agent has collected using generativeadversarial learning framework. We demonstrated that GASIL significantly improves existing state-of-the-art baselines across many control tasks especially when rewards are delayed. Extending thiswork towards a more principled generative adversarial learning approach with theoretical guaranteewould be an interesting research direction.

REFERENCES

Daniel A Abolafia, Mohammad Norouzi, and Quoc V Le. Neural program synthesis with priorityqueue training. arXiv preprint arXiv:1801.03526, 2018.

Nir Baram, Oron Anschel, Itai Caspi, and Shie Mannor. End-to-end differentiable adversarialimitation learning. In International Conference on Machine Learning, pp. 390–399, 2017.

Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo,Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. arXiv preprintarXiv:1606.04460, 2016.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, andWojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford,John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.

Tanmay Gangwani, Qiang Liu, and Jian Peng. Learning self-imitating diverse policies. arXiv preprintarXiv:1805.10309, 2018.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural informa-tion processing systems, pp. 2672–2680, 2014.

Anirudh Goyal, Philemon Brakel, William Fedus, Timothy Lillicrap, Sergey Levine, Hugo Larochelle,and Yoshua Bengio. Recall traces: Backtracking models for efficient reinforcement learning. arXivpreprint arXiv:1804.00379, 2018.

Xiaoxiao Guo, Satinder P. Singh, Richard L. Lewis, and Honglak Lee. Deep learning for rewarddesign to improve monte carlo tree search in atari games. In IJCAI, 2016.

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning withdeep energy-based policies. In ICML, 2017.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maxi-mum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290,2018.

Karol Hausman, Yevgen Chebotar, Stefan Schaal, Gaurav Sukhatme, and Joseph J Lim. Multi-modalimitation learning from unstructured demonstrations using generative adversarial nets. In Advancesin Neural Information Processing Systems, pp. 1235–1245, 2017.

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in NeuralInformation Processing Systems, pp. 4565–4573, 2016.

9

https://github.com/openai/baselines

https://github.com/openai/baselines


Máté Lengyel and Peter Dayan. Hippocampal contributions to control: the third way. In Advances inneural information processing systems, pp. 889–896, 2008.

Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visualdemonstrations. In Advances in Neural Information Processing Systems, pp. 3815–3825, 2017.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. Neural symbolic machines:Learning semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020,2016.

Elman Mansimov and Kyunghyun Cho. Simple nearest neighbor policy method for continuouscontrol tasks. In NIPS Deep Reinforcement Learning Symposium, 2017.

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations:Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.

Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. In ICML, 2018.

Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech, Oriol Vinyals, DemisHassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. arXiv preprintarXiv:1703.01988, 2017.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee.Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms. CoRR, abs/1707.06347, 2017.

Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from. In Proceedingsof the annual conference of the cognitive science society, pp. 2601–2606, 2009.

Jonathan Sorg, Richard L Lewis, and Satinder P Singh. Reward design via online gradient ascent. InAdvances in Neural Information Processing Systems, pp. 2190–2198, 2010.

Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear program-ming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039.ACM, 2008.

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012.

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradientmethods. arXiv preprint arXiv:1804.06459, 2018.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translationusing cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inversereinforcement learning. In AAAI, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.

10


A EXPERIMENTS ON ATARI GAMES

Figure 8: Learning curves on hard exploration Atari games averaged over 3 independent runs. x-axis and y-axiscorrespond to the number of steps and average reward.

Montezuma Freeway Hero PrivateEye Gravitar Frostbite

PPO 20 34 30645 145 2406 915PPO+GASIL (Ours) 629 34 21830 15099 2141 8276A2C+SIL (Oh et al., 2018) 2500 34 33069 8684 2722 6439

Table 1: Compariston to A2C+SIL (Oh et al., 2018) on hard exploration Atari games.

To see how well GASIL performs with richer observation space, we evaluated it on hard explorationAtari games used in Oh et al. (2018). The result in Figure 8 shows that GASIL significantly improvesPPO on Frostbite, Montezuma’s Revenge, and PrivateEye. This suggests that GASIL is a useful RLregularizer that can be generally applied to a variety of domains. We further compared PPO+GASILwith A2C+SIL (Oh et al., 2018) in Table 1. It turns out that PPO+GASIL does not outperformA2C+SIL, though this is not a fair comparison as they use different actor-critic algorithms (i.e., A2Cand PPO). In fact, GAIL (Ho & Ermon, 2016) has not been shown to be efficient on this type ofdomain where observations are high-dimensional. Thus, we conjecture that GASIL is more beneficialthan SIL particularly for dealing with continuous control and simple observation space as shown inour MuJoCo experiments.

11


B HYPERPARAMETERS

Hyperperameters and architectures used for MuJoCo experiments are described in Table 2. Weperformed a random search over the range of hyperparameters specified in Table 2. For GASIL+PPOon Humanoid-v2, the policy is trained with PPO (α = 0) for the first 2M steps, and α is increased to0.02 until 3M steps. For the rest of tasks including all delayed-MuJoCo tasks, we used used a fixed αthroughout training.

Table 2: GARL hyperparameters on MuJoCo.

Hyperparameters Value

Architecture FC(64)-FC(64)Learning rate {0.0003, 0.0001, 0.00005, 0.00003}Horizon 2048Number of epochs 10Minibatch size 64Discount factor (γ) 0.99GAE parameter 0.95Entropy regularization coefficient (λ) 0

Discriminator minibatch size 128Number of discriminator updates per batch {1, 5, 10, 20}Discriminator learning rate {0.0003, 0.0001, 0.00002, 0.00001}Size of good trajectory buffer (steps) {1000, 10000}Scale of discriminator reward (α) {0.02, 0.1, 0.2, 1}

12

GENERATIVE ADVERSARIAL SELF-IMITATION LEARNING

Documents

Transcript of GENERATIVE ADVERSARIAL SELF-IMITATION LEARNING