Mix & Match Agent Curricula for Reinforcement LearningM&M– Agent Curricula for RL paper, instead...

Mix & Match – Agent Curricula for Reinforcement Learning

Wojciech Marian Czarnecki * 1 Siddhant M. Jayakumar * 1 Max Jaderberg 1 Leonard Hasenclever 1

Yee Whye Teh 1 Simon Osindero 1 Nicolas Heess 1 Razvan Pascanu 1

Abstract

We introduce Mix & Match (M&M) – a train-ing framework designed to facilitate rapid andeffective learning in RL agents, especially thosethat would be too slow or too challenging to trainotherwise. The key innovation is a procedurethat allows us to automatically form a curricu-lum over agents. Through such a curriculum wecan progressively train more complex agents by,effectively, bootstrapping from solutions foundby simpler agents. In contradistinction to typ-ical curriculum learning approaches, we do notgradually modify the tasks or environments pre-sented, but instead use a process to gradually al-ter how the policy is represented internally. Weshow the broad applicability of our method bydemonstrating significant performance gains inthree different experimental setups: (1) We trainan agent able to control more than 700 actionsin a challenging 3D first-person task; using ourmethod to progress through an action-space cur-riculum we achieve both faster training and betterfinal performance than one obtains using tradi-tional methods. (2) We further show that M&Mcan be used successfully to progress through acurriculum of architectural variants defining anagents internal state. (3) Finally, we illustratehow a variant of our method can be used to im-prove agent performance in a multitask setting.

1 IntroductionThe field of deep reinforcement learning has seen signif-icant advances in the recent past. Innovations in environ-ment design have led to a range of exciting, challenging andvisually rich 3D worlds (e.g. Beattie et al., 2016; Kempkaet al., 2016; Brockman et al., 2016). These have in turn led

*Equal contribution 1DeepMind, London, UK. Correspon-dence to: Wojciech M. Czarnecki <[email protected]>, Sid-dhant M. Jayakumar <[email protected]>.

Figure 1. Scheme of Mix & Match– each box represents a policy.The blue πmm is the control policy, optimised with the true RLobjective. White boxes represent policies used for the curriculum,while the red policy is the final agent.

to the development of more complex agent architecturesand necessitated massively parallelisable policy gradientand Q-learning based RL algorithms (e.g. Mnih et al., 2016;Espeholt et al., 2018). While the efficacy of these methodsis undeniable, the problems we consider increasingly re-quire more powerful models, complex action spaces andchallenging training regimes for learning to occur.

Curriculum learning is a powerful instrument in the deeplearning toolbox (e.g. Bengio et al., 2009; Graves et al.,2017). In a typical setup, one trains a network sequentiallyon related problems of increasing difficulty, with the endgoal of maximizing final performance on a desired task.However, such task-oriented curricula pose some practi-cal difficulties for reinforcement learning. For instance,they require a certain understanding of and control overthe generation process of the environment, such that sim-pler variants of the task can be constructed. And in situa-tions where this is possible, it is not always obvious how toconstruct useful curricula – simple intuitions from learn-ing in humans do not always apply to neural networks.Recently (e.g. Sutskever & Zaremba, 2014; Graves et al.,2017) proposes randomised or automated curricula to cir-cumvent some of these issues with some success. In this

arX

iv:1

806.

0178

0v1

[cs

.LG

] 5

Jun

201

8

M&M– Agent Curricula for RL

paper, instead of curricula over task variants we consideran alternate formulation – namely a curriculum over vari-ants of agents. We are interested in training a single finalagent, and in order to do so we leverage a series of inter-mediate agents that differ structurally in the way in whichthey construct their policies (Fig. 1). Agents in such cur-ricula are not arranged according to architectural complex-ity, but rather training complexity. While these complexitymeasures are often aligned, they are sometimes orthogo-nal (e.g. it is often faster to train a complex model on twodistinct tasks, than a simpler model on them jointly). Incontrast to a curriculum over tasks, our approach can be ap-plied to a wide variety of problems where we do not havethe ability to modify the underlying task specification ordesign. However, in domains where traditional curriculaare applicable, these two methods can be easily combined.

The primary contribution of this work is thus to motivateand provide a principled approach for training with curric-ula over agents.

Mix & Match: An overview

In the Mix & Match framework, we treat multiple agents ofincreasing learning complexity as one M&M agent, whichacts with a mixture of policies from its constituent agents(Fig. 1). Consequently it can be seen as an ensemble or amixture of experts agent, which is used solely for purposeof training. Additionally, knowledge transfer (i.e. distilla-tion) is used such that we encourage the complex agents tomatch the simpler ones early on. The mixing coefficientis controlled such that ultimately only the complex, targetagent is used for generating experience. Note that we con-sider the complexity of an agent not just in terms of thedepth or size of its network (see Section 4.2), but with refer-ence to the difficulty in training it from scratch (see Section4.1 and 4.3). We also note that while analogous techniquesto ours can be applied to train mixtures of experts/policies(i.e. maximising performance across agents), this is not thefocus of the present research; here our focus is to train afinal target agent.

Training with the Mix & Match framework confers severalpotential benefits – for instance performance maximisation(either with respect to score or data efficiency), or enablingeffective learning in otherwise hard-to-train models. Andwith reference to this last point, M&M might be particu-larly beneficial in settings where real world constraints (in-ference speed, memory) demand the use of certain specificfinal models.

2 Related workCurriculum learning is a long standing idea in machinelearning, with mentions as early as the work of Elman (El-

man, 1993). In its simplest form, pretraining and finetuningis a form of curriculum, widely explored (e.g. Simonyan &Zisserman, 2014). More explicitly, several works look atthe importance of a curriculum for neural networks (e.g.Bengio et al., 2009). In many works, this focus is on con-structing a sequence of tasks of increasing difficulty. Morerecent work (Graves et al., 2017; Sutskever & Zaremba,2014) however looks at automating task selection or em-ploying a mixture of difficulties at each stage in training.We propose to extend this idea and apply it instead to train-ing agents in curricula – keeping in spirit recent ideas ofmixtures of tasks (here, models).

The recent work on Net2Net (Chen et al., 2016) proposesa technique to increase the capacity of a model withoutchanging the underlying function it represents. In order toachieve this, the architectures have to be supersets/subsetsof one another and be capable of expressing identity map-pings. Follow-up work (Wei et al., 2016) extends theseideas further. Both approaches can be seen as implicitlyconstructing a form of curriculum over the architecture, asa narrow architecture is first trained, then morphed into awider one.

Related to this idea is the concept of knowledge transferor distillation (Hinton et al., 2015; Ba & Caruana, 2014) –a technique for transferring the functional behaviour of anetwork into a different model, regardless of the structureof the target or source networks. While initially proposedfor model compression (Bucilu et al., 2006; Ba & Caruana,2014) , in (Parisotto et al., 2016; Rusu et al., 2016) dis-tillation is used to compress multiple distinct policies intoa single one. Distral (Teh et al., 2017) instead focuses onlearning independent policies, which use co-distilled cen-tralised agent as a communication channel.

Our work borrows and unifies several of these threads witha focus on online, end-to-end training of model curriculafrom scratch.

3 Method detailsWe first introduce some notation to more precisely describeour framework. Let us assume we are given a sequence oftrainable agents1 (with corresponding policies π1, ..., πK ,each parametrised with some θi ⊂ θ – which can sharesome parameters) ordered according to the complexity ofinterest (i.e. π1 can be a policy using a tiny neural networkwhile πK the very complex one). The aim is to train πK ,while all remaining agents are there to induce faster/easier

1 For simplicity of notation we are omitting time dependenceof all random variables, however we do consider a time-extendedsetting. In particular, when we talk about policy of agent i werefer to this agent policy at given time (which will change in thenext step).


learning. Furthermore, let us introduce the categorical ran-dom variable c ∼ Cat(1, ...,K|α) (with probability massfunction p(c = i) = αi) which will be used to select apolicy at a given time:

πmm(a|s) =K∑i=1

αiπi(a|s).

The point of Mix & Match is to allow curriculum learning,consequently we need the probability mass function (pmf)of c to be changed over time. Initially the pmf should haveα1 = 1 and near the end of training αK = 1 thus allow-ing the curriculum of policies from simple π1 to the targetone πK . Note, that αi has to be adjusted in order to controllearning dynamics and to maximise the whole learning per-formance, rather than immediate increase. Consequently itshould be trained in a way which maximises long lastingincrease of performance (as opposed to gradient based opti-misation which tends to be greedy and focus on immediaterewards).

We further note that mixing of policies is necessary butnot sufficient to obtain curriculum learning – even though(for non dirac delta like c) gradients always flows throughmultiple policies, there is nothing causing them to actuallyshare knowledge. In fact, this sort of mixture of expertsis inherently competitive rather than cooperative (Jacobset al., 1991). In order to address this issue we propose us-ing a distillation-like cost D, which will align the policiestogether.

Lmm(θ) =

K∑i,j=1

D(πi(·|·, θi), πj(·|·, θj), i, j, α).

The specific implementation of the above cost will varyfrom application to application. In the following sectionswe look at a few possible approaches.

The final optimisation problem we consider is just aweighted sum of the original LRL loss (i.e. A3C (Mnihet al., 2016)), applied to the control policy πmm and theknowledge transfer loss:

L(θ) = LRL(θ|πmm) + λLmm(θ).

We now describe in more detail each module requiredto implement Mix & Match, starting with policy mixing,knowledge transfer and finally α adjustment.

3.1 Policy mixing

There are two equivalent views of the proposed policy mix-ing element – one can either think about having a categor-ical selector random variable c described before, or an ex-plicit mixing of the policy. The expected gradients of both

are the same:

Ei∼c [∇θπi(a|s, θ)] =K∑i=1

αi∇θπi(a|s, θ) =

= ∇θK∑i=1

αiπi(a|s, θ) = ∇θπmm(a|s, θ),

however, if one implements the method by actually sam-pling from c and then executing a given policy, the result-ing single gradient update will be different than the oneobtained from explicitly mixing the policy. From this per-spective it can be seen as a Monte Carlo estimate of themixed policy, thus for the sake of variance reduction weuse explicit mixing in all experiments in this paper.

3.2 Knowledge transfer

For simplicity we consider the case of K = 2, but all fol-lowing methods have a natural extension to an arbitrarynumber of policies. Also, for notational simplicity we dropthe dependence of the losses or policies on θ when it is ob-vious from context.

Consider the problem of ensuring that final policy π2matches the simpler policy π1, while having access to sam-ples from the control policy, πmm. For simplicity, we defineour M&M loss over the trajectories directly, similarly tothe unsupervised auxiliary losses (Jaderberg et al., 2017b;Mirowski et al., 2017), thus we put:

Lmm(θ) =1− α|S|

∑s∈S

|s|∑t=1

DKL(π1(·|st)‖π2(·|st)), (1)

and trajectories (s ∈ S) are sampled from the control pol-icy. The 1 − α term is introduced so that the distillationcost disappears when we switch to π2.2 This is similar tothe original policy distillation (Rusu et al., 2016), howeverhere the control policy is mixture of π2 (the student) andπ1 (the teacher).

One can use a memory buffer to store S and ensure thattargets do not drift too much (Ross et al., 2011). In sucha setting, under reasonable assumptions one can prove theconvergence of π2 to π1 given enough capacity and experi-ence.

Remark 1. Lets assume we are given a set of N trajecto-ries from some predefined mix πmm = (1−α)π1+απ2 forany fixed α ∈ (0, 1) and a big enough neural network withsoftmax output layer as π2. Then in the limit as N → ∞,the minimisation of Eq. 1 converges to π1 if the optimiserused is globally convergent when minimising cross entropyover a finite dataset.

2It can also be justified as a distillation of mixture policy, seeAppendix for derivation.


Proof. Given in the appendix.

In practice we have found that minimising this loss in anonline manner (i.e. using the current on-policy trajectoryas the only sample for Eq. (1)) works well in the consideredapplications.

3.3 Adjusting α through training

An important component of the proposed method is how toset values of α through time. For simplicity let us againconsider the case of K = 2, where one needs just a singleα (as c now comes from Bernoulli distribution) which wetreat as a function of time t.

Hand crafted schedule Probably the most common ap-proach is to define a schedule by hand. Unfortunately, thisrequires per problem fitting, which might be time consum-ing. Furthermore while designing an annealing schedule issimple (given that we provide enough flat regions so thatRL training is stable), following this path might miss theopportunity to learn a better policy using non-monotonicswitches in α.

Online hyperparameter tuning Since α changesthrough time one cannot use typical hyperparametertuning techniques (like grid search or simple Bayesianoptimisation) as the space of possible values is exponentialin number of timesteps (α = (α(1), · · · , α(T )) ∈ 4TK−1,where4k denotes a k dimensional simplex). One possibletechnique to achieve this goal is the recently proposedPopulation Based Training (Jaderberg et al., 2017a) (PBT)which keeps a population of agents, trained in parallel, inorder to optimise hyperparameters through time (withoutthe need of ever reinitialising networks). For the rest of thepaper we rely on using PBT for α adaptation, and discussit in more detail in the next section.

3.4 Population based training and M&M

Population based training (PBT) is a recently proposedlearning scheme, which performs online adaptation of hy-perparameters in conjunction with parameter optimisationand a form of online model selection. As opposed to manyclassical hyperparameter optimisation schemes– the abil-ity of of PBT to modify hyperparameters throughout a sin-gle training run makes it is possible to discover powerfuladaptive strategies e.g. auto-tuned learning rate annealingschedules.

The core idea is to train a population of agents in parallel,which periodically query each other to check how well theyare doing relative to others. Badly performing agents copythe weights (neural network parameters) of stronger agentsand perform local modifications of their hyperparameters.

This way poorly performing agents are used to explore thehyperparameters space.

From a technical perspective, one needs to define two func-tions – eval which measures how strong a current agentis and explore which defines how to perturb the hyper-parameters. As a result of such runs we obtain agents max-imising the eval function. Note that when we refer toan agent in the PBT context we actually mean the M&Magent, which is already a mixture of constituent agents.

We propose to use one of the two schemes, depending onthe characteristics of the problem we are interested in. Ifthe models considered have a clear benefit (in terms of per-formance) of switching from simple to the more complexmodel, then all one needs to do is provide eval with per-formance (i.e. reward over k episodes) of the mixed policy.For an explore function for α we randomly add or sub-tract a fixed value (truncating between 0 and 1). Thus, oncethere is a significant benefit of switching to more complexone – PBT will do it automatically. On the other hand, of-ten we want to switch from an unconstrained architectureto some specific, heavily constrained one (where there maynot be an obvious benefit in performance from switching).In such setting, as is the case when training a multitaskpolicy from constituent single-task policies, we can makeeval an independent evaluation job which only looks atperformance of an agent with αK = 1. This way we di-rectly optimise for the final performance of the model ofinterest, but at the cost of additional evaluations needed forPBT.

4 ExperimentsWe now test and analyse our method on three sets of RL ex-periments. We train all agents with a form of batched actorcritic with an off policy correction (Espeholt et al., 2018)using DeepMind Lab (Beattie et al., 2016) as an environ-ment suite. This environment offers a range of challenging3D, first-person view based tasks (see, appendix) for RLagents. Agents perceive 96 × 72 pixel based RGB obser-vations and can move, rotate, jump and tag built-in bots.

We start by demonstrating how M&M can be used to scaleto a large and complex action space. We follow this withresults of scaling complexities of the agent architecture andfinally on a problem of learning a multitask policy. In allfollowing sections we do not force α to approach 1, insteadwe initialise it around 0 and analyse its adaptation throughtime. Unless otherwise stated, the eval function returnsaveraged rewards from last 30 episodes of the control pol-icy. Note, that even though in the experimental sections weuse K = 2, the actual curriculum goes through potentiallyinfinitely many agents being a result of mixing between π1and π2. Further technical details and descriptions of all


(a) M&M for action spaces progression (b) M&M for architecture progression (c) M&M for multitask progression

Figure 2. Schemes of variours settings M&M can be applied to, explored in this paper. Violet nodes represent modules that are sharedbetween πi, grey ones are separate modules for the helper agents and red ones – modules which are unique to the final agent. Blue nodesare the ones that are exposed to the environment – control policy(ies) and value function(s).

tasks are provided in Appendix.

4.1 Curricula over number of actions used

DeepMind Lab provides the agent with a complex actionspace, represented as a 6 dimensional vector. Two of theseaction groups are very high resolution (rotation and look-ing up/down actions), allowing up to 1025 values. The re-maining four groups are low resolution actions, such as theternary action of moving forward, backward or not mov-ing at all, shooting or not shooting etc. If naively ap-proached this leads to around 4 · 1013 possible actions ateach timestep.

Even though this action space is defined by the environ-ment, practitioners usually use an extremely reduced sub-set of available actions (Mnih et al., 2016; Espeholt et al.,2018; Jaderberg et al., 2017b; Mirowski et al., 2017) – from9 to 23 preselected ones. When referring to action spaceswe mean the subset of possible actions used for which theagent’s policy provides a non zero probability. Smaller ac-tion spaces significantly simplify the exploration problemand introduce a strong inductive bias into the action spacedefinition. However, having such a tiny subset of possiblemovements can be harmful for the final performance of theagent. Consequently, we apply M&M to this problem ofscaling action spaces. We use 9 actions to construct π1, thesimple policy (called Small action space). This is only usedto guide learning of our final agent – which in this case uses756 actions – these are all possible combinations of avail-able actions in the environment (when limiting the agentto 5 values of rotation about the z-axis, and 3 values aboutthe x-axis). Similarly to the research in continuous controlusing diagonal Gaussian distributions (Heess et al., 2017)we use a factorised policy (and thus assume conditional in-dependence given state) to represent the joint distribution

π2(a1, a2, ..., a6|s) :=∏6j=1 πj(aj |s), which we refer to

as Big action space. In order to be able to mix these twopolicies we map π1 actions onto the corresponding ones inthe action space of π2 (which is a strict superset of π1).

We use a simple architecture of a convolutional networkfollowed by an LSTM, analogous to previous works in thisdomain (Jaderberg et al., 2017b). For M&M we share allelements of two agents apart from the final linear transfor-mation into the policy/value functions (Fig. 2a). Full detailsof the experimental hyper-parameters can be found in theappendix, and on each figure we show the average over 3runs for each result.

We see that the small action space leads to faster learningbut hampers final performance as compared to the big ac-tion space (Fig. 3 and Fig. 4). Mix & Match applied tothis setting gets the best of both worlds – it learns fast, andnot only matches, but surpasses the final performance ofthe big action space. One possible explanation for this in-crease is the better exploration afforded by the small actionspace early on, which allows agents to exploit fully theirflexibility of movement.

We further compare two variants of our method. We firstinvestigate using M&M (Shared Head) – in this approach,we share weights in the final layer for those actions thatare common to both policies. This is achieved by mask-ing the factorised policy π2 and renormalising accordingly.We further consider a variant of our distillation cost – whencomputing the KL between π1 and π2 one can also maskthis loss such that π2 is not penalised for assigning non-zero probabilities to the actions outside the π1 support –M&M (Masked KL). Consistently across tested levels, bothshared Head and Masked KL approaches achieve compara-ble or worse performance than the original formulation. Itis worth noting however, that if M&M were to be applied


Figure 3. Comparison of M&M and its variations to baselines applied to the problem of scaling to complex action spaces. Each figureshows the results for a single DMLab level. Each curve shows the mean over 3 populations, each consisting of 10 agents each, withrandom seeds and hyperparameters as described in the appendix. M&M represents the formulation with mixing and a KL distillationcost. The masked KL version only incorporates a KL cost on those actions present in both policies, while the shared head variant sharesweights for these common actions directly. All variants of M&M outperform the baselines in data efficiency and performance.

Figure 4. Human normalised score through training in actionspaces experiments. Averaged across levels (for per level curvessee Fig. 3).

to a non-factorised complex action space, the Masked KLmight prove beneficial, as it would then be the only signalensuring agent explore the new actions.

Figure 5. Left: Exemplary α value through time from a single runof an action space experiment. Right: Progression of marginaldistribution of actions taken by the agent. Notice how the collisionentropy (−E[log

∑a π

2mm(a|s)]) grows over time.

When plotting α through time (Fig. 5 Left) we see that theagent switches fully to the big action space early on, thusshowing that small action space was useful only for ini-tial phase of learning. This is further confirmed by lookingat how varied the actions taken by the agent are throughtraining. Fig. 5 (Right) shows how the marginal distribu-tion over actions evolves through time. We see that new

actions are unlocked through training, and further that thefinal distribution is more entropic that the initial one.

4.2 Curricula over agent architecture

Another possible curriculum is over the main computa-tional core of the agent. We use an architecture analogousto the one used in previous sections, but for the simple orinitial agent, we substitute the LSTM with a linear projec-tion from the processed convolutional signal onto a 256 di-mensional latent space. We share both the convolutionalmodules as well as the policy/value function projections(Fig. 2b). We use a 540 element action space, and a fac-torised policy as described in the previous section.

We ran experiments on four problems in the DM Lab en-vironment, focusing on various navigation tasks. On onehand, reactive policies (which can be represented solelyby a FF policy) should learn reasonably quickly to movearound and explore, while on the other hand, recurrent net-works (which have memory) are needed to maximise the fi-nal performance – by either learning to navigate new mazelayouts (Explore Object Location Small) or avoiding (seek-ing) explored unsuccessful (successful) paths through themaze.

As one can see on the average human normalised perfor-mance plot (Fig. 7) the M&M applied to the transition be-tween FF and LSTM cores does lead to a significant im-provement in final performance (20% increase in humannormalised performance over tasks of interest). It is, how-ever no longer as fast as the FF counterpart. In order to in-vestigate this phenomenon we ran multiple ablation exper-iments (Fig. 6). In the first one, denoted FF+LSTM we usea skip connection which simply adds the activations of theFF core and LSTM core before passing it to a single linearprojector for the policy/value heads. This enriched archi-tecture does improve performance of LSTM only model,however it usually learns even slower, and has very similarlearning dynamics to M&M. Consquently it strongly sug-


Figure 6. Comparison of the M&M agent and various baselines on four DM Lab levels. Each curve represents the average of 3 inde-pendent runs of 10 agents each (used for population based training). FF and LSTM represent the feedforward and LSTM baselinesrespectively, while FF+LSTM is a model with both cores and a skip connection. FF+LSTM is thus a significantly bigger model than theothers, possibly explaining the outlier on the LT level. The LSTM&LSTM experiment shows M&M applied with two LSTM agents.

Figure 7. Human normalised score through training in agent’score experiments. Averaged across levels (for per level curvessee Fig. 6).

gests that M&M’s lack of initial speedup comes from thefact that it is architecturally more similar to the skip con-nection architecture. Note, that FF+LSTM is however asignificantly bigger model (which appears to be helpful onLT Horseshoe color task).

Another question of interest is whether the benefit trulycomes from the two core types, or simply through some sortof regularisation effect introduced by the KL cost. To testthis hypothesis we also ran an M&M-like model but with 2LSTM cores (instead of the feedforward). This model sig-nificantly underperforms all other baselines in speed andperformance. This seems to suggest that the distillation orKL cost on its own is not responsible for any benefits weare seeing, and rather it is the full proposed Mix & Matchmethod.

Finally if we look at the progression of the mixing co-efficient (α) through time (Fig. 8), we notice once againquick switches on navigation-like tasks (all curves exceptthe green one). However, there are two interesting obser-vations to be made. First, the lasertag level, which requiresa lot of reactiveness in the policy, takes much longer toswitch to the LSTM core (however it does so eventually).This might be related to complexity of the level, which haspickup gadgets as well as many opponents, making mem-

Figure 8. Exemplary α values through time from single runs offour DMLab levels. We find that our PBT optimisation finds cur-ricula of different lengths as suited to the target problem.

ory useful much later in training. Secondly, for the simplegoal finding task in a fixed maze (Nav maze static 01, theblue curve) the agent first rapidly switches to the LSTM,but then more or less mid training switches to the mixturepolicy (α ≈ 0.5) while finally switch completely again to-wards the end of training. This particular behaviour is pos-sible due to the use of unconstrained α adaptation with PBT– thus depending on the current performance the agent cango back and forth through curriculum, which for this par-ticular problem seems to be needed.

4.3 Curricula for multitask

As a final proof of concept we consider the task of learninga single policy capable of solving multiple RL problems atthe same time. The basic approach for this sort of task is totrain a model in a mixture of environments or equivalentlyto train a shared model in multiple environments in paral-lel (Teh et al., 2017; Espeholt et al., 2018). However, thissort of training can suffer from two drawbacks. First, it isheavily reward scale dependent, and will be biased towardshigh-reward environments. Second, environments that areeasy to train provide a lot of updates for the model and con-sequently can also bias the solution towards themselves.

To demonstrate this issue we use three DeepMind Lab envi-


Figure 9. Performance of M&M applied to the multitask domain on three problems considered. Note that this is a performance of anagent trained jointly on all these tasks. The x-axis counts frames across tasks and the y-axis shows score per episode.

ronments – one is Explore Object Locations Small, whichhas high rewards and a steep initial learning curve (due tolots of reward signal coming from gathering apples). Thetwo remaining ones are challenging laser tag levels (de-scribed in detail in the appendix). In both these problemstraining is hard, as the agent is interacting with other botsas well as complex mechanics (pick up bonuses, taggingfloors, etc.).

We see in Fig. 9 that the multitask solution focuses on solv-ing the navigation task, while performing comparitivelypoorly on the more challenging problems. To apply M&Mto this problem we construct one agent per environment(each acting as π1 from previous sections) and then onecentralised “multitask” agent (π2 from previous sections).Crucially, agents share convolutional layers but have inde-pendent LSTMs. Training is done in a multitask way, butthe control policy in each environment is again a mixturebetween the task specific πi (the specialist) and πmt (cen-tralised agent), see Fig. 2c for details. Since it is no longerbeneficial to switch to the centralised policy, we use theperformance of πmt (i.e. the central policy) as the optimi-sation criterion (eval) for PBT, instead of the control policy.

We evaluate both the performance of the mixture and thecentralised agent independently. Fig. 9 shows per task per-formance of the proposed method. One can notice muchmore uniform performance – the M&M agent learns toplay well in both challenging laser tag environments, whileslightly sacrificing performance in a single navigation task.One of the reasons of this success is the fact that knowl-edge transfer is done in policy space, which is invariantto reward scaling. While the agent can still focus purelyon high reward environments once it has switched to usingonly the central policy, this inductive bias in training withM&M ensures a much higher minimum score.

5 ConclusionsWe have demonstrated that the proposed method – Mix& Match – is an effective training framework to both im-

prove final performance and accelerate the learning pro-cess for complex agents in challenging environments. Thisis achieved by constructing an implicit curriculum overagents of different training complexities. The collection ofagents is bound together as a single composite whole usinga mixture policy. Information can be shared between thecomponents via shared experience or shared architecturalelements, and also through a distillation-like KL-matchingloss. Over time the component weightings of this mixtureare adapted such that at the end of training we are left witha single active component consisting of the most complexagent – our main agent of interest from the outset. Froman implementation perspective, the proposed method canbe seen as a simple wrapper (the M&M wrapper) that iscompatible with existing agent architectures and trainingschemes; as such it could easily be introduced as an addi-tional element in conjunction with wide variety of on- oroff-policy RL algorithms. In particular we note that, de-spite our focus on policy-based agents in this paper, theprinciples behind Mix & Match are also easily applied tovalue-based approaches such as Q-learning.

By leveraging M&M training, we are able to train complexagents much more effectively and in much less time than ispossible if one were to attempt to train such an agent with-out the support of our methods. The diverse applicationspresented in this paper support the generality of our ap-proach. We believe our training framework could help thecommunity unlock the potential of powerful, but hithertointractable, agent variants.

AcknowledgementsWe would like to thank Raia Hadsell, Koray Kavukcuoglu,Lasse Espeholt and Iain Dunning for their invaluable com-ments, advice and support.


Figure 10. Exemplary tasks of interest from DM Lab environment (Beattie et al., 2016). From left: Nav maze static 02 – The taskinvolves finding apples (+1 reward) and a final goal (+10 reward) in a fixed maze. Every time an agent finds the goal it respawns in arandom location, and all objects respawn too. Explore Object Locations Small – The task is to navigate through a 3D maze and eat allthe apples (each gives +1 reward). Once it is completed, the task restarts. Each new episode differs in terms of apples locations, maplayout as well as visual theme. Lt Horseshoe Color – the task is a game of lasertag, where player tries to tag as many of high skilledbuilt-in bots as possible, while using pick-up gadgets (which enhance tagging capabilities).

Appendix

A Network architecturesDefault network architecture consists of:

• Convolutional layer with 16 8x8 kernels of stride 4

• ReLU

• Convolutional layer with 32 4x4 kernels of stride 2

• ReLU

• Linear layer with 256 neurons

• ReLU

• Concatenation with one hot encoded last action andlast reward

• LSTM core with 256 hidden units

– Linear layer projecting onto policy logits, fol-lowed by softmax

– Linear layer projecting onto baseline

Depending on the experiment, some elements are sharedand/or replaced as described in the text.

B PBT (Jaderberg et al., 2017a) detailsIn all experiments PBT controls adaptation of three hyper-parameters: α, learning rate and entropy cost regularisa-tion. We use populations of size 10.

The explore operator for learning rate and entropy reg-ularisation is the permutation operator, which randomlymultiplies the corresponding value by 1.2 or 0.8. For α

it is an adder operator, which randomly adds or substracts0.05 and truncates result to [0, 1] interval. Exploration isexecuted with probability 25% independently each timeworker is ready.

The exploit operator copies all the weights and hyper-parameters from the randomly selected agent if it’s perfor-mance is significantly better.

Worker is deemed ready to undergo adaptation each 300episodes.

We use T-Test with p-value threshold of 5% to answer thequestion whether given performance is significantly betterthan the other, applied to averaged last 30 episodes returns.

Initial distributions of hyperparameters are as follows:

• learning rate: loguniform(1e-5, 1e-3)

• entropy cost: loguniform(1e-4, 1e-2)

• alpha: loguniform(1e-3, 1e-2)

B.1 Single task experiments

The eval function uses πmm rewards.

B.2 multitask experiments

The eval function uses πmt rewards, which requires a sep-arate evaluation worker per learner.

C M&M detailsλ for action space experiments is set to 1.0, and for agentcore and multitask to 100.0. In all experiments we allowbackpropagation through both policies, so that teacher isalso regularised towards student (and thus does not divergetoo quickly), which is similar to Distral work.


While in principle we could also transfer knowledge be-tween value functions, we did not find it especially helpfulempirically, and since it introduces additional weight to beadjusted, we have not used it in the reported experiments.

D IMPALA (Espeholt et al., 2018) detailsWe use 100 CPU actors per one learner. Each learner istrained with a single K80 GPU card. We use vtrace correc-tion with truncation as described in the original paper.

Agents are trained with a fixed unroll of 100 steps. Optimi-sation is performned using RMSProp with decay of 0.99,epsilon of 0.1. Discounting factor is set to 0.99, baselinefitting cost is 0.5, rewards are clipped at 1. Action repeat isset to 4.

E EnvironmentsWe ran DM Lab using 96 × 72 × 3 RGB observations, at60 fps.

E.1 Explore Object Locations Small

The task is to find all apples (each giving 1 point) in the pro-cedurally generated maze, where each episode has differentmaze, apples locations as well as visual theme. Collectingall apples resets environment.

E.2 Nav Maze Static 01/02

Nav Maze Static 01 is a fixed geometry maze with apples(worth 1 point) and one calabash (worth 10 points, gettingwhich resets environment). Agent spawns in random loca-tion, but walls, theme and objects positions are held con-stant.

The only difference for Nav Maze Static 02 is that it is sig-nificantly bigger.

E.3 LaserTag Horseshoe Color

Laser tag level against 6 built-in bots in a wide horseshoeshaped room. There are 5 Orb Gadgets and 2 Disc Gadgetslocated in the middle of the room, which can be picked upand used for more efficient tagging of opponents.

E.4 LaserTag Chasm

Laser tag level in a square room with Beam Gadgets, ShieldPickups (50 health) and Overshield Pickups (50 armor)hanging above a tagging floor (chasm) splitting room inhalf. Jumping is required to reach the items. Falling intothe chasm causes the agent to lose 1 point. There are 4built-in bots.

F ProofsFirst let us recall the loss of interest

Lmm(θ) =1− α|S|

∑s∈S

|s|∑t=1

DKL(π1(·|st)‖π2(·|st)), (2)

where each s ∈ S come from πmm = (1− α)π1 + απ2.

Proposition 1. Lets assume we are given a set of N trajec-tories from some predefined mix πmm = (1− α)π1 + απ2for any fixed α ∈ (0, 1) and a big enough neural net-work with softmax output layer as π2. Then in the limitas N → ∞, the minimisation of Eq. 1 converges to π1 ifthe optimiser used is globally convergent when minimisingcross entropy over a finite dataset.

Proof. ForDN denoting set ofN sampled trajectories overstate space S let as denote by SN the set of all states inDN ,meaning that SN = ∪DN . Since π2 is a softmax basedpolicy, it assigns non-zero probability to all actions in everystate. Consequently also πmm does that as α ∈ (0, 1). Thuswe have

limN→∞

SN = S.

Due to following the mixture policy, actual dataset DN

gathered can consist of multiple replicas of each element inSN , in different proportions that one would achieve whenfollowing π1. Note, note however that if we use optimiserwhich is capable of minimising the cross entropy over finitedataset, it can also minimise loss (1) over DN thus in par-ticular over SN which is its strict subset. Since the networkis big enough, it means that it will converge to 0 trainingerror:

∀s∈SN limt→∞

DKL(π1(a|s)‖π2(a|s, θt)) = 0

where θt is the solution of tth iteration of the optimiserused. Connecting the two above we get that in the limit ofN and t

∀s∈SDKL(π1(a|st)‖π2(a|st, θt)) = 0 ⇐⇒ π1 = π2.

While the global convergence might sound like a verystrong property, it holds for example when both teacher andstudent policies are linear. In general for deep networks itis hypothesised that if they are big enough, and well ini-tialised, they do converge to arbitrarily small training er-ror even if trained with a simple gradient descent, thus theabove proposition is not too restrictive for Deep RL.


G On α based scaling of knowledge transferloss

Let as take a closer look at the proposed loss

`mm(θ) = (1− α)DKL(π1(·|s)‖π2(·|s)) =

= (1− α)H(π1(·|s)‖π2(·|s))− (1− α)H(π1(·|s))

and more specifically at 1 − α factor. The intuitive justi-fication for this quantity is that it leads to DKL graduallydisappearing as M&M agent is switching to the final agent.However, one can provide another explanation. Let us in-stead consider divergence between mixed policy and thetarget policy (which also has the property of being 0 onceagent switches):

ˆmm(θ) = DKL(πmm(·|s)‖π2(·|s)) =

= H(πmm(·|s)‖π2(·|s))−H(πmm(·|s)) =

= H((1− α)π1(·|s) + απ2(·|s)‖π2(·|s))−H(πmm(·|s))

= H((1−α)π1(·|s)‖π2(·|s))+αH(π2(·|s)−H(πmm(·|s))

= (1−α)H(π1(·|s)‖π2(·|s))−(H(πmm(·|s))−αH(π2(·|s))

One can notice, that there are two factors of both losses,one being a cross entropy between π1 and π2 and the otherbeing a form of entropy regularisers. Furthermore, thesetwo losses differ only wrt. regularisations:

`mm(θ)− ˆmm(θ) =

= −(1− α)H(π1(·|s)) + (H(πmm(·|s))− αH(π2(·|s)) =

= H(πmm(·|s))− (αH(π2(·|s) + (1− α)H(π1(·|s)))

but since entropy is concave, this quantitiy is non-negative,meaning that

`mm(θ) ≥ ˆmm(θ)

therefore

−(1− α)H(πmm(·|s)) ≥ −(H(πmm(·|s))− αH(π2(·|s))

Thus the proposed scheme is almost equivalent to minimis-ing KL between mixed policy and π2 but simply with moresevere regularisation factor (and thus it is the upper boundof the ˆ

mm.

Further research and experiments need to be performed toasses quantitative differences between these costs though.In preliminary experiments we ran, the difference was hardto quantify – both methods behaved similarly well.

H On knowledge transfer lossThrough this paper we focused on using Kulback-LeiblerDivergence for knowledge transfer DKL(p‖q) = H(p, q)−H(p). For many distillation related methods, it is actuallyequivalent to minimising cross entropy (as p is constant), inM&M case the situation is more complex. When both p andq are learning DKL provides a two-way effect – from oneperspective q is pulled towards p and on the other p is modeseeking towards q while at the same time being pushed to-wards uniform distribution (entropy maximisation). Thishas two effects, first, it makes it harder for the teacher toget too ahead of the student (similarly to (Teh et al., 2017;Zhang et al., 2017)); second, additional entropy term makesit expensive to keep using teacher, and so switching is pref-fered.

Another element which has not been covered in depthin this paper is possibility of deep distillation. Apartfrom matching policies one could include inner activationmatching (Parisotto et al., 2016), which could be beneficialfor deeper models which do not share modules. Further-more, for speeding up convergence of distillation one coulduse Sobolev Training (Czarnecki et al., 2017) and matchboth policy and its Jacobian matrix. Since policy matchingwas enough for current experiments, none of these meth-ods has been used in this paper, however for much biggermodels and more complex domains it might be the necesityas M&M depends on ability to rapidly transfer knowledgebetween agents.

ReferencesBa, Jimmy and Caruana, Rich. Do deep nets really need

to be deep? In Ghahramani, Z., Welling, M., Cortes,C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Ad-vances in Neural Information Processing Systems 27, pp.2654–2662. 2014.

Beattie, Charles, Leibo, Joel Z., Teplyashin, Denis, Ward,Tom, Wainwright, Marcus, Kuttler, Heinrich, Lefrancq,Andrew, Green, Simon, Valdes, Vıctor, Sadik, Amir,Schrittwieser, Julian, Anderson, Keith, York, Sarah,Cant, Max, Cain, Adam, Bolton, Adrian, Gaffney,Stephen, King, Helen, Hassabis, Demis, Legg, Shane,and Petersen, Stig. Deepmind lab. CoRR, 2016.

Bengio, Yoshua, Louradour, Jerome, Collobert, Ronan, andWeston, Jason. Curriculum learning. In ICML, 2009.

Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig,Schneider, Jonas, Schulman, John, Tang, Jie, andZaremba, Wojciech. Openai gym. arXiv preprintarXiv:1606.01540, 2016.

Bucilu, Cristian, Caruana, Rich, and Niculescu-Mizil,Alexandru. Model compression. In Proceedings of the


12th ACM SIGKDD international conference on Knowl-edge discovery and data mining, pp. 535–541. ACM,2006.

Chen, Tianqi, Goodfellow, Ian J., and Shlens, Jonathon.Net2net: Accelerating learning via knowledge transfer.ICLR, abs/1511.05641, 2016.

Czarnecki, Wojciech M, Osindero, Simon, Jaderberg, Max,Swirszcz, Grzegorz, and Pascanu, Razvan. Sobolevtraining for neural networks. In Advances in Neural In-formation Processing Systems, pp. 4281–4290, 2017.

Elman, Jeffrey. Learning and development in neural net-works: The importance of starting small. In Cognition,pp. 71–99, 1993.

Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan,Karen, Mnih, Volodymir, Ward, Tom, Doron, Yotam,Firoiu, Vlad, Harley, Tim, Dunning, Iain, Legg, Shane,and Kavukcuoglu, Koray. Impala: Scalable distributeddeep-rl with importance weighted actor-learner architec-tures, 2018.

Graves, Alex, Bellemare, Marc G., Menick, Jacob, Munos,Remi, and Kavukcuoglu, Koray. Automated curriculumlearning for neural networks. CoRR, 2017.

Heess, Nicolas, Sriram, Srinivasan, Lemmon, Jay, Merel,Josh, Wayne, Greg, Tassa, Yuval, Erez, Tom, Wang,Ziyu, Eslami, Ali, Riedmiller, Martin, et al. Emergenceof locomotion behaviours in rich environments. arXivpreprint arXiv:1707.02286, 2017.

Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distill-ing the knowledge in a neural network. arXiv preprintarXiv:1503.02531, 2015.

Jacobs, Robert A, Jordan, Michael I, Nowlan, Steven J, andHinton, Geoffrey E. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991.

Jaderberg, Max, Dalibard, Valentin, Osindero, Simon,Czarnecki, Wojciech M., Donahue, Jeff, Razavi, Ali,Vinyals, Oriol, Green, Tim, Dunning, Iain, Simonyan,Karen, Fernando, Chrisantha, and Kavukcuoglu, Koray.Population based training of neural networks. CoRR,2017a.

Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Woj-ciech Marian, Schaul, Tom, Leibo, Joel Z, Silver, David,and Kavukcuoglu, Koray. Reinforcement learning withunsupervised auxiliary tasks. ICLR, 2017b.

Kempka, Michał, Wydmuch, Marek, Runc, Grzegorz,Toczek, Jakub, and Jaskowski, Wojciech. Vizdoom:A doom-based ai research platform for visual rein-forcement learning. In Computational Intelligence and

Games (CIG), 2016 IEEE Conference on, pp. 1–8. IEEE,2016.

Li, Yuanzhi and Yuan, Yang. Convergence analysis of two-layer neural networks with relu activation. In Advancesin Neural Information Processing Systems, pp. 597–607,2017.

Mirowski, Piotr, Pascanu, Razvan, Viola, Fabio, Soyer, Hu-bert, Ballard, Andrew J, Banino, Andrea, Denil, Misha,Goroshin, Ross, Sifre, Laurent, Kavukcuoglu, Koray,et al. Learning to navigate in complex environments.ICLR, 2017.

Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza,Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim,Silver, David, and Kavukcuoglu, Koray. Asynchronousmethods for deep reinforcement learning. In Interna-tional Conference on Machine Learning, pp. 1928–1937,2016.

Parisotto, Emilio, Ba, Lei Jimmy, and Salakhutdinov, Rus-lan. Actor-mimic: Deep multitask and transfer reinforce-ment learning. ICLR, 2016.

Ross, Stephane, Gordon, Geoffrey, and Bagnell, Drew. Areduction of imitation learning and structured predictionto no-regret online learning. In Proceedings of the four-teenth international conference on artificial intelligenceand statistics, pp. 627–635, 2011.

Rusu, Andrei A, Colmenarejo, Sergio Gomez, Gulcehre,Caglar, Desjardins, Guillaume, Kirkpatrick, James, Pas-canu, Razvan, Mnih, Volodymyr, Kavukcuoglu, Koray,and Hadsell, Raia. Policy distillation. 2016.

Simonyan, K. and Zisserman, A. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014.

Sutskever, Ilya and Zaremba, Wojciech. Learning to exe-cute. CoRR, 2014.

Teh, Yee, Bapst, Victor, Czarnecki, Wojciech M., Quan,John, Kirkpatrick, James, Hadsell, Raia, Heess, Nicolas,and Pascanu, Razvan. Distral: Robust multitask rein-forcement learning. In NIPS. 2017.

Wei, Tao, Wang, Changhu, Rui, Yong, and Chen,Chang Wen. Network morphism. In Proceedings ofThe 33rd International Conference on Machine Learn-ing, pp. 564–572, 2016.

Zhang, Ying, Xiang, Tao, Hospedales, Timothy M, andLu, Huchuan. Deep mutual learning. arXiv preprintarXiv:1706.00384, 2017.

Mix & Match Agent Curricula for Reinforcement LearningM&M– Agent Curricula for RL paper, instead...

Documents

Transcript of Mix & Match Agent Curricula for Reinforcement LearningM&M– Agent Curricula for RL paper, instead...