ACE: An Actor Ensemble Algorithm for Continuous Control ...ACE: An Actor Ensemble Algorithm for...

ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search

Shangtong Zhang1 ∗, Hao Chen2, Hengshuai Yao2

1 Department of Computing Science, University of Alberta2 Reinforcement Learning for Autonomous Driving Lab, Noah’s Ark Lab, Huawei

[email protected], hao.chen, [email protected]

Abstract

In this paper, we propose an actor ensemble algorithm, namedACE, for continuous control with a deterministic policy in re-inforcement learning. In ACE, we use actor ensemble (i.e.,multiple actors) to search the global maxima of the critic.Besides the ensemble perspective, we also formulate ACEin the option framework by extending the option-critic ar-chitecture with deterministic intra-option policies, revealinga relationship between ensemble and options. Furthermore,we perform a look-ahead tree search with those actors and alearned value prediction model, resulting in a refined valueestimation. We demonstrate a significant performance boostof ACE over DDPG and its variants in challenging physicalrobot simulators.

IntroductionIn this paper, we propose an actor ensemble algorithm,named ACE, for continuous control in reinforcement learn-ing (RL). In continuous control, a deterministic policy (Sil-ver et al. 2014) is a recent approach, which is a mappingfrom state to action. In contrast, a stochastic policy is a map-ping from state to a probability distribution over the actions.

Recently, neural networks has achieved great successas function approximators in various challenging domains(Tesauro 1995; Mnih et al. 2015; Silver et al. 2016). A deter-ministic policy parameterized by a neural network is usuallytrained via gradient ascent to maximize the critic, which isa state-action value function parameterized by a neural net-work (Silver et al. 2014; Lillicrap et al. 2015; Barth-Maronet al. 2018). However, gradient ascent can be easily trappedby local maxima during the search for the global maxima.We utilize the ensemble technique to mitigate this issue. Wetrain multiple actors (i.e., deterministic policies) in parallel,and each actor has a different initialization. In this way, eachactor is in charge of maximizing the state-action value func-tion in a local area. Different actors may be trapped in differ-ent local maxima. By considering the maximum state-actionvalue of the actions proposed by all the actors, we are morelikely to find the global maxima of the state-action valuefunction than a single actor.

∗Work done during an internship at HuaweiCopyright c© 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

ACE fits in with the option framework (Sutton, Precup,and Singh 1999). First, each option has its intra-option pol-icy, which maximizes the return in a certain area of the statespace. Similarly, an actor in ACE maximizes the critic in acertain area of the domain of the critic. It may be difficult fora single actor to maximize the critic in the whole domain dueto the complexity of the manifold of the critic. However, incontrast, the job for action search is easier if we ask an actorto find the best action in a local neighborhood of the actiondimension. Second, we chain the outputs of all the actors tothe critic, enabling a selection over the locally optimal actionvalues. In this way, the critic works similar to the policy overoptions in the option framework. We quantify this similaritybetween ensemble and options by extending the option-criticarchitecture (OCA, Bacon, Harb, and Precup 2017) with de-terministic intra-option policies. Particularly, we provide theDeterministic Intra-option Policy Gradient theorem, basedon which we show the actor ensemble in ACE is a specialcase of the general option-critic framework.

To make the state-action value function more accurate,which is essential in the actor selection, we perform a look-ahead tree search with the multiple actors. The look-aheadtree search has achieved great success in various discrete ac-tion problems (Knuth and Moore 1975; Browne et al. 2012;Silver et al. 2016; Oh, Singh, and Lee 2017; Farquhar etal. 2018). Recently, look-ahead tree search was extendedto continuous-action problems. For example, Mansley, We-instein, and Littman (2011) combined planning with adap-tive discretization of a continuous action space, resulting ina performance boost in continuous bandit problems. Nitti,Belle, and De Raedt (2015) utilized probability program-ming in planning in continuous action space. Yee, Lisy, andBowling (2016) used kernel regression to generalize the val-ues between explored actions and unexplored actions, result-ing in a new Monte Carlo tree search algorithm. However,to our best knowledge, a general tree search algorithm forcontinuous-action problems is a gap. In ACE, we use themultiple actors as meta-actions to perform a tree search withthe help of a learned value prediction model (Oh, Singh, andLee 2017; Farquhar et al. 2018).

We demonstrate the superiority of ACE over DDPG andits variants empirically in Roboschool 1, a challenging phys-

1https://github.com/openai/roboschool

arX

iv:1

811.

0269

6v1

[cs

.LG

] 6

Nov

201

8

ical robot environment.In the rest of this paper, we first present some preliminar-

ies of RL. Then we detail ACE and show some empiricalresults, after which we discuss some related work, followedby closing remarks.

PreliminariesWe consider a Markov Decision Process (MDP) which con-sists of a state space S, an action spaceA, a reward functionr : S ×A → R, a transition kernel p : S ×A× S → [0, 1],and a discount factor γ ∈ [0, 1]. We use π : S × A → [0, 1]to denote a stochastic policy. At each time step t, an agentis at state st and selects an action at ∼ π(·|st). Then theenvironment gives the reward rt+1 and leads the agent tothe next state st+1 according p(·|st, at). We use qπ to de-note the state action value of a policy π, i.e., qπ(s, a)

.=

Eπ[∑∞t=0 γ

trt+1|s0 = s, a0 = a]. In an RL problem, weare usually interested in finding an optimal policy π∗ s.t.qπ∗(s, a) ≥ qπ(s, a) for ∀(π, s, a). All the optimal policies

share the same optimal state action value function q∗, whichis the unique fixed point of the Bellman optimality operatorT ,

T Q(s, a).= r(s, a) + γEs′∼p(·|s,a)[max

a′Q(s′, a′)] (1)

where we useQ to indicate an estimation of q∗. With tabularrepresentation, Q-learning (Watkins and Dayan 1992) is acommonly used method to find this fixed point. The per stepupdate is

∆Q(st, at) ∝ rt+1 + γmaxa

Q(st+1, a)−Q(st, at) (2)

Recently, Mnih et al. (2015) used a deep convolutionalneural network θ to parameterize an estimation Q, result-ing in the Deep-Q-Network (DQN). At each time step t,DQN samples the transition (st, at, rt+1, st+1) from a re-play buffer (Lin 1992) and performs stochastic gradient de-scent to minimize the loss

1

2

(rt+1 + γmax

aQθ−(st+1, a)−Qθ(st, at)

)2where θ− is a target network (Mnih et al. 2015), which is acopy of θ and is synchronized with θ periodically.

Continuous Action ControlIn continuous control problems, it is not straightforward toapply Q-learning. The basic idea of the deterministic pol-icy algorithms (Silver et al. 2014) is to use a determinis-tic policy µ : S → A to approximate the greedy actionarg maxaQ(s, a). The deterministic policy is trained viagradient ascent according to chain rule. The gradient per stepis

∇θµQ(st, µ(st)) = ∇aQ(s, a)|s=st,a=µ(st)∇θµµ(s)|s=st(3)

where we assume µ is parameterized by θµ. With this actorµ, we are able to update the state action value function Q asusual. To be more specific, assuming Q is parameterized by

θQ, we perform gradient decent to minimize the followingloss

1

2

(rt+1 + γQ(st+1, µ(st+1))−Q(st, at)

)2Recently, Lillicrap et al. (2015) used neural networks to pa-rameterize µ andQ, resulting in the Deep Deterministic Pol-icy Gradient (DDPG) algorithm. DDPG is an off-policy con-trol algorithm with experience replay and a target network.

In the on-policy setting, the gradient in Equation (3) guar-antees policy improvement thanks to the Deterministic Pol-icy Gradient theorem (Silver et al. 2014). In off-policy set-ting, the policy improvement of this gradient is based on Off-policy Policy Gradient theorem (OPG, Degris, White, andSutton 2012). However, Errata of Degris, White, and Sutton(2012) shows that OPG only holds for tabular function rep-resentation and certain linear function approximation. So farno policy improvement is guaranteed for DDPG. However,DDPG and its variants has gained great success in variouschallenging domains (Lillicrap et al. 2015; Barth-Maron etal. 2018; Tassa et al. 2018). This success may be attributedto the gradient ascent via chain rule in Equation (3).

OptionAn option ω is a triple, (Iω, πω, βω), and we use Ω to in-dicate the option set. We use Iω ⊆ S to denote the ini-tiation set of ω, indicating where the option ω can be ini-tiated. In this paper, we consider Iω ≡ S for ∀ω, mean-ing that each option can be initiated at all states. We useπω : S × A → [0, 1] to denote the intra-option policy of ω.Once the option ω is committed, the action selection is basedon πω . We use βω : S → [0, 1] to denote the terminationfunction of ω. At each time step t, the agent terminates theprevious option ωt−1 with probability βωt−1

(st). In this pa-per, we consider the call-and-return option execution model(Sutton, Precup, and Singh 1999), where an agent executesan option until the option terminates.

An MDP augmented with options forms a Semi-MDP(Sutton, Precup, and Singh 1999). We use πΩ : S × Ω →[0, 1] to denote the policy over options. We use QΩ : S ×Ω→ R to denote the option-value function and VΩ : S → Rto denote the value function of πΩ. Furthermore, we useU : Ω × S → R to denote the option value upon arrivalat the state option pair (ω, s′) and QU : S × Ω×A → R todenote the value of executing an action a in the context of astate-option pair (s, ω). They are related as

QΩ(s, ω) =∑a

πω(a|s)QU (s, ω, a)

QU (s, ω, a) = r(s, a) + γEs′∼p(·|s,a)U(ω, s′) (4)

U(ω, s′) =(1− βω(s′)

)QΩ(s′, ω) + βω(s′)VΩ(s′)

(5)

VΩ(s) =∑ω

πΩ(ω|s)QΩ(s, ω) (6)

Bacon, Harb, and Precup (2017) proposed a policy gradi-ent method, the Option-Critic Architecture (OCA), to learnstochastic intra-option policies πω (parameterized by θπ)

and termination functions βω (parameterized by θβ). Theobjective is to maximize the expected discounted return perepisode, i.e., QΩ(s0, ω0). Based on their Intro-option PolicyGradient Theorem and Termination Gradient Theorem, theper step updates for θπ and θβ are

∆θπ ∝ ∇θπ log πωt(at|st)QU (st, ωt, at)

∆θβ ∝ −∇θββωt−1(st)

(QΩ(st, ωt−1)− VΩ(st)

)And at each time step t, OCA takes one gradient descentstep minimizing

1

2

(gt −QU (st, ωt, at)

)2(7)

to update the critic QU , where

gt.= rt+1 + γ

(1− βωt(st+1)

)QΩ(st+1, ωt)

+ γβωt(st+1) maxω′

QΩ(st+1, ω′)

The update target gt is also used in Intro-option Q-learning(Sutton, Precup, and Singh 1999).

Model-based RLIn RL, a transition model usually takes as inputs a state-action pair and generates the immediate reward and the nextstate. A transition model reflects the dynamics of an environ-ment and can either be given or learned. A transition modelcan be used to generate imaginary transitions for trainingthe agent to increase data efficiency (Sutton 1990; Yao etal. 2009; Gu et al. 2016). A transition model can also beused to reason about the future. When making a decision,an agent can perform a look-ahead tree search (Knuth andMoore 1975; Browne et al. 2012) with a transition modelto maximize the possible rewards. A look-ahead tree searchcan be performed with either a perfect model (Coulom 2006;Sturtevant 2008; Silver et al. 2016) or a learned model (We-ber et al. 2017).

A latent state is an abstraction of the original state. A la-tent state is also referred to as an abstract state (Oh, Singh,and Lee 2017) or an encoded state (Farquhar et al. 2018). Alatent state can be used as the input of a transition model.Correspondingly, the transition model then predicts the nextlatent state, instead of the original next state. A latent stateis particularly useful for high dimensional state space (e.g.,images), where a latent state is usually a low dimensionalvector.

Recently, some works demonstrated that learning a valueprediction model instead of a transition model is effective fora look-ahead tree search. For example, VPN (Oh, Singh, andLee 2017) predicted the value of the next latent state, insteadof the next latent state itself. Although this value predictionwas explicitly done in two phases (predicting the next latentstate and then predicting its value), the loss of predictingthe next latent state was not used in training. Instead, onlythe loss of the value prediction for the next latent state wasused. Oh, Singh, and Lee (2017) showed this value predic-tion model is particularly helpful for non-deterministic en-vironments. TreeQN (Farquhar et al. 2018) adopted a sim-ilar idea, where only the outcome value of a look-ahead

tree search is grounded in the loss. Farquhar et al. (2018)showed that grounding the predicted next latent state didnot bring in a performance boost. Although a value predic-tion model predicts much fewer information than a transi-tion model, VPN and TreeQN demonstrated improved per-formance over baselines in challenging domains. This valueprediction model is particularly helpful for a look-ahead treesearch in non-deterministic environments. In ACE, we fol-lowed these works and built a value prediction model similarto TreeQN.

The Actor Ensemble AlgorithmAs discussed earlier, it is important for the actor to maximizethe critic in DDPG. Lillicrap et al. (2015) trained the actorvia gradient ascent. However, gradient ascent can easily betrapped by local maxima or saddle points.

To mitigate this issue, we propose an ensemble approach.In our work, we have N actors µ1, . . . , µN. At each timestep t, ACE selects the action over the proposals by all theactors,

at.= argmaxa∈µi(st)i=1,...,N

Q(st, a) (8)

Our actors are trained in parallel. Assuming those actors areparameterized by θ, each actor µi is trained such that given astate s the action a = µi(s) maximizes Q(s, a). We adoptedsimilar gradient ascent as DDPG. The gradient at time stept is

∇θQ(st, µi(st)) = ∇aQ(s, a)|s=st,a=µi(st)∇θµi(s)|s=st(9)

for all i ∈ 1, . . . , N. ACE initializes each actor indepen-dently. So that the actors are likely to cover different lo-cal maxima of Q(s, a). By considering the maximum actionvalue of the proposed actions, the action in Equation (8) ismore likely to find the global maxima of the critic Q. Totrain the critic Q, ACE takes one gradient descent step ateach time t minimizing

1

2

(rt+1 + γmaxi∈1,...,NQ(st+1, µi(st+1))−Q(st, at)

)2(10)

Note our actors are not independent. They reinforce eachother by influencing the critic update, which in turn givesthem better policy gradient.

An Option PerspectiveIntuitively, each actor in ACE is similar to an option.To quantify the relationship between ACE and the optionframework, we first extend OCA with deterministic intra-option policies, referred to as OCAD. For each option ω, weuse µω : S → A to denote its intra-option policy, whichis a deterministic policy. The intro-option policies µω areparameterized by θ. The termination functions βω are pa-rameterized by ν. We have

Theorem 1 (Deterministic Intra-option Policy Gradient)Given a set of Markov options with deterministic intra-option policies, the gradient of the expected discounted

return objective w.r.t. θ is:∑s,ω

ρΩ(s, ω|s0, ω0)∇aQU (s, ω, a)|a=µω(s)∇θµω(s)

where ρΩ(s, ω|s0, ω0) =∑∞k=0 P

(k)γ (s, ω|s0, ω0) is the lim-

iting state-option pair distribution. Here P(k)γ represents

the γ-discounted probability of transitioning to (s, ω) from(s0, ω0) in k steps.

Theorem 2 (Termination Policy Gradient) Given a set ofMarkov options with deterministic intra-option policies, thegradient of the expected discounted return objective w.r.t. toν is:∑

ω,s′

ρΩ(s′, ω|s1, ω0)∇νβω(s′)(VΩ(s′)−QΩ(s′, ω))

The proof of Theorem 1 follows a similar scheme of Sut-ton et al. (2000), Silver et al. (2014), and Bacon, Harb, andPrecup (2017). The proof of Theorem 2 follows a similarscheme of Bacon, Harb, and Precup (2017). The conditionsand proofs of both theorems are detailed in SupplementaryMaterial. The critic update of OCAD remains the same asOCA (Equation 7).

We now show that the actor ensemble in ACE is a spe-cial setting of OCAD. The gradient update of the actorsin ACE (Equation 9) can be justified via Theorem 1. Thecritic update in ACE (Equation 10) is equivalent to the criticupdate in OCAD (Equation 13). We first consider a spe-cial setting, where βω(s) ≡ 1 for ∀(ω, s), which meanseach option terminates at every time step. In this setting,the value of QU (s, ω, a) does not depend on ω (Equations4 and 5). Based on this observation, we rewrite QU (s, ω, a)as Q(s, a). We have

QΩ(s, ω) = QU (s, ω, µω(s)) = Q(s, µω(s)) (11)

With the threeQs being the same, we rewrite the intra-policygradient update in OCAD according to Theorem 1 as

∇aQ(s, a)|st=s,a=µωt (st)∇θµωt(s)|st=s (12)

And we rewrite the critic update in OCAD (Equation 7) as

1

2

(rt+1 + γmax

ω′Q(st+1, µω′(st+1))−Q(st, at)

)2(13)

Now the actor-critic updates in OCAD (Equations 12 and13) recover the actor-critic updates in ACE (Equations 9 and10), revealing the relationship between the ensemble in ACEand OCAD.

Note that in the intra-option policy update of OCAD(Equation 12) only one intra-option policy is updated at eachtime step, while in the actor ensemble update (Equation 9)all actors are updated. Based on the intro-option policy up-date of OCAD, we propose a variant of ACE, named Al-ternative ACE (ACE-Alt), where only the selected actor isupdated at each time step. In practice, we add explorationnoise for each action at and use experience replay to stabi-lize the training of the neural network function approximatorlike DDPG, resulting in off-policy learning.

Model-based EnhancementsTo refine the state-action value function estimation, whichis essential for actor selection, we utilize a look-ahead treesearch method with a learned value prediction model similarto TreeQN. TreeQN was developed for discrete action space.We extend TreeQN to continuous control problems via theactor ensemble by searching over the actions proposed bythe actors.

Formally speaking, we first define the following learnablefunctions:• fenc : S → Rn, an encoding function that transforms a

state into an n-dimensional latent state, parameterized byθenc

• frew : Rn × A → R, a reward prediction function thatpredicts the immediate reward given a latent state and anaction, parameterized by θrew

• ftrans : Rn×A → Rn, a transition function that predictsthe next latent state given a latent state and an action, pa-rameterized by θtrans

• fq : Rn × A → R, a value prediction function that com-putes the value for a pair of a latent state and an action,parameterized by θq

• fµi : Rn → A: an actor that computes an actiongiven a latent state, parameterized by θµi , for each i ∈1, . . . , N

We use θQ to denote θenc, θrew, θtrans, θq and θµ to de-note θenc, θµ1 , . . . , θµN . Note the encoding function isshared in our implementation.

We use fµi(zt|0) to represent µi(st) and fq(zt|0, at|0) torepresent Q(st, at), where zt|0 = fenc(st) is a latent stateand at|0 = at. Furthermore, fq(zt|0, at|0) can also be de-composed into the sum of the predicted immediate rewardand the value of the predicted next latent state, i.e.,

fq(zt|0, at|0)← frew(zt|0, at|0) + γfq(zt|1, at|1) (14)

where

zt|1 = ftrans(zt|0, at|0) (15)

at|1 = argmaxa∈fµi (zt|1)i=1,...,Nfq(zt|1, a) (16)

We apply Equation (14) recursively d times with state pre-diction and action selection in Equations (15, 16), resultingin a new estimator fdq (zt|0, at|0) for Q(st, at), which is de-fined as

fdq (zt|l, at|l) =

fq(zt|l, at|l) d = 0

frew(zt|l, at|l) + γfd−1q (zt|l+1, at|l+1) d > 0

(17)

where

zt|0.= fenc(st), at|0

.= at

zt|l.= ftrans(zt|l−1, at|l−1) (l ≥ 1)

at|l.= argmaxa∈fµi (zt|l)i=1,...,N

fd−lq (zt|l, a) (l ≥ 1)

The look-ahead tree search and backup process (Equa-tion 17) are illustrated in Figure 1. The value of fdq (zt|l, at|l)

stands for the state-action value estimation for the predictedlatent state and action after l steps from t, with Equation (14)applied d times.

As fdq is fully differentiable w.r.t. θQ, we plug in fdq when-ever we need Q. We also ground the predicted reward inthe first recursive expansion as suggested by Farquhar et al.(2018). To summarize, given a transition (st, at, rt+1, st+1),the gradients for θQ and θµ are

∇θQ(

1

2

(fdq (zt|0, at|0)−max

ifdq (zt+1|0, fµi(zt+1|0))

)2+

1

2

(frew(zt|0, at|0)− rt+1

)2)and

N∑i=1

∇afdq (zt|0, a)|a=fµi (zt|0)∇θµfµi(zt|0)

respectively. ACE also utilizes experience replay and a tar-get network similar to DDPG. The pseudo-code of ACE isprovided in Supplementary Material.

ExperimentsWe designed experiments to answer the following questions:• Does ACE outperform baseline algorithms?• If so, how do the components of ACE contribute to the

performance boost?We used twelve continuous control tasks from Ro-

boschool, a free port of Mujoco 2 by OpenAI. A state in Ro-boschool contains joint information of a robot (e.g., speedor angle) and is presented as a vector of real numbers. Anaction in Roboschool is a vector, with each dimension beinga real number in [−1, 1]. All the implementations are madepublicly available. 3

ACE ArchitectureIn this section we describe the parameterization offenc, frew, ftrans, fq and fµii=1,...,N for Roboschooltasks. For each state s, we first transformed it into a latentstate z ∈ R400 by fenc, which was parameterized as a singleneural network layer. This latent state z was used as the inputfor all other functions (i.e., frew, fq, ftrans, fµ1 , . . . , fµN ).

The networks for frew, fq, ftrans are single hidden layernetworks with 400 + m input units and 300 hidden units,taking as inputs the concatenation of a latent state and anm-dimensional action. Particularly, the network for ftrans usedtwo residual connections similar to Farquhar et al. (2018).The networks for fµii=1,...,N are single hidden layer net-works with 400 input units and 300 hidden units, and all theN networks of fµi shared a common first layer. The ar-chitecture of ACE is illustrated in Figure 4(a).

We used tanh as the activation functions for the hiddenunits. (This selection will be further discussed in the nextsection.) We set the number of actors to 5 (i.e., N = 5) andthe planning depth to 1 (i.e. d = 1).

2http://www.mujoco.org/3https://github.com/ShangtongZhang/DeepRL

Baseline AlgorithmsDDPG In DDPG, Lillicrap et al. (2015) used two separatenetworks to parameterize the actor and the critic. Each net-work had 2 hidden layers with 400 and 300 hidden units re-spectively. Lillicrap et al. (2015) used ReLU activation func-tion (Nair and Hinton 2010) and applied a L2 regularizationto the critic. However, our analysis experiments found thattanh activation function outperformed ReLU with L2 reg-ularization. So throughout all our experiments, we alwaysused tanh activation function (without L2 regularization)for all algorithms. All other hyper-parameter values were thesame as Lillicrap et al. (2015). All the other compared al-gorithms inherited the hyper-parameter values from DDPGwithout tuning. We used the same Ornstein-Uhlenbeck pro-cess (Uhlenbeck and Ornstein 1930) as Lillicrap et al. (2015)for exploration in all the compared algorithms.

Wide-DDPG ACE had more parameters than DDPG. Toinvestigate the influence of the number of parameters, weimplemented Wide-DDPG, where the hidden units weredoubled (i.e., the two hidden layers had 800 and 600 unitsrespectively). Wide-DDPG had a comparable number of pa-rameters as ACE and remained the same depth as ACE.

Shared-DDPG DDPG used separate networks for actorand critic, while the actor and critic in ACE shared a com-mon representation layer. To investigate the influence ofa shared representation, we implemented Shared-DDPG,where the actor and the critic shared a common bottom layerin the same manner as ACE.

Ensemble-DDPG To investigate the influence of the treesearch in ACE, we removed the tree search in ACE bysetting d = 0, giving an ensemble version of DDPG,called Ensemble-DDPG. We still used 5 actors in Ensemble-DDPG.

TM-ACE To investigate the usefulness of the valueprediction model, we implemented Transition-Model-ACE(TM-ACE), where we learn a transition model instead of avalue prediction model. To be more specific, ftrans and frewwere trained to fit sampled transitions from the replay bufferto minimize the squared loss of the predicted reward and thepredicted next latent state. This model was then used for alook-ahead tree search. The pseudo-code of TM-ACE is de-tailed in Supplementary Material.

The architectures of all the above algorithms are illus-trated in Supplementary Material.

ResultsFor each task, we trained each algorithm for 1 million steps.At every 10 thousand steps, we performed 20 deterministicevaluation episodes without exploration noise and computedthe mean episode return. We report the best evaluation per-formance during training in Table 1, which is averaged over5 independent runs. The full evaluation curves are reportedin Supplementary Material.

In a summary, either ACE or ACE-Alt was placed amongthe best algorithms in 11 out of the 12 games. ACE itselfwas placed among the best algorithms in 8 games taking

Figure 1: An example of the tree search (Equation 17) with d = 2 andN = 2. We use f2q (zt|0, at|0) as the estimationQ(st, at).

Here, zt|l(l = 1, 2) represents the predicted latent state after l steps from t, a and b are actions proposed by the two actors anddepend on latent state. Arrows represent the transition kernel ftrans, and brackets represent maximization (backup) operations.Reward prediction is omitted for simplicity.

ACE-Alt into comparison. Without considering ACE-Alt,ACE was placed among the best algorithms in 10 games.ACE-Alt itself was placed among the best algorithms in 7games taking ACE into comparison. Without consideringACE, ACE-Alt was placed among the best algorithms in10 games. Overall, ACE was slightly better than ACE-Alt.However, ACE-Alt enjoyed lower variance than ACE. Weconjecture this is because ACE had more off-policy learningthan ACE-Alt. Off-policy learning improved data efficiencybut increased variances.

Wide-DDPG outperformed DDPG in only 1 game, in-dicating that naively increasing the parameters does notguarantee performance improvement. Shared-DDPG out-performed DDPG in only 2 games (lost in 2 games and tiedin 8 games), showing shared representation contributes littleto the overall performance in ACE. Ensemble-DDPG out-performed DDPG in 6 games (lost in 3 games and tied in3 games), indicating the DDPG agent benefits from an ac-tor ensemble. This may be attributed to that multiple actorsare more likely to find the global maxima of the critic. ACEfurther outperformed Ensemble-DDPG in 9 games, indicat-ing the agent benefits from the look-ahead tree search with avalue prediction model. In contrast, TM-ACE outperformedEnsemble-DDPG only in 2 games (lost in 3 games and tiedin 7 games), indicating that a value prediction model is bet-ter than a transition model for a look-ahead tree search. Thiswas also consistent with the results observed earlier in VPNand TreeQN.

In conclusion, the actor ensemble and the look-ahead treesearch with a learned value prediction model are key to theperformance boost.

ACE and ACE-Alt increase performance in terms of envi-ronment steps while require more computation than vanillaDDPG. We benchmarked the wall time for the algorithms.The results are reported in Supplementary Material. We alsoverified the diversity of the actors in ACE and ACE-Alt inSupplementary Material.

Ensemble Size and Planning DepthIn this section, we investigate how the ensemble size N andthe planning depth d in ACE influence the performance. Weperformed experiments in HalfCheetah with various N andd and used the same evaluation protocol as before. As a largeN and d induced a significant computation increase, we only

used N up to 10 and d up to 2. The results are reported inFigure 2.

To summarize, N = 5 and d = 1 achieved the best per-formance. We hypothesize there is a trade-off in the selec-tion of both the ensemble size and the planning depth. Onthe one hand, a single actor can easily be trapped into localmaxima during training. The more actors we have, the morelikely we find the global maxima. On the other hand, all theactors share the same encoding function with the critic. Alarge number of actors may dominate the training of the en-coding function to damage the critic learning. So a mediumensemble size is likely to achieve the best performance. Apossible solution is to normalize the gradient according tothe ensemble size, and we leave this for future work. Witha perfect model, the more planning steps we have, the moreaccurate Q estimation we can get. However, with a learnedvalue prediction model, there is a compound error in un-rolling. So a medium planning depth is likely to achieve thebest performance. Similar phenomena were also observed inthe multi-step Dyna planning (Yao et al. 2009).

Figure 2: Evaluation curves of ACE in HalfCheetah withdifferent N and d. Each curve is averaged over 5 indepen-dent runs, and standard errors are plotted in shadow.

ACE ACE-Alt TM-ACE Ensemble-DDPG Shared-DDPG Wide-DDPG DDPGAnt 1041(70.8) 983(36.8) 1031(55.6) 1026(87.2) 796(16.8) 871(19.9) 875(14.2)

HalfCheetah 1667(40.4) 1023(60.4) 800(28.8) 812(49.4) 771(79.6) 733(52.5) 703(37.3)Hopper 2136(86.4) 1923(88.3) 1586(85.0) 1972(63.4) 2047(76.7) 2090(118.6) 2133(99.0)

Humanoid 380(56.1) 441(90.1) 61(6.8) 76(11.4) 53(2.2) 54(1.0) 54(1.7)HF 311(30.3) 289(20.8) 126(39.6) 85(5.6) 53(1.0) 55(1.6) 53(1.4)

HFH 22(2.4) 20(2.2) -4(2.1) 2(7.5) 6(6.4) 15(3.2) 15(2.5)IDP 7555(1610.9) 9356(1.1) 7549(1613.7) 4102(1923.2) 7549(1613.6) 7548(1618.7) 5662(1945.9)IP 417(212.8) 1000(0.0) 415(213.4) 1000(0.0) 1000(0.0) 1000(0.0) 1000(0.0)

IPS 892(0.1) 892(0.2) 891(0.2) 891(0.2) 891(0.4) 546(308.7) 891(0.4)Pong 12(0.3) 11(0.1) 6(0.7) 8(0.9) 4(0.2) 4(0.1) 5(0.8)

Reacher 16(0.7) 17(0.2) 17(0.5) 17(0.3) 20(0.7) 15(2.2) 18(0.9)Walker2d 1659(65.9) 1864(21.4) 1086(97.0) 1142(99.5) 1142(146.3) 1185(121.2) 815(11.4)

Table 1: Best evaluation performance during training. Mean and standard error are reported. Bold numbers indicate the bestperformance. Scores are averaged over 5 independent runs. Numbers are rounded for the ease of display. HF, HFH, IDP, IP, andIPS stand for HumanoidFlagrun, HumanoidFlagrunHarder, InvertedDoublePendulum, InvertedPendulum, and InvertedPendu-lumSwingup respectively.

Related WorkContinuous-action RL Klissarov et al. (2017) extendedOCA into continuous action problems with the ProximalPolicy Option Critic (PPOC) algorithm. However, PPOCconsidered stochastic intra-option policies, and each intra-option policy was trained via a policy search method. InACE, we consider deterministic intra-option policies, andthe intra-option policies are optimized under the same ob-jective as OCA.

Gu et al. (2016) parameterized theQ function in a quadricform to deal with continuous control problems. In this way,the global maxima can be determined analytically. However,in general, the optimal Q value does not necessarily fall intothis quadric form. In ACE, we use an actor ensemble tosearch the global maxima of theQ function. Gu et al. (2016)utilized a transition model to generate imaginary data, whichis orthogonal to ACE.

Ensemble in RL Wiering and Van Hasselt (2008) de-signed four ensemble methods combining five RL algo-rithms with a voting scheme based on value functions ofdifferent RL algorithms. Hans and Udluft (2010) used a net-work ensemble to improve the performance of Fitted Q-Iteration. Osband et al. (2016) used a Q ensemble to ap-proximate Thomas’ sampling, resulting in improved explo-ration and performance boost in challenging video games.Huang et al. (2017) used both an actor ensemble and a criticensemble in continuous control problems. However, to ourbest knowledge, the present work is the first to relate ensem-ble with options and to use an ensemble for a look-aheadtree search in continuous control problems.

Closing RemarksIn this paper, we propose the ACE algorithm for continu-ous control problems. From an ensemble perspective, ACEutilizes an actor ensemble to search the global maxima of acritic function. From an option perspective, ACE is a spe-cial option-critic algorithm with deterministic intra-optionpolicies. Thanks to the actor ensemble, ACE is able to per-form a look-ahead tree search with a learned value predic-

tion model in continuous control problems, resulting in asignificant performance boost in challenging robot manipu-lation tasks.

AcknowledgementThe authors thank Bo Liu for feedbacks of the first draft ofthis paper. We also appreciate the insightful reviews of theanonymous reviewers.

ReferencesBacon, P.-L.; Harb, J.; and Precup, D. 2017. The option-critic architecture. In Proceedings of the 31st AAAI Confer-ence on Artificial Intelligence.Barth-Maron, G.; Hoffman, M. W.; Budden, D.; Dabney,W.; Horgan, D.; Muldal, A.; Heess, N.; and Lillicrap, T.2018. Distributed distributional deterministic policy gradi-ents. arXiv preprint arXiv:1804.08617.Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.;Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.;Samothrakis, S.; and Colton, S. 2012. A survey of montecarlo tree search methods. IEEE Transactions on Computa-tional Intelligence and AI in Games.Coulom, R. 2006. Efficient selectivity and backup operatorsin monte-carlo tree search. In Proceedings of the Interna-tional Conference on Computers and Games.Degris, T.; White, M.; and Sutton, R. S. 2012. Off-policyactor-critic. arXiv preprint arXiv:1205.4839.Farquhar, G.; Rocktaschel, T.; Igl, M.; and Whiteson, S.2018. Treeqn and atreec: Differentiable tree-structuredmodels for deep reinforcement learning. arXiv preprintarXiv:1710.11417.Gu, S.; Lillicrap, T.; Sutskever, I.; and Levine, S. 2016. Con-tinuous deep q-learning with model-based acceleration. InProceedings of the 33rd International Conference on Ma-chine Learning.Hans, A., and Udluft, S. 2010. Ensembles of neural net-works for robust reinforcement learning. In Proceedings of

the 9th International Conference on Machine Learning andApplications.Huang, Z.; Zhou, S.; Zhuang, B.; and Zhou, X. 2017.Learning to run with actor-critic ensemble. arXiv preprintarXiv:1712.08987.Klissarov, M.; Bacon, P.-L.; Harb, J.; and Precup, D. 2017.Learnings options end-to-end for continuous action tasks.arXiv preprint arXiv:1712.00004.Knuth, D. E., and Moore, R. W. 1975. An analysis of alpha-beta pruning. Artificial Intelligence.Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.;Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuouscontrol with deep reinforcement learning. arXiv preprintarXiv:1509.02971.Lin, L.-J. 1992. Self-improving reactive agents based onreinforcement learning, planning and teaching. MachineLearning.Mansley, C. R.; Weinstein, A.; and Littman, M. L. 2011.Sample-based planning for continuous action markov de-cision processes. In Proceedings of the 21st InternationalConference on Automated Planning and Scheduling.Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness,J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland,A. K.; Ostrovski, G.; et al. 2015. Human-level controlthrough deep reinforcement learning. Nature.Nair, V., and Hinton, G. E. 2010. Rectified linear unitsimprove restricted boltzmann machines. In Proceedings ofthe 27th International Conference on Machine Learning.Nitti, D.; Belle, V.; and De Raedt, L. 2015. Planning indiscrete and continuous markov decision processes by prob-abilistic programming. In Proceedings of the 17th Joint Eu-ropean Conference on Machine Learning and KnowledgeDiscovery in Databases.Oh, J.; Singh, S.; and Lee, H. 2017. Value prediction net-work. In Advances in Neural Information Processing Sys-tems.Osband, I.; Blundell, C.; Pritzel, A.; and Van Roy, B. 2016.Deep exploration via bootstrapped dqn. In Advances in Neu-ral Information Processing Systems.Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; andRiedmiller, M. 2014. Deterministic policy gradient algo-rithms. In Proceedings of the 31st International Conferenceon Machine Learning.Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.;Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;Panneershelvam, V.; Lanctot, M.; et al. 2016. Masteringthe game of go with deep neural networks and tree search.Nature.Sturtevant, N. 2008. An analysis of uct in multi-playergames. In Proceedings of the International Conference onComputers and Games.Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour,Y. 2000. Policy gradient methods for reinforcement learningwith function approximation. In Advances in Neural Infor-mation Processing Systems.

Sutton, R. S.; Precup, D.; and Singh, S. 1999. Betweenmdps and semi-mdps: A framework for temporal abstractionin reinforcement learning. Artificial Intelligence.Sutton, R. S. 1990. Integrated architectures for learn-ing, planning, and reacting based on approximating dynamicprogramming. In Proceedings of the 7th International Con-ference on Machine Learning.Tassa, Y.; Doron, Y.; Muldal, A.; Erez, T.; Li, Y.; Casas, D.d. L.; Budden, D.; Abdolmaleki, A.; Merel, J.; Lefrancq,A.; et al. 2018. Deepmind control suite. arXiv preprintarXiv:1801.00690.Tesauro, G. 1995. Temporal difference learning and td-gammon. Communications of the ACM.Uhlenbeck, G. E., and Ornstein, L. S. 1930. On the theoryof the brownian motion. Physical review.Watkins, C. J., and Dayan, P. 1992. Q-learning. MachineLearning.Weber, T.; Racaniere, S.; Reichert, D. P.; Buesing, L.; Guez,A.; Rezende, D. J.; Badia, A. P.; Vinyals, O.; Heess, N.; Li,Y.; et al. 2017. Imagination-augmented agents for deep re-inforcement learning. arXiv preprint arXiv:1707.06203.Wiering, M. A., and Van Hasselt, H. 2008. Ensemble al-gorithms in reinforcement learning. IEEE Transactions onSystems, Man, and Cybernetics, Part B (Cybernetics).Yao, H.; Bhatnagar, S.; Diao, D.; Sutton, R. S.; andSzepesvari, C. 2009. Multi-step dyna planning for policyevaluation and control. In Advances in Neural InformationProcessing Systems.Yee, T.; Lisy, V.; and Bowling, M. H. 2016. Monte carlotree search in continuous action spaces with execution un-certainty. In Proceedings of the 25th International JointConference on Artificial Intelligence.

Supplementary MaterialProof of Theorem 1Under mild conditions (Bacon, Harb, and Precup 2017), the Markov chain underlying πΩ is aperiodic and ergodic. We use thefollowing augmented process defined by Bacon, Harb, and Precup (2017), which is homogeneous.

P (1)γ (st+1, ωt+1|st, ωt) = γp(st+1|st, µωt(st))

((1− βωt(st+1))Iωt=ωt+1 + βωt(st+1)πΩ(wt+1|st+1)

)P (k)γ (st+k, ωt+k|st, ωt) =

∑st+1

∑ωt+1

P (1)γ (st+1, ωt+1|st, ωt)P (k−1)

γ (st+k, ωt+k|st+1, ωt+1)

We compute the gradient as

∇θQΩ(s, ω) = ∇θQU (s, ω, µω(s)) (Equation 11)

= ∇θr(s, µω(s)) + γ∇θ∑s′

p(s′|s, µω(s))U(ω, s′) (Equation 4)

= ∇θµω(s)∇ar(s, a)|a=µω(s) + γ∑s′

∇θp(s′|s, µω(s))U(ω, s′) + γ∑s′

p(s′|s, µω(s))∇θU(ω, s′)

= ∇θµω(s)∇ar(s, a)|a=µω(s) + γ∑s′

∇θµω(s)∇ap(s′|s, a)|a=µω(s)U(ω, s′) + γ∑s′


= ∇θµω(s)∇a(r(s, a) + γ

∑s′

p(s′|s, a)U(ω, s′))|a=µω(s)

+ γ∑s′


= ∇θµω(s)∇aQU (s, ω, a)|a=µω(s) + γ∑s′

p(s′|s, µω(s))∇θU(ω, s′) (Equation 4) (18)

From Equation (5), we have

∇θU(ω, s′) = (1− βω(s′))∇θQΩ(s′, ω) + βω(s′)∇θVΩ(s′)

= (1− βω(s′))∇θQΩ(s′, ω) + βω(s′)∑ω′

πΩ(ω′|s′)∇θQΩ(s′, ω′)

=∑ω′

((1− βω(s′))Iω=ω′ + βω(s′)πΩ(ω′|s′)

)∇QΩ(s′, ω′) (19)

Plug in Equation (19) into Equation (18) and use the definition of P (1)γ , we have

∇θQΩ(s, ω) = ∇θµω(s)∇aQU (s, ω, a)|a=µω(s) +∑s′,ω′

P (1)γ (s′, ω′|s, ω)∇θQΩ(s′, ω′) (20)

Expand ∇θQΩ(s0, ω0) with Equation (20) recursively and apply the augmented process, we end up with

∇θQΩ(s0, ω0) =∑s,ω

ρΩ(s, ω)∇aQU (s, ω, a)|a=µω(s)∇θµω(s)

Proof of Theorem 2Under mild conditions (Bacon, Harb, and Precup 2017), the Markov chain underlying πΩ is aperiodic and ergodic. We use thefollowing augmented process defined by Bacon, Harb, and Precup (2017), which is homogeneous.

P (1)γ (st+1, ωt|st, ωt−1) = γp(st+1|st, µωt−1(st))

((1− βωt−1(st+1))Iωt−1=ωt + βωt−1(st+1)πΩ(wt|st+1)

)P (k)γ (st+k, ωt+k−1|st, ωt−1) =

∑st+1

∑ωt

P (1)γ (st+1, ωt|st, ωt−1)P (k−1)

γ (st+k, ωt+k−1|st+1, ωt)

We have

∇νQΩ(s′, ω) = ∇νQΩ(s′, ω, µω(s′))

= ∇ν(r(s′, µω(s′)) + γ

∑s′′

p(s′′|s′, µω(s′))U(ω, s′′))

(Equation 4)

= γ∑s′′

p(s′′|s′, µω(s′))∇νU(ω, s′′) (21)

The gradient of U(ω, s′) w.r.t. ν is

∇νU(ω, s′) = QΩ(s′, ω)∇ν(1− βω(s′)) + (1− βω(s′))∇νQΩ(s′, ω)

+ βω(s′)∇νVΩ(s′) + VΩ(s′)∇νβω(s′) (Equation 5 and product rule of calculus)

= QΩ(s′, ω)∇ν(1− βω(s′)) + (1− βω(s′))∇νQΩ(s′, ω)

+ βω(s′)∇ν∑ω′

πΩ(ω′|s′)QΩ(s′, ω′) + VΩ(s′)∇νβω(s′) (Equation 6)

= ∇νβω(s′)(VΩ(s′)−QΩ(s′, ω))

+ (1− βω(s′))∇νQΩ(s′, ω) + βω(s′)∑ω′

πΩ(ω′|s′)∇νQΩ(s′, ω′)

= ∇νβω(s′)(VΩ(s′)−QΩ(s′, ω))

+∑s′′

∑ω′

((1− βω(s′))Iω′=ω + βω(s′)πΩ(ω′|s′)

)γp(s′′|s′, µω′(s′))∇νU(ω′, s′′) (Equation 21)

= ∇νβω(s′)(VΩ(s′)−QΩ(s′, ω)) +∑s′′

∑ω′

P (1)γ (ω′, s′′|ω, s′)∇νU(ω′, s′′) (Definition of P (1)

γ ) (22)

Applying Equation (22) recursively and using the augmented process, we have

∇νU(ω0, s1) =∑ω,s′

ρΩ(s′, ω|s1, ω0)∇νβω(s′)(VΩ(s′)−QΩ(s′, ω))

The ACE Algorithm

Algorithm 1: ACEInput:N : number of actorsεt: a noise processd: plan depthα, β: two step sizesfq, fenc, frew, ftrans, fµ1

, . . . , fµN : parameterized by θQ and θµ // see Section Model-based EnhancementsOutput:The parameters θQ and θµ

Initialize the replay buffer Dfor each time step t do

Observe the state stzt ← fenc(st)at ← argmaxa∈fµi (zt)i=1,...,N

fdq (zt, a) + εt // see Equation 17 for fdq

Execute the action at, get reward rt+1 and next state st+1

Store (st, at, rt+1, st+1) into DSample a batch of transitions B from Dfor each transition s, a, r, s′ in B do

z ← fenc(s)z′ ← fenc(s

′)

y ←

0 if s′ is terminalmaxi f

dq (z′, fµi(z

′)) otherwisey ← r + γyr ← frew(z, a)

θQ ← θQ − α∇θQ(

12 (fdq (z, a)− y)2 + 1

2 (r − r)2)

θµ ← θµ + β∇θµ∑Ni=1∇bfdq (z, b)|b=fµi (z)∇θµfµi(z)

endend

ACE with a transition modelTM-ACE only performs a look-ahead tree search in decision making. During training, TM-ACE uses only fq instead of fdq .

Algorithm 2: TM-ACEInput:N : number of actorsεt: a noise processd: plan depthα, β: two step sizefq, fenc, frew, ftrans, fµ1 , . . . , fµN : parameterized by θQ and θµ // see Section Model-based EnhancementsOutput:The parameters θQ and θµ

Initialize the replay buffer Dfor each time step t do

Observe the state stzt ← fenc(st)at ← argmaxa∈fµi (zt)i=1,...,N

fdq (zt, a) + εt // see Equation 17 for fdq

Execute the action at, get reward rt+1 and next state st+1

Store (st, at, rt+1, st+1) into DSample a batch of transitions B from Dfor each transition s, a, r, s′ in B do

z ← fenc(s)z′ ← fenc(s

′)

y ←

0 if s′ is terminalmaxi fq(z

′, fµi(z′)) otherwise

y ← r + γyr ← frew(z, a)

θQ ← θQ − α∇θQ(

12 (fq(z, a)− y)2 + 1

2 (r − r)2 + 12 ||ftrans(z, a)− z′||2

)θµ ← θµ + β∇θµ

∑Ni=1∇bfq(z, b)|b=fµi (z)∇θµfµi(z)

endend

Wall time comparisonWe benchmarked the wall time for ACE and DDPG with an i7-8750H CPU (2.20GHz).

environment steps per secondDDPG 73.63

ACE with d = 1, N = 5 27.82ACE with d = 1, N = 10 16.12ACE with d = 2, N = 5 6.34

Table 2: Wall time for different algorithms in HalfCheetah

Evaluation curves

Figure 3: Evaluation curves of the compared algorithms in the Roboschool tasks. The x-axis is the training steps, and the y-axisis the raw evaluation scores. Each curve is averaged over 5 independent runs, and standard errors are plotted as shadow.

Diversity of the actorsTo verify the diversity of the actors, we evaluated each actor separately at the middle of training. We reported the averageepisode return of 20 deterministic episodes for each actor in ACE and ACE-Alt. In both ACE and ACE-Alt, the actors canroughly be grouped into four groups, indicating those actors are seeking for different policies. It is also observed that actorsin ACE-Alt have more overall variance than actors in ACE, which is as expected as transitions are not shared among differentactors in ACE-Alt.

actor 1 actor 2 actor 3 actor 4 actor 5ACE 714.8 (14.7) 775.5 (4.9) 550.8 (54.9) 771.2 (5.0) 645.7 (41.9)

ACE-Alt 418.8 (30.1) 631.3 (96.8) 302.7 (44.0) 945.2 (44.9) 916.2 (16.0)

Table 3: Evaluation performance of individual actors in ACE and ACE-Alt (with d = 2, N = 5) in HalfCheetah

Architectures of the algorithms

Figure 4: Architectures of the compared algorithms. (a) ACE and TM-ACE (b) DDPG and Wide-DDPG (c) Shared-DDPG(d) Ensemble-DDPG. Circles are input, dashed boxes are latent states, and solid boxes are outputs. A boxed ai is an actionproposed by the i-th actor, while a circled a is the actual action that the agent performed (which may include exploration noise).The shared layer of a1, . . . , aN is not displayed for simplicity.

ACE: An Actor Ensemble Algorithm for Continuous Control ...ACE: An Actor Ensemble Algorithm for...

Documents

Transcript of ACE: An Actor Ensemble Algorithm for Continuous Control ...ACE: An Actor Ensemble Algorithm for...