Decentralized Reinforcement Learning: Global Decision ...

Decentralized Reinforcement Learning:Global Decision-Making via Local Economic Transactions

Michael Chang 1 Sidhant Kaushik 1 S. Matthew Weinberg 2 Thomas L. Griffiths 2 Sergey Levine 1

Abstract

This paper1 seeks to establish a framework fordirecting a society of simple, specialized, self-interested agents to solve what traditionally areposed as monolithic single-agent sequential de-cision problems. What makes it challenging touse a decentralized approach to collectively opti-mize a central objective is the difficulty in char-acterizing the equilibrium strategy profile of non-cooperative games. To overcome this challenge,we design a mechanism for defining the learningenvironment of each agent for which we knowthat the optimal solution for the global objectivecoincides with a Nash equilibrium strategy profileof the agents optimizing their own local objectives.The society functions as an economy of agentsthat learn the credit assignment process itself bybuying and selling to each other the right to oper-ate on the environment state. We derive a class ofdecentralized reinforcement learning algorithmsthat are broadly applicable not only to standardreinforcement learning but also for selecting op-tions in semi-MDPs and dynamically composingcomputation graphs. Lastly, we demonstrate thepotential advantages of a society’s inherent modu-lar structure for more efficient transfer learning.

You know that everything you think and do is thought anddone by you. But what’s a “you”? What kinds of smallerentities cooperate inside your mind to do your work?

(Minsky, 1988)

1Department of Computer Science, University of Califor-nia, Berkeley, USA 2Department of Computer Science, Prince-ton University, USA. Correspondence to: Michael Chang<[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Online, PMLR 119, 2020. Copyright 2020 by the au-thor(s).

1Code, talk, and blog here.

Figure 1. We study how a society of primitives agents can beabstracted as a super-agent. The incentive mechanism is the ab-straction barrier that relates the optimization problems of the super-agent with those of its constituent primitive agents.

1. IntroductionBiological processes, corporations, and ecosystems – physi-cally decentralized, yet in some sense functionally unified.A corporation, for example, optimizes for maximizing prof-its as it were a single rational agent. But this agent abstrac-tion is an illusion: the corporation is simply a collection ofhuman agents, each solving their own optimization prob-lems, most not even knowing the existence of many of theircolleagues. But the human as the decision-making agent isalso simply an abstraction of the trillions of cells makingtheir own simpler decisions. The society of agents is itself anagent. What mechanisms bridge between these two levelsof abstraction, and under what framework can we developlearning algorithms for studying the self-organizing natureof intelligent societies that pervade so much of the world?

Both the monolithic and the multi-agent optimization frame-works in machine learning offer a language for representingonly one of the levels of abstraction but not the relationbetween both. The monolithic framework, the most com-monly used in much of modern machine learning, considersa single agent that optimizes a single objective in an en-vironment, whether it be minimizing classification loss ormaximizing return. The multi-agent framework considersmultiple agents that each optimize their own independentobjective and each constitute each other’s learning environ-ments. What distinguishes the multi-agent from the mono-lithic is the presence of multiple independent optimizationproblems. The difficulty of interpreting a learner in themonolithic framework as a society of simpler componentsis that all components are still globally coupled together bythe same optimization problem without independent local

https://sites.google.com/view/clonedvickreysociety/home

Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions

optimization problems themselves, as are the weights in aneural network trained by backpropagation. The difficultyof interpreting a multi-agent system under a global optimiza-tion problem is the computational difficulty of computingNash equilibrium (Daskalakis et al., 2009), even for generaltwo-player games (Chen et al., 2009).

To better understand the relationship between the societyand the agent, this paper makes four contributions, each at adifferent level of abstraction. At the highest level, we definethe societal decision-making framework to relate the localoptimization problem of the agent to the global optimizationproblem of the society in the following restricted setting.Each agent is specialized to transform the environment fromone state to another. The agents bid in an auction at eachstate and the auction winner transforms the state into an-other state, which it sells to the agents at the next time-step,thereby propagating a series of economic transactions. Thisframework allows us to ask what are properties of the auc-tion mechanism and of the society that enable the globalsolution to a Markov decision process (MDP) that the so-ciety solves to emerge implicitly as a consequence of theagents optimizing their own independent auction utilities.

At the second level, we present a solution to this question byintroducing the cloned Vickrey society that guarantees thatthe dominant strategy equilibrium of the agents coincideswith the optimal policy of the society. We prove this resultby leveraging the truthfulness property of the Vickrey auc-tion (Vickrey, 1961) and showing that initializing redundantagents makes the primitives’ economic transactions robustagainst market bubbles and suboptimal equilibria.

At the third level, we propose a class of decentralized re-inforcement learning algorithms for optimizing the MDPobjective of the society as an emergent consequence of theagents’ optimizing their own auction utilities. These algo-rithms treat the auction utility as optimization objectivesthemselves, thereby learning a societal policy that is globalin space and time using only credit assignment for learnableparameters that is local in space and time.

At the fourth level, we empirically investigate various im-plementations of the cloned Vickrey society under our de-centralized reinforcement learning algorithm and find thata particular set of design choices, which we call the creditconserving Vickrey implementation, yields both the bestperformance at the societal and and agent level.

Finally, we demonstrate that the societal decision makingframework, along with its solution, the algorithm that learnsthe solution, and the implementation of this algorithm, is abroadly applicable perspective on self-organization to notonly standard reinforcement learning but also selecting op-tions in semi-MDPs (Sutton et al., 1999) and composingfunctions in dynamic computation graphs (Chang et al.,

2018). Moreover, we show evidence that the local credit as-signment mechanisms of societal decision-making producemore efficient learning than the global credit assignmentmechanisms of the monolithic framework.

2. Related WorkDescribing an intelligent system as the product of interac-tions among many individual agents dates as far back asthe Republic (Plato, 380 B.C.), in which Plato analyzes thehuman mind via an analogy to a political state. This themecontinued into the early foundations of AI in the 1980sand 1990s through cognitive models such as the Society ofMind (Minsky, 1988) and Braitenberg vehicles (Braitenberg,1986) and engineering successes in robotics (Brooks, 1991)and in visual pattern recognition (Selfridge, 1988).

The closest works to ours were the algorithms developedaround that same time period that sought as we do to lever-age a multi-agent society for achieving a global objective,starting as early as the bucket brigade algorithm (Holland,1985), in which agents bid in a first-price auction to operateon the state and auction winners directly paid their bid to thewinners from the previous step. Prototypical self-referentiallearning mechanisms (Schmidhuber, 1987) improved thebucket brigade by imposing credit conservation in the eco-nomic transactions. The neural bucket brigade (Schmidhu-ber, 1989) adapted the bucket brigade to learning neuralnetwork weights, where payoffs corresponded to weightchanges. Baum (1996) observed that the optimal choice foran agent’s bid should be equivalent to the optimal Q-valuefor executing that agent’s transformation and developed theHayek architecture for introducing new agents and remov-ing agents that have gone broke. Kwee et al. (2001) addedexternal memory to the Hayek architecture.

However, to this date there has been no proof to the best ofour knowledge that the bid-updating schemes proposed inthese works simultaneously optimize a global objective ofthe society in a decision-making context. Sutton (1988) pro-vides a convergence proof for temporal difference methodsthat share some properties with the bucket brigade creditassignment scheme, but importantly does not take the com-petition between the individual agents into account. But itis precisely the competition among agents in multi-agentlearning that make their equilibria nontrivial to character-ize (Mazumdar et al., 2019). Our work offers an alternativeauction mechanism for which we prove that the optimalsolution for the global objective does coincide with a Nashequilibrium of the society. We follow similar motivationsto Balduzzi (2014), which investigates incentive mecha-nisms for training a society of rational discrete-valued neu-rons. In contrast to other works that decouple the computa-tion graph (Srivastava et al., 2013; Goyal et al., 2019; Penget al., 2019; Pathak et al., 2019) but optimize a global ob-


jective, our work considers optimizing local objectives only.We consider economic transactions between time-steps, asopposed to within a single time-step (Ohsawa et al., 2018).

3. PreliminariesTo set up a framework for societal decision-making, werelate Markov decision processes (MDP) and auctions un-der a unifying language. We define an environment as atuple that specifies an input space, an output space, andadditional parameters for specifying an objective. An agentis a function that maps the input space to the output space.An objective is a functional that maps the learner to a realnumber. Given an environment and objective, the problemthe agent solves is to maximize the value of the objective.

In the MDP environment, the input space is the state spaceS and the output space is the action space A. The agent is apolicy π : S → A. The transition function T : S ×A → S ,the reward function r : S × A → R, and discount fac-tor γ are additional parameters that specify the objective:the return J(π) = Eτ∼pπ(τ)

[∑Tt=0 γ

tr (st, at)], where

pπ(τ) = p(s0)∏Tt=0 π(at|st)

∏T−1t=0 T (st+1|st, at). The

agent solves the problem of finding π∗ = arg maxπ J(π).For any state s, the optimal action for maximizing J(π)is π∗(s) = arg maxaQ

∗(s, a), where the optimal Qfunction Q∗(s, a) is recursively defined as Q∗(s, a) =Es′∼T (s,a) [r(s, a) + γmaxa′ Q

∗(s′, a′)|s, a].

In the auction environments we consider, the input space isa single auction item s and the output space is the biddingspace B. Instead of a single agent, each of N agents ψ1:N

compete to bid for the auction item via its bidding policyψi : s → B. Let b be the vector of bids produced byψ1:N . The vector vs of each agent’s valuations for auc-tion item s and the auction mechanism – allocation ruleX : BN → [0, 1]N and pricing rule P : BN → RN≥0 – areadditional parameters that specify each agent’s objective:the utility U is(ψ

1:N ) = vis ·Xi(b)− P i(b), where Xi(b)is the proportion of s allocated to i, and P i(b) is the scalarprice i pays. Each agent i independently solves the problemof finding ψi∗ = arg maxψi U

is(ψ

1:N ). The independentoptimization of objectives distinguishes a multi-agent prob-lem from a single-agent one and makes multi-agent prob-lems generally difficult to analyze when an agent’s optimalpolicy depends on the strategies of other agents.

However, if an auction is dominant strategy incentive com-patible (DSIC), bidding one’s own valuation is optimal, inde-pendent of other players’ bidding strategies. That is, truthfulbidding is the unique dominant strategy. Notably, the Vick-rey auction (Vickrey, 1961), which sets P i(b) to be thesecond highest bid maxj 6=i b

j and Xi(b) = 1 if i wins and0 and 0 respectively if i loses, is DSIC, which means thedominant strategy equilibrium occurs when every agent bids

truthfully, making the Vickrey auction straightforward toanalyze. Another attractive property of the Vickrey auctionis that the dominant strategy equilibrium automatically max-imizes the social welfare

∑Ni=1 v

i ·Xi(b) (Roughgarden,2016), which selects the bidder with the highest valuation aswinner. The existence of dominant strategies in the Vickreyauction removes the need for agents to recursively modelothers, giving the Vickrey auction the practical benefit ofrunning in linear time (Roughgarden, 2016).

4. Societal Decision-MakingThe perspective of this paper is that a society of agents canbe abstracted as an agent that itself solves an optimizationproblem at a global level as an emergent consequence ofthe optimization problems its constituent agents solve atthe local level. To make this abstraction precise, we nowintroduce the societal decision-making framework for an-alyzing and developing algorithms that relate the globaldecision problem of a society to the local decision problemsof its constituent agents. We use primitive and society todistinguish between the agents at the local and global levels,respectively, which we define in the context of their localand global environments and objectives:

Definition 4.1. A primitive ω is a tuple (ψ, φT ) of a biddingpolicy ψ : S → B and transformation φT : S → S .

Definition 4.2. A society Ω is a set of primitives ω1:N .

The global environment is an MDP that we call the globalMDP, with state space S and discrete action space A =1, ..., N that indexes the primitives ω1:N . The local envi-ronment is an auction that we call the local auction withauction item s ∈ S and bidding space B = [0,∞).

The connection between the local and global environmentsis as follows. Each state in the global MDP is an auctionitem for a different local auction. The winning primitiveω of the auction at state s transforms s into the next states′ of the global MDP using its transformation φT , param-eterized by the global MDP’s transition function T . Foreach primitive i at each state s, its local objective is the util-ity U is(ψ

1:N ). Its local problem is to maximize U is(ψ1:N ).

The global objective is the return J(πΩ) in the global MDPof the global policy πΩ. The global problem for the soci-ety is to maximize J(πΩ). We define the optimal societal Qfunction Q∗Ω(s, ω) as the expected return received from ωinvoking its transformation φT on s and the society activat-ing primitives optimally with respect to J(πΩ) afterward.

Since all decisions made at the societal level are an emergentconsequence of decisions made at the primitive level, thesocietal decision-making framework is a self-organizationperspective on a broad range of sequential decision prob-lems. If each transformation φT specifies a literal action,then societal decision-making is a decentralized re-framing


of standard reinforcement learning (RL). Societal decision-making also encompasses the decision problem of choosingφT s as options in semi-MDPs (Sutton et al., 1999) as well aschoosing φT s as functions in a computation graph (Changet al., 2018; Rosenbaum et al., 2017; Alet et al., 2018).

We are interested in auction mechanisms and learning algo-rithms for optimizing the global objective as an emergentconsequence of optimizing the local objectives. Translatingproblems from one level of abstraction to another wouldprovide a recipe for engineering a multi-agent system toachieve a desired global outcome and permit theoreticalexpectations on the nature of the equilibrium of the soci-ety, while giving us free choice on the architectures andlearning algorithms of the primitive agents. To this end, wenext present an auction mechanism for which the dominantstrategy equilibrium of the primitives coincides with theoptimal policy of the society, which we develop into a classof decentralized RL algorithms in later sections.

5. Mechanism Design for the SocietyWe first observe that to produce the optimal global policy,the optimal bidding strategy for each primitive at each localauction must be to bid their societal Q-value. By definingeach primitive’s valuation of a state as its optimal societal Q-value at that state, we show that the Vickrey auction ensuresthe dominant strategy equilibrium profile of the primitivescoincides with the optimal global policy. Then we show thata market economy perspective on societal decision-makingovercomes the need to assume knowledge of optimal Q-values, although weakens the dominant strategy equilibriumto a Nash equilibrium. Lastly, we explain that adding redun-dant primitives to the society mitigates market bubbles byenforcing credit conservation. Proofs are in the Appendix.

5.1. Optimal Bidding

We state what was observed informally in (Baum, 1996):

Proposition 5.1. Assume at each state s the local auction al-locates Xi(b) = 1 if i wins and Xi(b) = 0 if i loses. Thenall primitives ωi bidding their optimal societal Q-valuesQ∗Ω(s, ωi) collectively induce an optimal global policy.

This proposition makes the problem of self-organizationconcrete: getting the optimal behavior in the global MDPto emerge from the optimal behavior in the local auctionscan be reduced to incentivizing the primitives to bid theiroptimal societal Q-value at every state.

5.2. Dominant Strategies for Optimal Bidding

To incentivize the primitives to bid optimally, we propose todefine the primitives’ valuations vis for each state s as theiroptimal societal Q-values Q∗Ω(s, ωi) and use the Vickrey

auction mechanism for each local auction.

Theorem 5.2. If the valuations vis for each state s arethe optimal societal Q-values Q∗Ω(s, ωi), then the society’soptimal global policy coincides with the primitives’ uniquedominant strategy equilibrium under the Vickrey mechanism.

Then, the utilityU is(ψ1:N ) at each state s that induces the op-

timal global policy, which we refer equivalently as U is(ω1:N )

for the winning primitive ωi and U js (ω1:N ) for losing prim-itives ωj , is given by

U is(ω1:N ) = Q∗Ω(s, ωi)−max

j 6=ibjs (1)

and by U js (ω1:N ) = 0 for losing primitives.

5.3. Economic Transactions for Propagating Q∗ΩWe have so far defined optimal bidding with respect to soci-etal decision-making and characterized the utilities as func-tions of Q∗Ω for which such bidding is a dominant strategy.We now propose to redefine the utilities without knowledgeof Q∗Ω by viewing the society as a market economy.

Monolithic frameworks for solving MDPs, such as directlyoptimizing the policy J(π) with policy gradient methods,are analogous to command-economies, in which all pro-duction – the transformation of past states st into futurestates st+1 – and wealth distribution – the credit assignmentof reward signals to parameters – derive directly from singlecentral authority – the MDP objective. In contrast, underthe societal decision-making framework, the optimal globalpolicy does not derive directly from the MDP objective, butrather emerges implicitly as the equilibrium of the primitivesoptimizing their own local objectives. We thus redefine thevaluations vis following the analogy of a market economy,in which production and wealth distribution are governedby the economic transactions between the primitives.

Specifically, we couple the local auctions at consecutivetime-steps in the same game by defining the valuation vist ofprimitive ωi for winning the auction item st as the revenueit can receive in the auction at the next time-step by sellingthe product st+1 of executing its transformation φiT on st.This compensation comes as the environment reward plusthe (discounted) winning bid at the next time-step:

U ist(ω1:N )︸︷︷︸

utility

= r(st, ωi) + γ ·max

kbkst+1︸︷︷︸

revenue, or valuation vist

−maxj 6=i

bjst︸︷︷︸price

. (2)

Analogous to a market economy, the revenue ωi receives forproducing st+1 from st depends on the price the winningprimitive ωk at t + 1 is willing to bid for st+1. In turn,ωk sells st+2 to the winning primitive at t + 2, and so on.Ultimately currency is grounded in the reward. Wealth isdistributed based on what future primitives decide to bid for


Figure 2. The cloned Vickrey society. In this market economy ofprimitive agents, wealth is distributed not directly from the globalMDP objective but based on what future primitives decide to bidfor the fruits of the labor of information processing carried out bypast primitives transforming one state to another. The primitiveω0′t that wins the auction at time t receives an environment rewardr(st, ω

0′t ) as well as payment b1

′t+1 from ω1′

t+1 for transforming stto st+1. By the Vickrey auction, the price ω1′

t+1 pays to transformst+1 is the second highest bid b1t+1 at time t + 1. Because eachprimitive ωi and its clone ωi′ have the same valuations, their bidsare equivalent and so credit is conserved through time.

the fruits of the labor of information processing carried outby past primitives transforming one state to another.

Definition 5.1. A Market MDP is a global MDP in whichall utilities are defined as in Equation 2.

As valuations now depend on the strategies of future primi-tives, the dominant strategy equilibrium from Theorem 5.2must be weakened to a Nash equilibrium in the general case:

Proposition 5.3. In a Market MDP, it is a Nash equilibriumfor every primitive to bidQ∗Ω(s, ωi). Moreover, if the MarketMDP is finite horizon, then bidding Q∗Ω(s, ωi) is the uniqueNash equilibrium that survives iterated deletion of weaklydominated strategies.

5.4. Redundancy for Credit Conservation

In general, credit is not conserved in the Market MDP: thewinning primitive at time t− 1 gets paid an amount equal tothe highest bid at time t, but the winner at time t only paysan amount equal to the second highest bid at time t. At timet, ωi could bid arbitrarily higher than maxj 6=i b

j withoutpenalty, which distorts the valuations for primitives that bidbefore time t, creating a market bubble.

To prevent market bubbles, we propose a modification to thesociety, which we will call a cloned society, for enforcingcredit conservation – for all transformations φT initialize atleast two primitives that share the same φT :

Lemma 5.4. For a cloned society, at the Nash equilibriumspecified in Proposition 5.3, what the winning primitive ωi

at time t receives from the winning primitive ωk at t+ 1 isexactly what ωk pays: bkst+1

.

We now state our main result.

Theorem 5.5. Define a cloned Vickrey society as a clonedsociety that solves a Market MDP. Then it is a Nash equilib-rium for every primitive in the cloned Vickrey society to bidQ∗Ω(s, ωi). In addition, the price that the winning primitivepays for winning is equivalent to what it bid.

The significance of Theorem 5.5 is that guaranteeing truthfulbidding of societal Q-values decouples the analysis of thelocal problem within each time-step from that of the globalproblem across time-steps, which opens the possibility fordesigning learners to reach a Nash equilibrium that we knowexists. Without such a separation of these two levels ofabstraction the entire society must be analyzed as a repeatedgame through time – a non-trivial challenge.

6. From Equilibria to Learning ObjectivesSo far the discussion has centered around the quality of theequilibria, which assumes the primitives know their ownvaluations. We now propose a class of decentralized RLalgorithms for learning optimal bidding behavior withoutassuming knowledge of the primitives’ valuations.

Instead of learning to optimize the MDP objective directly,we propose to train the primitives’ parameterized biddingpolicies to optimize their utilities in Equation 2, yieldinga class of decentralized RL algorithms for optimizing theglobal RL objective that is agnostic to the choice of RLalgorithm used to train each primitive. By Theorem 5.5,truthful bidding for all agents is one global optimum toall agent’s local learning problems that also serves as aglobal optimum for the society as whole. In the special casewhere the transformation φT is a literal action, this class ofdecentralized RL algorithms can serve as an alternative toany standard algorithm for solving discrete-action MDPs.An on-policy learning algorithm is presented in Appendix D.

6.1. Local Credit Assignment in Space and Time

The global problem requires a solution that is global inspace, because the society must collectively work together,and global in time, because the society must maximize ex-pected return across time-steps. But an interesting propertyof using the auction utility in Equation 2 as an RL objectiveis that given redundant primitives it takes the form of theBellman equation, thereby implicitly coupling all primitivestogether in space and time. Thus each primitive need onlyoptimize for its immediate utility at a each time-step withoutneeding to optimize for its own future utilities. This class ofdecentralized RL algorithms thus implicitly finds a globalsolution in space and time using only credit assignment thatis local in space, because transactions are only between indi-vidual agents, and local in time, because each primitive needonly solve a contextual bandit problem at each time-step.


6.2. Redundancy for Avoiding Suboptimal Equilibria

A benefit of casting local auction utilities as RL objectivesis the practical development of learning algorithms thatneed not assume oracle knowledge of valuations, but unlesscare is taken with each primitives’ learning environments,the society may not converge to the globally optimal Nashequilibrium described in Section 5. As an example of asuboptimal equilibrium, in a Market MDP with two primi-tives, even if v1 = 1,v2 = 2, it is a Nash equilibrium forb1 = 100,b2 = 0. Without sufficient competitive pressureto not overbid or underbid, a rogue winner lacks the riskof losing to other primitives with similarly close valuations.Fortunately, the redundancy of a cloned Vickrey societyserves a dual purpose of not only preventing market bubblesbut also introducing competitive pressure in the primitives’learning environments in the form of other clones.

7. ExperimentsNow we study how well the cloned Vickrey society canrecover the optimal societal Q-function as its Nash equilib-rium. We compare several implementations of the clonedVickrey society against baselines across simple tabular envi-ronments in Section 7.1, where the transformations φT areliteral actions. Then in Section 7.2 we demonstrate the broadapplicability of the cloned Vickrey society for learning toselect options in semi-MDPs and composing functions ina computation graph. Moreover, we show evidence for theadvantages of the societal decision making framework overits monolithic counterpart in transferring to new tasks.

Although our characterization of the equilibrium and thelearning algorithm are agnostic to the implementation of theprimitive, in our experiments the bidding policy ψ is imple-mented as a neural network that maps the state to the pa-rameters of a Beta distribution, from which a bid is sampled.We used proximal policy optimization (PPO) (Schulmanet al., 2017) to optimize the bidding policy parameters.

7.1. Numerical Simulations

In the following simulations we ask: (1) How closely dothe bids the primitives learn match their optimal societal Q-values? (2) Does the solution to the global objective emergefrom the competition among the primitives? (3) How doesredundancy affect the solutions the primitives converge to?

We first consider the Market Bandit, the simplest globalMDP with one state, and compare the cloned Vickrey societywith societies with different auction mechanisms. Next weconsider the Chain environment, a sparse-reward multi-stepglobal MDP designed to study market bubbles. Last weconsider the Duality environment, a multi-step global MDPdesigned to study suboptimal equilibria. We use solitarysociety to refer to a society without redundant clones.

(a) (b)

Figure 3. Market Bandits. We compare the cloned Vickrey so-ciety (Cloned Vickrey Auction) against solitary societies that usethe first-price auction mechanism (first price auction), the Vickreyauction mechanism (Vickrey Auction), and a mechanism whoseutility is only the environment reward (Environment Reward). Thedashed line in (a) indicates truthful bidding of valuations. Thecloned Vickrey society’s bids are closest to the true valuationswhich also translates into the best global policy (b).

7.1.1. MARKET BANDITS AND AUCTION MECHANISMS

The simulation primarily studies question (1) by comparingthe auction mechanism of the cloned Vickrey society againstother mechanisms in eliciting bids that match optimal so-cietal Q-values. We compare against first price auction,based on Holland (1985), a solitary society that uses the firstprice auction mechanism for the local auction, in which thewinning primitive pays a price of their bid, rather the secondhighest bid; against Vickrey Auction, a solitary version of thecloned Vickrey society; and against Environment Reward,a baseline solitary society whose utility function uses onlythe environment reward, with no price term P i(b).

The environment is a four-armed Market Bandit whose armscorrespond to transformations φ that deterministically yieldreward values of 0.2, 0.4, 0.6, and 0.8. Solitary societiesthus would have four primitives, while cloned societieseight. Since we are mainly concerned with question (1), westochastically drop out a subset of primitives at every roundto give each primitive a chance to win and learn its value.

In the Market Bandit, the arm rewards directly specifythe primitives’ valuations, so if the primitives successfullylearned their valuations, we would expect them to learn tobid values exactly at 0.2, 0.4, 0.6, and 0.8. Figure 3a showsthat the primitives in the cloned Vickrey society learn to bidmost closely to their true valuations, which also translatesinto the best global policy in Figure 3b, answering question(2). To address question (3), we observe that the solitaryVickrey society’s bids are more spread out than those of thecloned Vickrey society because there is no competitive pres-sure to learn the valuations exactly. This is not detrimentalwhen the global MDP has only one time-step, but the nextsection shows that the lack of redundancy creates marketbubbles that result in a suboptimal global policy.


ωt Receives ωt+1 PaysCCV b′t+1 b′t+1

BB bt+1 bt+1

V bt+1 b′t+1

Figure 4. Implementations. The table shows the bid price thattemporally consecutive winners ωt+1 and ωt pay and receive basedon three possible implementations of the cloned Vickrey society:CCV, BB, V, with tradeoffs depicted in the Venn diagram. We usebt+1 and b′t+1 to denote the highest and second highest bids attime t+ 1 respectively.

7.1.2. MARKET MDPS

Now we consider MDPs that involve multiple time-stepsto understand how learning is affected when primitives’valuations are defined by the bids of future primitives. Re-dundancy theoretically makes what the winner ωt receivesand ωt+1 equivalent, but the stochasticity of the bid distribu-tion yields various possible implementations for the clonedVickrey society in the market MDP– bucket brigade (BB),Vickrey (V), and credit conserving Vickrey (CCV), summa-rized in Figure 4, with different pros and cons. Note thatthe solitary bucket brigade society is a multiple time-stepanalog of the first price auction in Section 7.1.1. We aim toempirically test which of the implementations of the clonedVickrey society yields best local and global performance.

The Chain environment and market bubbles. A pri-mary purpose of the Chain environment (Figure 6a) is tostudy the effect of redundancy on mitigating market bubblesin the bidding behavior and how that affects global optimal-ity. The Chain environment is a sparse reward finite-horizonenvironment which initializes the society at s0 on the left,yields a terminal reward of 0.8 if the society enters the goalstate s5 on the right, and ends the episode after 20 steps ifthe society does not reach the goal. Chain thus tests theability of the society to propagate utility from future agentsto past agents in the absence of an immediate environmentreward signal. We compare the bidding behaviors of the BB,V, and CCV implementations of the cloned Vickrey society,as well as those of their solitary counterparts, in Figure 6and the society’s global learning curve in Figure 8a.

To answer question (1), we observe that redundancy indeedprevents market bubbles, with the cloned CCV implementa-tion bidding closet to the optimal societal Q values. Detailsare in the caption of Figure 5. When we consider the so-ciety’s global learning curve in Figure 8a, the answers toquestions (2) and (3) go hand-in-hand: the solitary societiesfail to find the globally optimal policy and the cloned CCVimplementation has the highest sample efficiency.

(a) CCV solitary (b) BB solitary (c) V solitary

(d) CCV clone (e) BB clone (f) V clone

Figure 5. Learned Bidding Strategies for Chain. We organizethe analysis by distinguishing between the credit-conserving (CCVand BB) and the non-credit-conserving (V) implementations. Thesolitary CCV (a) and BB (b) implementations learn to bid veryclose to 0: CCV because the valuation for a primitive at t is onlythe second-highest bid at t + 1, resulting in a rapid decay in thevaluations leftwards down the chain; BB because each primitive isincentivized to pay as low of a price for winning as possible. Thecloned CCV (d) and BB (e) implementations learn to implementa form of return decomposition (Arjona-Medina et al., 2019) thatredistributes the terminal reward into a series of positive payoffsback through the chain, each agent getting paid for contributingto moving the society closer to the goal state, where the CCVimplementation’s bids are closer to the optimal societal Q-valuethan those of the BB implementation. Because both the solitary (c)and cloned (f) versions of the V implementations do not conservecredit, they learn to bid close to the optimal societal Q-value, butboth suffer from market bubbles where the primitive for goingleft bids higher than the primitive for going right, even though theoptimal global policy is to keep moving right.

The Duality environment and suboptimal equilibria.The Chain environment experiments suggested a connec-tion between lack of redundancy and globally suboptimalequilibria, the subtleties of which we explore further in theDuality environment (Figure 6b). The CCV implementa-tion has yielded the best performance so far but does notguarantee Bellman optimality (Figure 4) without redundantprimitives. We show that in the Duality environment, with-out redundant primitives the dominant strategy equilibriumwould lead the society to get stuck in a self-loop at s1 indef-initely, even though the global optimal solution would beto cycle back and forth between s0 and s1. The reasoningfor this is explained in Appendix F.1.2. Figure 8b indeedshows that a society with redundant primitives learns a betterequilibrium than without.

7.2. Semi-MDPs and Computation Graphs

One benefit of framing global decision-making from the per-spective of local economic transactions is that the same so-cietal decision-making framework and learning algorithms


Figure 6. Multi-Step MDPs. (a) In the Chain environment, the society starts at state s0 and the goal state is s5. Only activating primitiveω1 at state s4 yields reward. The optimal global policy is to directly move right by continually activating ω1. Without credit conservation,the society may get stuck going back and forth between s0 and s4 without reaching the goal. (b) In the Duality environment, the societystarts at state s0. s−1 is an absorbing state with perpetual negative rewards. The optimal societal policy is to cycle between s0 and s1 toreceive unbounded reward, but without redundant primitives, the society may end up in a suboptimal perpetual self-loop at s1.

(a) solitary: s−1 (b) solitary: s0 (c) solitary: s1

(d) cloned: s−1 (e) cloned: s0 (f) cloned: s1

Figure 7. CCV Bidding Curves for Duality. Each column showsthe bidding curves of the solitary (top row) and cloned (bottomrow) CCV societies for states s−1, s0, and s1. Without redundantprimitives to force the second-highest and highest valuations to beequal, the dominant strategy of truthful bidding may not coincidewith the globally optimal policy because the solitary CCV imple-mentation does not guarantee Bellman optimality. The biddingcurves in (c) show that ω1 learns a best response of bidding higherthan primitive ω0 at state s1, even though it would be globallyoptimal for the society if ω0 wins at s1. Adding redundant primi-tives causes the second-highest and highest valuations to be equal,causing ω0 to learn to bid highly as well at s1, which results in amore optimal return as shown in Figure 8b.

can be used regardless of the type of transformation φT . Wenow show in the Two Rooms environment that the clonedVickrey society can learn more efficiently than a monolithiccounterpart to select among pre-trained options to solve agym-minigrid (Chevalier-Boisvert et al., 2018) naviga-tion task that involves two rooms separated by a bottleneckstate. We also show in the Mental Rotation environment thatthe cloned Vickrey society can learn to dynamically com-pose computation graphs of pre-specified affine transforma-tions for classifying spatially transformed MNIST digits.We use the cloned CCV society for these experiments.

7.2.1. TRANSFERRING WITH OPTIONS IN SEMI-MDPS

We construct a two room environment, Two Rooms, whichrequires opening the red door and reaching either a green

(a) Chain (b) Duality

Figure 8. Multi-Step MDP Global Learning Curves. We ob-serve that cloned societies are more robust against suboptimalequilibria than solitary societies. Furthermore the cloned CCV im-plementation achieves the best sample efficiency, suggesting thattruthful bidding and credit-conservation are important propertiesto enforce for enabling the optimal global policy to emerge.

goal or blue goal. The transformations φT are subpoliciesthat have been pre-trained to open the red door, reach thegreen goal, and reach the blue goal. In the pre-trainingtask, only reaching the green goal gives a non-zero terminalreward and reaching the blue goal does not give reward. Inthe transfer task, the rewards for reaching the green and bluegoals are switched. We compared the cloned Vickrey societyagainst a non-hierarchical monolithic baseline that selectsamong the low-level gym-minigrid actions as well asa hierarchical monolithic baseline that selects among thepre-trained subpolicies. Both baseline policies sample froma Categorical distribution and are trained with PPO. Thecloned Vickrey society is more sample efficient in learningon the pre-training task and significantly faster in adaptingto the transfer task (Figure 10).

Our hypothesized explanation for this efficiency is that thelocal credit assignment mechanisms of a society parallelizelearning across space and time, a property that is not trueof the global credit assignment mechanisms of a mono-lithic learner. To test this hypothesis, we observe in thebottom-right of Figure 10 that not only has a higher per-centage of the hierarchical monolithic baseline’s weightsshifted during transfer compared to our method but they alsoshift to a larger degree, which suggests that the hierarchicalmonolithic baseline’s weights are more globally coupled


Figure 9. Mental Rotation. The cloned Vickrey society learns to transform the image into a form that can be classified correctly witha pre-trained classifier by composing two of six possible affine transformations of rotation and translation. Clones are indicated by anapostrophe. In this example, the society activated primitive ω2′ to translate the digit up then primitive ω1 to rotate the digit clockwise.Though the bidding policies ψi and ψi′ of the clones ωi and ωi′ have the same parameters, their sampled bids may be different becausethe bidding policies are stochastic.

Figure 10. Two Rooms. The cloned Vickrey society adapts morequickly than the hierarchical monolithic baseline in both the pre-training and the transfer tasks. The bottom-right figure, whichis a histogram of the absolute values of how much the weightshave shifted from fine-tuning on the transfer task, shows that moreweights shift, and to a larger degree, in the hierarchical monolithicbaseline than in our method. This seems to suggest that the clonedVickrey society is a more modular learner than the hierarchicalmonolithic baseline. The non-hierarchical monolithic baselinedoes not learn to solve the task from scratch.

and perhaps thereby slower to transfer. While this analysisis not comprehensive, it is a suggestive result that motivatesfuture work for further study.

7.2.2. COMPOSING DYNAMIC COMPUTATION GRAPHS

We adapt the Image Transformations task from Chang et al.(2018) as what we call the Mental Rotation environment(Figure 9), in which MNIST images have been transformedwith a composition of one of two rotations and one of fourtranslations. There are 60, 000 × 8 = 240, 000 possibleunique inputs, meaning learning an optimal global policyinvolves training primitives across 240, 000 local auctions.The transformations φT are pre-specified affine transforma-

tions. The society must transform the image into a form thatcan be classified correctly with a pre-trained classifier, witha terminal reward of 1 if the predicted label is correct and 0otherwise. The cloned Vickrey society converges to a meanreturn of 0.933 with a standard deviation of 0.014.

8. ConclusionThis work formally defines the societal decision-makingframework and proves the optimality of a society – thecloned Vickrey society – for solving a problem that was firstposed in the AI literature in the 1980s: to specify the incen-tive structure that causes the global solution of a society toemerge as the equilibrium strategy profile of self-interestedagents. For training the society, we further proposed a classof decentralized reinforcement learning algorithms whoseglobal objective decouples in space and time into simpler lo-cal objectives. We have demonstrated the generality of thisframework for selecting actions, options, and computationsas well as its potential advantages for transfer learning.

The generality of the societal decision-making frameworkopens much opportunity for future work in decentralized re-inforcement learning. A society’s inherent modular structuresuggests the potential of reformulating problems with globalcredit-assignment paths into problems with much more lo-cal credit-assignment paths that offer more parallelism andindependence in the training of different components of alearner. Understanding the learning dynamics of multi-agentsocieties continues to be an open problem of research. Itwould be exciting to explore algorithms for constructingand learning societies in which the primitives are also soci-eties themselves. We hope that the societal decision-makingframework and its associated decentralized reinforcementlearning algorithms provide a foundation for future workexploring the potential of creating AI systems that reflectthe collective intelligence of multi-agent societies.


AcknowledgementsThe authors are grateful to the anonymous reviewers,Michael Janner, and Glen Berseth for their feedback onthis paper, as well as Aviral Kumar, Sam Toyer, AnirudhGoyal, Marco Cusumano-Towner, Abhishek Gupta, and Di-nesh Jayaraman for helpful discussions. This research wassupported by AFOSR grant FA9550-18-1-0077, NSF CCF-1717899, and Google Cloud Platform. MC is supported bythe National Science Foundation Graduate Research Fellow-ship Program. SK is supported by Berkeley EngineeringStudent Services.

ReferencesAlet, F., Lozano-Perez, T., and Kaelbling, L. P. Modular

meta-learning. arXiv preprint arXiv:1806.10166, 2018.

Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Un-terthiner, T., Brandstetter, J., and Hochreiter, S. Rudder:Return decomposition for delayed rewards. In Advancesin Neural Information Processing Systems, pp. 13544–13555, 2019.

Balduzzi, D. Cortical prediction markets. arXiv preprintarXiv:1401.1465, 2014.

Baum, E. B. Toward a model of mind as a laissez-faireeconomy of idiots. In ICML, pp. 28–36, 1996.

Bellmann, R. Dynamic programming princeton universitypress. Princeton, NJ, 1957.

Braitenberg, V. Vehicles: Experiments in synthetic psychol-ogy. 1986.

Brooks, R. A. Intelligence without representation. Artificialintelligence, 47(1-3):139–159, 1991.

Chang, M. B., Gupta, A., Levine, S., and Griffiths,T. L. Automatically composing representation transfor-mations as a means for generalization. arXiv preprintarXiv:1807.04640, 2018.

Chen, X., Deng, X., and Teng, S.-H. Settling the complexityof computing two-player nash equilibria. Journal of theACM (JACM), 56(3):1–57, 2009.

Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalis-tic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.

Daskalakis, C., Goldberg, P. W., and Papadimitriou, C. H.The complexity of computing a nash equilibrium. SIAMJournal on Computing, 39(1):195–259, 2009.

Goyal, A., Sodhani, S., Binas, J., Peng, X. B., Levine, S.,and Bengio, Y. Reinforcement learning with competitiveensembles of information-constrained primitives. arXivpreprint arXiv:1906.10667, 2019.

Holland, J. H. Properties of the bucket brigade. In Pro-ceedings of an International Conference on Genetic Al-gorithms and their Applications 1-7, 1985.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Kwee, I., Hutter, M., and Schmidhuber, J. Market-basedreinforcement learning in partially observable worlds. InInternational Conference on Artificial Neural Networks,pp. 865–873. Springer, 2001.

Mazumdar, E., Ratliff, L. J., Jordan, M. I., and Sastry, S. S.Policy-gradient algorithms have no guarantees of conver-gence in continuous action and state multi-agent settings.arXiv preprint arXiv:1907.03712, 2019.

Minsky, M. Society of mind. Simon and Schuster, 1988.

Ohsawa, S., Akuzawa, K., Matsushima, T., Bezerra, G., Iwa-sawa, Y., Kajino, H., Takenaka, S., and Matsuo, Y. Neu-ron as an agent, 2018. URL https://openreview.net/forum?id=BkfEzz-0-.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,L., et al. Pytorch: An imperative style, high-performancedeep learning library. In Advances in neural informationprocessing systems, pp. 8026–8037, 2019.

Pathak, D., Lu, C., Darrell, T., Isola, P., and Efros, A. A.Learning to control self-assembling morphologies: astudy of generalization via modularity. In Advances inNeural Information Processing Systems, pp. 2295–2305,2019.

Peng, X. B., Chang, M., Zhang, G., Abbeel, P., and Levine,S. Mcp: Learning composable hierarchical control withmultiplicative compositional policies. arXiv preprintarXiv:1905.09808, 2019.

Plato. Republic. 380 B.C.

Rosenbaum, C., Klinger, T., and Riemer, M. Routing net-works: Adaptive selection of non-linear functions formulti-task learning. arXiv preprint arXiv:1711.01239,2017.

Roughgarden, T. Twenty lectures on algorithmic game the-ory. Cambridge University Press, 2016.

Schmidhuber, J. Evolutionary principles in self-referentiallearning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universitat Munchen,1987.

Schmidhuber, J. A local learning algorithm for dynamicfeedforward and recurrent networks. Connection Science,1(4):403–412, 1989.

https://github.com/maximecb/gym-minigrid

https://github.com/maximecb/gym-minigrid

https://openreview.net/forum?id=BkfEzz-0-

https://openreview.net/forum?id=BkfEzz-0-


Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel,P. High-dimensional continuous control using generalizedadvantage estimation. arXiv preprint arXiv:1506.02438,2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.

Selfridge, O. G. Pandemonium: A Paradigm for Learning,pp. 115122. MIT Press, Cambridge, MA, USA, 1988.ISBN 0262010976.

Srivastava, R. K., Masci, J., Kazerounian, S., Gomez, F.,and Schmidhuber, J. Compete to compute. In Advancesin neural information processing systems, pp. 2310–2318,2013.

Sutton, R. S. Learning to predict by the methods of temporaldifferences. Machine learning, 3(1):9–44, 1988.

Sutton, R. S., Precup, D., and Singh, S. Between mdpsand semi-mdps: A framework for temporal abstraction inreinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.

Vickrey, W. Counterspeculation, auctions, and competitivesealed tenders. The Journal of finance, 16(1):8–37, 1961.

Decentralized Reinforcement Learning: Global Decision ...

Documents

Transcript of Decentralized Reinforcement Learning: Global Decision ...