Learning to Ground Multi-Agent Communication with Autoencoders

Learning to Ground Multi-Agent Communicationwith Autoencoders

Toru LinMIT CSAIL

[email protected]

Minyoung HuhMIT CSAIL

[email protected]

Chris StaufferFacebook AI

[email protected]

Ser-Nam LimFacebook AI

[email protected]

Phillip IsolaMIT CSAIL

[email protected]

Abstract

Communication requires having a common language, a lingua franca, betweenagents. This language could emerge via a consensus process, but it may requiremany generations of trial and error. Alternatively, the lingua franca can be givenby the environment, where agents ground their language in representations of theobserved world. We demonstrate a simple way to ground language in learnedrepresentations, which facilitates decentralized multi-agent communication andcoordination. We find that a standard representation learning algorithm – autoen-coding – is sufficient for arriving at a grounded common language. When agentsbroadcast these representations, they learn to understand and respond to eachother’s utterances and achieve surprisingly strong task performance across a varietyof multi-agent communication environments.

1 Introduction

An essential aspect of communication is that each pair of speaker and listener must share a commonunderstanding of the symbols being used [9]. For artificial agents interacting in an environment,with a communication channel but without an agreed-upon communication protocol, this raisesthe question: how can meaningful communication emerge before a common language has beenestablished?

To address this challenge, prior works have used supervised learning [19], centralized learning [15,18, 30], or differentiable communication [7, 18, 30, 34, 43]. Yet, none of these mechanisms isrepresentative of how communication emerges in nature, where animals and humans have evolvedcommunication protocols without supervision and without a centralized coordinator [37]. Thecommunication model that most closely resembles language learning in nature is a fully decentralizedmodel, where agents’ policies are independently optimized. However, decentralized models performpoorly even in simple communication tasks [18] or with additional inductive biases [11].

We tackle this challenge by first making the following observations on why emergent communicationis difficult in a decentralized multi-agent reinforcement learning setting. A key problem that preventsagents from learning meaningful communication is the lack of a common grounding in communicationsymbols [3, 11, 20]. In nature, the emergence of a common language is thought to be aided by physicalbiases and embodiment [31, 44] – we can only produce certain vocalizations, these sounds only canbe heard a certain distance away, these sounds bear similarity to natural sounds in the environment,etc. – yet artificial communication protocols are not a priori grounded in aspects of the environment

Project page, code, and videos can be found at https://toruowo.github.io/marl-ae-comm/.

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

arX

iv:2

110.

1534

9v1

[cs

.LG

] 2

8 O

ct 2

021

https://toruowo.github.io/marl-ae-comm/

dynamics. This poses a severe exploration problem as the chances of a consistent protocol beingfound and rewarded is extremely small [18]. Moreover, before a communication protocol is found,the random utterances transmitted between agents add to the already high variance of multi-agentreinforcement learning, making the learning problem even more challenging [11, 30].

To overcome the grounding problem, an important question to ask is: do agents really need to learnlanguage grounding from scratch through random exploration in an environment where success isdetermined by chance? Perhaps nature has a different answer; previous studies in cognitive scienceand evolutionary linguistics [21, 38, 41, 42] have provided evidence for the hypothesis that commu-nication first started from sounds whose meaning are grounded in the physical environment, thencreatures adapted to make sense of those sounds and make use of them. Inspired by language learningin natural species, we propose a novel framework for grounding multi-agent communication: firstground speaking through learned representations of the world, then learn listening to interpret thesegrounded utterances. Surprisingly, even with the simple representation learning task of autoencoding,our approach eases the learning of communication in fully decentralized multi-agent settings andgreatly improves agents’ performance in multi-agent coordination tasks that are nearly unsolvablewithout communication.

The contribution of our work can be summarized as follows:

• We formulate communication grounding as a representation learning problem and propose to useobservation autoencoding to learn a common grounding across all agents.

• We experimentally validate that this is an effective approach for learning decentralized communica-tion in MARL settings: a communication model trained with a simple autoencoder can consistentlyoutperform baselines across various MARL environments.

• In turn, our work highlights the need to rethink the problem of emergent communication, where wedemonstrate the essential need for visual grounding.

2 Related Work

In multi-agent reinforcement learning (MARL), achieving successful emergent communication withdecentralized training and non-differentiable communication channel is an important yet challengingtask that has not been satisfactorily addressed by existing works. Due to the non-stationary and non-Markovian transition dynamics in multi-agent settings, straightforward implementation of standardreinforcement learning methods such as Actor-Critic [24] and DQN [33] perform poorly [18, 30].

Centralized learning is often used to alleviate the problem of high variance in MARL, for examplelearning a centralized value function that has access to the joint observation of all agents [10, 15,30]. However, it turns out that MARL models are unable to solve tasks that rely on emergentcommunication, even with centralized learning and shared policy parameters across the agents [18].Eccles et al. [11] provides an analysis that illustrates how MARL with communication poses amore difficult exploration problem than standard MARL, which is confirmed by empirical resultsin [18, 30]: communication exacerbates the sparse and high variance reward signal in MARL.

Many works therefore resort to differentiable communication [7, 18, 30, 34, 43], where agentsare allowed to directly optimize each other’s communication policies through gradients. Amongthem, Choi el al. [7] explore a high-level idea that is similar to ours: generating messages that themodel itself can interpret. However, these approaches impose a strong constraint on the nature ofcommunication, which limits their applicability to many real-world multi-agent coordination tasks.

Jaques et al. [22] proposes a method that allows independently trained RL agents to communicateand coordinate. However, the proposed method requires that an agent either has access to policies ofother agents or stays in close proximity to other agents. These constraints make it difficult for thesame method to be applied to a wider range of tasks, such as those in which agents are not embodiedor do not observe others directly. Eccles et al. [11] attempts to solve the same issue by introducinginductive biases for positive signaling and positive listening, but implementation requires numeroustask-specific hyperparameter tuning, and the effectiveness is limited.

It is also worth noting that, while a large number of existing works on multi-agent communicationtake structured state information as input [5, 16, 17, 18, 34, 36], we train agents to learn a commu-nication protocol directly from raw pixel observations. This presents additional challenges due to

2

the unstructured and ungrounded nature of pixel data, as shown in [3, 7, 26]. To our knowledge, thiswork is the first to effectively use representation learning to aid communication learning from pixelinputs in a wide range of MARL task settings.

3 Preliminaries

We model multi-agent reinforcement learning (MARL) with communication as a partially-observable general-sum Markov game [28, 40], where each agent can broadcast information toa shared communication channel. Each agent receives a partial observation of the underlying worldstate at every time step, including all information communicated in the shared channel. This observa-tion is used to learn an appropriate policy that maximizes the agent’s environment reward. In thiswork, we parameterize the policy function using a deep neural network.

Formally, a decentralized MARL can be expressed as a partially observable Markov decision processas M = 〈S,A, C,O, T , R, γ〉, where N is the number of agents, S is a set of states spaces,A = {A1, ...,AN}, C = {C1, ..., CN}, andO = {O1, ...,ON} are a set of action, of communication,and of observation spaces respectively.

At time step t, an agent k observes a partial view o(k)t of the underlying true state st , and a set of

communicated messages from the previous time step ct−1 = {c(1)t−1, ..., c(N)t−1}. The agent then chooses

an action a(k)t ∈ Ak and a subsequent message to broadcast c(k)t ∈ Ck. Given the joint actions of allNagents at = {a(1)t , ..., a

(N)t } ∈ (A1, ...,AN ), the transition function T : S×A1× ...×AN → ∆(S)

maps the current state st and set of agent actions at to a distribution over the next state st+1. Sincethe transition function T is non-determinstic, we denote the probability distributions over S as ∆(S).Finally, each agent receives an individual reward r(k)t ∈ R(st, at) whereR : S×A1× ...×AN → R.

In our work, we consider a fully cooperative setting in which the objective of each agent is tomaximize the total expected return of all agents:

maximizeπ:S→A×C

E[ ∑

t∈T

∑

k∈NγtR(st, at)

∣∣∣ (at, ct) ∼ π(k), st ∼ T (st−1)]

(1)

for some finite time horizon T and discount factor γ.

In MARL, the aforementioned objective function is optimized using policy gradient. Specifically, weuse asynchronous advantage actor-critic (A3C) [32] with Generalized Advantage Estimation [39]to optimize our policy network. The policy network in A3C outputs a distribution over the actionsand the discounted future returns. Any other policy optimization algorithms could have been used inlieu of A3C. We do not use centralized training or self-play, and only consider decentralized trainingwhere each agent is parameterized by an independent policy.

4 Grounding Representation for Communication with Autoencoders

The main challenge of learning to communicate in fully decentralized MARL settings is that there is nogrounded information to which agents can associate their symbolic utterances. This lack of groundingcreates a dissonance across agents and poses a difficult exploration problem. Ultimately, the gradientsignals received by agents are therefore largely inconsistent. As the time horizon, communicationspace, and the number of agents grow, this grounding problem becomes even more pronounced. Thisdifficulty is highlighted in numerous prior works, with empirical results showing that agents oftenfail to use the communication channel at all during decentralized learning [3, 11, 20, 30].

We propose a simple and surprisingly effective approach to mitigate this issue: using a self-supervisedrepresentation learning task to learn a common grounding [45] across all agents. Having such agrounding enables speakers to communicate messages that are understandable to the listeners andconvey meaningful information about entities in the environment, though agents need not use thesame symbols to mean the same things. Specifically, we train each agent to independently learn toauto-encode its own observation and use the learned representation for communication. This approachoffers the benefits of allowing fully decentralized training without needing additional architectural

3

image encoder

message encoder

�(st)

Policy GradientCommunication

Grounding

agent 1

a(2)

tc(2)t c(n)

t a(n)

ta(1)

tc(1)t

agent 2 agent n

c(n)

t�1c(1)

t�1 c(2)

t�1

Communicate

s(1)

ts(2)

t s(n)

t

s(n)

t c(1)

t�1 . . . c(n)

t�1

�(ct�1)

a(n)

tc(n)

t

�(n)�(2)�(1)

Figure 1: Overview: The overall schematic of our multi-agent system. All agents share the sameindividual model architecture, but each agent is independently trained to learn to auto-encode its ownobservation and use the learned representation for communication. At each time step, each agentobserves an image representation of the environment as well as messages broadcasted by other agentsduring the last time step. The image pixels are processed through an Image Encoder; the broadcastedmessages are processed through a Message Encoder; the image features and the message features areconcatenated and passed through a Policy Network to predict the next action. The image features arealso used to generate the next communication messages using the Communication Autoencoder.

bias or supervision signal. In Section 5, we show the effectiveness of our approach on a variety ofMARL communication tasks.

An overview of our method is shown in Figure 1, which illustrates the communication flow amongagents at some arbitrary time step t. All agents share the same individual model architecture, andeach agent consists of two modules: a speaker module and a listener module. We describe thearchitecture details of a single agent k below.

4.1 Speaker Module

At each time step t, the speaker module takes in the agent’s observation o(k)t and outputs the agent’s

next communication message c(k)t .

Image Encoder Given the raw pixel observation, the module first uses a image encoder to embed thepixels into a low-dimensional feature o

(k)t ! �im(o

(k)t ) 2 R128. The image encoder is a convolutional

neural network with 4 convolutional layers, and the output of this network is spatially pooled. We usethe same image encoder in the listener module.

Communication Autoencoder The goal of the communication autoencoder is to take the currentstate observation and generate the next subsequent message. We use an autoencoder to learn amapping from the feature space of image encoder to communication symbols, i.e. �im(o

(k)t ) ! c

(k)t .

The autoencoder consists of an encoder and a decoder, both parameterized by a 3-layer MLP. Thedecoder tries to reconstruct the input state from the communication message c

(k)t ! �im(o

(k)t ). The

communication messages are quantized before being passed through the decoder. We use a straight-through estimator to differentiate through the quantization [1]. The auxiliary objective function ofour model is to minimize the reconstruction loss k�im(o

(k)t ) � �im(o

(k)t )k2

2. This loss is optimizedjointly with the policy gradient loss from the listener module.

4.2 Listener Module

While the goal of the speaker module is to output grounded communication based on the agent’sprivate observation o

(k)t , the goal of the listener module is to learn an optimal action policy based on

both the observation o(k)t and communicated messages ct�1. At each time step t, the listener module

outputs the agent’s next action a(k)t .

4

�im(o(k)t )

<latexit sha1_base64="9Fgn9kpxREwS/dVOKmp4IDKcEoI=">AAACiHicfZFNb9QwEIa94auEry0ce4lYIW0RWiUIqe2tKhy4IIrEbiutQ+R4J1lr/RHZE+jKyoFfwxV+Dv8GZ7tI0CJGsvT4nXdsz7hspHCYpj8H0Y2bt27f2bkb37v/4OGj4e7jmTOt5TDlRhp7XjIHUmiYokAJ540FpkoJZ+XqdZ8/+wzWCaM/4rqBXLFai0pwhkEqhnu0WYrCU4QL9EJ13dgU+MmPV/vdfjEcpZN0E8l1yLYwIts4LXYHU7owvFWgkUvm3DxLG8w9syi4hC6mrYOG8RWrYR5QMwUu95suuuRZUBZJZWxYGpON+meFZ8q5tSqDUzFcuqu5XvxXbt5idZh7oZsWQfPLi6pWJmiSfiTJQljgKNcBGLcivDXhS2YZxzC4OKZvIDRj4V04+H0DlqGxzz1ltlbsogvN1fRFT/8zCv3bGCimGr5woxTTC0+1saqbZ7mnEiqkcgYWRxm1ol4itf2uCx+RXR37dZi9nGTpJPvwanR8sv2SHbJHnpIxycgBOSZvySmZEk6+km/kO/kRxVEaHURHl9ZosK15Qv6K6OQX3AnIXA==</latexit><latexit sha1_base64="ja7fR1eH9RhdwkzGaNhaf4QiBes=">AAACrXicjVFdaxQxFM2OX3X86FYf+zK4CFuRdUYE+1jUB19EBWdb2EyHTPbubNh8DMmd2iXMg7/GV3+N4L8xs11BWxEvBE7OPTfJOakaKRym6Y9BdO36jZu3dm7Hd+7eu7873Hswdaa1HHJupLEnFXMghYYcBUo4aSwwVUk4rlav+/7xGVgnjP6E6wYKxWotFoIzDFQ53KfNUpSeIpyjF6rrxqbEUz9eHXQH5XCUTtJNJVdBtgUjsq3/k5d7g5zODW8VaOSSOTfL0gYLzywKLqGLaeugYXzFapgFqJkCV/iN1S55HJh5sjA2LI3Jhv19wjPl3FpVQakYLt3lXk/+rTdrcXFYeKGbFkHzi4sWrUzQJH1uyVxY4CjXATBuRXhrwpfMMo4h3TimbyCYsfAuHPy+AcvQ2CeeMlsrdt4FczV92qN/CYX+JQwopho+c6MU03NPtbGqm2WFpxIWSOUULI4yakW9RGr7XRfizy6HfRVMn0+ydJJ9fDE6erX9tx2yTx6RMcnIS3JE3pIPJCecfCFfyTfyPXoW5RGNTi+k0WA785D8UVH9EyeZz9U=</latexit><latexit sha1_base64="ja7fR1eH9RhdwkzGaNhaf4QiBes=">AAACrXicjVFdaxQxFM2OX3X86FYf+zK4CFuRdUYE+1jUB19EBWdb2EyHTPbubNh8DMmd2iXMg7/GV3+N4L8xs11BWxEvBE7OPTfJOakaKRym6Y9BdO36jZu3dm7Hd+7eu7873Hswdaa1HHJupLEnFXMghYYcBUo4aSwwVUk4rlav+/7xGVgnjP6E6wYKxWotFoIzDFQ53KfNUpSeIpyjF6rrxqbEUz9eHXQH5XCUTtJNJVdBtgUjsq3/k5d7g5zODW8VaOSSOTfL0gYLzywKLqGLaeugYXzFapgFqJkCV/iN1S55HJh5sjA2LI3Jhv19wjPl3FpVQakYLt3lXk/+rTdrcXFYeKGbFkHzi4sWrUzQJH1uyVxY4CjXATBuRXhrwpfMMo4h3TimbyCYsfAuHPy+AcvQ2CeeMlsrdt4FczV92qN/CYX+JQwopho+c6MU03NPtbGqm2WFpxIWSOUULI4yakW9RGr7XRfizy6HfRVMn0+ydJJ9fDE6erX9tx2yTx6RMcnIS3JE3pIPJCecfCFfyTfyPXoW5RGNTi+k0WA785D8UVH9EyeZz9U=</latexit><latexit sha1_base64="ja7fR1eH9RhdwkzGaNhaf4QiBes=">AAACrXicjVFdaxQxFM2OX3X86FYf+zK4CFuRdUYE+1jUB19EBWdb2EyHTPbubNh8DMmd2iXMg7/GV3+N4L8xs11BWxEvBE7OPTfJOakaKRym6Y9BdO36jZu3dm7Hd+7eu7873Hswdaa1HHJupLEnFXMghYYcBUo4aSwwVUk4rlav+/7xGVgnjP6E6wYKxWotFoIzDFQ53KfNUpSeIpyjF6rrxqbEUz9eHXQH5XCUTtJNJVdBtgUjsq3/k5d7g5zODW8VaOSSOTfL0gYLzywKLqGLaeugYXzFapgFqJkCV/iN1S55HJh5sjA2LI3Jhv19wjPl3FpVQakYLt3lXk/+rTdrcXFYeKGbFkHzi4sWrUzQJH1uyVxY4CjXATBuRXhrwpfMMo4h3TimbyCYsfAuHPy+AcvQ2CeeMlsrdt4FczV92qN/CYX+JQwopho+c6MU03NPtbGqm2WFpxIWSOUULI4yakW9RGr7XRfizy6HfRVMn0+ydJJ9fDE6erX9tx2yTx6RMcnIS3JE3pIPJCecfCFfyTfyPXoW5RGNTi+k0WA785D8UVH9EyeZz9U=</latexit>

�im(o(k)t )

<latexit sha1_base64="YU7rtaRhZBJ8J9MdzDeoPBYZGoc=">AAACjnicfZFNj9MwEIbd8LWEj+3CkUugQuoiVCUIARfECjjsBbFItLtSHSLXnSRW/RHZE9jKyplfwxV+C/8Gp1sk2EWMZOnxO+/YnvGikcJhmv4cRJcuX7l6bed6fOPmrdu7w707M2day2HKjTT2ZMEcSKFhigIlnDQWmFpIOF6s3vT5489gnTD6I64byBWrtCgFZxikYnif1gw9bWrRFZ4inKIXquvGpsBPfrza7/aL4SidpJtILkK2hRHZxlGxN5jSpeGtAo1cMufmWdpg7plFwSV0MW0dNIyvWAXzgJopcLnf9NIlD4OyTEpjw9KYbNQ/KzxTzq3VIjgVw9qdz/Xiv3LzFssXuRe6aRE0P7uobGWCJukHkyyFBY5yHYBxK8JbE14zyziG8cUxfQuhGQvvwsHvG7AMjX3kKbOVYqddaK6ij3v6n1Ho38ZAMdXwhRulmF56qo1V3TzLPZVQIpUzsDjKqBVVjdT2uy58RHZ+7Bdh9mSSpZPsw9PRwevtl+yQe+QBGZOMPCcH5JAckSnh5Cv5Rr6TH9Eweha9jF6dWaPBtuYu+Suiw19AYMsp</latexit><latexit sha1_base64="oCAjeJcB3MMOAajffD25/gExfpo=">AAACs3icjVFdi9QwFM3Ur7V+zeqjL9VBmBUZWlH0cVEffBEVnNmFyVjS9LYNk4+S3OoOoc/+Gl/9L4I/xnR2BN0V8ULg5Nxzk5yTopXCYZp+H0UXLl66fGXvanzt+o2bt8b7txfOdJbDnBtp7HHBHEihYY4CJRy3FpgqJBwV65dD/+gTWCeM/oCbFlaK1VpUgjMMVD6+RxuGnraN6HNPEU7QC9X3U5PjRz9dH/QH+XiSztJtJedBtgMTsqv/k+f7ozktDe8UaOSSObfM0hZXnlkUXEIf085By/ia1bAMUDMFbuW3hvvkQWDKpDI2LI3Jlv19wjPl3EYVQakYNu5sbyD/1lt2WD1feaHbDkHz04uqTiZokiG9pBQWOMpNAIxbEd6a8IZZxjFkHMf0FQQzFt6Eg9+2YBka+9BTZmvFTvpgrqaPBvQvodC/hAHFVMNnbpRiuvRUG6v6ZbbyVEKFVC7A4iSjVtQNUjvs+hB/djbs82DxeJals+z9k8nhi92/7ZG75D6Zkow8I4fkNXlH5oSTL+Qr+UZ+RE+jZVRE5ak0Gu1m7pA/KlI/Aeu00qI=</latexit><latexit sha1_base64="oCAjeJcB3MMOAajffD25/gExfpo=">AAACs3icjVFdi9QwFM3Ur7V+zeqjL9VBmBUZWlH0cVEffBEVnNmFyVjS9LYNk4+S3OoOoc/+Gl/9L4I/xnR2BN0V8ULg5Nxzk5yTopXCYZp+H0UXLl66fGXvanzt+o2bt8b7txfOdJbDnBtp7HHBHEihYY4CJRy3FpgqJBwV65dD/+gTWCeM/oCbFlaK1VpUgjMMVD6+RxuGnraN6HNPEU7QC9X3U5PjRz9dH/QH+XiSztJtJedBtgMTsqv/k+f7ozktDe8UaOSSObfM0hZXnlkUXEIf085By/ia1bAMUDMFbuW3hvvkQWDKpDI2LI3Jlv19wjPl3EYVQakYNu5sbyD/1lt2WD1feaHbDkHz04uqTiZokiG9pBQWOMpNAIxbEd6a8IZZxjFkHMf0FQQzFt6Eg9+2YBka+9BTZmvFTvpgrqaPBvQvodC/hAHFVMNnbpRiuvRUG6v6ZbbyVEKFVC7A4iSjVtQNUjvs+hB/djbs82DxeJals+z9k8nhi92/7ZG75D6Zkow8I4fkNXlH5oSTL+Qr+UZ+RE+jZVRE5ak0Gu1m7pA/KlI/Aeu00qI=</latexit><latexit sha1_base64="oCAjeJcB3MMOAajffD25/gExfpo=">AAACs3icjVFdi9QwFM3Ur7V+zeqjL9VBmBUZWlH0cVEffBEVnNmFyVjS9LYNk4+S3OoOoc/+Gl/9L4I/xnR2BN0V8ULg5Nxzk5yTopXCYZp+H0UXLl66fGXvanzt+o2bt8b7txfOdJbDnBtp7HHBHEihYY4CJRy3FpgqJBwV65dD/+gTWCeM/oCbFlaK1VpUgjMMVD6+RxuGnraN6HNPEU7QC9X3U5PjRz9dH/QH+XiSztJtJedBtgMTsqv/k+f7ozktDe8UaOSSObfM0hZXnlkUXEIf085By/ia1bAMUDMFbuW3hvvkQWDKpDI2LI3Jlv19wjPl3EYVQakYNu5sbyD/1lt2WD1feaHbDkHz04uqTiZokiG9pBQWOMpNAIxbEd6a8IZZxjFkHMf0FQQzFt6Eg9+2YBka+9BTZmvFTvpgrqaPBvQvodC/hAHFVMNnbpRiuvRUG6v6ZbbyVEKFVC7A4iSjVtQNUjvs+hB/djbs82DxeJals+z9k8nhi92/7ZG75D6Zkow8I4fkNXlH5oSTL+Qr+UZ+RE+jZVRE5ak0Gu1m7pA/KlI/Aeu00qI=</latexit>

c(n)t

<latexit sha1_base64="f2ZMuhTMtn5OumvZJgwDxGEweks=">AAACdHicfVFba9RAFJ6NtxovbfVRH4KhUEWWRAT7WNQHX8QK7rayE5eT2ZPs0LmEmZPqEvIrfNUf5h/x2cl2BW3FDwa++c53Zs6lbJT0lGU/RtGVq9eu39i6Gd+6fefu9s7uvam3rRM4EVZZd1KCRyUNTkiSwpPGIehS4XF5+mqIH5+h89KaD7RqsNBQG1lJARSkj2JOn7p987if76TZOFsjuUzyDUnZBkfz3dGEL6xoNRoSCryf5VlDRQeOpFDYx7z12IA4hRpngRrQ6ItuXXGf7AVlkVTWhWMoWat/ZnSgvV/pMjg10NJfjA3iv2KzlqqDopOmaQmNOP+oalVCNhnaTxbSoSC1CgSEk6HWRCzBgaAwpDjmrzE04/BtePhdgw7IuicdB1dr+NKH5mr+dGD/M0rz2xhYzA1+FlZrMIuOG+t0P8uLjiusiKspOkpz7mS9JO6G27CI/OLYL5Pps3GejfP3z9PDl5uVbLEH7BHbZzl7wQ7ZG3bEJkwwzb6yb+z76Gf0MEqjvXNrNNrk3Gd/IRr/Ap5ZwTk=</latexit><latexit sha1_base64="qkn0q/PWLShigmcCsY7WLZoS9XU=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi5jR13bHPO5mm8N0lK4iuQzZGoZsHf9nn20NxnxuRaPRkFDg/TRLa8pbcCSFwi7mjccaxBFUOA1oQKPP21VbXfIoKPOktC4sQ8lK/b2iBe39UhfBqYEW/mKuF/+WmzZUvshbaeqG0Ijzi8pGJWSTfkbJXDoUpJYBQDgZ3pqIBTgQFCYZx/wNhmYcvgsHv6/RAVn3pOXgKg0nXWiu4k97+pdRml/GQDE3eCys1mDmLTfW6W6a5S1XWBJXE3Q0zLiT1YK463f9b2UXh30ZJrujLB1lH58N916t/22DbbOHbIdl7DnbY/vsAxszwTT7zk7ZWbQdvYz2o7fn1miwrnnA/ojo00+wDcez</latexit><latexit sha1_base64="qkn0q/PWLShigmcCsY7WLZoS9XU=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi5jR13bHPO5mm8N0lK4iuQzZGoZsHf9nn20NxnxuRaPRkFDg/TRLa8pbcCSFwi7mjccaxBFUOA1oQKPP21VbXfIoKPOktC4sQ8lK/b2iBe39UhfBqYEW/mKuF/+WmzZUvshbaeqG0Ijzi8pGJWSTfkbJXDoUpJYBQDgZ3pqIBTgQFCYZx/wNhmYcvgsHv6/RAVn3pOXgKg0nXWiu4k97+pdRml/GQDE3eCys1mDmLTfW6W6a5S1XWBJXE3Q0zLiT1YK463f9b2UXh30ZJrujLB1lH58N916t/22DbbOHbIdl7DnbY/vsAxszwTT7zk7ZWbQdvYz2o7fn1miwrnnA/ojo00+wDcez</latexit><latexit sha1_base64="qkn0q/PWLShigmcCsY7WLZoS9XU=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi5jR13bHPO5mm8N0lK4iuQzZGoZsHf9nn20NxnxuRaPRkFDg/TRLa8pbcCSFwi7mjccaxBFUOA1oQKPP21VbXfIoKPOktC4sQ8lK/b2iBe39UhfBqYEW/mKuF/+WmzZUvshbaeqG0Ijzi8pGJWSTfkbJXDoUpJYBQDgZ3pqIBTgQFCYZx/wNhmYcvgsHv6/RAVn3pOXgKg0nXWiu4k97+pdRml/GQDE3eCys1mDmLTfW6W6a5S1XWBJXE3Q0zLiT1YK463f9b2UXh30ZJrujLB1lH58N916t/22DbbOHbIdl7DnbY/vsAxszwTT7zk7ZWbQdvYz2o7fn1miwrnnA/ojo00+wDcez</latexit>

c(1)t�1 . . . c

(n)t�1<latexit sha1_base64="Q/tTE9J/tIm2cTdaSKpfQ+8KN9M=">AAACj3icfZFNj9MwEIbd8LWEj23hyMXaCmkXQZUgpOWEKuAAB8Qi0e5KdagcZ5Ja64/IngBVlDu/hiv8Ff4NTrcI2EWMZOnxO+/YnnFeK+kxSX4MokuXr1y9tnM9vnHz1u3d4ejO3NvGCZgJq6w7ybkHJQ3MUKKCk9oB17mC4/z0RZ8//gjOS2ve47qGTPPKyFIKjkFaDvfEssVHafeh3U8POspUYdHT36I56JbDcTJJNkEvQrqFMdnG0XI0mLHCikaDQaG494s0qTFruUMpFHQxazzUXJzyChYBDdfgs3bTTEfvB6WgpXVhGaQb9c+Klmvv1zoPTs1x5c/nevFfuUWD5dOslaZuEIw4u6hsFEVL+8nQQjoQqNYBuHAyvJWKFXdcYJhfHLOXEJpx8CYc/LYGx9G6By3jrtL8cxeaq9jDnv5nlOaXMVDMDHwSVmtuipYZ63S3SLOWKSiRqTk4HKfMyWqFzPW7/iPS82O/CPPHkzSZpO+ejKfPt1+yQ+6RPbJPUnJIpuQVOSIzIsgX8pV8I9+jUXQYPYumZ9ZosK25S/6K6PVPng/J4Q==</latexit><latexit sha1_base64="pkZePaAaz45n815LYapDmQmMGN8=">AAACtHicjZFNbxMxEIad5aNl+WgKRy5WI6QWQbRGqHCsgAMXBEgkrRQvkdc72Vj1x8qeBaLV3vk1XPktHPgveNMgoEWIkSw9fucd2zMuaq0CZtm3QXLp8pWrW9vX0us3bt7aGe7engbXeAkT6bTzJ4UIoJWFCSrUcFJ7EKbQcFycPu/zxx/AB+XsO1zVkBtRWbVQUmCU5sM9OW/xIevet/vsoKNclw4D/SXag24+HGXjbB30IrANjMgm/s8+3x1MeOlkY8Ci1CKEGctqzFvhUUkNXcqbALWQp6KCWUQrDIS8XXfc0XtRKenC+bgs0rX6e0UrTAgrU0SnEbgM53O9+LfcrMHF07xVtm4QrDy7aNFoio7246Ol8iBRryII6VV8K5VL4YXEOOQ05S8gNuPhVTz4dQ1eoPP3Wy58ZcSnLjZX8Qc9/cuo7E9jpJRb+CidMcKWLbfOm27G8pZrWCDXU/A4Ytyraonc97v+t9j5YV+E6aMxy8bs7ePR0bPNv22Tu2SP7BNGnpAj8pK8IRMiyWfyhXwl35PDhCcygTNrMtjU3CF/RGJ/ABi90Vo=</latexit><latexit sha1_base64="pkZePaAaz45n815LYapDmQmMGN8=">AAACtHicjZFNbxMxEIad5aNl+WgKRy5WI6QWQbRGqHCsgAMXBEgkrRQvkdc72Vj1x8qeBaLV3vk1XPktHPgveNMgoEWIkSw9fucd2zMuaq0CZtm3QXLp8pWrW9vX0us3bt7aGe7engbXeAkT6bTzJ4UIoJWFCSrUcFJ7EKbQcFycPu/zxx/AB+XsO1zVkBtRWbVQUmCU5sM9OW/xIevet/vsoKNclw4D/SXag24+HGXjbB30IrANjMgm/s8+3x1MeOlkY8Ci1CKEGctqzFvhUUkNXcqbALWQp6KCWUQrDIS8XXfc0XtRKenC+bgs0rX6e0UrTAgrU0SnEbgM53O9+LfcrMHF07xVtm4QrDy7aNFoio7246Ol8iBRryII6VV8K5VL4YXEOOQ05S8gNuPhVTz4dQ1eoPP3Wy58ZcSnLjZX8Qc9/cuo7E9jpJRb+CidMcKWLbfOm27G8pZrWCDXU/A4Ytyraonc97v+t9j5YV+E6aMxy8bs7ePR0bPNv22Tu2SP7BNGnpAj8pK8IRMiyWfyhXwl35PDhCcygTNrMtjU3CF/RGJ/ABi90Vo=</latexit><latexit sha1_base64="pkZePaAaz45n815LYapDmQmMGN8=">AAACtHicjZFNbxMxEIad5aNl+WgKRy5WI6QWQbRGqHCsgAMXBEgkrRQvkdc72Vj1x8qeBaLV3vk1XPktHPgveNMgoEWIkSw9fucd2zMuaq0CZtm3QXLp8pWrW9vX0us3bt7aGe7engbXeAkT6bTzJ4UIoJWFCSrUcFJ7EKbQcFycPu/zxx/AB+XsO1zVkBtRWbVQUmCU5sM9OW/xIevet/vsoKNclw4D/SXag24+HGXjbB30IrANjMgm/s8+3x1MeOlkY8Ci1CKEGctqzFvhUUkNXcqbALWQp6KCWUQrDIS8XXfc0XtRKenC+bgs0rX6e0UrTAgrU0SnEbgM53O9+LfcrMHF07xVtm4QrDy7aNFoio7246Ol8iBRryII6VV8K5VL4YXEOOQ05S8gNuPhVTz4dQ1eoPP3Wy58ZcSnLjZX8Qc9/cuo7E9jpJRb+CidMcKWLbfOm27G8pZrWCDXU/A4Ytyraonc97v+t9j5YV+E6aMxy8bs7ePR0bPNv22Tu2SP7BNGnpAj8pK8IRMiyWfyhXwl35PDhCcygTNrMtjU3CF/RGJ/ABi90Vo=</latexit>

a(n)t

<latexit sha1_base64="PkqRI09LOBn4opjQsmChvRaZY6E=">AAACdHicfVFba9RAFJ6NtxovbfVRH4KhUEWWRAT7WNQHX8QK7rayE5eT2ZPs0LmEmZPqEvIrfNUf5h/x2cl2BW3FDwa++c53Zs6lbJT0lGU/RtGVq9eu39i6Gd+6fefu9s7uvam3rRM4EVZZd1KCRyUNTkiSwpPGIehS4XF5+mqIH5+h89KaD7RqsNBQG1lJARSkjzCnT92+edzPd9JsnK2RXCb5hqRsg6P57mjCF1a0Gg0JBd7P8qyhogNHUijsY956bECcQo2zQA1o9EW3rrhP9oKySCrrwjGUrNU/MzrQ3q90GZwaaOkvxgbxX7FZS9VB0UnTtIRGnH9UtSohmwztJwvpUJBaBQLCyVBrIpbgQFAYUhzz1xiacfg2PPyuQQdk3ZOOg6s1fOlDczV/OrD/GaX5bQws5gY/C6s1mEXHjXW6n+VFxxVWxNUUHaU5d7JeEnfDbVhEfnHsl8n02TjPxvn75+nhy81KttgD9ojts5y9YIfsDTtiEyaYZl/ZN/Z99DN6GKXR3rk1Gm1y7rO/EI1/AZo5wTc=</latexit><latexit sha1_base64="C7NwhuG/vbnOL4fWRBpVnGZEWno=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpC8zoa7tjHnezzWE6SleRXIZsDUO2jv+zz7YGYz63otFoSCjwfpqlNeUtOJJCYRfzxmMN4ggqnAY0oNHn7aqtLnkUlHlSWheWoWSl/l7RgvZ+qYvg1EALfzHXi3/LTRsqX+StNHVDaMT5RWWjErJJP6NkLh0KUssAIJwMb03EAhwICpOMY/4GQzMO34WD39fogKx70nJwlYaTLjRX8ac9/csozS9joJgbPBZWazDzlhvrdDfN8pYrLImrCToaZtzJakHc9bv+t7KLw74Mk91Rlo6yj8+Ge6/W/7bBttlDtsMy9pztsX32gY2ZYJp9Z6fsLNqOXkb70dtzazRY1zxgf0T06Sero8ex</latexit><latexit sha1_base64="C7NwhuG/vbnOL4fWRBpVnGZEWno=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpC8zoa7tjHnezzWE6SleRXIZsDUO2jv+zz7YGYz63otFoSCjwfpqlNeUtOJJCYRfzxmMN4ggqnAY0oNHn7aqtLnkUlHlSWheWoWSl/l7RgvZ+qYvg1EALfzHXi3/LTRsqX+StNHVDaMT5RWWjErJJP6NkLh0KUssAIJwMb03EAhwICpOMY/4GQzMO34WD39fogKx70nJwlYaTLjRX8ac9/csozS9joJgbPBZWazDzlhvrdDfN8pYrLImrCToaZtzJakHc9bv+t7KLw74Mk91Rlo6yj8+Ge6/W/7bBttlDtsMy9pztsX32gY2ZYJp9Z6fsLNqOXkb70dtzazRY1zxgf0T06Sero8ex</latexit><latexit sha1_base64="C7NwhuG/vbnOL4fWRBpVnGZEWno=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpC8zoa7tjHnezzWE6SleRXIZsDUO2jv+zz7YGYz63otFoSCjwfpqlNeUtOJJCYRfzxmMN4ggqnAY0oNHn7aqtLnkUlHlSWheWoWSl/l7RgvZ+qYvg1EALfzHXi3/LTRsqX+StNHVDaMT5RWWjErJJP6NkLh0KUssAIJwMb03EAhwICpOMY/4GQzMO34WD39fogKx70nJwlYaTLjRX8ac9/csozS9joJgbPBZWazDzlhvrdDfN8pYrLImrCToaZtzJakHc9bv+t7KLw74Mk91Rlo6yj8+Ge6/W/7bBttlDtsMy9pztsX32gY2ZYJp9Z6fsLNqOXkb70dtzazRY1zxgf0T06Sero8ex</latexit>

�comm(ct�1)<latexit sha1_base64="l6oZ+6b35mQmZK+c0MpS99YgvNY=">AAACiHicfVFNb9QwEPWGj5bw0W059hKxQioIVglCanurCgcuVYvEbiuto8jxTrJW/RHZE+jKyoFfwxV+Dv8GZ7tI0CJGsvT85s3Y86ZspHCYpj8H0Z279+5vbD6IHz56/GRruL0zdaa1HCbcSGMvSuZACg0TFCjhorHAVCnhvLx81+fPP4N1wuhPuGwgV6zWohKcYaCK4S5tFqLwFOEKPTdKdd0eLzy+zroXxXCUjtNVJLdBtgYjso6zYnswoXPDWwUauWTOzbK0wdwzi4JL6GLaOmgYv2Q1zALUTIHL/WqKLnkemHlSGRuOxmTF/lnhmXJuqcqgVAwX7mauJ/+Vm7VYHeRe6KZF0Pz6oaqVCZqktySZCwsc5TIAxq0If034glnGMRgXx/Q9hGEsnITGpw1Yhsa+9JTZWrGrLgxX01c9+p9Q6N/CgGKq4UvvNdNzT7WxqptluacSKqRyChZHGbWiXiC1/a0Li8hu2n4bTN+Ms3ScfXw7Ojper2ST7JJnZI9kZJ8ckQ/kjEwIJ1/JN/Kd/IjiKI32o8NraTRY1zwlf0V0/Av4/Mhq</latexit><latexit sha1_base64="gvKkV4MWCSccXOJw/5amjm5XJRU=">AAACrXicjVFNb9QwEPWGj5bwtYVjLxErpIJgSSokOFaUAxcESGRbaR0ixzvJWvVHZE+gKyuH/hqu/Bok/g3OdpGgRYiRLD2/eTP2vKlaKRym6Y9RdOXqtetb2zfim7du37k73rk3c6azHHJupLHHFXMghYYcBUo4bi0wVUk4qk4Oh/zRZ7BOGP0RVy0UijVa1IIzDFQ53qXtUpSeIpyi50apvt/jpcenWf+oHE/SabqO5DLINmBCNvF/8nJnlNOF4Z0CjVwy5+ZZ2mLhmUXBJfQx7Ry0jJ+wBuYBaqbAFX49ap88DMwiqY0NR2OyZn+v8Ew5t1JVUCqGS3cxN5B/y807rF8WXui2Q9D8/KG6kwmaZPAtWQgLHOUqAMatCH9N+JJZxjG4G8f0NYRhLLwNjd+1YBka+9hTZhvFTvswXEOfDOhfQqF/CQOKqYYvw0KYXniqjVX9PCs8lVAjlTOwOMmoFc0SqR1ufbA/u2j2ZTDbn2bpNPvwfHLwarO3bbJLHpA9kpEX5IC8Ie9JTjg5I1/JN/I9ehblEY0+nUuj0abmPvkjouYnRpLP4w==</latexit><latexit sha1_base64="gvKkV4MWCSccXOJw/5amjm5XJRU=">AAACrXicjVFNb9QwEPWGj5bwtYVjLxErpIJgSSokOFaUAxcESGRbaR0ixzvJWvVHZE+gKyuH/hqu/Bok/g3OdpGgRYiRLD2/eTP2vKlaKRym6Y9RdOXqtetb2zfim7du37k73rk3c6azHHJupLHHFXMghYYcBUo4bi0wVUk4qk4Oh/zRZ7BOGP0RVy0UijVa1IIzDFQ53qXtUpSeIpyi50apvt/jpcenWf+oHE/SabqO5DLINmBCNvF/8nJnlNOF4Z0CjVwy5+ZZ2mLhmUXBJfQx7Ry0jJ+wBuYBaqbAFX49ap88DMwiqY0NR2OyZn+v8Ew5t1JVUCqGS3cxN5B/y807rF8WXui2Q9D8/KG6kwmaZPAtWQgLHOUqAMatCH9N+JJZxjG4G8f0NYRhLLwNjd+1YBka+9hTZhvFTvswXEOfDOhfQqF/CQOKqYYvw0KYXniqjVX9PCs8lVAjlTOwOMmoFc0SqR1ufbA/u2j2ZTDbn2bpNPvwfHLwarO3bbJLHpA9kpEX5IC8Ie9JTjg5I1/JN/I9ehblEY0+nUuj0abmPvkjouYnRpLP4w==</latexit><latexit sha1_base64="gvKkV4MWCSccXOJw/5amjm5XJRU=">AAACrXicjVFNb9QwEPWGj5bwtYVjLxErpIJgSSokOFaUAxcESGRbaR0ixzvJWvVHZE+gKyuH/hqu/Bok/g3OdpGgRYiRLD2/eTP2vKlaKRym6Y9RdOXqtetb2zfim7du37k73rk3c6azHHJupLHHFXMghYYcBUo4bi0wVUk4qk4Oh/zRZ7BOGP0RVy0UijVa1IIzDFQ53qXtUpSeIpyi50apvt/jpcenWf+oHE/SabqO5DLINmBCNvF/8nJnlNOF4Z0CjVwy5+ZZ2mLhmUXBJfQx7Ry0jJ+wBuYBaqbAFX49ap88DMwiqY0NR2OyZn+v8Ew5t1JVUCqGS3cxN5B/y807rF8WXui2Q9D8/KG6kwmaZPAtWQgLHOUqAMatCH9N+JJZxjG4G8f0NYRhLLwNjd+1YBka+9hTZhvFTvswXEOfDOhfQqF/CQOKqYYvw0KYXniqjVX9PCs8lVAjlTOwOMmoFc0SqR1ufbA/u2j2ZTDbn2bpNPvwfHLwarO3bbJLHpA9kpEX5IC8Ie9JTjg5I1/JN/I9ehblEY0+nUuj0abmPvkjouYnRpLP4w==</latexit>

Policy Gradient

Co

mm

unic

atio

n

Auto

enco

der

o(n)t

<latexit sha1_base64="AijpuD5XdKwqeARXIV+woEF8hQM=">AAACdHicfVFba9RAFJ6NtxovbfVRH4KhUEWWRAT7WNQHX8QK7rayE5eT2ZPs0LmEmZPqEvIrfNUf5h/x2cl2BW3FDwa++c53Zs6lbJT0lGU/RtGVq9eu39i6Gd+6fefu9s7uvam3rRM4EVZZd1KCRyUNTkiSwpPGIehS4XF5+mqIH5+h89KaD7RqsNBQG1lJARSkj3ZOn7p987if76TZOFsjuUzyDUnZBkfz3dGEL6xoNRoSCryf5VlDRQeOpFDYx7z12IA4hRpngRrQ6ItuXXGf7AVlkVTWhWMoWat/ZnSgvV/pMjg10NJfjA3iv2KzlqqDopOmaQmNOP+oalVCNhnaTxbSoSC1CgSEk6HWRCzBgaAwpDjmrzE04/BtePhdgw7IuicdB1dr+NKH5mr+dGD/M0rz2xhYzA1+FlZrMIuOG+t0P8uLjiusiKspOkpz7mS9JO6G27CI/OLYL5Pps3GejfP3z9PDl5uVbLEH7BHbZzl7wQ7ZG3bEJkwwzb6yb+z76Gf0MEqjvXNrNNrk3Gd/IRr/ArcZwUU=</latexit><latexit sha1_base64="7We+Wtw1IUCjthukT8N7Q+XrA+c=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13bHPO5mm8N0lK4iuQzZGoZsHf9nn20NxnxuRaPRkFDg/TRLa8pbcCSFwi7mjccaxBFUOA1oQKPP21VbXfIoKPOktC4sQ8lK/b2iBe39UhfBqYEW/mKuF/+WmzZUvshbaeqG0Ijzi8pGJWSTfkbJXDoUpJYBQDgZ3pqIBTgQFCYZx/wNhmYcvgsHv6/RAVn3pOXgKg0nXWiu4k97+pdRml/GQDE3eCys1mDmLTfW6W6a5S1XWBJXE3Q0zLiT1YK463f9b2UXh30ZJrujLB1lH58N916t/22DbbOHbIdl7DnbY/vsAxszwTT7zk7ZWbQdvYz2o7fn1miwrnnA/ojo00/Kice/</latexit><latexit sha1_base64="7We+Wtw1IUCjthukT8N7Q+XrA+c=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13bHPO5mm8N0lK4iuQzZGoZsHf9nn20NxnxuRaPRkFDg/TRLa8pbcCSFwi7mjccaxBFUOA1oQKPP21VbXfIoKPOktC4sQ8lK/b2iBe39UhfBqYEW/mKuF/+WmzZUvshbaeqG0Ijzi8pGJWSTfkbJXDoUpJYBQDgZ3pqIBTgQFCYZx/wNhmYcvgsHv6/RAVn3pOXgKg0nXWiu4k97+pdRml/GQDE3eCys1mDmLTfW6W6a5S1XWBJXE3Q0zLiT1YK463f9b2UXh30ZJrujLB1lH58N916t/22DbbOHbIdl7DnbY/vsAxszwTT7zk7ZWbQdvYz2o7fn1miwrnnA/ojo00/Kice/</latexit><latexit sha1_base64="7We+Wtw1IUCjthukT8N7Q+XrA+c=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13bHPO5mm8N0lK4iuQzZGoZsHf9nn20NxnxuRaPRkFDg/TRLa8pbcCSFwi7mjccaxBFUOA1oQKPP21VbXfIoKPOktC4sQ8lK/b2iBe39UhfBqYEW/mKuF/+WmzZUvshbaeqG0Ijzi8pGJWSTfkbJXDoUpJYBQDgZ3pqIBTgQFCYZx/wNhmYcvgsHv6/RAVn3pOXgKg0nXWiu4k97+pdRml/GQDE3eCys1mDmLTfW6W6a5S1XWBJXE3Q0zLiT1YK463f9b2UXh30ZJrujLB1lH58N916t/22DbbOHbIdl7DnbY/vsAxszwTT7zk7ZWbQdvYz2o7fn1miwrnnA/ojo00/Kice/</latexit>

image encoder

message encoder

(

<latexit sha1_base64="SzgOjECc3/G/NjwHBGjVHycPgtA=">AAACcnicfZFba9RAFMdn46U13lr7pi/RRRApSyIFfSzVB1/ECu5uYSeUk9mT7LBzCTMn6hLyIXzVT9bv4QdwskbQVjww8Jv/OWfmXIpaSU9pejGKrl2/cXNn91Z8+87de/f39h/MvG2cwKmwyrqzAjwqaXBKkhSe1Q5BFwrnxfp1759/QuelNR9pU2OuoTKylAIoSHN+IquKt+d743SSbi25CtkAYzbY6fn+aMqXVjQaDQkF3i+ytKa8BUdSKOxi3nisQayhwkVAAxp93m7r7ZKnQVkmpXXhGEq26p8ZLWjvN7oIkRpo5S/7evFfvkVD5au8laZuCI349VHZqIRs0jefLKVDQWoTAISTodZErMCBoDCiOOZvMDTj8F14+H2NDsi65y0HV2n40oXmKn7Y0/8CpfkdGCjmBj8LqzWYZcuNdbpbZHnLFZbE1QwdjTPuZLUi7vpbFxaRXR77VZi9mGTpJPtwND4+GVayyx6xJ+wZy9hLdszeslM2ZYKt2Vf2jX0f/YgeRo+jYX/RaMg5YH9ZdPgTCjXAhg==</latexit><latexit sha1_base64="bPDTkEMP9FXbnEuQC8fXG0O8QIc=">AAACl3icjVFdaxQxFM2OH63jV6tP4svgIojIMiOF9s1SRfoiWnB3C5uh3MnemQ2bjyG5Y7sM8yN89df4N/w3ZrYraCvihcDJuecmOSdFraSnNP0xiG7cvHV7a/tOfPfe/QcPd3YfTbxtnMCxsMq60wI8KmlwTJIUntYOQRcKp8Xybd+ffkHnpTWfaVVjrqEyspQCKFBTfiSrirdnO8N0lK4ruQ6yDRiyTf2f/Gx3MOZzKxqNhoQC72dZWlPegiMpFHYxbzzWIJZQ4SxAAxp93q5NdcnzwMyT0rqwDCVr9veJFrT3K10EpQZa+Ku9nvxbb9ZQeZC30tQNoRGXF5WNSsgmfULJXDoUpFYBgHAyvDURC3AgKOQYx/wdBjMOP4SDP9bogKx72XJwlYaLLpir+Kse/UsozS9hQDE3eC6s1mDmLTfW6W6W5S1XWBJXE3Q0zLiT1YK463ddiD+7GvZ1MHk9ytJRdrI3PDza/Ns2e8qesRcsY/vskB2zT2zMBFuyr+wb+x49id5E76PjS2k02Mw8Zn9UdPITBKTHAA==</latexit><latexit sha1_base64="bPDTkEMP9FXbnEuQC8fXG0O8QIc=">AAACl3icjVFdaxQxFM2OH63jV6tP4svgIojIMiOF9s1SRfoiWnB3C5uh3MnemQ2bjyG5Y7sM8yN89df4N/w3ZrYraCvihcDJuecmOSdFraSnNP0xiG7cvHV7a/tOfPfe/QcPd3YfTbxtnMCxsMq60wI8KmlwTJIUntYOQRcKp8Xybd+ffkHnpTWfaVVjrqEyspQCKFBTfiSrirdnO8N0lK4ruQ6yDRiyTf2f/Gx3MOZzKxqNhoQC72dZWlPegiMpFHYxbzzWIJZQ4SxAAxp93q5NdcnzwMyT0rqwDCVr9veJFrT3K10EpQZa+Ku9nvxbb9ZQeZC30tQNoRGXF5WNSsgmfULJXDoUpFYBgHAyvDURC3AgKOQYx/wdBjMOP4SDP9bogKx72XJwlYaLLpir+Kse/UsozS9hQDE3eC6s1mDmLTfW6W6W5S1XWBJXE3Q0zLiT1YK463ddiD+7GvZ1MHk9ytJRdrI3PDza/Ns2e8qesRcsY/vskB2zT2zMBFuyr+wb+x49id5E76PjS2k02Mw8Zn9UdPITBKTHAA==</latexit><latexit sha1_base64="bPDTkEMP9FXbnEuQC8fXG0O8QIc=">AAACl3icjVFdaxQxFM2OH63jV6tP4svgIojIMiOF9s1SRfoiWnB3C5uh3MnemQ2bjyG5Y7sM8yN89df4N/w3ZrYraCvihcDJuecmOSdFraSnNP0xiG7cvHV7a/tOfPfe/QcPd3YfTbxtnMCxsMq60wI8KmlwTJIUntYOQRcKp8Xybd+ffkHnpTWfaVVjrqEyspQCKFBTfiSrirdnO8N0lK4ruQ6yDRiyTf2f/Gx3MOZzKxqNhoQC72dZWlPegiMpFHYxbzzWIJZQ4SxAAxp93q5NdcnzwMyT0rqwDCVr9veJFrT3K10EpQZa+Ku9nvxbb9ZQeZC30tQNoRGXF5WNSsgmfULJXDoUpFYBgHAyvDURC3AgKOQYx/wdBjMOP4SDP9bogKx72XJwlYaLLpir+Kse/UsozS9hQDE3eC6s1mDmLTfW6W6W5S1XWBJXE3Q0zLiT1YK463ddiD+7GvZ1MHk9ytJRdrI3PDza/Ns2e8qesRcsY/vskB2zT2zMBFuyr+wb+x49id5E76PjS2k02Mw8Zn9UdPITBKTHAA==</latexit>

o(1)t<latexit sha1_base64="N82kK+S1vS9PY1MkZKJNB4AkyPQ=">AAACdHicfVFba9RAFJ6NtxovbfVRH4KhUEWWRAT7WNQHX8QK7rayE5eT2ZPs0LmEmZPqEvIrfNUf5h/x2cl2BW3FDwa++c53Zs6lbJT0lGU/RtGVq9eu39i6Gd+6fefu9s7uvam3rRM4EVZZd1KCRyUNTkiSwpPGIehS4XF5+mqIH5+h89KaD7RqsNBQG1lJARSkj3ZOn7r9/HE/30mzcbZGcpnkG5KyDY7mu6MJX1jRajQkFHg/y7OGig4cSaGwj3nrsQFxCjXOAjWg0RfduuI+2QvKIqmsC8dQslb/zOhAe7/SZXBqoKW/GBvEf8VmLVUHRSdN0xIacf5R1aqEbDK0nyykQ0FqFQgIJ0OtiViCA0FhSHHMX2NoxuHb8PC7Bh2QdU86Dq7W8KUPzdX86cD+Z5TmtzGwmBv8LKzWYBYdN9bpfpYXHVdYEVdTdJTm3Ml6SdwNt2ER+cWxXybTZ+M8G+fvn6eHLzcr2WIP2CO2z3L2gh2yN+yITZhgmn1l39j30c/oYZRGe+fWaLTJuc/+QjT+BTq3wQg=</latexit><latexit sha1_base64="iud3vMFNlLYbGl0OrfK/AAN+70w=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13Yne9zNNofpKF1FchmyNQzZOv7PPtsajPncikajIaHA+2mW1pS34EgKhV3MG481iCOocBrQgEaft6u2uuRRUOZJaV1YhpKV+ntFC9r7pS6CUwMt/MVcL/4tN22ofJG30tQNoRHnF5WNSsgm/YySuXQoSC0DgHAyvDURC3AgKEwyjvkbDM04fBcOfl+jA7LuScvBVRpOutBcxZ/29C+jNL+MgWJu8FhYrcHMW26s0900y1uusCSuJuhomHEnqwVx1+/638ouDvsyTHZHWTrKPj4b7r1a/9sG22YP2Q7L2HO2x/bZBzZmgmn2nZ2ys2g7ehntR2/PrdFgXfOA/RHRp59FVseC</latexit><latexit sha1_base64="iud3vMFNlLYbGl0OrfK/AAN+70w=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13Yne9zNNofpKF1FchmyNQzZOv7PPtsajPncikajIaHA+2mW1pS34EgKhV3MG481iCOocBrQgEaft6u2uuRRUOZJaV1YhpKV+ntFC9r7pS6CUwMt/MVcL/4tN22ofJG30tQNoRHnF5WNSsgm/YySuXQoSC0DgHAyvDURC3AgKEwyjvkbDM04fBcOfl+jA7LuScvBVRpOutBcxZ/29C+jNL+MgWJu8FhYrcHMW26s0900y1uusCSuJuhomHEnqwVx1+/638ouDvsyTHZHWTrKPj4b7r1a/9sG22YP2Q7L2HO2x/bZBzZmgmn2nZ2ys2g7ehntR2/PrdFgXfOA/RHRp59FVseC</latexit><latexit sha1_base64="iud3vMFNlLYbGl0OrfK/AAN+70w=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13Yne9zNNofpKF1FchmyNQzZOv7PPtsajPncikajIaHA+2mW1pS34EgKhV3MG481iCOocBrQgEaft6u2uuRRUOZJaV1YhpKV+ntFC9r7pS6CUwMt/MVcL/4tN22ofJG30tQNoRHnF5WNSsgm/YySuXQoSC0DgHAyvDURC3AgKEwyjvkbDM04fBcOfl+jA7LuScvBVRpOutBcxZ/29C+jNL+MgWJu8FhYrcHMW26s0900y1uusCSuJuhomHEnqwVx1+/638ouDvsyTHZHWTrKPj4b7r1a/9sG22YP2Q7L2HO2x/bZBzZmgmn2nZ2ys2g7ehntR2/PrdFgXfOA/RHRp59FVseC</latexit>

o(2)t<latexit sha1_base64="pbaN7Yle/L5V1xOdf+zP3ruyQj0=">AAACdHicfVFba9RAFJ6NtxpvrT7qQzAUqsiSFME+FvXBF7GCu63sxOVk9iQ7dC5h5qS6hPwKX/WH+Ud8drJdQVvxg4FvvvOdmXMpGyU9ZdmPUXTl6rXrN7Zuxrdu37l7b3vn/tTb1gmcCKusOynBo5IGJyRJ4UnjEHSp8Lg8fTXEj8/QeWnNB1o1WGiojaykAArSRzunT93e/pN+vp1m42yN5DLJNyRlGxzNd0YTvrCi1WhIKPB+lmcNFR04kkJhH/PWYwPiFGqcBWpAoy+6dcV9shuURVJZF46hZK3+mdGB9n6ly+DUQEt/MTaI/4rNWqoOik6apiU04vyjqlUJ2WRoP1lIh4LUKhAQToZaE7EEB4LCkOKYv8bQjMO34eF3DTog6552HFyt4Usfmqv5s4H9zyjNb2NgMTf4WVitwSw6bqzT/SwvOq6wIq6m6CjNuZP1krgbbsMi8otjv0ym++M8G+fvn6eHLzcr2WIP2WO2x3L2gh2yN+yITZhgmn1l39j30c/oUZRGu+fWaLTJecD+QjT+BTzBwQk=</latexit><latexit sha1_base64="DNUAenKGd/gB9ac2XxeyIn3grM8=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13Zn93E32xymo3QVyWXI1jBk6/g/+2xrMOZzKxqNhoQC76dZWlPegiMpFHYxbzzWII6gwmlAAxp93q7a6pJHQZknpXVhGUpW6u8VLWjvl7oITg208Bdzvfi33LSh8kXeSlM3hEacX1Q2KiGb9DNK5tKhILUMAMLJ8NZELMCBoDDJOOZvMDTj8F04+H2NDsi6Jy0HV2k46UJzFX/a07+M0vwyBoq5wWNhtQYzb7mxTnfTLG+5wpK4mqCjYcadrBbEXb/rfyu7OOzLMNkdZeko+/hsuPdq/W8bbJs9ZDssY8/ZHttnH9iYCabZd3bKzqLt6GW0H709t0aDdc0D9kdEn34CR4XHgw==</latexit><latexit sha1_base64="DNUAenKGd/gB9ac2XxeyIn3grM8=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13Zn93E32xymo3QVyWXI1jBk6/g/+2xrMOZzKxqNhoQC76dZWlPegiMpFHYxbzzWII6gwmlAAxp93q7a6pJHQZknpXVhGUpW6u8VLWjvl7oITg208Bdzvfi33LSh8kXeSlM3hEacX1Q2KiGb9DNK5tKhILUMAMLJ8NZELMCBoDDJOOZvMDTj8F04+H2NDsi6Jy0HV2k46UJzFX/a07+M0vwyBoq5wWNhtQYzb7mxTnfTLG+5wpK4mqCjYcadrBbEXb/rfyu7OOzLMNkdZeko+/hsuPdq/W8bbJs9ZDssY8/ZHttnH9iYCabZd3bKzqLt6GW0H709t0aDdc0D9kdEn34CR4XHgw==</latexit><latexit sha1_base64="DNUAenKGd/gB9ac2XxeyIn3grM8=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13Zn93E32xymo3QVyWXI1jBk6/g/+2xrMOZzKxqNhoQC76dZWlPegiMpFHYxbzzWII6gwmlAAxp93q7a6pJHQZknpXVhGUpW6u8VLWjvl7oITg208Bdzvfi33LSh8kXeSlM3hEacX1Q2KiGb9DNK5tKhILUMAMLJ8NZELMCBoDDJOOZvMDTj8F04+H2NDsi6Jy0HV2k46UJzFX/a07+M0vwyBoq5wWNhtQYzb7mxTnfTLG+5wpK4mqCjYcadrBbEXb/rfyu7OOzLMNkdZeko+/hsuPdq/W8bbJs9ZDssY8/ZHttnH9iYCabZd3bKzqLt6GW0H709t0aDdc0D9kdEn34CR4XHgw==</latexit>

o(n)t

<latexit sha1_base64="AijpuD5XdKwqeARXIV+woEF8hQM=">AAACdHicfVFba9RAFJ6NtxovbfVRH4KhUEWWRAT7WNQHX8QK7rayE5eT2ZPs0LmEmZPqEvIrfNUf5h/x2cl2BW3FDwa++c53Zs6lbJT0lGU/RtGVq9eu39i6Gd+6fefu9s7uvam3rRM4EVZZd1KCRyUNTkiSwpPGIehS4XF5+mqIH5+h89KaD7RqsNBQG1lJARSkj3ZOn7p987if76TZOFsjuUzyDUnZBkfz3dGEL6xoNRoSCryf5VlDRQeOpFDYx7z12IA4hRpngRrQ6ItuXXGf7AVlkVTWhWMoWat/ZnSgvV/pMjg10NJfjA3iv2KzlqqDopOmaQmNOP+oalVCNhnaTxbSoSC1CgSEk6HWRCzBgaAwpDjmrzE04/BtePhdgw7IuicdB1dr+NKH5mr+dGD/M0rz2xhYzA1+FlZrMIuOG+t0P8uLjiusiKspOkpz7mS9JO6G27CI/OLYL5Pps3GejfP3z9PDl5uVbLEH7BHbZzl7wQ7ZG3bEJkwwzb6yb+z76Gf0MEqjvXNrNNrk3Gd/IRr/ArcZwUU=</latexit><latexit sha1_base64="7We+Wtw1IUCjthukT8N7Q+XrA+c=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13bHPO5mm8N0lK4iuQzZGoZsHf9nn20NxnxuRaPRkFDg/TRLa8pbcCSFwi7mjccaxBFUOA1oQKPP21VbXfIoKPOktC4sQ8lK/b2iBe39UhfBqYEW/mKuF/+WmzZUvshbaeqG0Ijzi8pGJWSTfkbJXDoUpJYBQDgZ3pqIBTgQFCYZx/wNhmYcvgsHv6/RAVn3pOXgKg0nXWiu4k97+pdRml/GQDE3eCys1mDmLTfW6W6a5S1XWBJXE3Q0zLiT1YK463f9b2UXh30ZJrujLB1lH58N916t/22DbbOHbIdl7DnbY/vsAxszwTT7zk7ZWbQdvYz2o7fn1miwrnnA/ojo00/Kice/</latexit><latexit sha1_base64="7We+Wtw1IUCjthukT8N7Q+XrA+c=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13bHPO5mm8N0lK4iuQzZGoZsHf9nn20NxnxuRaPRkFDg/TRLa8pbcCSFwi7mjccaxBFUOA1oQKPP21VbXfIoKPOktC4sQ8lK/b2iBe39UhfBqYEW/mKuF/+WmzZUvshbaeqG0Ijzi8pGJWSTfkbJXDoUpJYBQDgZ3pqIBTgQFCYZx/wNhmYcvgsHv6/RAVn3pOXgKg0nXWiu4k97+pdRml/GQDE3eCys1mDmLTfW6W6a5S1XWBJXE3Q0zLiT1YK463f9b2UXh30ZJrujLB1lH58N916t/22DbbOHbIdl7DnbY/vsAxszwTT7zk7ZWbQdvYz2o7fn1miwrnnA/ojo00/Kice/</latexit><latexit sha1_base64="7We+Wtw1IUCjthukT8N7Q+XrA+c=">AAACmXicjZHbahRBEIZ7x1McT4le5mZwEaLIMhMEvYyHiyCIiu4msj0uNb01s036MHTXmCzDPIW3eZq8hW9jz2YFTUQsaPj6r7+6u6qLWklPafpjEF25eu36jY2b8a3bd+7e29y6P/G2cQLHwirrDgvwqKTBMUlSeFg7BF0oPCiOXvf5g2/ovLTmMy1rzDVURpZSAAXpi53R13bHPO5mm8N0lK4iuQzZGoZsHf9nn20NxnxuRaPRkFDg/TRLa8pbcCSFwi7mjccaxBFUOA1oQKPP21VbXfIoKPOktC4sQ8lK/b2iBe39UhfBqYEW/mKuF/+WmzZUvshbaeqG0Ijzi8pGJWSTfkbJXDoUpJYBQDgZ3pqIBTgQFCYZx/wNhmYcvgsHv6/RAVn3pOXgKg0nXWiu4k97+pdRml/GQDE3eCys1mDmLTfW6W6a5S1XWBJXE3Q0zLiT1YK463f9b2UXh30ZJrujLB1lH58N916t/22DbbOHbIdl7DnbY/vsAxszwTT7zk7ZWbQdvYz2o7fn1miwrnnA/ojo00/Kice/</latexit>

Figure 1: Overview: The overall schematic of our multi-agent system. All agents share the sameindividual model architecture, but each agent is independently trained to learn to auto-encode its ownobservation and use the learned representation for communication. At each time step, each agentobserves an image representation of the environment as well as messages broadcasted by other agentsduring the last time step. The image pixels are processed through an Image Encoder; the broadcastedmessages are processed through a Message Encoder; the image features and the message features areconcatenated and passed through a Policy Network to predict the next action. The image features arealso used to generate the next communication messages using the Communication Autoencoder.

bias or supervision signal. In Section 5, we show the effectiveness of our approach on a variety ofMARL communication tasks.

An overview of our method is shown in Figure 1, which illustrates the communication flow amongagents at some arbitrary time step t. All agents share the same individual model architecture, andeach agent consists of two modules: a speaker module and a listener module. We describe thearchitecture details of a single agent k below.

4.1 Speaker Module

At each time step t, the speaker module takes in the agent’s observation o(k)t and outputs the agent’snext communication message c(k)t .

Image Encoder Given the raw pixel observation, the module first uses a image encoder to embedthe pixels into a low-dimensional feature o(k)t → φim(o

(k)t ) ∈ R128. The image encoder is a

convolutional neural network with 4 convolutional layers, and the output of this network is spatiallypooled. We use the same image encoder in the listener module.

Communication Autoencoder The goal of the communication autoencoder is to take the currentstate observation and generate the next subsequent message. We use an autoencoder to learn amapping from the feature space of image encoder to communication symbols, i.e. φim(o

(k)t )→ c

(k)t .

The autoencoder consists of an encoder and a decoder, both parameterized by a 3-layer MLP. Thedecoder tries to reconstruct the input state from the communication message c(k)t → φim(o

(k)t ). The

communication messages are quantized before being passed through the decoder. We use a straight-through estimator to differentiate through the quantization [1]. The auxiliary objective function ofour model is to minimize the reconstruction loss ‖φim(o

(k)t )− φim(o

(k)t )‖22. This loss is optimized

jointly with the policy gradient loss from the listener module.

4.2 Listener Module

While the goal of the speaker module is to output grounded communication based on the agent’sprivate observation o(k)t , the goal of the listener module is to learn an optimal action policy based onboth the observation o(k)t and communicated messages ct−1. At each time step t, the listener moduleoutputs the agent’s next action a(k)t .

4

Red door has to be opened first Red door is opened

Blue door is openedBlue agent waits

Visibility region of each agentGoal is to reach the green block

FindGoal (non-ordinal) RedBlueDoors (ordinal)

Figure 2: MarlGrid Environment: We introduce two new grid environments: FindGoal (left) andRedBlueDoors (right). These environments are adapted from the GridWorld environment [6, 35].Environment states are randomized at every episode and are partially observable to the agents. InFindGoal, the task is to reach the green goal location. Each agent receives a reward of 1 when theyreach the goal, and an additional reward of 1 when all 3 agents reach the goal within the time frame.In RedBlueDoors, the task is ordinal, where the ordering of actions matter. A reward of 1 is given toboth agents if and only if the red door is opened first and then the blue door.

Message Encoder The message encoder linearly projects all messages communicated from theprevious time step ct−1 using a shared embedding layer. The information across all agent messageembeddings is combined through concatenation and passed through 3-layer MLP. The resultingmessage feature has a fixed dimension of 128, i.e. φcomm(ct) ∈ R128.

Policy Network Each agent uses an independent policy head, which is a standard GRU [8] policywith a linear layer. The GRU policy concatenates the encoded image features and the message featuresφ(k)t = φim(o

(k)t ) ◦ φcomm(ct), where ◦ is the concatenation operator across the feature dimension.

The GRU policy predicts a distribution over the actions a ∼ π(φ(k)t ) and the corresponding expected

returns. The predicted action distribution and expected returns are used for computing the policygradient loss. This loss is jointly optimized with the autoencoder reconstruction loss from the speakermodule.

The same setup is used for all experiments. The exact details of the network architecture and thecorresponding training details are in the Appendix.

5 Experiments

In this section, we demonstrate that autoencoding is a simple and effective representation learningalgorithm to ground communication in MARL. We evaluate our method on various multi-agentenvironments and qualitatively show that our method outperforms baseline methods. We then providefurther ablations and analyses on the learned communication.

5.1 Environments

We introduce three multi-agent communication environments: CIFAR Game, FindGoal, andRedBlueDoors. Our work focuses on fully cooperative scenarios, but can also be extended tocompetitive or mixed scenarios.

Our environments cover a wide range of communication task settings, including (1) referential ornon-referential, (2) ordinal or non-ordinal, and (3) two-agent versus generalized multi-agent. Areferential game, often credited to Lewis signaling game [27], refers to a setup in which agentscommunicate through a series of message exchanges to solve a task. In contrast to non-referentialgames, constructing a communication protocol is critical to solving the task – where one can onlyarrive at a solution through communication. Referential games are referred to as a grounded learningenvironment, and therefore, communication in MARL has been studied mainly through the lens ofreferential games [13, 26]. Lastly, ordinal tasks refer to a family of problems where the ordering ofthe actions is critical for solving the task. The difference between ordinal and non-ordinal settingsis illustrated in Figure 2, where the blue agent must wait for the red agent to open the door tosuccessfully receive a reward. In contrast, non-ordinal tasks could benefit from shared information,but it is not necessary to complete the task. We now describe the environments used in our work inmore detail:

5

CIFAR Game We design CIFAR Game following the setup of Multi-Step MNIST Game in [18],but with CIFAR-10 dataset [25] instead. This is a non-ordinal, two-agent, referential game. In CIFARgame, each agent independently observes a randomly drawn image from the CIFAR-10 dataset, andthe goal is to communicate the observed image to the other agent within 5 environment time steps. Ateach time step, each agent broadcasts a set of communication symbols of length l. At the final timestep, each agent must choose a class label from the 10 possible choices. At the end of the episode, anagent receives a reward of 0.5 for each correctly guessed class label, and both agents receive a rewardof 1 only when both images are classified correctly.

MarlGrid Environments The second and third environments we consider are: FindGoal (Figure 2left) and RedBlueDoors (Figure 2 right). Both environments are adapted from the GridWorldenvironment [6, 35] and environment states are randomized at every episode.

FindGoal is a non-ordinal, multi-agent, non-referential game. We use N = 3 agents, and at eachtime step, each agent observes a partial view of the environment centered at its current position. Thetask of agents is to reach the green goal location as fast as possible. Each agent receives an individualreward of 1 for completing the task and an additional reward of 1 when all agents have reached thegoal. Hence, the optimal strategy of an agent is to communicate the goal location once it observes thegoal. If all agents learn a sufficiently optimized search algorithm, they can maximize their rewardwithout communication.

RedBlueDoors is an ordinal, two-agent, non-referential game. The environment consists of a reddoor and a blue door, both initially closed. The task of agents is to open both doors, but unlike in theprevious two games, the ordering of actions executed by agents matters. A reward of 1 is given toboth agents if and only if the red door is opened first and then the blue door. This means that any timethe blue door is opened first, both agents receive a reward of 0, and the episode ends immediately.Hence, the optimal strategy for agents is to convey the information that the red door was opened.Since it is possible to solve the task through visual observation or by a single agent that opens bothdoors, communication is not necessary.

Compared to CIFAR Game, the MarlGrid environments have a higher-dimensional observation spaceand a more complex action space. The fact that these environments are non-referential exacerbatesthe visual-language grounding problem since communication can only exist in the form of cheap talk(i.e., costless, nonbinding, nonverifiable communication [14] that has no direct effect on the gamestate and agent payoffs). We hope to show from this set of environments that autoencoders can beused as a surprisingly simple and adaptable representation learning task to ground communication.It requires little effort to implement and almost no change across environments. Most importantly,as we will see in Section 5.4, autoencoded representation shows an impressive improvement overcommunication trained with reinforcement learning.

5.2 Baselines

To evaluate the effectiveness of grounded communication, we compare our method (ae-comm) againstthe following baselines: (1) a no-comm baseline, where agents are trained without a communicationchannel; (2) a rl-comm baseline*, where we do not make a distinction between the listener moduleand the speaker module, and the communication policy is learned in a way similar to the environmentpolicy; (3) a rl-comm-with-biases baseline, where inductive biases for positive signaling andpositive listening are added to rl-comm training as proposed in [11]; (4) a ae-rl-comm baseline,where the communication policy is learned by an additional policy network trained on top of theautoencoded representation in speaker module.

5.3 The Effectiveness of Grounded Communication

In Table 1 and Figure 3, we compare task performance of ae-comm agents with performance ofbaseline agents. We report all results with 95% confidence intervals, evaluated on 100 episodes perseed over 10 random seeds.

In CIFAR Game environment, all three baselines (no-comm, rl-comm, rl-comm-with-biases)could only obtain an average reward close to that of random guesses throughout the training process.

*The rl-comm baseline can be seen as an A3C version of RIAL [18] without parameter sharing.

6

0 100000 200000 300000

Train iterations

0.0

0.2

0.4

0.6

0.8

1.0

Envi

ronm

ent r

ewar

d

RedBlueDoors

0 100000 200000 300000

Train iterations

0.0

0.1

0.2

0.3

0.4

0.5

Envi

ronm

ent r

ewar

d

CIFAR Dialogue

no comm rl comm rl comm w/ bias ae comm

0 100000 200000 300000

Train iterations

0

100

200

300

400

Epis

ode

leng

th

FindGoal

Figure 3: Comparison with baselines: Comparison between our method that uses an autoencodedcommunication (ae-comm), a baseline that is trained without communication (no-comm), a base-line where communication policy is trained using reinforcement learning (rl-comm), and anotherbaseline where inductive biases for positive signaling and positive listening are added to rl-commtraining (rl-comm-with-inductive-biases). For FindGoal, we visualize the amount of time ittakes for all agents to reach the goal, as all methods can reach the goal within the time frame. Foreach set of results, we report the mean and 95% confidence intervals evaluated on 100 episodes perseed over 10 random seeds.

In comparison, ae-comm agents achieve a much higher reward on average. Since this environmentis a referential game, our results directly indicate that ae-comm agents learn to communicate moreeffectively than the baseline agents.

methods CIFAR RedBlueDoors FindGoal

avg. r ↑ avg. r ↑ avg. t ↓no-comm 0.082 ± 0.009 0.123 ± 0.096 169.0 ± 26.8rl-comm 0.099 ± 0.013 0.174 ± 0.009 184.8 ± 25.2rl-comm w/ bias [11] 0.142 ± 0.019 0.729 ± 0.072 158.0 ± 12.3ae-comm (ours) 0.348 ± 0.041 0.984 ± 0.002 103.5 ± 20.2

Table 1: Comparison with baselines: We compute the averagereward for CIFAR Game and RedBlueDoors environments, andaverage episode length for FindGoal environment. For eachset of results, we report the mean and 95% confidence intervalsevaluated on 100 episodes per seed over 10 random seeds.

RedBlueDoors environmentposes a challenging multi-agentcoordination problem since thereward is extremely sparse. Asshown in Figure 3(b), neither ofthe no-comm and rl-comm agentswas able to learn a successfulcoordination strategy. Althoughrl-comm-with-biases agentsoutperform the other two baselineagents, they do not learn anoptimal strategy that guaranteesan average reward close to 1.In contrast, ae-comm agentsconverge to an optimal strategyafter 150k of training.

In FindGoal environment, agents are able to solve the task without communication, but their perfor-mance can be improved with communication. Therefore, we use episode length instead of reward asthe performance metric for this environment. To resolve the ambiguity in Figure 3(c), we also includenumerical results in a table below. While all agents are able to obtain full rewards, Figure 3 showsthat ae-comm agents are able to complete the episode much faster than other agents. We furtherverify that this improvement is indeed a result of the successful communication by providing furtheranalysis in Section 5.5.

Our results indicate that a communication model trained with autoencoding tasks consistentlyoutperforms the baselines across all environments. The observation that communication does notwork well with reinforcement learning is consistent with observations made in prior works [11, 18, 30].Furthermore, our results with autoencoders – a task that is often considered trivial – highlight thatwe as a community may have overlooked a critical representation learning component in MARLcommunication. In Section 5.4, we provide a more detailed discussion on the difficulty of trainingemergent communication via policy optimization.

7

0 100000 200000 300000

Train iterations

0.0

0.2

0.4

0.6

0.8

1.0

Envi

ronm

ent r

ewar

d

RedBlueDoors

0 100000 200000 300000

Train iterations

0.0

0.2

0.4

0.6

0.8

1.0

Envi

ronm

ent r

ewar

d

CIFAR Dialogue

ae comm ae rl comm

0 100000 200000 300000

Train iterations

0

50

100

150

200

250

300

350

400

450

Epis

ode

leng

th

FindGoal

Figure 4: Representation learning with reinforcement learning: Comparison between a speakermodule trained with only an autoencoding task (ae-comm) and another one trained with both autoen-coding task and reinforcement learning (ae-rl-comm). We observed that further training a policy ontop of the autoencoder representation degrades performance across all environments (except for inFindGoal, where performance of ae-rl-comm and ae-comm stayed about the same).

5.4 The Role of Autoencoding

Given the success of agents trained with autoencoders, it is natural to ask whether a better communica-tion protocol can emerge from the speaker module by jointly training it with a reinforcement learningpolicy. To this end, we train a GRU policy on top of the autoencoded representation (ae-rl-comm)and compare it against our previous model that was trained just with an autoencoder (ae-comm). Thecommunication policy head is independent of the environment action policy head.

Surprisingly, we observed in Figure 4 that the model trained jointly with reinforcement learningconsistently performed worse (except for in FindGoal, where performance of ae-rl-comm andae-comm stayed about the same). We hypothesize that the lack of correlation between visualobservation and the communication rewards hurts the agents’ performance. This lack of visual-reward grounding could introduce conflicting gradient updates to the action policy, and therebyexacerbate the high-variance problem that already exists in reinforcement learning. In other words,optimization is harder whenever the joint-exploration problem for learning speaker and listenerpolicies is introduced. Our observation suggests that specialized reward design and training at thelevel of [2] might be required for decentralized MARL communication. This prompts us to rethinkhow to address the lack of visual grounding in communication policy, where this work serves as afirst step in this direction.

5.5 Analyzing the Effects of Communication Signals on Agent Behavior

Communication Embedding To analyze whether the agents have learned a meaningful visualgrounding for communication, we first visualize the communication embedding. In Figure 5, wevisualize the communication symbols transmitted by the agents trained on RedBlueDoors. Thecommunication symbols are discrete with a length of l = 10 (1024 possible embedding choices), andwe use approximately 4096 communication samples across 10 episodic runs. We embed the commu-nication symbols using t-SNE [46] and further cluster them using DBSCAN [12]. In the figure, wevisualize clusters by observing the correspondence between image states and communication symbolsproduced by agents in those states. For example, we observed that a specific communication clustercorresponded to an environment state when the red door was opened; this suggests a communicationaction where one agent signals the other agent to open the blue door and complete the task.Entropy of Action Distribution To measure whether communicated information directly influ-ences other agents’ actions, we visualize the entropy of action distribution during episode rollouts.Suppose one agent shares information that is vital to solving the task. In that case, a decrease in en-tropy should be observed in the action distributions of other agents, as they act more deterministicallytowards solving the task.

As shown in Figure 6, we visualize the entropy of action distribution across 256 random episodicruns using policy parameters from a fully trained ae-comm model. The entropies are aligned using

8

No doors visible Red door was opened

Communication Embedding

Figure 5: Communication clusters: 4096 communication messages are embedded into low-dimensional representation using t-SNE [46] and is clustered using DBSCAN [12]. We visualizethe images corresponding to the communication messages. We observed that the message clusterscorrespond to various meaningful phases throughout the task. The communication symbol of thepurple cluster corresponds to when no doors are visible by either agent, and the light green clustercorresponds to when the red door is opened.

environment milestone events: for FindGoal, this is when the first agent reaches the goal; forRedBlueDoors, this is when the red door is opened. Since the identity of the agents that solvethe task first does not matter, entropy plots are computed with respect to the listener agents (i.e.,agents that receive vital information from others). In FindGoal, this corresponds to the last agentto reach the goal; in RedBlueDoors, this corresponds to the agent opening the blue door. Forboth environments, we see a sharp fall-off in entropy as soon as the first agents finish the task. Incontrast, agents trained without autoencoding act randomly regardless of whether other agents havecompleted the task. This reaffirms that the agents trained with an autoencoder can effectively transmitinformation to other agents.

methods CIFAR RedBlueDoors

gain ∆r ↑ gain ∆r ↑rl-comm 0.017 ± 0.022 -0.030 ± 0.155rl-comm w/ bias [11] 0.060 ± 0.029 0.552 ± 0.155ae-comm (ours) 0.266 ± 0.058 0.807 ± 0.185

Table 2: Performance gain with communication:Positive listening as measured by the increase inreward after adding a communication channel. Per-formance of ae-comm agents improves more thanthe baselines.

Positive Signaling and Positive ListeningWe additionally investigate the two metricssuggested by [29] for measuring effectivenessof communication, positive signaling and pos-itive listening. Since ae-comm agents have tocommunicate their learned representation, thepresence of representation learning task lossmeans that ae-comm agents are intrinsicallyoptimized for positive signaling (i.e., sendingmessages that are related to their observationor action). In Table 2, we report the increasein average reward when adding a communi-cation channel, comparing ae-comm with itsbaselines; this metric is suggested by [29] tobe a sufficient metric for positive listening (i.e., communication messages influence the behavior ofagents). We observe that agent task performance improves most substantially in ae-comm with theaddition of communication channel.

6 Discussion and Societal Impacts

We present a framework for grounding multi-agent communication through autoencoding, a simpleself-supervised representation learning task. Our method allows agents to learn non-differentiablecommunication in fully decentralized settings and does not impose constraints on input structures(e.g., state inputs or pixel inputs) or task nature (e.g., referential or non-referential). Our resultsdemonstrate that agents trained with the proposed method achieve much better performance on asuite of coordination tasks compared to baselines.

9

0.8

1.0

1.2

1.4

1.6

1.8

Act

ion

entr

opy

Red door opened Both doors opened

FindGoal

without grounding with ae grounding

1.1

1.2

1.3

1.4

1.5

1.6

Act

ion

entr

opy

First agent reached Second agent reached

RedBlueDoor

Figure 6: Policy entropy with communication: We visualize the entropy of the action policythroughout the task (lower is better). The graph is generated with 256 random episodes. ForFindGoal, the entropy is measured on the last agent to enter the goal, and for RedBlueDoors theentropy is measured on the agent that opens the blue door (the second door). All 256 runs are alignedto the dotted red lines which corresponds to the time in which the first and second agent enters the goalfor (FindGoal), and the time in which the red and the blue doors are opened for (RedBlueDoors).The model trained with an auto-encoder transmits messages that are effectively used by other agents.

We believe this work on multi-agent communication is of importance to our society for two reasons.First, it extends a computational framework under which scientific inquiries concerning languageacquisition, language evolution, and social learning can be made. Second, unlike works in whichagents can only learn latent representations of other agents through passive observations [47, 48],it opens up new ways for artificial learning agents to improve their coordination and cooperativeskills, increasing their reliability and usability when deployed to real-world tasks and interacting withhumans.

However, we highlight two constraints of our work. First, our method assumes that all agents have thesame model architecture and the same autoencoding loss, while real-world applications may involveheterogeneous agents. Second, our method may fail in scenarios where agents need to be moreselective of the information communicated since they are designed to communicate their observedinformation indiscriminately. We believe that testing the limits of these constraints, and relaxingthem, will be important steps for future work.

We would also like to clarify that communication between the agents in our work is only part of whattrue “communication” should eventually entail. We position our work as one partial step towardtrue communication, in that our method provides a strong bias toward positive signaling and allowsdecentralized agents to coordinate their behavior toward common goals by passing messages amongsteach other. It does not address other aspects of communication, such as joint optimization betweena speaker and a listener. In the appendix section, we highlight the difficulty of training emergentcommunication via policy optimization alongside empirical results.

Another limitation of this work, which is also one concern we have regarding its potential negativesocietal impact, is that the environments we consider are fully cooperative. If the communicationmethod we present in this work is to be deployed to the real world, we need to either make sure theenvironment is rid of adversaries, or conduct additional research to come up with robust counter-strategies in the face of adversaries, which could use better communication policies as a way to lie,spread misinformation, or maliciously manipulate other agents.

Acknowledgements

We sincerely thank all the anonymous reviewers for their extensive discussions on and valuablecontributions to this paper. We thank Lucy Chai and Xiang Fu for helpful comments on the manuscript.We thank Jakob Foerster for providing inspiring advice on multi-agent training. Additionally, TLwould like to thank Sophie and Sofia; MH would like to thank Sally, Leo and Mila; PI would like tothank Moxie and Momo. This work was supported by a grant from Facebook.

10

References

[1] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neuronsfor conditional computation. arXiv preprint arXiv:1308.3432, 2013. 4, 14

[2] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme,C. Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680,2019. 8

[3] B. Bogin, M. Geva, and J. Berant. Emergence of communication in an interactive world with consistentspeakers. arXiv preprint arXiv:1809.00549, 2018. 1, 3

[4] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym.arXiv preprint arXiv:1606.01540, 2016. 13

[5] K. Cao, A. Lazaridou, M. Lanctot, J. Z. Leibo, K. Tuyls, and S. Clark. Emergent communication throughnegotiation. arXiv preprint arXiv:1804.03980, 2018. 2

[6] M. Chevalier-Boisvert, L. Willems, and S. Pal. Minimalistic gridworld environment for openai gym.https://github.com/maximecb/gym-minigrid, 2018. 5, 6

[7] E. Choi, A. Lazaridou, and N. de Freitas. Compositional obverter communication learning from raw visualinput. arXiv preprint arXiv:1804.02341, 2018. 1, 2, 3

[8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks onsequence modeling. arXiv preprint arXiv:1412.3555, 2014. 5

[9] A. Dafoe, E. Hughes, Y. Bachrach, T. Collins, K. R. McKee, J. Z. Leibo, K. Larson, and T. Graepel. Openproblems in cooperative ai. arXiv preprint arXiv:2012.08630, 2020. 1

[10] A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau. Tarmac: Targeted multi-agentcommunication. In International Conference on Machine Learning, pages 1538–1546. PMLR, 2019. 2

[11] T. Eccles, Y. Bachrach, G. Lever, A. Lazaridou, and T. Graepel. Biases for emergent communication inmulti-agent reinforcement learning. arXiv preprint arXiv:1912.05676, 2019. 1, 2, 3, 6, 7, 9

[12] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in largespatial databases with noise. In Kdd, volume 96, pages 226–231, 1996. 8, 9

[13] K. Evtimova, A. Drozdov, D. Kiela, and K. Cho. Emergent communication in a multi-modal, multi-stepreferential game. arXiv preprint arXiv:1705.10369, 2017. 5

[14] J. Farrell. Cheap talk, coordination, and entry. The RAND Journal of Economics, pages 34–39, 1987. 6[15] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent policy

gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. 1, 2[16] J. Foerster, F. Song, E. Hughes, N. Burch, I. Dunning, S. Whiteson, M. Botvinick, and M. Bowling.

Bayesian action decoder for deep multi-agent reinforcement learning. In International Conference onMachine Learning, pages 1942–1951. PMLR, 2019. 2

[17] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate to solve riddleswith deep distributed recurrent q-networks. arXiv preprint arXiv:1602.02672, 2016. 2

[18] J. N. Foerster, Y. M. Assael, N. De Freitas, and S. Whiteson. Learning to communicate with deepmulti-agent reinforcement learning. arXiv preprint arXiv:1605.06676, 2016. 1, 2, 6, 7

[19] L. Graesser, K. Cho, and D. Kiela. Emergent linguistic phenomena in multi-agent communication games.arXiv preprint arXiv:1901.08706, 2019. 1

[20] S. Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990. 1,3

[21] J. R. Hurford. Biological evolution of the saussurean sign as a component of the language acquisitiondevice. Lingua, 77(2):187–222, 1989. 2

[22] N. Jaques, A. Lazaridou, E. Hughes, C. Gulcehre, P. Ortega, D. Strouse, J. Z. Leibo, and N. De Freitas.Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In InternationalConference on Machine Learning, pages 3040–3049. PMLR, 2019. 2

[23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,2014. 13

[24] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural information processingsystems, pages 1008–1014. Citeseer, 2000. 2

[25] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. (0), 2009. 6[26] A. Lazaridou, K. M. Hermann, K. Tuyls, and S. Clark. Emergence of linguistic communication from

referential games with symbolic and pixel input. arXiv preprint arXiv:1804.03984, 2018. 3, 5[27] D. Lewis. Convention: A philosophical study. John Wiley & Sons, 2008. 5[28] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning

proceedings 1994, pages 157–163. Elsevier, 1994. 3[29] R. Lowe, J. Foerster, Y.-L. Boureau, J. Pineau, and Y. Dauphin. On the pitfalls of measuring emergent

communication. arXiv preprint arXiv:1903.05168, 2019. 9[30] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed

cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017. 1, 2, 3, 7

11

https://github.com/maximecb/gym-minigrid

[31] B. MacWhinney. The emergence of language from embodiment. Psychology Press, 2013. 1[32] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.

Asynchronous methods for deep reinforcement learning. In International conference on machine learning,pages 1928–1937. PMLR, 2016. 3

[33] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playingatari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 2

[34] I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent populations. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. 1, 2

[35] K. Ndousse. marlgrid. https://github.com/kandouss/marlgrid, 2020. 5, 6[36] M. Noukhovitch, T. LaCroix, A. Lazaridou, and A. Courville. Emergent communication under competition.

arXiv preprint arXiv:2101.10276, 2021. 2[37] M. A. Nowak and D. C. Krakauer. The evolution of language. Proceedings of the National Academy of

Sciences, 96(14):8028–8033, 1999. 1[38] D. K. Roy and A. P. Pentland. Learning words from sights and sounds: A computational model. Cognitive

science, 26(1):113–146, 2002. 2[39] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using

generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. 3[40] L. S. Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100,

1953. 3[41] L. Steels. The synthetic modeling of language origins. Evolution of communication, 1(1):1–34, 1997. 2[42] L. Steels and F. Kaplan. Aibo’s first words: The social learning of language and meaning. Evolution of

communication, 4(1):3–32, 2000. 2[43] S. Sukhbaatar, A. Szlam, and R. Fergus. Learning multiagent communication with backpropagation. arXiv

preprint arXiv:1605.07736, 2016. 1, 2[44] M. Tomasello. Origins of human communication. MIT press, 2010. 1[45] T. Udagawa and A. Aizawa. A natural language corpus of common grounding under continuous and

partially-observable context. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33,pages 7120–7127, 2019. 3

[46] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research,9(11), 2008. 8, 9

[47] A. Xie, D. P. Losey, R. Tolsma, C. Finn, and D. Sadigh. Learning latent representations to influencemulti-agent interaction. arXiv preprint arXiv:2011.06619, 2020. 10

[48] L. Zintgraf, S. Devlin, K. Ciosek, S. Whiteson, and K. Hofmann. Deep interactive bayesian reinforcementlearning via meta-learning. arXiv preprint arXiv:2101.03864, 2021. 10

12

https://github.com/kandouss/marlgrid

A Implementation Details

A.1 Environment

All environments used in this work were implemented using OpenAI Gym [4]. Below, we describethe MarlGrid Environments in more detail.

FindGoal The game map is a 15×15 grid world. On each episode reset, 1 goal tile and 25 obstacletiles are randomly placed on the map. Then, 3 agents are randomly placed on the remaining emptyspace. Each agent can only observe a 7× 7 partial view of the map centered on the agent. Each agenthas 5 actions: up, right, down, left, stay. The maximum episode length allowed is 512 time steps.

RedBlueDoors The game map is a 10× 10 grid world. On each episode reset, 1 blue door and 1red door are placed at a random position on either the leftmost side or the rightmost side of the map;the doors must be on opposite sides. Then, 2 agents are randomly placed on the remaining emptyspace. Each agent can only observe a 3× 3 partial view of the map centered on the agent. Each agenthas 6 actions: up, right, down, left, stay, open door. The maximum episode length allowed is 2048time steps.

A.2 Model

Image Encoder The Image Encoder is a convolutional neural network with 4 convolutional layers.Each layer has a kernel size of 3, stride of 2, padding of 1, and outputs 32 channels. ELU activationis applied to each convolutional layer. A 2D adaptive average pooling is applied over the output fromconvolutional layers. The final output has 32 channels, a height of 3, and a width of 3.

Communication Autoencoder The Communication Autoencoder takes as input the output fromthe Image Encoder. The encoder is a 3-layer MLP with hidden units [128, 64, 32] and ReLUactivation. The decoder is a 3-layer MLP with hidden units [32, 64, 128] and ReLU activation. Theoutput communication message is a 1D vector of length 10.

Message Encoder The Message Encoder first projects all input messages using an embeddinglayer of size 32, then concatenates and passes the message embeddings through a 3-layer MLP withhidden units [32, 64, 128] and ReLU activation. Dimension of the output message feature is 128.

Policy Network Each policy network is consisted of a GRU policy with hidden size 128, a linearlayer mapping GRU outputs to policy logits for the environment action, and a linear layer mappingGRU outputs to the baseline value function.

B Training Details

B.1 Hyperparameters

For all experiments, we use the Adam optimizer [23] with a learning rate of 0.0001 and parametersβ1 = 0.9, β2 = 0.999, ε = 10−8. We perform a sweep over the following values of these hyperpa-rameter to select the combination that gives rise to best results: (0.01, 0.001, 0.0001, 0.00001) forlearning rate, (0.9, 0.99, 0.995) for β1, (0.995, 0.999) for β2, and (10−7, 10−8) for ε.

B.2 Compute Resources

We ran all experiments on an internal cluster that consists of 4 NVIDIA GeForce RTX 2080 Ti GPUs.

13

methods CIFAR RedBlueDoors FindGoal

avg. r ↑ avg. r ↑ avg. t ↓no-comm 0.082 ± 0.009 0.123 ± 0.096 169.0 ± 26.8fc-comm 0.107 ± 0.021 0.382 ± 0.022 188.3 ± 33.6latent-comm-1 0.129 ± 0.034 0.714 ± 0.073 160.0 ± 13.0latent-comm-2 0.092 ± 0.012 0.739 ± 0.080 228.6 ± 21.3full-latent-comm 0.346 ± 0.070 0.912 ± 0.046 111.1 ± 26.1ae-comm (ours) 0.348 ± 0.041 0.984 ± 0.002 103.5 ± 20.2

Table 3: Comparison with additional baselines. We compute the average reward for CIFAR Gameand RedBlueDoors environments, and average episode length for FindGoal environment. For eachset of results, we report the mean and 95% confidence intervals evaluated on 100 episodes per seedover 10 random seeds.

C Additional Experiments

C.1 fc-comm and latent-comm Baselines

In addition to baselines presented in Section 5.2, we added two more baselines: fc-comm andlatent-comm. Below, we describe them in more details, and report comparisons between theseadditional baselines and our method (ae-comm) on all three environments in Table 3.

For fc-comm, agents transmit discrete states that take the same form as the autoencoded messagesin ae-comm. These discrete states come from the policy network without autoencoding. For faircomparison, a fully connected layer is added on top of the policy network states, such that the learnedmessages for fc-comm are discretized and have the same shape and range as messages in ae-comm.

For latent-comm, agents transmit latent states from before the policy head as communicationvectors. For fair comparison, for latent-comm we again restrict agents to use the same fixed-sizediscrete communication channel (except for full-latent-comm, which we will explain in the nextparagraph). The latent activation vectors in the policy nets are high-dimensional; reducing them tothe size of the communication channel can be achieved in multiple ways. In latent-comm-1, wedo so by reducing the size of the last hidden layer of the policy net so that it matches the size of thecommunication channel. In latent-comm-2, we instead use the original policy architecture andonly transmit the first N features of the last hidden layer of the policy net, where N is the size of thecommunication channel. In all cases, we quantize the outputs identically and use a straight-throughestimator [1] to differentiate through the quantization. Their performance is better than fc-commbaseline; since these latent states are continuously trained with the policy reward, we hypothesizethat they contain more useful and relevant information.

As an “upper bound”, we also compare all methods to full-latent-comm, a method that com-municates the full, high-dimensional and continuous latent vector from the last hidden layer ofthe policy net. As shown in the table, the performance of full-latent-comm is close to that ofae-comm. Clearly, if agents can directly observe each other’s full internal states, they can coordinatetheir behavior well. Our contribution is to show that similar performance can be achieved evenwhen agents are only allowed to make low-dimensional and discrete utterances – a setting that moreclosely models communication in nature and may have greater practical utility (e.g., low-bandwidthcommunication channels between robots).

C.2 Training Emergent Communication via Policy Optimization

Making policy optimization work on top of autoencoding is a difficult yet essential problem to solve.We hypothesize that optimization is harder whenever the joint-exploration problem for learningspeaker and listener policies is introduced. We do not have a solution yet, but our empirical resultscan confirm that the solution is not as simple as requiring more training iterations. We have runtraining to up to 1 million iterations and did not see an improvement in the result. We have alsotried training the autoencoding component and RL component alternatingly, i.e. freezing one while

14

Figure 7: Difficulty of training policy optimization: We tried training the autoencoding componentand RL component alternatingly, i.e. freezing one while training the other until loss stabilizes. Thealternating training is visualized as different segments in this training graph, starting from the segmenttrained with autoencoding objective. Performance did not improve after extended training.

0 50000 100000 150000 200000 250000 300000

Train iterations0.0

0.1

0.2

0.3

0.4

0.5

Loss

0.0

0.2

0.4

0.6

0.8

1.0

Envi

ronm

ent r

ewar

d

(a) RedBlueDoors ↑

0 50000 100000 150000 200000 250000 300000

Train iterations0.0

0.1

0.2

0.3

0.4

0.5

Loss

0

50

100

150

200

250

300

350

400

450

Epis

ode

leng

th

(b) FindGoal ↓

Figure 8: Performance and representation: Both agent performance and autoencoder loss improveover the course of learning.

training the other until loss stabilizes, yet performance does not improve whenever the RL objectiveis optimized. As an example, we include a failed training graph in Figure 7.

C.3 Autoencoder Loss

In Figure 8, we observe that agent task performance improves as representation learning task lossdecreases. In particular, the agent task performance does not improve drastically until representationtask loss is sufficiently low.

15

Learning to Ground Multi-Agent Communication with Autoencoders

Documents

Transcript of Learning to Ground Multi-Agent Communication with Autoencoders