Coordination in Adversarial Sequential Team Games via Multi [email protected] Nicola...

13
Coordination in Adversarial Sequential Team Games via Multi-agent Deep Reinforcement Learning Andrea Celli Politecnico di Milano [email protected] Marco Ciccone Politecnico di Milano [email protected] Raffaele Bongo Politecnico di Milano [email protected] Nicola Gatti Politecnico di Milano [email protected] Abstract Many real-world applications involve teams of agents that have to coordinate their actions to reach a common goal against potential adversaries. This paper focuses on zero-sum games where a team of players faces an opponent, as is the case, for example, in Bridge, collusion in poker, and collusion in bidding. The possibility for the team members to communicate before gameplay—that is, coordinate their strategies ex ante—makes the use of behavioral strategies unsatisfactory. We introduce Soft Team Actor-Critic (STAC) as a solution to the team’s coordination problem that does not require any prior domain knowledge. STAC allows team members to effectively exploit ex ante communication via exogenous signals that are shared among the team. STAC reaches near-optimal coordinated strategies both in perfectly observable and partially observable games, where previous deep RL algorithms fail to reach optimal coordinated behaviors. 1 Introduction In many strategic interactions agents have similar goals and have incentives to team up, and share their final reward. In these settings, coordination between team members plays a crucial role. The benefits from team coordination (in terms of team’s expected utility) depend on the communication possibilities among team members. This work focuses on ex ante coordination, where team members have an opportunity to discuss and agree on tactics before the game starts, but will be unable to communicate during the game, except through their publicly-observed actions. Consider, as an illustration, a poker game where some players are colluding against some identified target players and will share the final winnings after the game. Non-recreational applications are ubiquitous as well. This is the case, for instance, of collusion in bidding, where communication during the auction is illegal, and coordinated swindling in public. Ex ante coordination enables the team to obtain significantly higher returns (up to a linear factor in the size of the game-tree) than the return they would obtain by abstaining from coordination [3, 13]. Finding an equilibrium with ex ante coordination is NP-hard and inapproximable [3, 13]. Celli and Gatti [13] introduced the first algorithm to compute optimal coordinated strategies for a team playing against an adversary. At its core, it’s a column generation algorithm exploiting a hybrid representation of the game, where team members play joint normal-form actions while the adversary employs sequence-form strategies [62]. More recently, Farina et al. [19] proposed a variation of fictitious play, named Fictitious Team-Play (FTP), to compute an approximate solution to the problem. Both approaches require to iteratively solve mixed-integer linear programs (MILP), which significantly Working paper. Last update: December 18, 2019. arXiv:1912.07712v1 [cs.AI] 16 Dec 2019

Transcript of Coordination in Adversarial Sequential Team Games via Multi [email protected] Nicola...

Page 1: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

Coordination in Adversarial Sequential Team Gamesvia Multi-agent Deep Reinforcement Learning

Andrea CelliPolitecnico di Milano

[email protected]

Marco CicconePolitecnico di Milano

[email protected]

Raffaele BongoPolitecnico di Milano

[email protected]

Nicola GattiPolitecnico di Milano

[email protected]

Abstract

Many real-world applications involve teams of agents that have to coordinate theiractions to reach a common goal against potential adversaries. This paper focuseson zero-sum games where a team of players faces an opponent, as is the case, forexample, in Bridge, collusion in poker, and collusion in bidding. The possibilityfor the team members to communicate before gameplay—that is, coordinate theirstrategies ex ante—makes the use of behavioral strategies unsatisfactory. Weintroduce Soft Team Actor-Critic (STAC) as a solution to the team’s coordinationproblem that does not require any prior domain knowledge. STAC allows teammembers to effectively exploit ex ante communication via exogenous signals thatare shared among the team. STAC reaches near-optimal coordinated strategies bothin perfectly observable and partially observable games, where previous deep RLalgorithms fail to reach optimal coordinated behaviors.

1 Introduction

In many strategic interactions agents have similar goals and have incentives to team up, and sharetheir final reward. In these settings, coordination between team members plays a crucial role. Thebenefits from team coordination (in terms of team’s expected utility) depend on the communicationpossibilities among team members. This work focuses on ex ante coordination, where team membershave an opportunity to discuss and agree on tactics before the game starts, but will be unable tocommunicate during the game, except through their publicly-observed actions. Consider, as anillustration, a poker game where some players are colluding against some identified target playersand will share the final winnings after the game. Non-recreational applications are ubiquitous as well.This is the case, for instance, of collusion in bidding, where communication during the auction isillegal, and coordinated swindling in public.

Ex ante coordination enables the team to obtain significantly higher returns (up to a linear factor inthe size of the game-tree) than the return they would obtain by abstaining from coordination [3, 13].

Finding an equilibrium with ex ante coordination is NP-hard and inapproximable [3, 13]. Celli andGatti [13] introduced the first algorithm to compute optimal coordinated strategies for a team playingagainst an adversary. At its core, it’s a column generation algorithm exploiting a hybrid representationof the game, where team members play joint normal-form actions while the adversary employssequence-form strategies [62]. More recently, Farina et al. [19] proposed a variation of fictitiousplay, named Fictitious Team-Play (FTP), to compute an approximate solution to the problem. Bothapproaches require to iteratively solve mixed-integer linear programs (MILP), which significantly

Working paper. Last update: December 18, 2019.

arX

iv:1

912.

0771

2v1

[cs

.AI]

16

Dec

201

9

Page 2: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

limits the scalability of these techniques to large problems, with the biggest instances solved via FTPbeing in the order of 800 infosets per player.1 The biggest crux of these tabular approaches is the needfor an explicit representation of the sequential game, which may not be exactly known to players, orcould be too big to be even stored in memory. For this reason, extremely large games are usuallyabstracted by bucketing similar states together. The problem with this approach is that abstractionsprocedures are domain-specific, require extensive domain knowledge, and fail to generalize to novelsituations.

Popular equilibrium computation techniques such as fictious play (FP) [7, 51] and counterfactualregret minimization (CFR) [65] are not model-free, and therefore are unable to learn in a sample-based fashion (i.e., directly from experience). The same holds true for the techniques recentlyemployed to achieve expert-level poker playing, which are heavily based on poker-specific domainknowledge [43, 8, 9, 10]. There exist model-free variations of FP and CFR [26, 11]. However, thesealgorithms fail to model coordinated team behaviors even in simple toy problems.

On the other side of the spectrum with respect to tabular equilibrium computation techniques thereare Multi-Agent Reinforcement Learning (MARL) algorithms [12, 27]. These techniques don’trequire a perfect knowledge of the environment and are sample-based by nature, but applying themto imperfect-information sequential games presents a number of difficulties: i) hidden informationand arbitrary action spaces make players non homogeneous; ii) player’s policies may be conditionedonly on local information. The second point is of crucial importance since we will show that, evenin simple settings, it is impossible to reach an optimal coordinated strategy without conditioningpolicies on exogenous signals. Conditioning policies on the observed histories allows for some sortof implicit coordination in perfectly observable games (i.e., games where each player observes andremembers other player’s actions), but this doesn’t hold in settings with partial observability, whichare ubiquitous in practice.

Our contributions In this paper, we propose Soft Team Actor-Critic (STAC), which is an end-to-end framework to learn coordinated strategies directly from experience, without requiring the perfectknowledge of the game-tree. We design an ex ante signaling framework which is principled from agame-theoretic perspective, and show that team members are able to associate shared meanings tosignals that are initially uninformative. We show that our algorithm achieves strong performanceson standard imperfect-information benchmarks, and on coordination games where known deepmulti-agent RL algorithms fail to reach optimal solutions.

2 Preliminaries

In this section, we provide a brief overview of reinforcement learning and extensive-form games. Fordetails, see Sutton and Barto [59], Shoham and Leyton-Brown [53].

2.1 Reinforcement Learning

At each state s P S, an agent selects an action a P A according to a policy π : S Ñ ∆pAq, whichmaps states into probability distributions over the set of available actions. This causes a transitionon the environment according to the state transition function T : S ˆ A Ñ ∆pSq. The agentobserves a reward rt for having selected at at st. In partially observable environments, the agentobserves opst, at, st`1q P Ω drawn according to an observation function O : S ˆAÑ ∆pΩq. Theagent’s goal is to find the policy that maximizes her expected discounted return. Formally, by lettingRt :“

ř8

i“0 γirt`i, the optimal policy π˚ is such that π˚ P arg maxEπrR0s.

Value-based algorithms compute π˚ by estimating vπpsq :“ EπrRt|St “ ss (state-value function) orQπps, aq :“ EπrRt|St “ s,At “ as (action-value function) via temporal difference learning, andproducing a series of ε-greedy policies.

A popular alternative is employing policy gradient methods. The main ides is to directly adjustthe parameters θ of a differentiable and parametrized policy πθ via gradient ascent on a scorefunction Jpπθq. Let ρπpst, atq denote the state-action marginals of the trajectory distributions

1From a practical perspective, the hardness of the problem usually reduces to either solving an LP with anexponential number of possible actions (i.e., team’s joint plans), or to solving best-response MILPs on a compactaction space.

2

Page 3: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

induced by a policy πpat|stq. Then, gradient of the score function may be rewritten as followsvia the policy gradient theorem [60]: ∇θJpθq “ Eps,aq„ρπ r∇θ log πθpa|sqQ

πps, aqs. One couldlearn an approximation of the true action-value function by, e.g., temporal-difference learning, andalternates between policy evaluation and policy improvement using the value function to obtain abetter policy [2, 59]. In this case, the policy is referred to as the actor, and the value function as thecritic leading to a variety of actor-critic algorithms [25, 34, 42, 49, 52].

2.2 Extensive-Form Games

Extensive-form games (EFGs) are a model of sequential interaction involving a set of players P .Exogenous stochasticity is represented via a chance player (also called nature) which selects actionswith a fixed known probability distribution. A history h P H is a sequence of actions from all players(chance included) taken from the start of an episode. In imperfect-information games, each playeronly observes her own information states. For instance, in a poker game a player observes her owncards, but not those of the other players. An information state st of player i comprises all thehistories which are consistent with player i’s previous observations. We focus on EFGs with perfectrecall, i.e., et every t each player has a perfect knowledge of the sequence si1, a

i1, . . . , s

it.

There are two causes for information states: i) private information determined by the environment(e.g., hands in a poker game), and ii) limitations in the observability of other players’ actions. Whenall information states are caused by i), we say the game is perfectly observable.

A policy which maps information states to probability distributions is often referred to as a behav-ioral strategy. Given a behavioral strategy profile π “ pπiqiPP , player i’s incentive to deviate fromπi is quantified as eipπq :“ maxπ1i Eπ1i,π´irR0,is ´ EπrR0,is, where π´i “ pπjqjPPztiu. The ex-ploitability of π is defined as epπq :“ 1

|P|ř

iPP eipπq.2 A strategy profile π is a Nash equilibrium

(NE) [44] if epπq “ 0.

In principle, a player i could adopt a deterministic policy σi : s Ñ A, selecting a single action ateach information state. From a game-theoretic perspective, σi is an action of the equivalent tabular(i.e., normal-form) representation of the EFG. We refer to such actions as plans. Let Σi be the setof all plans of player i. The size of Σi grows exponentially in the number of information states.Finally, a normal-form strategy xi is a probability distribution over Σi. We denote by Xi the set ofthe normal-form strategies of player i.

3 Team’s Coordination: A Game-Theoretic Perspective

A team is a set of players sharing the same objectives [3, 63]. To simplify the presentation, we focuson the case in which a team of two players (denoted by T1, T2) faces a single adversary (A) in azero-sum game. This happens, e.g., in the card-playing phase of Bridge, where a team of two players,called the “defenders”, plays against a third player, the “declarer”.

Team members are allowed to communicate before the beginning of the game, but are otherwiseunable to communicate during the game, except via publicly-observed actions. This joint planningphase gives team members an advantage: for instance, the team members could skew their strategiesto use certain actions to signal about their state (for example, that they have particular cards). In otherwords, by having agreed on each member’s planned reaction under any possible circumstance of thegame, information can be silently propagated in the clear, by simply observing public information.

A powerful, game-theoretic way to think about about ex ante coordination is through the notion ofcoordination device. In the planning phase, before the game starts, team members can identify a setof joint plans. Then, just before the play, the coordination device draws one of such plans accordingto an appropriate probability distribution, and team members will act as specified in the selected jointplan.

Let ΣT “ ΣT1 ˆ ΣT2 and denote by Rt,T the return of the team from time t. The game is such that,for any t, Rt,A “ ´Rt,T . An optimal ex ante coordinated strategy for the team is defined as follows(see also Celli and Gatti [13], Farina et al. [19]):

2In constant-sum games (i.e., games whereř

iPP ri,t “ k for each t) the exploitability is defined asepπq “

ř

iPP eipπq´k

|P| .

3

Page 4: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

Definition 1 (TMECor) A pair ζ “ pπA, xT q, with xT P ∆pΣT q, is a team-maxmin equilibriumwith coordination device (TMECor) iff

eApζq :“ maxπ1A

Eπ1A,pσ1,σ2q„xT rR0,As ´ EπA,pσ1,σ2q„xT rR0,As “ 0,

andeT pζq :“ max

pσ11,σ12qPΣT

EπA,pσ11,σ12q rR0,T s ´ EπA,pσ1,σ2q„xT rR0,T s “ 0.

In an approximate version, ε-TMECor, neither the team nor the opponent can gain more than ε bydeviating from their strategy, assuming that the other does not deviate.

By sampling a recommendation from a joint probability distribution over ΣT , the coordination deviceintroduces a correlation between the strategies of the team members that is otherwise impossible tocapture using behavioral strategies (i.e., decentralized policies). Coordinated strategies (i.e., strategiesin ∆pΣT q) may lead to team’s expected returns which are arbitrarily better than the expected returnsachievable via behavioral strategies. This is further illustrated by the following example.

Example 1 A team of two players (T1, T2) faces an adversary (A) who has the goal of minimizingteam’s utility. Each player of the game has two available actions and has to select one without havingobserved the choice of the other players. Team members receive a payoff of K only if they both guesscorrectly the action taken by the adversary, and mimic that action (e.g., when A plays left, the team isrewarded K only if T1 plays ` and T2 plays L). Team’s rewards are depicted in the leaves of the treein Figure 1.

A

T1 T1

T2 T2 T2 T2

K 0 0 0 0 0 0 K

` r ` r

L R L R L R L R

Figure 1: Extensive-form game with a team

If team members didn’t have the opportunity of communicating before the game, then the best theycould do (i.e., the NE of the game) is randomizing between their available actions according to auniform probability. This leads to an expected return for the team of K4.

When coordinated strategies are employed, team members could decide to play with probability 1/2the deterministic policies t`,Lu and tr,Ru. This is precisely the TMECor of the game, guaranteeingthe team an expected return of K2.

Classical multi-agent RL algorithms employ behavioral strategies and, therefore, are unable to attainthe optimal coordinated expected return even in simple settings such as the one depicted in theprevious example (see also Section 5). However, working in the space of joint plans ΣT is largelyunpractical since its size grows exponentially in the size of the game instance. In the followingsection we propose a technique to foster ex ante coordination without explicitly working over ΣT .

4 Soft Team Actor-Critic (STAC)

In this section, we describe Soft Team Actor-Critic (STAC), which is a scalable sample-basedtechnique to approximate team’s ex ante coordinated strategies. Intuitively, STAC mimics thebehavior of a coordination device via an exogenous signaling scheme. The STAC’s framework allowsteam members to associate shared meanings to initially uninformative signals.

4

Page 5: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

First, we present a brief description of SAC [25] and a multi-agent adaptation of the originalalgorithm which may be employed to compute team’s behavioral strategies. Then, we describe a wayto introduce a signaling scheme in this framework, which allows team members to play over ex antecoordinated strategies.

4.1 Soft Actor Critic

Soft Actor Critic (SAC) [25], is an off-policy deep RL algorithm based on the maximum entropy(maxEnt) principle. The algorithm exploits an actor-critic architecture with separate policy and valuefunction networks.

The key characteristic of SAC is that the actor’s goal is learning the policy which maximizes theexpected reward while also maximizing its entropy in each visited state. This is achieved via thefollowing maximum entropy score function:

Jpπq :“ÿ

t

Epst,atq„ρπ rrt ` αHpπp ¨ |stqqs , (1)

where α ą 0 is a temperature parameter that weights the importance of the entropy term. In practice,stochastic policies are favored since the agent gains an extra reward proportional to the policy entropy.This results in better exploration, preventing the policy from prematurely converging to a bad localoptimum. Moreover, SAC is designed to improve sample efficiency by reusing previously collecteddata stored in a replay buffer.

4.2 Multi-Agent Soft Actor-Critic

SAC was originally designed for the single-agent setting. To ease training in multi-agent environments,we adopt the framework of centralized training with decentralized execution, which allows teammembers to exploit extra information during training, as long as it is not exploited at test time. Thislearning paradigm can be naturally applied to actor-critic policy gradient methods since they decouplethe policy (i.e., the actor) of each agent from its Q-value function (i.e., the critic). The actor hasonly access to the player’s private state, which makes the policies decentralized. This framework isa standard paradigm for multi-agent planning [31, 45] and has been successfully employed by themulti-agent deep RL community to encourage the emergence of cooperative behaviors [20, 21, 36].

Actors In imperfect-information games, team members are not homogeneous. Intuitively, this isbecause: i) they have different private information, ii) they play at different decision points of theEFG. Then, the sets of observations ΩT1, ΩT2 are disjoint. This raises the need to have two separatepolicy networks, one for each team member, and optimize them separately [15].

Critic In our setting, the critic has access to the complete team state, which contains the privateinformation of each team member. In the card game of Bridge, for instance, team members (actors)cannot observe each other cards. However, at training time, the critic can exploit this additionalinformation. From a game-theoretic perspective, the critic works on an EFG with a finer informationstructure, such that if T1 and T2 were to be considered as a single player, the resulting player wouldhave perfect recall. Then, since team members’ observations are shared, and since rewards arehomogeneous by definition of team, we exploit parameter sharing and employ a single critic for thewhole team.

4.3 Signal Conditioning

Ex ante coordinated strategies are defined over the exponentially large joint space of deterministicpolicies ΣT . Instead of working over ΣT , STAC agents play according to decentralized policiesconditioned on an exogenous signal, drawn just before the beginning of the game. To do so, STACintroduces in the learning process an approximate coordination device, which is modeled as a fictitiousplayer:

Definition 2 (Signaler) Given a set of signals Ξ and a probability distribution xs P ∆pΞq, a signaleris a non-strategic player which draws ξ „ xs at the beginning of each episode, and subsequentlycommunicates ξ to team members.

5

Page 6: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

Algorithm 1 Soft Team Actor-CriticRequire: θ1, θ2, φ, ψ Ź Initial parameters

1: ψ Ð ψ Ź Initialize target network weights2: D ÐH Ź Initialize an empty replay pool3: for each iteration do4: ξ „ xs Ź The signaler draws ξ5: for each environment step do6: at „ πφpat|st; ξq Ź Sample action from the policy, conditioned on the signal7: st`1 „ T pst|st, atq Ź Sample transition from the environment8: D Ð D Y tpst, at, rt, st`1, ξqu Ź Store the transition in the replay pool9: end for

10: for each gradient step do11: ψ Ð ψ ´ λV ∇ψJV pψq Ź Update the V -function parameters12: θi Ð θi ´ λQ∇θiJQpθiq for i P t1, 2u Ź Update the Q-function parameters13: φÐ φ´ λπ∇φJπpφq Ź Update policy weights14: ψ Ð τψ ` p1´ τqψ Ź Periodically update target network weights15: end for16: end for

We assume that the number of signals is fixed, and that xspξq “ 1|Ξ| for all ξ P Ξ.

At the beginning of the game, signals are completely uninformative, since they are drawn from auniform probability distribution. At training time, STAC allows team members for reaching a sharedconsensus about their joint behavior for each of the possible signals. This is achieved as follows (seealso Algorithm 1):

Policy evaluation step STAC exploits a value-conditioner network whose parameters are producedvia an hypernetwork [24], conditioned on the observed signal ξ. The role of the value-conditionernetwork is to perform action-value and state-value estimates by keeping ξ into account. This is anatural step since we expect team members to follow ad hoc policies for different observed signals. Inthis phase, signal encodings are fixed. Indeed, learning the hypernetworks’ parameters has proven toworsen the performances of the architecture and, in some settings, prevented STAC from converging.

Policy Improvement Step The policy module is composed of a policy conditioner network whoseparameters are produced by a fixed number of hypernetworks conditioned on the observed signal. Thepolicy module takes as input the local state of a team member, and outputs a probability distributionover the available actions. The signals’ hypernetworks are shared by all team members. Thisdemonstrated to be fundamental to associate a shared meaning to signals. The policy conditioner netis updated by minimizing the expected Kullback-Leibler divergence as in the original SAC.

Algorithm Algorithm 1 describes STAC’s main steps. STAC employs a parametrized state-valuefunction Vψpst; ξq, soft Q-function Qθpst, at; ξq, and non-markovian policy πφpat|st; ξq. Each ofthis networks is conditioned on the observed signal ξ.

For each iteration, the signaler draws a signal ξ „ xs (Line 4). Signals are sampled only before thebeginning of each game. Therefore, ξ is kept constant for the entire trajectory (i.e., until a terminalstate is reached). Records stored in the replay buffer are augmented with the observed signal ξ for thecurrent trajectory. We refer to these tuples (Line 8) as SARSAS records (i.e., a SARSA record withthe addition of a signal).

After a fixed number of samples is collected and stored in the replay buffer, a batch is drawn from D.Since xs may be, in principle, an arbitrary distribution over Ξ, we decouple the sampling processfrom the choice of xs by ensuring the batch is composed by the same amount of samples for each ofthe available signals. Then, parameters are updated as in SAC [25] (Lines 11-14). STAC employs atarget value network Vψpst; ξq to stabilize training [41]. In Line 12, STAC uses clipped double-Q forlearning the value function. Specifically, taking the minimumQ-value between the two approximatorshas been shown to reduce positive bias in the policy improvement step [22, 61].

6

Page 7: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

5 Experimental Evaluation

5.1 Experimental Setting

Our performance metric will be team’s worst case payoff, i.e., team’s expected payoff against abest-responding adversary.

Baselines Neural Fictitious Self-Play (NFSP) is a sample-based variation of fictitious play intro-duced by Heinrich and Silver [26]. For each player i, a transition buffer is exploited by a DQNagent to approximate a best response against π´i (i.e., the average strategies of the opponents). Asecond transition buffer is used to store data via reservoir sampling, and is used to approximateπi by classification. The first baseline (NFSP-independent) is an instantiation of an independentNFSP agent for each team member. The second baseline (NFSP-coordinated-payoffs) is equivalentto NFSP-independent but, this time, team members observe coordinated payoffs (i.e., they sharetheir rewards and split the total in equal parts). The last baseline is the multi-agent variation of SACpresented in Section 4.2, which allows us to assess the importance of the signaling framework on topof this architecture.

Game Instances We start by evaluating STAC on instances of the coordination game presentedin Example 1. We also consider the setting in which there’s an imbalance in the team’s payoffs,i.e., instead of receiving K for playing both t`, Lu and tr,Ru, team members receive K2 and K,respectively.

Then, we consider a three-player Leduc poker game instance [56], which is a common benchmark forimperfect-information games. We employ an instance with rank 3 and two suits of identically-rankedcards. The game proceeds as follows: each player antes 1 chip to play. Then, there are two bettingrounds, with bets limited to 2 chips in the first, and 4 chips in the second. After the first betting round,a public card is revealed from the deck. The reward to each player is the number of chips they hadafter the game, minus the number of chips they had before the game. Team members’ rewards arecomputed by summing rewards of T1 and T2, and then dividing the total in equal parts.

Architecture details Both STAC’s critics and policies are parameterized with a multi-layer per-ceptron (MLP) with ReLU non-linearities, followed by a linear output layer that maps to the agents’actions. The number of hidden layers and neurons depend on the instance of the game. For thecoordination game we used a single hidden layer and 64 neurons, while for Leduc-Poker threehidden layers with 128 neurons each. Conditioned on the external signal, the hypernetworks generatethe weights and biases of the MLP’s layers online, via independent linear transformations, and anadditional MLP respectively. We used Adam optimizer [30] with learning rates of 10´2 for the policynetwork and 0.5ˆ 10´3 for the value netowrks. At each update iteration we sample a batch of 128trajectories from the buffer of size 105.

5.2 Main Experimental Results

First, we focus on settings without perfect observability. As expected, algorithms employing marko-vian policies fail to reach the optimal worst-case payoff with ex ante coordination (see Figure 2).NFSP-coordinated-payoffs is shown to guarantee a worst-case payoff close to K4, which is theoptimal value in absence of coordination. Multi-agent SAC exhibits a cyclic behavior. This is causedby team members alternating deterministic policies t`, Lu, and tr,Ru. The team is easily exploitedas soon as the adversary realizes that team members are always playing on the same side of the game.

Introducing signals allows team members for reaching near optimal worst-case expected payoffs(Figure 3). The right plot of Figure 3 shows STAC’s performances on a coordination game structuredas the one in Example 1 but with team’s strictly positive payoffs equal to K and K2. In this case, auniform xs over a binary signaling space cannot lead to the optimal coordinated behavior. However, ifan additional signal is added to Ξ (i.e., |Ξ| “ 3), team members learn to play toward the terminal statewith payoff K2 after observing two out of three signals. When the remaining signal is observed,team members play to reach the terminal state with payoff K. This strategy is exactly the TMECorof the underlying EFG. This suggests that even if xs is fixed to a uniform distribution, having areasonable number of signals could allow good approximations of the optimal team’s coordinatedbehaviors.

7

Page 8: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

0 2k 4k 6k 8k 10k 12k 14k0

20

40

60

80

100

120Average rewardSmoothed average reward

Iteration

Rew

ard

0 1000 2000 3000 4000 5000 6000 70000

20

40

60

80

100

120Average rewardSmoothed average reward

Iteration

Rew

ard

Figure 2: Performances of the baseline algorithms on the coordination game of Example 1 withK “ 100. The red line is the optimal worst-case expected payoff with coordination. Left: NFSP withcoordinated payoofs. Right: Multi-agent SAC.

0 100 200 300 400 500 600 700 8000

20

40

60

80

100

120Average rewardSmoothed average reward

Iteration

Rew

ard

0 100 200 300 400 500 600 700 8000

20

40

60

80

100

120Average rewardSmoothed average reward

Iteration

Rew

ard

Figure 3: Performances of STAC on different coordination games. Left: STAC with |Ξ| “ 2 onthe balanced coordination game. Right: STAC with |Ξ| “ 3 on coordination game with tK,K2ustrictly positive payoffs. Both games have K “ 100, and the optimal worst-case reward is markedwith the red line.

In perfectly observable settings such as Leduc poker, reaching a reasonable level of team coordinationis less demanding. We observed that most of the times it is enough to condition policies on theobserved history of play to reach some form of implicit coordination among team members. However,adding a signaling framework proved to be beneficial for team members even in this setting. This canbe observed in Figure 4.

6 Related Works

Learning how to coordinate multiple independent agents [18, 5] via Reinforcement Learning requirestackling multiple concurrent challenges, e.g., non-stationarity, alter-exploration and shadowed-equilibria [39]. There is a rich literature of algorithms proposed for learning cooperative behavioursamong independent learners. Most of them are based on heuristics encouraging agents’ policiescoordination [4, 6, 32, 33, 37, 38, 48].

0 50k 100k 150k 200k 250k 300k 350k

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

NFSP correlated payoffsMulti-agent SACIndependent NFSPSTAC

Steps

Rew

ard

Figure 4: Comparison on the Leduc poker instance.

8

Page 9: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

Thanks to the recent successes of deep RL in single-agent environments [40, 54, 55], MARL isrecently experiencing a new wave of interest and some old ideas have been adapted to leverage thepower of function approximators [46, 47]. Several successful variants of the Actor-Critic frameworkbased on the centralized training - decentralized execution paradigm have been proposed [36, 20, 21,57]. These works encourage the emergence of coordination and cooperation, by learning a centralizedQ-function that exploits additional information available only during training. Other approachesfactorize the shared value function into an additive decomposition of the individual values of theagents [58], or combine them in a non-linear way [50], enforcing monotonic action-value functions.More recent works, showed the emergence of complex coordinated behaviours across team membersin real-time games [29, 35], even with a fully independent asynchronous learning, by employingpopulation-based training [28].

In Game Theory, player’s coordination is usually modeled via the notion of Correlated equilibrium(CE) [1], where agents make decisions following a recommendation function, i.e., a correlationdevice. Learning a CE of EFGs is a challenging problem as actions spaces grow exponentially in thesize of the game tree. A number of works in the MARL literature address this problem (see, e.g.,[16, 17, 23, 64]). Differently from these works, we are interested in the computation of TMECor, asdefined by [13].

In our work, we extend the SAC [25] algorithm to model the correlation device explicitly. Bysampling a signal at the beginning of each episode, we show that the team members are capableto learn how to associate meaning to the uninformative signal. Concurrently to our work, Chenet al. [14] proposed a similar approach based on exogenous signals. Coordination is incentivized byenforcing a mutual information regularization, while we make use of hypernetworks [24] to conditionboth the policy and the value networks to guide players towards learning a TMECor.

7 Discussion

Equilibrium computation techniques for computing team’s ex ante coordinated strategies usuallyassume a perfect knowledge of the EFG, which has to be represented explicitly [13, 19]. Wehave introduced STAC, which is an end-to-end deep reinforcement learning approach to learningapproximate TMECor in imperfect-information adversarial team games. Unlike prior game-theoreticapproaches to compute TMECor, STAC does not require any prior domain knowledge.

STAC fosters coordination via exogenous signals that are drawn before the beginning of the game.Initially, signals are completely uninformative. However, team members are able to learn a sharedmeaning for each signal, associating specific behaviors to different signals. Our preliminary experi-ments showed that STAC allows team members for reaching near optimal team’s coordination evenin settings that are not perfectly observable, where other deep RL methods failed to learn coordinatedbehaviors. On the other hand, we observed that, in problems with perfect observability, a formof implicit coordination may be reached even without signaling, by conditioning policies on theobserved history of play.

References[1] Robert J Aumann. Subjectivity and correlation in randomized strategies. Journal of mathematical

Economics, 1(1):67–96, 1974.

[2] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solvedifficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5):834–846,1983.

[3] Nicola Basilico, Andrea Celli, Giuseppe De Nittis, and Nicola Gatti. Team-maxmin equilibrium: efficiencybounds and algorithms. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[4] Daan Bloembergen, Michael Kaisers, and Karl Tuyls. Lenient frequency adjusted q-learning. In Proc. of22nd Belgium-Netherlands Conf. on Artif. Intel, 2010.

[5] Craig Boutilier. Sequential optimality and coordination in multiagent systems. In IJCAI, volume 99, pages478–485, 1999.

9

Page 10: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

[6] Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games. In In-ternational joint conference on artificial intelligence, volume 17, pages 1021–1026. Lawrence ErlbaumAssociates Ltd, 2001.

[7] George W Brown. Iterative solution of games by fictitious play. Activity analysis of production andallocation, 13(1):374–376, 1951.

[8] Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfect-information games.In Advances in neural information processing systems, pages 689–699, 2017.

[9] Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats topprofessionals. Science, 359(6374):418–424, 2018.

[10] Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890,2019.

[11] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization.arXiv preprint arXiv:1811.00164, 2018.

[12] Lucian Bu, Robert Babu, Bart De Schutter, et al. A comprehensive survey of multiagent reinforcementlearning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008.

[13] Andrea Celli and Nicola Gatti. Computational results for extensive-form adversarial team games. InThirty-Second AAAI Conference on Artificial Intelligence, 2018.

[14] Liheng Chen, Hongyi Guo, Haifeng Zhang, Fei Fang, Yaoming Zhu, Ming Zhou, Weinan Zhang, QingWang, and Yong Yu. Signal instructed coordination in team competition. arXiv preprint arXiv:1909.04224,2019.

[15] Xiangxiang Chu and Hangjun Ye. Parameter sharing deep deterministic policy gradient for cooperativemulti-agent reinforcement learning. CoRR, abs/1710.00336, 2017. URL http://arxiv.org/abs/1710.00336.

[16] Ludek Cigler and Boi Faltings. Reaching correlated equilibria through multi-agent learning. In The10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 509–516.International Foundation for Autonomous Agents and Multiagent Systems, 2011.

[17] Ludek Cigler and Boi Faltings. Decentralized anti-coordination through multi-agent learning. Journal ofArtificial Intelligence Research, 47:441–473, 2013.

[18] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagentsystems. AAAI/IAAI, 1998(746-752):2, 1998.

[19] Gabriele Farina, Andrea Celli, Nicola Gatti, and Tuomas Sandholm. Ex ante coordination and collusionin zero-sum multi-player extensive-form games. In Advances in Neural Information Processing Systems,pages 9638–9648, 2018.

[20] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to com-municate with deep multi-agent reinforcement learning. In Advances in Neural Information ProcessingSystems, pages 2137–2145, 2016.

[21] Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence,2018.

[22] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-criticmethods. arXiv preprint arXiv:1802.09477, 2018.

[23] Amy Greenwald and Keith Hall. Correlated q-learning. In ICML, pages 242–249, 2003. URL http://www.aaai.org/Library/ICML/2003/icml03-034.php.

[24] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. CoRR, 2016. URL http://arxiv.org/abs/1609.09106.

[25] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximumentropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings ofMachine Learning Research, pages 1861–1870, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018.PMLR. URL http://proceedings.mlr.press/v80/haarnoja18b.html.

10

Page 11: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

[26] Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-informationgames. CoRR, 2016. URL http://arxiv.org/abs/1603.01121.

[27] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. A survey and critique of multiagent deepreinforcement learning. Autonomous Agents and Multi-Agent Systems, 33(6):750–797, 2019.

[28] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi,Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neuralnetworks. arXiv preprint arXiv:1711.09846, 2017.

[29] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda,Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Human-level performance in3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.

[30] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[31] Landon Kraemer and Bikramjit Banerjee. Multi-agent reinforcement learning as a rehearsal for decentral-ized planning. Neurocomputing, 190:82–94, 2016.

[32] Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement learning in cooperativemulti-agent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning.Citeseer, 2000.

[33] Martin Lauer and Martin Riedmiller. Reinforcement learning for stochastic cooperative multi-agentsystems. In Proceedings of the Third International Joint Conference on Autonomous Agents and MultiagentSystems-Volume 3, pages 1516–1517. Citeseer, 2004.

[34] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, DavidSilver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprintarXiv:1509.02971, 2015.

[35] Siqi Liu, Guy Lever, Nicholas Heess, Josh Merel, Saran Tunyasuvunakool, and Thore Graepel. Emergentcoordination through competition. In International Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=BkG8sjR5Km.

[36] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information ProcessingSystems, pages 6379–6390, 2017.

[37] Laëtitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. Hysteretic q-learning: an algorithm fordecentralized reinforcement learning in cooperative multi-agent teams. In 2007 IEEE/RSJ InternationalConference on Intelligent Robots and Systems, pages 64–69. IEEE, 2007.

[38] Laëtitia Matignon, Guillaume Laurent, and Nadine Le Fort-Piat. A study of fmq heuristic in cooperativemulti-agent games. 2008.

[39] Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. Independent reinforcement learnersin cooperative markov games: a survey regarding coordination problems. The Knowledge EngineeringReview, 27(1):1–31, 2012.

[40] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra,and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.URL http://arxiv.org/abs/1312.5602.

[41] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control throughdeep reinforcement learning. Nature, 518(7540):529, 2015.

[42] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational conference on machine learning, pages 1928–1937, 2016.

[43] Matej Moravcík, Martin Schmid, Neil Burch, Viliam Lisy, Dustin Morrill, Nolan Bard, Trevor Davis,Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence inheads-up no-limit poker. Science, 356(6337):508–513, 2017.

[44] John Nash. Non-cooperative games. The Annals of Mathematics, 54(2):286–295, September 1951.

11

Page 12: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

[45] Frans A Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate q-value functions fordecentralized pomdps. Journal of Artificial Intelligence Research, 32:289–353, 2008.

[46] Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian. Deep decentral-ized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the 34thInternational Conference on Machine Learning-Volume 70, pages 2681–2690. JMLR. org, 2017.

[47] Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-agent deep reinforcementlearning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgentSystems, AAMAS ’18, pages 443–451, Richland, SC, 2018. International Foundation for AutonomousAgents and Multiagent Systems. URL http://dl.acm.org/citation.cfm?id=3237383.3237451.

[48] Liviu Panait and Sean Luke. Cooperative multi-agent learning: The state of the art. Autonomous agentsand multi-agent systems, 11(3):387–434, 2005.

[49] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neuralnetworks, 21(4):682–697, 2008.

[50] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and ShimonWhiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. InJennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on MachineLearning, volume 80 of Proceedings of Machine Learning Research, pages 4295–4304, Stockholmsmässan,Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/rashid18a.html.

[51] Julia Robinson. An iterative method of solving a game. Annals of mathematics, pages 296–301, 1951.

[52] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[53] Yoav Shoham and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and logicalfoundations. Cambridge University Press, 2008.

[54] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, JulianSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of gowith deep neural networks and tree search. nature, 529(7587):484, 2016.

[55] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez,Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learningalgorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.

[56] Finnegan Southey, Michael P Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, andChris Rayner. Bayes’ bluff: Opponent modelling in poker. arXiv preprint arXiv:1207.1411, 2012.

[57] Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. InAdvances in Neural Information Processing Systems, pages 2244–2252, 2016.

[58] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinícius Flores Zambaldi,Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning. CoRR, abs/1706.05296, 2017. URLhttp://arxiv.org/abs/1706.05296.

[59] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

[60] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methodsfor reinforcement learning with function approximation. In Advances in neural information processingsystems, pages 1057–1063, 2000.

[61] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InThirtieth AAAI conference on artificial intelligence, 2016.

[62] Bernhard Von Stengel and Françoise Forges. Extensive-form correlated equilibrium: Definition andcomputational complexity. Mathematics of Operations Research, 33(4):1002–1022, 2008.

[63] Bernhard von Stengel and Daphne Koller. Team-maxmin equilibria. Games and Economic Behavior, 21(1):309 – 321, 1997.

12

Page 13: Coordination in Adversarial Sequential Team Games via Multi ...raffaele.bongo@mail.polimi.it Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it Abstract Many real-world applications

[64] Chongjie Zhang and Victor Lesser. Coordinating multi-agent reinforcement learning with limited com-munication. In Proceedings of the 2013 international conference on Autonomous agents and multi-agentsystems, pages 1101–1108. International Foundation for Autonomous Agents and Multiagent Systems,2013.

[65] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimizationin games with incomplete information. In Advances in neural information processing systems, pages1729–1736, 2008.

13