[IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational...

6
A Biologically Inspired Architecture for Multiagent Games Fernanda M. Eliott Computer Science Division Technological Institute of Aeronautics (ITA) São José dos Campos, Brazil Carlos H. C. Ribeiro Computer Science Division Technological Institute of Aeronautics (ITA) São José dos Campos, Brazil Abstract—This paper reports modifications on a biologically inspired robotic architecture originally designed to work in single agent contexts. Several adaptations have been applied to the architecture, seeking as result a model-free artificial agent able to accomplish shared goals in a multiagent environment, from sensorial information translated into homeostatic variable values and a rule database that play roles respectively in temporal credit assignment and action-state space exploration. The new architecture was tested in a well-known benchmark game, and the results were compared to the ones from the multiagent RL algorithm WoLF-PHC. We verified that the proposed architecture can produce coordinated behaviour equivalent to WoLF-PHC in stationary domains, and is also able to learn cooperation in non-stationary domains. The proposal is a first step towards an artificial agent that cooperate as result of a biologically plausible computational model of morality. Keywords- biologically inspired architectures, multiagent systems, game theory, reinforcement learning. I. I NTRODUCTION Important engineering applications have resulted from the embodiment of a biological concept. Besides, biologically inspired artifices nay trigger conceivable analogies to areas as Philosophy and Psychology. We consider in this paper a way of achieving cooperative behaviour by modifying a biologically inspired computing architecture and make it able to fulfil tasks in multiagent (MA) environments to maximize the average reinforcement, rather than maximizing the agent’s own rein- forcement. In [9], a behaviour-based control architecture was instantiated over simulated primary emotions and hormonal processes maintained by a homeostatic system. The latter was inspired by the Somatic Marker hypothesis from [5]: what would assist us to make fast decisions under low time and computational burden, supporting predictions of what might occur hereafter. In [11][12] the computational architecture was improved giving rise to ALEC (Asynchronous Learning by Emotion and Cognition). It was influenced by the Clar- ion Model [18], an architecture intended to model cognitive processes in a psychological approach. ALEC was designed to use data from sensors of a Khepera robot [14] and its multi-task abilities were constructed in the context of a single robot. ALEC is based on homeostatic system and two levels participating in the processes of decision-making. The bottom level is a backpropagation [23] feedforward artificial neural network (ANN) employing the Q-Learning algorithm [21] to learn and calculate the utility values of behaviour-state pairs. The top level is composed by independently kept rules that counterbalance the other level by providing different suggestions of action selection. We propose in this paper modifications to ALEC to make it suitable to MA coordinating tasks, hence the name Multi- A for the proposed architecture. Multi-A is intended to learn through its actions, reinforcements and environment, including other agents. Although in its validating experiments ALEC was considered as having a homeostatic system fed by the sensors of a Khepera robot, the idea was to have it built in such a way that it is independent of specific robotic sensors, as a result it might be easier to custom the system to different environments. A. Related Work Application of model-free techniques such as Q-learning in multiagent games is an open area of research. An important issue is how to accomplish cooperative behaviour in general- sum games when the Pareto-optimal solution is not a Nash equilibrium [13][15]. As a matter of fact, [13] contrasts the performance of classic MA learning algorithms when facing aspects that can be problematic to handle, such as: number of states, agents and actions per agent; single, several and shadowed optimal equilibrium; deterministic versus stochastic games. With the aim of conceiving an investigative foundation and establishing an algorithm to be contrasted with Multi-A, we analyzed some classic MA algorithms: Correlated-Q [8]; Awesome [2]; CMLeS [7]; Manipulator [17]; M-Qubed [3][4] and WoLF-PHC [1]. The essential prerequisite for choosing a benchmarking MA learning algorithm was its ability to handle stochastic and repeated general-sum games; moreover it was important to have published results that illustrate different kinds of difficulties, as those results would be compared to those by Multi-A. We then decided to consider WoLF-PHC as a benchmark. It was developed according to the description from [1] and follows the “Win or Learn Fast” principle: learn fast when losing and carefully when winning as it uses a variable learning rate to help making it robust against the alter- exploration problem, i.e. perturbation caused to the agent’s learning process due to environmental exploration by another agents [13]. Reference [1] verified the convergence of WoLF- PHC to optimal policies in several general-sum stochastic 1st BRICS Countries Congress on Computational Intelligence 978-1-4799-3194-1/13 $31.00 © 2013 IEEE DOI 10.1109/BRICS-CCI.&.CBIC.2013.41 230 1st BRICS Countries Congress on Computational Intelligence 978-1-4799-3194-1/13 $31.00 © 2013 IEEE DOI 10.1109/BRICS-CCI.&.CBIC.2013.41 230 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence 978-1-4799-3194-1/13 $31.00 © 2013 IEEE DOI 10.1109/BRICS-CCI-CBIC.2013.45 230

Transcript of [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational...

Page 1: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

A Biologically Inspired Architecture for MultiagentGames

Fernanda M. EliottComputer Science Division

Technological Institute of Aeronautics (ITA)

São José dos Campos, Brazil

Carlos H. C. RibeiroComputer Science Division

Technological Institute of Aeronautics (ITA)

São José dos Campos, Brazil

Abstract—This paper reports modifications on a biologicallyinspired robotic architecture originally designed to work in singleagent contexts. Several adaptations have been applied to thearchitecture, seeking as result a model-free artificial agent ableto accomplish shared goals in a multiagent environment, fromsensorial information translated into homeostatic variable valuesand a rule database that play roles respectively in temporalcredit assignment and action-state space exploration. The newarchitecture was tested in a well-known benchmark game, andthe results were compared to the ones from the multiagent RLalgorithm WoLF-PHC. We verified that the proposed architecturecan produce coordinated behaviour equivalent to WoLF-PHCin stationary domains, and is also able to learn cooperation innon-stationary domains. The proposal is a first step towards anartificial agent that cooperate as result of a biologically plausiblecomputational model of morality.

Keywords- biologically inspired architectures, multiagentsystems, game theory, reinforcement learning.

I. INTRODUCTION

Important engineering applications have resulted from the

embodiment of a biological concept. Besides, biologically

inspired artifices nay trigger conceivable analogies to areas as

Philosophy and Psychology. We consider in this paper a way of

achieving cooperative behaviour by modifying a biologically

inspired computing architecture and make it able to fulfil tasks

in multiagent (MA) environments to maximize the average

reinforcement, rather than maximizing the agent’s own rein-

forcement. In [9], a behaviour-based control architecture was

instantiated over simulated primary emotions and hormonal

processes maintained by a homeostatic system. The latter was

inspired by the Somatic Marker hypothesis from [5]: what

would assist us to make fast decisions under low time and

computational burden, supporting predictions of what might

occur hereafter. In [11] [12] the computational architecture

was improved giving rise to ALEC (Asynchronous Learning

by Emotion and Cognition). It was influenced by the Clar-

ion Model [18], an architecture intended to model cognitive

processes in a psychological approach. ALEC was designed

to use data from sensors of a Khepera robot [14] and its

multi-task abilities were constructed in the context of a single

robot. ALEC is based on homeostatic system and two levels

participating in the processes of decision-making. The bottom

level is a backpropagation [23] feedforward artificial neural

network (ANN) employing the Q-Learning algorithm [21]

to learn and calculate the utility values of behaviour-state

pairs. The top level is composed by independently kept rules

that counterbalance the other level by providing different

suggestions of action selection.

We propose in this paper modifications to ALEC to make

it suitable to MA coordinating tasks, hence the name Multi-

A for the proposed architecture. Multi-A is intended to learn

through its actions, reinforcements and environment, including

other agents. Although in its validating experiments ALEC

was considered as having a homeostatic system fed by the

sensors of a Khepera robot, the idea was to have it built in

such a way that it is independent of specific robotic sensors,

as a result it might be easier to custom the system to different

environments.

A. Related Work

Application of model-free techniques such as Q-learning in

multiagent games is an open area of research. An important

issue is how to accomplish cooperative behaviour in general-

sum games when the Pareto-optimal solution is not a Nash

equilibrium [13] [15]. As a matter of fact, [13] contrasts the

performance of classic MA learning algorithms when facing

aspects that can be problematic to handle, such as: number

of states, agents and actions per agent; single, several and

shadowed optimal equilibrium; deterministic versus stochastic

games. With the aim of conceiving an investigative foundation

and establishing an algorithm to be contrasted with Multi-A,

we analyzed some classic MA algorithms: Correlated-Q [8];

Awesome [2]; CMLeS [7]; Manipulator [17]; M-Qubed [3] [4]

and WoLF-PHC [1]. The essential prerequisite for choosing a

benchmarking MA learning algorithm was its ability to handle

stochastic and repeated general-sum games; moreover it was

important to have published results that illustrate different

kinds of difficulties, as those results would be compared to

those by Multi-A. We then decided to consider WoLF-PHC

as a benchmark. It was developed according to the description

from [1] and follows the “Win or Learn Fast” principle: learn

fast when losing and carefully when winning as it uses a

variable learning rate to help making it robust against the alter-

exploration problem, i.e. perturbation caused to the agent’s

learning process due to environmental exploration by another

agents [13]. Reference [1] verified the convergence of WoLF-

PHC to optimal policies in several general-sum stochastic

1st BRICS Countries Congress on Computational Intelligence

978-1-4799-3194-1/13 $31.00 © 2013 IEEE

DOI 10.1109/BRICS-CCI.&.CBIC.2013.41

230

1st BRICS Countries Congress on Computational Intelligence

978-1-4799-3194-1/13 $31.00 © 2013 IEEE

DOI 10.1109/BRICS-CCI.&.CBIC.2013.41

230

2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence

978-1-4799-3194-1/13 $31.00 © 2013 IEEE

DOI 10.1109/BRICS-CCI-CBIC.2013.45

230

Page 2: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

games; likewise [3] and [13] set forth different analysis about

WoLF-PHC in self-play.

Using WoLF-PHC as reference, we tested Multi-A in a

benchmark game. We verified that the proposed architecture

can produce coordinated behaviour equivalent to WoLF-PHC

in stationary domains, and is also able to learn cooperation in

non-stationary domains. The proposal is a first step towards

an artificial agent that cooperate as result of a biologically

plausible computational model of morality.

II. MULTI-A ARCHITECTURE

The general scheme of Multi-A is illustrated in Figure 1. It

is based on four major modules that operate as follows:

1) Sensory Module: stores the information the agent has

about its environment. After the update of the sensory

and homeostatic variables, a well-being index is cal-

culated from an equation that qualitatively estimates

the current situation of the agent. When the sensory

variables present a certain conguration, an action is

performed, and right after the execution of the action,

sensory and homeostatic variables are updated and the

well-being is recalculated.

2) Cognitive Module: stores and manipulates rules that

consist of interval-based specifications of the sensory

space and a mapped recommended action to be per-

formed in face of such specifications. The range for rule

specification may be, for example, quantified in intervals

of 0.2 over the sensory variable values. The set of rules

is supposed to assist in cases of excessive generalization

produced by the Learning Module.

3) Learning Module (adaptive system): uses an artificial

neural network (ANN) for each available action, and the

Q-learning algorithm [21] to estimate the utility value

for the current sensory data and action. Learning or

correction of the weights of the ANN is made through

the Backpropagation algorithm [11] employing the well-

being as the target value.

4) Action Selection Module (AS): receives from the Learn-

ing Module the Q-values for each action, and then gath-

ers actions suggested by the rules that match with the

current sensory data (if there is any rule that contemplate

the current values of the sensory variables). During the

beginning of a simulation, AS uses a high exploration

rate for the action space.

The total number of homeostatic variables and sensory

inputs must be defined according to the domain. For the sake

of generality let m be the number of sensory variables (values

in [0.0, 1.0]). Added to a bias, these variables feed the adaptive

(ANN) and rule systems. There are also n homeostatic vari-

ables Hi with values in [−1.0, 1.0] supplied by the sensory

data. The values for the homeostatic variables are the result

of an operation involving reinforcements and values of sensory

variables, and the application domain where the architecture

operates will determine the nature of such operation. They

provide input values to the equation that indicates the valence

Figure 1. General scheme of Multi-A architecture.

(well-being) of choices and environment, and are updated at

each iteration step.

A. Well-being

The well-being W represents the current situation of the

agent w.r.t. its interaction with the environment and other

agents. It is calculated from the homeostatic variables, but

with normalizing weights so that the final value falls in the

range [−1.0, 1.0]. It thus produces the target value brought to

the Learning Module for correcting the ANN weights through

an updating algorithm (in our case, standard backpropagation).

Additionally, W supports the cognitive system in calculating

the likelihood of success or failure of a rule: if it is greater

than or equal to a parameter RV a rule is created (and if a rule

that fits the current sensory data already exists, its successful

rate will be updated).

More specifically, RV is a threshold to W , consequently

it is set within the same range [−1.0, 1.0]. All pairs sensory

data/action that produce well-being equal or above RV will

be added to the rule set. Thereby, RV is a parameter that

indirectly determines the influence of rules on the decision

process: higher values of RV induce a lesser effect of rules

on the decision process - since just the pairs that resulted into

high values of well-being will be initially kept in the set of

rules.

W is calculated according to Equation 1:

W =

n∑

i=1

aiHi (1)

where n is the number of homeostatic variables H . The

weights ai are set according to the relevance of each homeo-

static variable to the task.

The reinforcements are normalized to the range [−1.0, 1.0],since the homeostatic variables fit the range [−1.0, 1.0]: as

homeostatic variables are created from reinforcements and

sensory variables, the reinforcements are incorporated to the

well-being via them.

231231231

Page 3: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

B. Cognitive Module

Homeostatic variables provide guidelines about the environ-

ment. For example: if there is a homeostatic variable related

to positive reinforcements, high values of that homeostatic

variable indicate the agent has been through a situation that

gives a positive reinforcement. Thus, the rule system can store

the set of associate sensory data/actions that led to a goal state

in domains where a positive reinforcement is only associated

with such states. More specifically, during learning the pair

action/state (indicated via sensory variables) that leads to a

faster drive towards the goal state (by supplying levels of

well-being equal or above RV ) will be stored via the rules,

increasing the count of success for that sensory data/action

pair. Stored rules from past positive situations but that became

inadequate (e.g., by leading to frequent collisions against

another agents) will be deleted through updates of success

rates. In the early stages of exploration, the agent may produce

sensory values that will not come about again, and stored rules

fitting that situation will become useless and deleted as the

maximum allowed number of rules in the set is reached.

When an action is successful in a state (well-being greater

than or equal to RV ), a rule consistent with the current

sensorial values of Multi-A and the action taken will be created

(if not already existent), and added to the set of rules. The

existence of conflicting rules is allowed: the same description

of the input sensorial values but with different actions. If a

rule is used and the outcome is a well-being below RV , its

failure rate will be increased; otherwise the success rate will

be increased. Whenever a stored rule matches the sensory

variable values, its recency is updated. Once in the set of rules,

the rule can be manipulated: reduced, expanded or deleted.

However, manipulations of a rule are only allowed if the rule

fits the current sensory values and its suggestion of action is

performed for at least MEx times. If a rule is not enforced

a minimum number of times MEx, it will be deleted only if

the set of rules is complete (the cardinality of this set being

a design parameter). New rules replace the ones that were

applied more remotely in time.

C. Action Selection Module (AS)

The Action Selection Module receives values from both

the Cognitive and Learning Modules. The Learning Module,

through the ANN, delivers the utility values for the pairs

(current sensory data, actions). If there is a rule that fits the

current state, the Cognitive Module provides suggestions of

action selection: each rule will have the same weight CR.

From the data sent by both Modules, the AS Module will

assign values FVi to the available actions through equation 2:

FVi = Qi + CR×ACi (2)

where FVi is the value of action i; CR is the weight of a

rule that suggests action i; ACi is the number of rules fitting

the current sensory values that have i as action suggestion and

Qi is the Q-value of action i.

The action with maximum FVi is selected with higher

probability (e.g. using an ε-greedy strategy).

The value of the constant CR determines the importance

given to the recommendation of a rule. There may be several

rules that indicate the same action and also conflicting rules. If

a prominent number of rules support the same action, the Q-

values provided by the ANN may turn out to be irrelevant

in Equation 2, thereby the same action will be performed

continuously. The value of the multiplicative constant CRshould be low enough to allow for a balance between the

application of actions driven by rules (more specific) and by

the Q-values (more general). The values of CR, MEx, and RV

must be defined according to the domain while respecting the

observations above.

III. EXPERIMENTAL SETUP AND RESULTS

Each simulation covered a total number of times (trials) that

a game was played. Each trial had the duration of one game:

from its beginning to its end, when a goal was reached. Thus

the total number of steps or iterations for each trial was not

always the same: during learning, agents can spend more time

steps until solving the game. The total number of trials for

each simulation was 50, 000. The number of simulations was

50. The tested games were those already reported in [1] and

[3]: the Coordination Game and the Gridworld Game. The

depicted data in the figures are the mean reinforcement taken

at intervals of 100 trials. The mean reinforcement corresponds

to two agents in self play (i.e., playing against each other and

using the same architecture). Multi-A was compared to the

WoLF-PHC algorithm, and as it operates on reinforcements

within the interval [−1, 1], the original values of the game

scores were normalized to this range. The state of a Multi-

A agent is determined by the values of its sensory entries,

suggesting a fair number of trials to train the homeostatic

system.

The number of sensory data used in the experiments re-

ported herein was 6 and the number of homeostatic variables

was 4. The 6 sensory variables are:

• Clearance: a maximum value when there was no collision,

low otherwise.

• Obstacle Density: high when there was collision (either

against an obstacle or another agent); zero otherwise.

Depending on the application, Obstacle Density may be

used to differentiate kinds of collision, such as against an

obstacle or against another agent.

• Movement: represents the number of steps the agent has

been moving around during a trial. It is decreased when

the agent stops (since that is usually a bad option: the

agent being still, other agents might have time to take

advantage of the environment) and increased otherwise.

The agent stops if there is a collision, be it against an

obstacle or another agent.

• Energy: reflects if the agent has been finding its goal

often. Starts in the 1st trial of the 1st simulation with

maximum value, but is decreased step-by-step. Only

232232232

Page 4: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

grows when the agent receives positive reinforcement:

the later will be added to the Energy value.

• Target Proximity and Target Direction: both simulate the

light intensity sensors from ALEC [12]: light source is

replaced by target state. Those sensory variables provide

high values in the goal state and (with a smaller value)

in the neighboring states (but not diagonal to the target).

Otherwise the variables will be zeroed. In the 1st trial

of the 1st simulation these variables are always zeroed

as the agent still does not know where the target is:

the sensory data and actions that lead to a target state

have to be learned. Once achieved the first positive

reinforcement, those variables change their values. Notice

that the target state and neighbor state discrimination

provide incomplete environmental information about the

global localization of the agent in the environment.

The 4 homeostatic variables are:

• HM : related to the sensory variable Movement. In a

multiagent task a decision should be taken fast, as there

are other agents who can take advantage of any delay. So

this variable is expected to reach maximum and lowest

values quickly. It will be decreased at any collision and

increased otherwise.

• HC: related to the sensory variable Clearance. It equals

−1 when there is collision and is 1 otherwise.

• HD: related to the sensory variable Energy, it reflects for

how long the agent have not received positive reinforce-

ments.

• HN : Variable fed by negative reinforcements, in their

absence it equals zero.

A. The Coordination Game

Each of two agents has 4 options of actions in a grid world

with 3× 3 = 9 states. The game ends when at least one agent

reaches its target position, receiving reinforcement R = 1.When both agents try to go to the same state, they remain

still and both receive R = −0.01. The agents have to learn

how to coordinate their paths to the target position so they

both will get reinforcement R = 1.Figure 2 shows that both WoLF-PHC and Multi-A learn to

coordinate their paths to the target state in self-play. The mean

reinforcement of Muti-A is slightly lower, since in contrast

with WoLF-PHC, Multi-A does not use complete global state

information, and there is perceptual ambiguity caused by the

internal sensory readings and by the adopted range for rules

description (sensory variable values quantified in intervals of

0.2 : {[0; 0.2); [0.2; 0.4); [0.4; 0.6); [0.6; 0.8); [0.8; 1.0]}). The

parameters set for Multi-A were: MEx = 10, CR = 0.2,RV = 0.6.

Figure 3 shows the mean reinforcement for two agents under

different architectures: Multi-A and WoLF-PHC. Both learn to

coordinate their actions in order to achieve their own goals.

It is interesting to note that the two learning architectures

managed to play despite using different leaning strategies. The

parameters set in Multi-A were MEx = 10, CR = 0.15,RV = 0.6.

Figure 2. The Coordination Game: mean reinforcement of Multi-A andWoLF-PHC in self-play.

Figure 3. The Coordination Game: mean reinforcement of Multi-A andWoLF-PHC playing as colleagues in the same simulation.

B. The Gridworld Game

The difference from the previous game is that the only state

that allows multiple simultaneous occupation is the target,

since both agents have the same state as target. Figure 4

illustrates the game and the initial positions, which have a

barrier with a 50% chance of closing. One of the players

must learn to leave the starting position through the free state

numbered 8, and the other must try to go northboud (with

the risk of colliding against the barrier), then coordinate their

paths to the target, otherwise they will be stuck trying to go

to the same place (state 8), colliding against each other. Both

algorithms learned to manage the task in 3 steps (minimum

required quantity) and laid upon the same strategy. As learning

goes by, the average reinforcement converges to 75%. The

agent which continually leaves the starting position via the

free state always wins, whereas the agent that repeatedly tries

the barrier reaches the target approximately in 50 % of the

trials (that is, only when the barrier opens). Figure 5 shows

that, in fact, the average reinforcement for both algorithms

reaches 75%. The parameters for Multi-A were MEx = 10;CR = 0.2 and RV = 0.2.

C. The Gridworld Game - Second Version

A second version of the grid world game was created to

test the algorithms under a non-stationary condition. Now the

barriers have 50% closing probability only in the 1st step

of each trial. During all the remaining steps of the trial,

the barriers will remain opened. For the games previously

233233233

Page 5: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

Figure 4. The Gridworld game. The barrier is ilustred in red colour.

Figure 5. The Gridworld game: mean reinforcement of Multi-A and WoLF-PHC in self-play.

Figure 6. The Gridworld Game - Version 2: mean reinforcement of WoLF-PHC in self-play with different exploration rates.

described both algorithms reached very similar performance,

however this game originates different outcomes.

WoLF-PHC has the same behaviour and performance as

in the original game – see figure 6, WoLF − PHC(2000).Different exploration rates were set aiming to detect if it could

perform differently, the final conclusion concerning different

exploration rates was that either the algorithm behaves the

same as in the original game (the Gridworld Game) or there

is no effective learning. WoLF − PHC(2000) corresponds

to a linearly decreasing exploration rate from trial 1 to trial

2000, starting at 0.5 and down to 0.0001 in each trial. Several

other exploration strategies were tested and generated similar

results. WoLF − PHC(−) had a zero exploration rate.

The results for Multi-A in the Gridworld game, second

version, are shown in Figure 7. The parameters were CR=0.15and RV = 0.2. As explained in the Cognitive Module, manip-

ulations of a rule are only allowed if the rule fits the current

sensory values, and its suggestion of action is performed for

at least MEx times. With the intention of evaluating the

impact of the set of rules in the decision process, we tested

different values of MEx and obtained different outcomes.

In general, the rules eventually met an exploratory role: the

fact of keeping the set of rules dynamic (rules being reduced,

expanded or deleted) is likely to impact on the selection of

actions. Consequently, the agent might not maintain its action

policy for too long, thus frustrating the expectation of action

selection from another agent and subsequently causing impact

on their best-response policy (when the strategy for solving

the task is opponent-dependent [1]).

The lower the maximum allowed number of rules in the set

of rules, the greater the likelihood that operations are applied

to the set (since rules can be deleted all the time to open space

to new rules), consequently resulting in changes on the action

selection. When there is a high rate of exploration it may be

convenient to keep a small value of MEx so that the set of

rules adapts quickly to learning. However, depending on the

application, at some point it may be appropriate to increase

the value of MEx or even prohibit any change on the set of

rules, so the agent can keep an action policy and ’commit’

itself, in response getting the same from another agent and

enabling the emergence and maintenance of cooperation. In

Figure 7, MA(MEx10) is for Multi-A with MEx = 10,whereas MEx(−) represents that there is no usage of the

Cognitive Module (rules are never created). Different MExrates cause diverse outcomes, summarized as follows:

1- The only tested MEx that produced consistent outcomes

regarding the game result was MEx = 10. Although re-

ceiving negative reinforcement because of colliding among

themselves, the two agents learned to always try to go to the

free state in the 1st step. Thus both will be delayed, but able

to get to the target state at the same time. As result they both

always win the game, and a trial will last 4 steps, and not

3 anymore. Thus, the average reinforcement converges to the

maximum value subtracted by a collision, resulting in 0.99.All simulations converged to the very same ending (actually,

only two failed) but some of them took longer times: from all

simulations, the one that first converged did that by trial 1100and the last one by trial 34900. As MEx is small, the set of

rules of both agents can be quite different from simulation to

simulation, producing that variance.

2- MEx = 50, MEx = 100 and MEx(−): all produced

agents that do not know how to deal with collisions. As they

change their paths during the simulation (because of perceptual

ambiguity caused by the internal sensory readings and by the

adopted range for rules description), as there are collisions

they have difficulties trying achieve the target state.

3- MEx = 15, MEx = 20, MEx = 25 and MEx = 40.We observed here that there were 3 kinds of policy of

actions as outcome: first, similar to the ones observed for

MEx = 50, MEx = 100 and MEx(−); second, the same as

234234234

Page 6: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

Figure 7. The Gridworld Game - Version 2: mean reinforcement of Multi-Awith different values of MEx.

for MEx = 10; and a third where both agents first collide

into a wall and then go to the target. For the latter the

final mean reinforcement was maximum since there were no

further negative reinforcements provided by collisions against

a wall – that strategy is a good one only if there is certainty

and cooperative expectation about other agents. MEx = 40and MEx = 20 were omitted from the graph for better

visualization.

With the aim of ensuring that the Cognitive System is work-

ing together with the Learning Module instead of determining

the selected actions all by itself, another simulation was made

without the activation of the Learning Module. In this case the

agent performance was unsatisfactory.

IV. FINAL REMARKS AND FUTURE WORK

We proposed in this paper Multi-A, a multiagent version

of a biologically-inspired computational agent for multiagent

games. Multi-A was tested in two benchmark coordination

games producing results similar to those of the WoLF-PHC

algorithm but without using complete global localization infor-

mation. In a modified non-stationary version of a coordination

game, Multi-A produced a higher mean reinforcement. Such

a behaviour is specially encouraging because the original idea

was just to adapt the ALEC computational model for single

agents to a general MA context via the inclusion of an ad hocmodule designed to bring up cooperation or model knowledge

about another agent, such a module has not yet been fully

implemented.

Issues still unanswered about Multi-A are: a) will the agent

be mixed (cooperative and competitive, acting accordingly

to a certain pattern of reciprocity) or purely cooperative? b)

will it work with agents operating under different learning

algorithms (not just in self-play) in other games? For this

second issue, the results we had in the Coordination game

suggest that Multi-A and WoLF-PHC can coordinate their

action selections. Consequently, depending on the task, Multi-

A is not restricted to self-play, this is an interesting finding

already achieved for the proposed architecture.

Altogether with answering that questions and the improve-

ment of Multi-A, our original purpose was to move towards an

artificial agent that cooperates as result of a biologically plau-

sible computational model of morality. This will be achieved

through the inclusion of the additional module mentioned

above, yet to be devised and implemented in the project.

ACKNOWLEDGMENTS

The authors thank CNPQ and FAPESP for the financial

support.

REFERENCES

[1] M. Bowling, M. Veloso, “Multiagent learning using a variable learningrate”, Artificial Intelligence, Vol. 136, 2002, pp.215-250.

[2] V. Conitzer, T. Sandholm. “Awesome: A general multiagent learningalgorithm that converges in self-Play and learns a best response againststationary opponents”, Machine Learning, Vol. 67, 1-2, 2007, pp. 23-43.

[3] J. Crandall, Learning Successful Strategies in Repeated General-sumGames, Ph.D. thesis, Brigham Young University, 2005.

[4] J. Crandall, M. Goodrich, “Learning to compete, coordinate, and cooper-ate in repeated games using reinforcement learning”, Machine Learning,Vol. 82 (3), 2011, pp. 281-314.

[5] A. Damásio, Descartes Error - Emotion, Reason and the Human Brain (OErro de Descartes: emoção, razão e cérebro humano), Portugal, Fórumda Ciência, Publicações Europa-América, 1995.

[6] A. Damasio, H. Damasio and A. Bechara, “Emotion, decision makingand the orbitofrontal cortex”, Cerebral Cortex, Oxford University Press,Oxford, Vol. 10 (3), 2000, pp. 295-307.

[7] D. Chakraborty, P. Stone, “Convergence, targeted optimality, and safetyin multiagent learning”, ICML 2010, pp. 191-198.

[8] A. Greenwald, K. Hall, “Correlated-Q learning”, Proceedings of theInternational Conference on Machine Learning (ICML), 2003.

[9] S. Gadanho, Reinforcement Learning in Autonomous Robots: an Em-pirical Investigation of the Role of Emotions. PhD Thesis, EdinburghUniversity, 1999.

[10] S. Gadanho, L. Custódio, “Learning behavior-selection in a multi-goal robot task”, Technical Report RT-701-02, Instituto de Sistemas eRobótica, IST, Lisbon, 2002a.

[11] S. Gadanho, L. Custódio, “Asynchronous learning by emotions andcognition”, Proceedings of the Seventh International Conference onSimulation of Adaptive Behavior, From Animals to Animats, 2002b.

[12] S. Gadanho, “Learning behavior-selection by emotions and cognition ina multi-Goal robot task”, Journal of Machine Learning Research, JMLR,(4), 2003, pp.385-412.

[13] L. Matignon, G. Laurent and N. Le Fort-Piat, “Independent rein-forcement learners in cooperative Markov games: a survey regardingcoordination problems”, The Knowledge Engineering Review, CambridgeUniversity Press, 27(1), 2012, pp.1-31.

[14] F. Mondada, E. Franzi and P. Ienne, “Mobile robot miniaturization: Atool for investigation in control algorithms”, Yoshikawa and Miyazaki(eds), Experimental Robotics III, Lecture notes in Control and Informa-tion Sciences, London, Springer-Verlag, 1994.

[15] J. Nash, “Equilibrium points in n-person games”, Proceedings of theNational Academy of Sciences 36 (1), 1950, pp. 48-49.

[16] M. Osborne, A. Rubinstein, A Course in Game Theory, Cambridge, MA:MIT Press, 1994.

[17] R. Powers, Y. Shoham, “Learning against opponents with boundedmemory”, Proceedings of IJCAI 2005.

[18] R. Sun, T. Peterson, “Autonomous learning of sequential tasks: experi-ments and analysis”, IEEE Transactions on Neural Networks, Vol. 9 (6),1998, pp.1217-1234.

[19] R. Sun, “The CLARION cognitive architecture: extending cognitivemodeling to social simulation”, Ron Sun (ed.), Cognition and MAInteraction, Cambridge University Press, 2006.

[20] R. Sutton, A. Barto, Reinforcement Learning, The MIT Press, 1998.[21] C. Watkins, Learning from Delayed Rewards, PhD Thesis, Cambridge

University, 1989.[22] C. WatkinS, P. Dayan, “Technical note Q-Learning,” Machine Learn-

ing,(8), 1992, pp. 279.[23] J. Werbos, “Beyond regression: new tools for prediction and analysis in

the behavioral sciences”, Harvard, 1974.

235235235