[IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational...
Transcript of [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational...
A Biologically Inspired Architecture for MultiagentGames
Fernanda M. EliottComputer Science Division
Technological Institute of Aeronautics (ITA)
São José dos Campos, Brazil
Carlos H. C. RibeiroComputer Science Division
Technological Institute of Aeronautics (ITA)
São José dos Campos, Brazil
Abstract—This paper reports modifications on a biologicallyinspired robotic architecture originally designed to work in singleagent contexts. Several adaptations have been applied to thearchitecture, seeking as result a model-free artificial agent ableto accomplish shared goals in a multiagent environment, fromsensorial information translated into homeostatic variable valuesand a rule database that play roles respectively in temporalcredit assignment and action-state space exploration. The newarchitecture was tested in a well-known benchmark game, andthe results were compared to the ones from the multiagent RLalgorithm WoLF-PHC. We verified that the proposed architecturecan produce coordinated behaviour equivalent to WoLF-PHCin stationary domains, and is also able to learn cooperation innon-stationary domains. The proposal is a first step towards anartificial agent that cooperate as result of a biologically plausiblecomputational model of morality.
Keywords- biologically inspired architectures, multiagentsystems, game theory, reinforcement learning.
I. INTRODUCTION
Important engineering applications have resulted from the
embodiment of a biological concept. Besides, biologically
inspired artifices nay trigger conceivable analogies to areas as
Philosophy and Psychology. We consider in this paper a way of
achieving cooperative behaviour by modifying a biologically
inspired computing architecture and make it able to fulfil tasks
in multiagent (MA) environments to maximize the average
reinforcement, rather than maximizing the agent’s own rein-
forcement. In [9], a behaviour-based control architecture was
instantiated over simulated primary emotions and hormonal
processes maintained by a homeostatic system. The latter was
inspired by the Somatic Marker hypothesis from [5]: what
would assist us to make fast decisions under low time and
computational burden, supporting predictions of what might
occur hereafter. In [11] [12] the computational architecture
was improved giving rise to ALEC (Asynchronous Learning
by Emotion and Cognition). It was influenced by the Clar-
ion Model [18], an architecture intended to model cognitive
processes in a psychological approach. ALEC was designed
to use data from sensors of a Khepera robot [14] and its
multi-task abilities were constructed in the context of a single
robot. ALEC is based on homeostatic system and two levels
participating in the processes of decision-making. The bottom
level is a backpropagation [23] feedforward artificial neural
network (ANN) employing the Q-Learning algorithm [21]
to learn and calculate the utility values of behaviour-state
pairs. The top level is composed by independently kept rules
that counterbalance the other level by providing different
suggestions of action selection.
We propose in this paper modifications to ALEC to make
it suitable to MA coordinating tasks, hence the name Multi-
A for the proposed architecture. Multi-A is intended to learn
through its actions, reinforcements and environment, including
other agents. Although in its validating experiments ALEC
was considered as having a homeostatic system fed by the
sensors of a Khepera robot, the idea was to have it built in
such a way that it is independent of specific robotic sensors,
as a result it might be easier to custom the system to different
environments.
A. Related Work
Application of model-free techniques such as Q-learning in
multiagent games is an open area of research. An important
issue is how to accomplish cooperative behaviour in general-
sum games when the Pareto-optimal solution is not a Nash
equilibrium [13] [15]. As a matter of fact, [13] contrasts the
performance of classic MA learning algorithms when facing
aspects that can be problematic to handle, such as: number
of states, agents and actions per agent; single, several and
shadowed optimal equilibrium; deterministic versus stochastic
games. With the aim of conceiving an investigative foundation
and establishing an algorithm to be contrasted with Multi-A,
we analyzed some classic MA algorithms: Correlated-Q [8];
Awesome [2]; CMLeS [7]; Manipulator [17]; M-Qubed [3] [4]
and WoLF-PHC [1]. The essential prerequisite for choosing a
benchmarking MA learning algorithm was its ability to handle
stochastic and repeated general-sum games; moreover it was
important to have published results that illustrate different
kinds of difficulties, as those results would be compared to
those by Multi-A. We then decided to consider WoLF-PHC
as a benchmark. It was developed according to the description
from [1] and follows the “Win or Learn Fast” principle: learn
fast when losing and carefully when winning as it uses a
variable learning rate to help making it robust against the alter-
exploration problem, i.e. perturbation caused to the agent’s
learning process due to environmental exploration by another
agents [13]. Reference [1] verified the convergence of WoLF-
PHC to optimal policies in several general-sum stochastic
1st BRICS Countries Congress on Computational Intelligence
978-1-4799-3194-1/13 $31.00 © 2013 IEEE
DOI 10.1109/BRICS-CCI.&.CBIC.2013.41
230
1st BRICS Countries Congress on Computational Intelligence
978-1-4799-3194-1/13 $31.00 © 2013 IEEE
DOI 10.1109/BRICS-CCI.&.CBIC.2013.41
230
2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence
978-1-4799-3194-1/13 $31.00 © 2013 IEEE
DOI 10.1109/BRICS-CCI-CBIC.2013.45
230
games; likewise [3] and [13] set forth different analysis about
WoLF-PHC in self-play.
Using WoLF-PHC as reference, we tested Multi-A in a
benchmark game. We verified that the proposed architecture
can produce coordinated behaviour equivalent to WoLF-PHC
in stationary domains, and is also able to learn cooperation in
non-stationary domains. The proposal is a first step towards
an artificial agent that cooperate as result of a biologically
plausible computational model of morality.
II. MULTI-A ARCHITECTURE
The general scheme of Multi-A is illustrated in Figure 1. It
is based on four major modules that operate as follows:
1) Sensory Module: stores the information the agent has
about its environment. After the update of the sensory
and homeostatic variables, a well-being index is cal-
culated from an equation that qualitatively estimates
the current situation of the agent. When the sensory
variables present a certain conguration, an action is
performed, and right after the execution of the action,
sensory and homeostatic variables are updated and the
well-being is recalculated.
2) Cognitive Module: stores and manipulates rules that
consist of interval-based specifications of the sensory
space and a mapped recommended action to be per-
formed in face of such specifications. The range for rule
specification may be, for example, quantified in intervals
of 0.2 over the sensory variable values. The set of rules
is supposed to assist in cases of excessive generalization
produced by the Learning Module.
3) Learning Module (adaptive system): uses an artificial
neural network (ANN) for each available action, and the
Q-learning algorithm [21] to estimate the utility value
for the current sensory data and action. Learning or
correction of the weights of the ANN is made through
the Backpropagation algorithm [11] employing the well-
being as the target value.
4) Action Selection Module (AS): receives from the Learn-
ing Module the Q-values for each action, and then gath-
ers actions suggested by the rules that match with the
current sensory data (if there is any rule that contemplate
the current values of the sensory variables). During the
beginning of a simulation, AS uses a high exploration
rate for the action space.
The total number of homeostatic variables and sensory
inputs must be defined according to the domain. For the sake
of generality let m be the number of sensory variables (values
in [0.0, 1.0]). Added to a bias, these variables feed the adaptive
(ANN) and rule systems. There are also n homeostatic vari-
ables Hi with values in [−1.0, 1.0] supplied by the sensory
data. The values for the homeostatic variables are the result
of an operation involving reinforcements and values of sensory
variables, and the application domain where the architecture
operates will determine the nature of such operation. They
provide input values to the equation that indicates the valence
Figure 1. General scheme of Multi-A architecture.
(well-being) of choices and environment, and are updated at
each iteration step.
A. Well-being
The well-being W represents the current situation of the
agent w.r.t. its interaction with the environment and other
agents. It is calculated from the homeostatic variables, but
with normalizing weights so that the final value falls in the
range [−1.0, 1.0]. It thus produces the target value brought to
the Learning Module for correcting the ANN weights through
an updating algorithm (in our case, standard backpropagation).
Additionally, W supports the cognitive system in calculating
the likelihood of success or failure of a rule: if it is greater
than or equal to a parameter RV a rule is created (and if a rule
that fits the current sensory data already exists, its successful
rate will be updated).
More specifically, RV is a threshold to W , consequently
it is set within the same range [−1.0, 1.0]. All pairs sensory
data/action that produce well-being equal or above RV will
be added to the rule set. Thereby, RV is a parameter that
indirectly determines the influence of rules on the decision
process: higher values of RV induce a lesser effect of rules
on the decision process - since just the pairs that resulted into
high values of well-being will be initially kept in the set of
rules.
W is calculated according to Equation 1:
W =
n∑
i=1
aiHi (1)
where n is the number of homeostatic variables H . The
weights ai are set according to the relevance of each homeo-
static variable to the task.
The reinforcements are normalized to the range [−1.0, 1.0],since the homeostatic variables fit the range [−1.0, 1.0]: as
homeostatic variables are created from reinforcements and
sensory variables, the reinforcements are incorporated to the
well-being via them.
231231231
B. Cognitive Module
Homeostatic variables provide guidelines about the environ-
ment. For example: if there is a homeostatic variable related
to positive reinforcements, high values of that homeostatic
variable indicate the agent has been through a situation that
gives a positive reinforcement. Thus, the rule system can store
the set of associate sensory data/actions that led to a goal state
in domains where a positive reinforcement is only associated
with such states. More specifically, during learning the pair
action/state (indicated via sensory variables) that leads to a
faster drive towards the goal state (by supplying levels of
well-being equal or above RV ) will be stored via the rules,
increasing the count of success for that sensory data/action
pair. Stored rules from past positive situations but that became
inadequate (e.g., by leading to frequent collisions against
another agents) will be deleted through updates of success
rates. In the early stages of exploration, the agent may produce
sensory values that will not come about again, and stored rules
fitting that situation will become useless and deleted as the
maximum allowed number of rules in the set is reached.
When an action is successful in a state (well-being greater
than or equal to RV ), a rule consistent with the current
sensorial values of Multi-A and the action taken will be created
(if not already existent), and added to the set of rules. The
existence of conflicting rules is allowed: the same description
of the input sensorial values but with different actions. If a
rule is used and the outcome is a well-being below RV , its
failure rate will be increased; otherwise the success rate will
be increased. Whenever a stored rule matches the sensory
variable values, its recency is updated. Once in the set of rules,
the rule can be manipulated: reduced, expanded or deleted.
However, manipulations of a rule are only allowed if the rule
fits the current sensory values and its suggestion of action is
performed for at least MEx times. If a rule is not enforced
a minimum number of times MEx, it will be deleted only if
the set of rules is complete (the cardinality of this set being
a design parameter). New rules replace the ones that were
applied more remotely in time.
C. Action Selection Module (AS)
The Action Selection Module receives values from both
the Cognitive and Learning Modules. The Learning Module,
through the ANN, delivers the utility values for the pairs
(current sensory data, actions). If there is a rule that fits the
current state, the Cognitive Module provides suggestions of
action selection: each rule will have the same weight CR.
From the data sent by both Modules, the AS Module will
assign values FVi to the available actions through equation 2:
FVi = Qi + CR×ACi (2)
where FVi is the value of action i; CR is the weight of a
rule that suggests action i; ACi is the number of rules fitting
the current sensory values that have i as action suggestion and
Qi is the Q-value of action i.
The action with maximum FVi is selected with higher
probability (e.g. using an ε-greedy strategy).
The value of the constant CR determines the importance
given to the recommendation of a rule. There may be several
rules that indicate the same action and also conflicting rules. If
a prominent number of rules support the same action, the Q-
values provided by the ANN may turn out to be irrelevant
in Equation 2, thereby the same action will be performed
continuously. The value of the multiplicative constant CRshould be low enough to allow for a balance between the
application of actions driven by rules (more specific) and by
the Q-values (more general). The values of CR, MEx, and RV
must be defined according to the domain while respecting the
observations above.
III. EXPERIMENTAL SETUP AND RESULTS
Each simulation covered a total number of times (trials) that
a game was played. Each trial had the duration of one game:
from its beginning to its end, when a goal was reached. Thus
the total number of steps or iterations for each trial was not
always the same: during learning, agents can spend more time
steps until solving the game. The total number of trials for
each simulation was 50, 000. The number of simulations was
50. The tested games were those already reported in [1] and
[3]: the Coordination Game and the Gridworld Game. The
depicted data in the figures are the mean reinforcement taken
at intervals of 100 trials. The mean reinforcement corresponds
to two agents in self play (i.e., playing against each other and
using the same architecture). Multi-A was compared to the
WoLF-PHC algorithm, and as it operates on reinforcements
within the interval [−1, 1], the original values of the game
scores were normalized to this range. The state of a Multi-
A agent is determined by the values of its sensory entries,
suggesting a fair number of trials to train the homeostatic
system.
The number of sensory data used in the experiments re-
ported herein was 6 and the number of homeostatic variables
was 4. The 6 sensory variables are:
• Clearance: a maximum value when there was no collision,
low otherwise.
• Obstacle Density: high when there was collision (either
against an obstacle or another agent); zero otherwise.
Depending on the application, Obstacle Density may be
used to differentiate kinds of collision, such as against an
obstacle or against another agent.
• Movement: represents the number of steps the agent has
been moving around during a trial. It is decreased when
the agent stops (since that is usually a bad option: the
agent being still, other agents might have time to take
advantage of the environment) and increased otherwise.
The agent stops if there is a collision, be it against an
obstacle or another agent.
• Energy: reflects if the agent has been finding its goal
often. Starts in the 1st trial of the 1st simulation with
maximum value, but is decreased step-by-step. Only
232232232
grows when the agent receives positive reinforcement:
the later will be added to the Energy value.
• Target Proximity and Target Direction: both simulate the
light intensity sensors from ALEC [12]: light source is
replaced by target state. Those sensory variables provide
high values in the goal state and (with a smaller value)
in the neighboring states (but not diagonal to the target).
Otherwise the variables will be zeroed. In the 1st trial
of the 1st simulation these variables are always zeroed
as the agent still does not know where the target is:
the sensory data and actions that lead to a target state
have to be learned. Once achieved the first positive
reinforcement, those variables change their values. Notice
that the target state and neighbor state discrimination
provide incomplete environmental information about the
global localization of the agent in the environment.
The 4 homeostatic variables are:
• HM : related to the sensory variable Movement. In a
multiagent task a decision should be taken fast, as there
are other agents who can take advantage of any delay. So
this variable is expected to reach maximum and lowest
values quickly. It will be decreased at any collision and
increased otherwise.
• HC: related to the sensory variable Clearance. It equals
−1 when there is collision and is 1 otherwise.
• HD: related to the sensory variable Energy, it reflects for
how long the agent have not received positive reinforce-
ments.
• HN : Variable fed by negative reinforcements, in their
absence it equals zero.
A. The Coordination Game
Each of two agents has 4 options of actions in a grid world
with 3× 3 = 9 states. The game ends when at least one agent
reaches its target position, receiving reinforcement R = 1.When both agents try to go to the same state, they remain
still and both receive R = −0.01. The agents have to learn
how to coordinate their paths to the target position so they
both will get reinforcement R = 1.Figure 2 shows that both WoLF-PHC and Multi-A learn to
coordinate their paths to the target state in self-play. The mean
reinforcement of Muti-A is slightly lower, since in contrast
with WoLF-PHC, Multi-A does not use complete global state
information, and there is perceptual ambiguity caused by the
internal sensory readings and by the adopted range for rules
description (sensory variable values quantified in intervals of
0.2 : {[0; 0.2); [0.2; 0.4); [0.4; 0.6); [0.6; 0.8); [0.8; 1.0]}). The
parameters set for Multi-A were: MEx = 10, CR = 0.2,RV = 0.6.
Figure 3 shows the mean reinforcement for two agents under
different architectures: Multi-A and WoLF-PHC. Both learn to
coordinate their actions in order to achieve their own goals.
It is interesting to note that the two learning architectures
managed to play despite using different leaning strategies. The
parameters set in Multi-A were MEx = 10, CR = 0.15,RV = 0.6.
Figure 2. The Coordination Game: mean reinforcement of Multi-A andWoLF-PHC in self-play.
Figure 3. The Coordination Game: mean reinforcement of Multi-A andWoLF-PHC playing as colleagues in the same simulation.
B. The Gridworld Game
The difference from the previous game is that the only state
that allows multiple simultaneous occupation is the target,
since both agents have the same state as target. Figure 4
illustrates the game and the initial positions, which have a
barrier with a 50% chance of closing. One of the players
must learn to leave the starting position through the free state
numbered 8, and the other must try to go northboud (with
the risk of colliding against the barrier), then coordinate their
paths to the target, otherwise they will be stuck trying to go
to the same place (state 8), colliding against each other. Both
algorithms learned to manage the task in 3 steps (minimum
required quantity) and laid upon the same strategy. As learning
goes by, the average reinforcement converges to 75%. The
agent which continually leaves the starting position via the
free state always wins, whereas the agent that repeatedly tries
the barrier reaches the target approximately in 50 % of the
trials (that is, only when the barrier opens). Figure 5 shows
that, in fact, the average reinforcement for both algorithms
reaches 75%. The parameters for Multi-A were MEx = 10;CR = 0.2 and RV = 0.2.
C. The Gridworld Game - Second Version
A second version of the grid world game was created to
test the algorithms under a non-stationary condition. Now the
barriers have 50% closing probability only in the 1st step
of each trial. During all the remaining steps of the trial,
the barriers will remain opened. For the games previously
233233233
Figure 4. The Gridworld game. The barrier is ilustred in red colour.
Figure 5. The Gridworld game: mean reinforcement of Multi-A and WoLF-PHC in self-play.
Figure 6. The Gridworld Game - Version 2: mean reinforcement of WoLF-PHC in self-play with different exploration rates.
described both algorithms reached very similar performance,
however this game originates different outcomes.
WoLF-PHC has the same behaviour and performance as
in the original game – see figure 6, WoLF − PHC(2000).Different exploration rates were set aiming to detect if it could
perform differently, the final conclusion concerning different
exploration rates was that either the algorithm behaves the
same as in the original game (the Gridworld Game) or there
is no effective learning. WoLF − PHC(2000) corresponds
to a linearly decreasing exploration rate from trial 1 to trial
2000, starting at 0.5 and down to 0.0001 in each trial. Several
other exploration strategies were tested and generated similar
results. WoLF − PHC(−) had a zero exploration rate.
The results for Multi-A in the Gridworld game, second
version, are shown in Figure 7. The parameters were CR=0.15and RV = 0.2. As explained in the Cognitive Module, manip-
ulations of a rule are only allowed if the rule fits the current
sensory values, and its suggestion of action is performed for
at least MEx times. With the intention of evaluating the
impact of the set of rules in the decision process, we tested
different values of MEx and obtained different outcomes.
In general, the rules eventually met an exploratory role: the
fact of keeping the set of rules dynamic (rules being reduced,
expanded or deleted) is likely to impact on the selection of
actions. Consequently, the agent might not maintain its action
policy for too long, thus frustrating the expectation of action
selection from another agent and subsequently causing impact
on their best-response policy (when the strategy for solving
the task is opponent-dependent [1]).
The lower the maximum allowed number of rules in the set
of rules, the greater the likelihood that operations are applied
to the set (since rules can be deleted all the time to open space
to new rules), consequently resulting in changes on the action
selection. When there is a high rate of exploration it may be
convenient to keep a small value of MEx so that the set of
rules adapts quickly to learning. However, depending on the
application, at some point it may be appropriate to increase
the value of MEx or even prohibit any change on the set of
rules, so the agent can keep an action policy and ’commit’
itself, in response getting the same from another agent and
enabling the emergence and maintenance of cooperation. In
Figure 7, MA(MEx10) is for Multi-A with MEx = 10,whereas MEx(−) represents that there is no usage of the
Cognitive Module (rules are never created). Different MExrates cause diverse outcomes, summarized as follows:
1- The only tested MEx that produced consistent outcomes
regarding the game result was MEx = 10. Although re-
ceiving negative reinforcement because of colliding among
themselves, the two agents learned to always try to go to the
free state in the 1st step. Thus both will be delayed, but able
to get to the target state at the same time. As result they both
always win the game, and a trial will last 4 steps, and not
3 anymore. Thus, the average reinforcement converges to the
maximum value subtracted by a collision, resulting in 0.99.All simulations converged to the very same ending (actually,
only two failed) but some of them took longer times: from all
simulations, the one that first converged did that by trial 1100and the last one by trial 34900. As MEx is small, the set of
rules of both agents can be quite different from simulation to
simulation, producing that variance.
2- MEx = 50, MEx = 100 and MEx(−): all produced
agents that do not know how to deal with collisions. As they
change their paths during the simulation (because of perceptual
ambiguity caused by the internal sensory readings and by the
adopted range for rules description), as there are collisions
they have difficulties trying achieve the target state.
3- MEx = 15, MEx = 20, MEx = 25 and MEx = 40.We observed here that there were 3 kinds of policy of
actions as outcome: first, similar to the ones observed for
MEx = 50, MEx = 100 and MEx(−); second, the same as
234234234
Figure 7. The Gridworld Game - Version 2: mean reinforcement of Multi-Awith different values of MEx.
for MEx = 10; and a third where both agents first collide
into a wall and then go to the target. For the latter the
final mean reinforcement was maximum since there were no
further negative reinforcements provided by collisions against
a wall – that strategy is a good one only if there is certainty
and cooperative expectation about other agents. MEx = 40and MEx = 20 were omitted from the graph for better
visualization.
With the aim of ensuring that the Cognitive System is work-
ing together with the Learning Module instead of determining
the selected actions all by itself, another simulation was made
without the activation of the Learning Module. In this case the
agent performance was unsatisfactory.
IV. FINAL REMARKS AND FUTURE WORK
We proposed in this paper Multi-A, a multiagent version
of a biologically-inspired computational agent for multiagent
games. Multi-A was tested in two benchmark coordination
games producing results similar to those of the WoLF-PHC
algorithm but without using complete global localization infor-
mation. In a modified non-stationary version of a coordination
game, Multi-A produced a higher mean reinforcement. Such
a behaviour is specially encouraging because the original idea
was just to adapt the ALEC computational model for single
agents to a general MA context via the inclusion of an ad hocmodule designed to bring up cooperation or model knowledge
about another agent, such a module has not yet been fully
implemented.
Issues still unanswered about Multi-A are: a) will the agent
be mixed (cooperative and competitive, acting accordingly
to a certain pattern of reciprocity) or purely cooperative? b)
will it work with agents operating under different learning
algorithms (not just in self-play) in other games? For this
second issue, the results we had in the Coordination game
suggest that Multi-A and WoLF-PHC can coordinate their
action selections. Consequently, depending on the task, Multi-
A is not restricted to self-play, this is an interesting finding
already achieved for the proposed architecture.
Altogether with answering that questions and the improve-
ment of Multi-A, our original purpose was to move towards an
artificial agent that cooperates as result of a biologically plau-
sible computational model of morality. This will be achieved
through the inclusion of the additional module mentioned
above, yet to be devised and implemented in the project.
ACKNOWLEDGMENTS
The authors thank CNPQ and FAPESP for the financial
support.
REFERENCES
[1] M. Bowling, M. Veloso, “Multiagent learning using a variable learningrate”, Artificial Intelligence, Vol. 136, 2002, pp.215-250.
[2] V. Conitzer, T. Sandholm. “Awesome: A general multiagent learningalgorithm that converges in self-Play and learns a best response againststationary opponents”, Machine Learning, Vol. 67, 1-2, 2007, pp. 23-43.
[3] J. Crandall, Learning Successful Strategies in Repeated General-sumGames, Ph.D. thesis, Brigham Young University, 2005.
[4] J. Crandall, M. Goodrich, “Learning to compete, coordinate, and cooper-ate in repeated games using reinforcement learning”, Machine Learning,Vol. 82 (3), 2011, pp. 281-314.
[5] A. Damásio, Descartes Error - Emotion, Reason and the Human Brain (OErro de Descartes: emoção, razão e cérebro humano), Portugal, Fórumda Ciência, Publicações Europa-América, 1995.
[6] A. Damasio, H. Damasio and A. Bechara, “Emotion, decision makingand the orbitofrontal cortex”, Cerebral Cortex, Oxford University Press,Oxford, Vol. 10 (3), 2000, pp. 295-307.
[7] D. Chakraborty, P. Stone, “Convergence, targeted optimality, and safetyin multiagent learning”, ICML 2010, pp. 191-198.
[8] A. Greenwald, K. Hall, “Correlated-Q learning”, Proceedings of theInternational Conference on Machine Learning (ICML), 2003.
[9] S. Gadanho, Reinforcement Learning in Autonomous Robots: an Em-pirical Investigation of the Role of Emotions. PhD Thesis, EdinburghUniversity, 1999.
[10] S. Gadanho, L. Custódio, “Learning behavior-selection in a multi-goal robot task”, Technical Report RT-701-02, Instituto de Sistemas eRobótica, IST, Lisbon, 2002a.
[11] S. Gadanho, L. Custódio, “Asynchronous learning by emotions andcognition”, Proceedings of the Seventh International Conference onSimulation of Adaptive Behavior, From Animals to Animats, 2002b.
[12] S. Gadanho, “Learning behavior-selection by emotions and cognition ina multi-Goal robot task”, Journal of Machine Learning Research, JMLR,(4), 2003, pp.385-412.
[13] L. Matignon, G. Laurent and N. Le Fort-Piat, “Independent rein-forcement learners in cooperative Markov games: a survey regardingcoordination problems”, The Knowledge Engineering Review, CambridgeUniversity Press, 27(1), 2012, pp.1-31.
[14] F. Mondada, E. Franzi and P. Ienne, “Mobile robot miniaturization: Atool for investigation in control algorithms”, Yoshikawa and Miyazaki(eds), Experimental Robotics III, Lecture notes in Control and Informa-tion Sciences, London, Springer-Verlag, 1994.
[15] J. Nash, “Equilibrium points in n-person games”, Proceedings of theNational Academy of Sciences 36 (1), 1950, pp. 48-49.
[16] M. Osborne, A. Rubinstein, A Course in Game Theory, Cambridge, MA:MIT Press, 1994.
[17] R. Powers, Y. Shoham, “Learning against opponents with boundedmemory”, Proceedings of IJCAI 2005.
[18] R. Sun, T. Peterson, “Autonomous learning of sequential tasks: experi-ments and analysis”, IEEE Transactions on Neural Networks, Vol. 9 (6),1998, pp.1217-1234.
[19] R. Sun, “The CLARION cognitive architecture: extending cognitivemodeling to social simulation”, Ron Sun (ed.), Cognition and MAInteraction, Cambridge University Press, 2006.
[20] R. Sutton, A. Barto, Reinforcement Learning, The MIT Press, 1998.[21] C. Watkins, Learning from Delayed Rewards, PhD Thesis, Cambridge
University, 1989.[22] C. WatkinS, P. Dayan, “Technical note Q-Learning,” Machine Learn-
ing,(8), 1992, pp. 279.[23] J. Werbos, “Beyond regression: new tools for prediction and analysis in
the behavioral sciences”, Harvard, 1974.
235235235