Multi-Agent Learning Mini-Tutorial Gerry Tesauro IBM T.J.Watson Research Center .

Multi-Agent Learning Mini-Tutorial

Gerry Tesauro

IBM T.J.Watson Research Centerhttp://www.research.ibm.com/infoecon http://www.research.ibm.com/massdist

Outline Statement of the problem Tools and concepts from RL & game theory “Naïve” approaches to multi-agent learning

ordinary single-agent RL; no-regret learning fictitious play evolutionary game theory

“Sophisticated” approaches minimax-Q (Littman), Nash-Q (Hu & Wellman) tinkering with learning rates: WoLF (Bowling),

Multiple-timescale Q-learning (Leslie & Collins) “strategic teaching” (Camerer talk)

Challenges and Opportunities

Normal single-agent learning Assume that environment has observable states,

characterizable expected rewards and state transitions, and all of the above is stationary (MDP-ish)

Non-learning, theoretical solution to fully specified problem: DP formalism

Learning: solve by trial and error without a full specification: RL + exploration, Monte Carlo, ...

Multi-Agent Learning Problem: Agent tries to solve its

learning problem, while other agents in the environment also are trying to solve their own learning problems.

Non-learning, theoretical solution to fully specified problem: game theory

Basics of game theory A game is specified by: players (1…N), actions,

and payoff matrices (functions of joint actions) B’s action

A’s action

A’s payoff B’s payoff

If payoff matrices are identical, game is cooperative, else non-cooperative (zero-sum = purely competitive)

Basic lingo…(2) Games with no states: (bi)-matrix games Games with states: stochastic games, Markov

games; (state transitions are functions of joint actions)

Games with simultaneous moves: normal form Games with alternating turns: extensive form

No. of rounds = 1: one-shot game No. of rounds > 1: repeated game

deterministic action choice: pure strategy non-deterministic action choice: mixed strategy

Basic Analysis A joint strategy x is Pareto-optimal if no x’ that

improves everybody’s payoffs An agent’s xi is a dominant strategy if it’s always

best regardless of others’ actions xi is a best-reponse to others’ x-i if it maximizes

payoff given x-i

A joint strategy x is an equilibrium if each agent’s strategy is simultaneously a best-response to everyone else’s strategy, i.e. no incentive to deviate (Nash, correlated)

A Nash equilibrium always exists, but may be exponentially many of them, and not easy to compute

What about imperfect information games?

Nash eqm. requires knowledge of all payoffs. For imperfect info. games, corresponding concept is Bayes-Nash equilibrium (Nash plus Bayesian inference over hidden information). Even more intractable than regular Nash.

Can we make game theory more tractable?

Active area of research Symmetric games: payoffs are invariant under

swapping of player labels. Can look for symmetric equilibria, where all agents play same mixed strategy.

Network games: agent payoffs only depend on interactions with a small # of neighbors

Summarization games: payoffs are simple summarization functions of population joint actions (e.g. voting)

Summary: pros and cons of game theory Game theory provides a nice

conceptual/theoretical framework for thinking about multi-agent learning.

Game theory is appropriate provided that: Game is stationary and fully specified; Enough computer power to compute equilibrium; Can assume other agents are also game theorists; Can solve equilibrium coordination problem.

Above conditions rarely hold in real applications Multi-agent learning is not only a fascinating

problem, it may be the only viable option.

Naïve Approaches to Multi-Agent Learning

Basic idea: agent adapts, ignoring non-stationarity of other agents’ strategies

1. Fictitious play: Agent observes time-average frequency of other players’ action choices, and models:

agent then plays best-response to this model

Variants of fictitious play: exponential recency weighting, “smoothed” best response (~softmax), small adjustment toward best response, ...

nsobservatiototal

observedktimeskactionprob

What if all agents use fictitious play?

Strict Nash equilibria are absorbing points for fictitious play

Typical result is limit-cycle behavior of strategies, with increasing period as N

In certain cases, product of empirical distributions converges to Nash even though actual play cycles (penny matching example)

More Naïve Approaches…

2. Evolutionary game theory: “Replicator Dynamics” models: large population of agents using different strategies, fittest agents breed more copies. Let x= population strategy vector, and xk = fraction of

population playing strategy k. Growth rate then:

Above equation also derived from an “imitation” model NE are fixed points of above equation, but not

necessarily attractors (unstable or neutral stable)

),(),( xxuxeuxdt

Many possible dynamic behaviors...

limit cycles attractors unstable f.p.

Also saddle points, chaotic orbits, ...

Replicator dynamics: auction bidding strategies

More Naïve Approaches…

3. Iterated Gradient Ascent: (Singh, Kearns and Mansour): Again does a myopic adaptation to other players’ current strategy.

Coupled system of linear equations: u is linear in xi and x-i

Analysis for two-player, two-action games: either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles

),( iii

i xxuxdt

Further Naïve Approaches…

4. Dumb Single-Agent Learning: Use a single-agent algorithm in a multi-agent problem & hope that it works

No-regret learning by pricebots (Greenwald & Kephart) Simultaneous Q-learning by pricebots (Tesauro & Kephart)

In many cases, this actually works: learners converge either exactly or approximately to self-consistent optimal strategies

“Sophisticated” approaches

Takes into account the possibility that other agents’ strategies might change.

5. Multi-Agent Q-learning: Minimax-Q (Littman): convergent algorithm for

two-player zero-sum stochastic games Nash-Q (Hu & Wellman): convergent algorithm

for two-player general-sum stochastic games; requires use of Nash equilibrium solver

More sophisticated approaches...

6. Varying learning rates

WoLF: “Win or Learn Fast” (Bowling): agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbing

Multi-timescale Q-Learning (Leslie): different agents use different power laws t-n for learning rate decay: achieves simultaneous convergence where ordinary Q-learning doesn’t

More sophisticated approaches...

7. “Strategic Teaching:” recognizes that other players’ strategy are adaptive

“A strategic teacher may play a strategy which is not myopically optimal (such as cooperating in Prisoner’s Dilemma) in the hope that it induces adaptive players to expect that strategy in the future, which triggers a best-response that benefits the teacher.” (Camerer, Ho and Chong)

Theoretical Research Challenges Proper theoretical formulation?

“No short-cut” hypothesis: Massive on-line search a la Deep Blue to maximize expected long-term reward

(Bayesian) Model and predict behavior of other players, including how they learn based on my actions (beware of infinite model recursion)

trial-and-error exploration continual Bayesian inference using all evidence

over all uncertainties (Boutilier: Bayesian exploration)

When can you get away with simpler methods?

Real-World Opportunities Multi-agent systems where you can’t do

game theory (covers everything :-)) Electronic marketplaces (Kephart) Mobile networks (Chang) Self-managing computer systems (Kephart) Teams of robots (Bowling, Stone) Video games Military/counter-terrorism applications

Multi-Agent Learning Mini-Tutorial Gerry Tesauro IBM T.J.Watson Research Center .

Documents

Transcript of Multi-Agent Learning Mini-Tutorial Gerry Tesauro IBM T.J.Watson Research Center .

TOP TASKS MANAGEMENT Overview December 20, 2010 Gerry McGovern gerry@customercarewords.com .

Status Report from AAT Chile and Tesauro Regional Patrimonial ...

Gerry O’Connor

PRESENTORS David Gerry, BSc - NAADAC · David Gerry, BSc LivingWithFASD.com Diane Malbin, MSW ... David Gerry, BSc LivingWithFASD.com Trying Differently Rather Than …

Course of Geothermics Dr. Magdala Tesauro · Dr. Magdala Tesauro. Temperature and magmatic processes Melting generation is a consequence of: • Decompression occurring during upper

MULTI-LAYER GRID EMBEDDINGS Alok Aggarwal, IBM T.J.Watson ...

Millennial Attitude May 6, 2015 Gerry McGovern gerry@customercarewords.com @gerrymcgovern.

Making security-agile matt-tesauro

Wei Fan, IBM T.J.Watson Joe McCloskey, US Department of Defense Philip Yu, IBM T.J.Watson

Tesauro de Cine

Gerry Panel Wang

Gerry Crispin

Tesauro de Arte & Arquitectura y Tesauro Regional Patrimonial: Steps

Course of Geothermics Dr. Magdala Tesauro

Tesauro de Arte & Arquitectura (TAA) - getty.edu · PDF fileSomeSomePUBLICATIONSPUBLICATIONS aboutabout TESAUROS TESAUROS Nagel: The Tesauro de Arte & Arquitectura and the Tesauro

Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.

On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

QUIETUD use MOVIMIENTO C.04.02 ORGANISMO HUMANO …bvs.psi.uba.ar/local/File/tesauro/TESAURO-Q-R.pdf · test grafologico emocional tests de personalidad tipologia de la personalidad

Tesauro Ling

Cisco Improving software downloads Gerry McGovern gerry@customercarewords.com @gerrymcgovern.