LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames...

83
Bacgkround N-player games No-regret learning in games Learning with limited feedback NO-REGRET LEARNING IN GAMES Panayotis Mertikopoulos¹ ¹French National Center for Scientific Research (CNRS) Laboratoire d’Informatique de Grenoble (LIG) NPCG ’19 – Paris, April 16, 2019 P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Transcript of LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames...

Page 1: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

NO-REGRET LEARNING IN GAMES

Panayotis Mertikopoulos¹

¹French National Center for Scientific Research (CNRS)

Laboratoire d’Informatique de Grenoble (LIG)

NPCG ’19 – Paris, April 16, 2019

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 2: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Outline

Bacgkround

N-player games

No-regret learning in games

Learning with limited feedback

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 3: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

The Kelly auction

Proportionally fair allocation of resources to different clients [Kelly, 1998]:

Client 1

Client 2

Client 3

Resources

Resources could be processor cores, bandwidth , or even anonymous web traffic

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 4: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

The Kelly auction

Proportionally fair allocation of resources to different clients [Kelly, 1998]:

Client 1

Client 2

Client 3

Resources

bid x1

bid x3

bid x2

Resources could be processor cores, bandwidth , or even anonymous web traffic

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 5: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

The Kelly auction

Proportionally fair allocation of resources to different clients [Kelly, 1998]:

Client 1

Client 2

Client 3

Resources

Client 1 resources

Client 3 resources

Client 2 resources

bid x1

bid x3

bid x2

get x1x1 + x2 + x3

get x2x1 + x2 + x3

get x3x1 + x2 + x3

Resources could be processor cores, bandwidth , or even anonymous web traffic

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 6: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

The Kelly auction

Proportionally fair allocation of resources to different clients [Kelly, 1998]:

Client 1

Client 2

Client 3

Resources

Client 1 resources

Client 3 resources

Client 2 resources

bid x1

bid x3

bid x2

get x1x1 + x2 + x3

get x2x1 + x2 + x3

get x3x1 + x2 + x3

Resources could be processor cores, bandwidth

, or even anonymous web traffic

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 7: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

The Kelly auction

Proportionally fair allocation of resources to different clients [Kelly, 1998]:

Advertiser 1

Advertiser 2

Advertiser 3

Website traffic

Advertiser 1 impressions

Advertiser 3 impressions

Advertiser 2 impressions

bid x1

bid x3

bid x2

get x1x1 + x2 + x3

get x2x1 + x2 + x3

get x3x1 + x2 + x3

Resources could be processor cores, bandwidth , or even anonymous web trafficP. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 8: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Online decision processes

Agents called to take repeated decisions with minimal information:

repeat

At each epoch t = , , . . .Choose action Xt

Get payoff ut(Xt)until end

Main question: How to choose a “good” action at each epoch?

▸ Uncertain world: no beliefs, feedback, knowledge of future, etc.

▸ Obliviousness: are payoffs affected by the agent’s previous actions?

▸ Optimality: what is “optimal” in this setting?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 9: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Online decision processes

Agents called to take repeated decisions with minimal information:

repeat

At each epoch t = , , . . .Choose action Xt

Get payoff ut(Xt)until end

Main question: How to choose a “good” action at each epoch?

▸ Uncertain world: no beliefs, feedback, knowledge of future, etc.

▸ Obliviousness: are payoffs affected by the agent’s previous actions?

▸ Optimality: what is “optimal” in this setting?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 10: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Regret minimization

Performance often quantified by the agent’s regret

Reg(T) = maxx∈X

T∑t=[

ut(x) − ut(Xt)

]

No regret: Reg(T) = o(T)“The sequence of chosen actions is as good as the best fixed action in hindsight.”

Prolific literature:▸ Economics [Hannan, Blackwell, Hart & Mas-Colell,…]

▸ Machine learning & computer science [Littlestone &Warmuth, Vovk,…]

▸ Online learning & optimization [Cesa-Bianchi & Lugosi, Zinkevich,…]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 11: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Regret minimization

Performance often quantified by the agent’s regret

Reg(T) = maxx∈X

T∑t=[ut(x) − ut(Xt)]

No regret: Reg(T) = o(T)“The sequence of chosen actions is as good as the best fixed action in hindsight.”

Prolific literature:▸ Economics [Hannan, Blackwell, Hart & Mas-Colell,…]

▸ Machine learning & computer science [Littlestone &Warmuth, Vovk,…]

▸ Online learning & optimization [Cesa-Bianchi & Lugosi, Zinkevich,…]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 12: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Regret minimization

Performance often quantified by the agent’s regret

Reg(T) =

maxx∈X

T∑t=[ut(x) − ut(Xt)]

No regret: Reg(T) = o(T)“The sequence of chosen actions is as good as the best fixed action in hindsight.”

Prolific literature:▸ Economics [Hannan, Blackwell, Hart & Mas-Colell,…]

▸ Machine learning & computer science [Littlestone &Warmuth, Vovk,…]

▸ Online learning & optimization [Cesa-Bianchi & Lugosi, Zinkevich,…]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 13: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Regret minimization

Performance often quantified by the agent’s regret

Reg(T) = maxx∈X

T∑t=[ut(x) − ut(Xt)]

No regret: Reg(T) = o(T)“The sequence of chosen actions is as good as the best fixed action in hindsight.”

Prolific literature:▸ Economics [Hannan, Blackwell, Hart & Mas-Colell,…]

▸ Machine learning & computer science [Littlestone &Warmuth, Vovk,…]

▸ Online learning & optimization [Cesa-Bianchi & Lugosi, Zinkevich,…]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 14: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Regret minimization

Performance often quantified by the agent’s regret

Reg(T) = maxx∈X

T∑t=[ut(x) − ut(Xt)]

No regret: Reg(T) = o(T)“The sequence of chosen actions is as good as the best fixed action in hindsight.”

Prolific literature:▸ Economics [Hannan, Blackwell, Hart & Mas-Colell,…]

▸ Machine learning & computer science [Littlestone &Warmuth, Vovk,…]

▸ Online learning & optimization [Cesa-Bianchi & Lugosi, Zinkevich,…]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 15: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Regret minimization

Performance often quantified by the agent’s regret

Reg(T) = maxx∈X

T∑t=[ut(x) − ut(Xt)]

No regret: Reg(T) = o(T)“The sequence of chosen actions is as good as the best fixed action in hindsight.”

Prolific literature:▸ Economics [Hannan, Blackwell, Hart & Mas-Colell,…]

▸ Machine learning & computer science [Littlestone &Warmuth, Vovk,…]

▸ Online learning & optimization [Cesa-Bianchi & Lugosi, Zinkevich,…]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 16: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Multi-agent learning

▸ Multiple agents, individual objectives

Example: place a bid in a repeated auction

▸ Payoffs determined by actions of all agents

Example: outcome of auction revealed

▸ Agents receive payoffs, adjust actions, and the process repeats

Example: change bid if unsatisfied

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 17: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Multi-agent learning

▸ Multiple agents, individual objectivesExample: place a bid in a repeated auction

▸ Payoffs determined by actions of all agentsExample: outcome of auction revealed

▸ Agents receive payoffs, adjust actions, and the process repeatsExample: change bid if unsatisfied

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 18: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

No-regret and equilibrium

The golden rule:

No-regret learning leads to equilibrium

∗If it’s ok to:

✗ Assign positive weight only to strictly dominated strategies[Viossat & Zapechelnyuk, 2013]

✗ Be arbitrarily far from equilibrium infinitely often[too many to list]

✗ ⋯

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 19: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

No-regret and equilibrium

The golden rule:

No-regret learning leads to equilibrium∗

∗If it’s ok to:

✗ Assign positive weight only to strictly dominated strategies[Viossat & Zapechelnyuk, 2013]

✗ Be arbitrarily far from equilibrium infinitely often[too many to list]

✗ ⋯

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 20: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

No-regret and equilibrium

The golden rule:

No-regret learning leads to equilibrium∗

∗If it’s ok to:

✗ Assign positive weight only to strictly dominated strategies[Viossat & Zapechelnyuk, 2013]

✗ Be arbitrarily far from equilibrium infinitely often[too many to list]

✗ ⋯

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 21: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

No-regret and equilibrium

The golden rule:

No-regret learning leads to equilibrium∗

∗If it’s ok to:

✗ Assign positive weight only to strictly dominated strategies[Viossat & Zapechelnyuk, 2013]

✗ Be arbitrarily far from equilibrium infinitely often[too many to list]

✗ ⋯

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 22: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

No-regret and equilibrium

The golden rule:

No-regret learning leads to equilibrium∗

∗If it’s ok to:

✗ Assign positive weight only to strictly dominated strategies[Viossat & Zapechelnyuk, 2013]

✗ Be arbitrarily far from equilibrium infinitely often[too many to list]

✗ ⋯

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 23: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

No-regret and equilibrium

When does no-regret learning converge to Nash equilibrium?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 24: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Outline

Bacgkround

N-player games

No-regret learning in games

Learning with limited feedback

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 25: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

N-player games

The game▸ Finite set of players i ∈N = {, . . . ,N}▸ Each player selects an action xi from a compact convex set Xi▸ Reward of player i determined by payoff function ui ∶X ≡∏i Xi → R

Examples▸ Finite games (mixed extensions)▸ Power control/allocation problems▸ Traffic routing▸ Generative adversarial networks (two-player zero-sum games)▸ Divisible good auctions (Kelly,…)▸ Cournot oligopolies▸ ⋯

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 26: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Kelly auctions

The Kelly auction as an N-player game:

▸ Players: i = , . . . ,N [bidders]

▸ Resources S = {, . . . , S} [websites]

▸ Action spaces: Xi = {xi ∈ RS+ ∶ ∑s xis ≤ bi} [bi : budget of i-th bidder]

▸ Resource allocation ratio:

ρ i s(x) = qsxiscis +∑ j∈N x js

[cis : entry barrier]

▸ Payoff functions:ui(x) =∑

s∈S[дiρ i s(x) − xi]

[utility from resources minus cost]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 27: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Nash equilibrium

Nash equilibriumAction profile x∗ = (x∗ , . . . , x∗t ) ∈ X that is unilaterally stable

ui(x∗i ; x∗−i) ≥ ui(xi ; x∗−i) for every player i ∈N and every deviation xi ∈ Xi

Individual payoff gradients

Vi(x) = ∇xi ui(xi ; x−i)Interpretation: direction of individually steepest payoff ascent

Variational characterizationIf x∗ is a Nash equilibrium, then

⟨Vi(x∗), xi − x∗i ⟩ ≤ for all i ∈N , xi ∈ Xi

Intuition: ui(xi ; x∗−i) decreasing along all rays emanating from x∗i

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 28: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Nash equilibrium

Nash equilibriumAction profile x∗ = (x∗ , . . . , x∗t ) ∈ X that is unilaterally stable

ui(x∗i ; x∗−i) ≥ ui(xi ; x∗−i) for every player i ∈N and every deviation xi ∈ Xi

Individual payoff gradients

Vi(x) = ∇xi ui(xi ; x−i)Interpretation: direction of individually steepest payoff ascent

Variational characterizationIf x∗ is a Nash equilibrium, then

⟨Vi(x∗), xi − x∗i ⟩ ≤ for all i ∈N , xi ∈ Xi

Intuition: ui(xi ; x∗−i) decreasing along all rays emanating from x∗i

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 29: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Nash equilibrium

Nash equilibriumAction profile x∗ = (x∗ , . . . , x∗t ) ∈ X that is unilaterally stable

ui(x∗i ; x∗−i) ≥ ui(xi ; x∗−i) for every player i ∈N and every deviation xi ∈ Xi

Individual payoff gradients

Vi(x) = ∇xi ui(xi ; x−i)Interpretation: direction of individually steepest payoff ascent

Variational characterizationIf x∗ is a Nash equilibrium, then

⟨Vi(x∗), xi − x∗i ⟩ ≤ for all i ∈N , xi ∈ Xi

Intuition: ui(xi ; x∗−i) decreasing along all rays emanating from x∗i

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 30: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Geometric interpretation

X

TC(x∗)

NC(x∗)

.x∗

V(x∗)

At Nash equilibrium, individual payoff gradients are outward-pointing

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 31: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Monotonicity

A key assumption for games ismonotonicity:

⟨V(x′) − V(x), x′ − x⟩ ≤ for all x ∈ X (MC)

Equivalently: H(x) ≼ where H is the game’s Hessian matrix:

Hi j(x) = ∇x j∇x j ui(x) +

(∇xi∇x j u j(x))⊺

Interpretation: concavity for games

Examples: Kelly auctions, Cournot oligopolies, routing, power control, …

Close relatives:▸ Stable games [Hofbauer & Sandholm, 2009]▸ Contractive games [Sandholm, 2015];▸ Dissipative [Sorin & Wan, 2016]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 32: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Monotonicity

A key assumption for games ismonotonicity:

⟨V(x′) − V(x), x′ − x⟩ ≤ for all x ∈ X (MC)

Equivalently: H(x) ≼ where H is the game’s Hessian matrix:

Hi j(x) = ∇x j∇x j ui(x) +

(∇xi∇x j u j(x))⊺

Interpretation: concavity for games

Examples: Kelly auctions, Cournot oligopolies, routing, power control, …

Close relatives:▸ Stable games [Hofbauer & Sandholm, 2009]▸ Contractive games [Sandholm, 2015];▸ Dissipative [Sorin & Wan, 2016]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 33: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Monotonicity

A key assumption for games ismonotonicity:

⟨V(x′) − V(x), x′ − x⟩ ≤ for all x ∈ X (MC)

Equivalently: H(x) ≼ where H is the game’s Hessian matrix:

Hi j(x) = ∇x j∇x j ui(x) +

(∇xi∇x j u j(x))⊺

Interpretation: concavity for games

Examples: Kelly auctions, Cournot oligopolies, routing, power control, …

Close relatives:▸ Stable games [Hofbauer & Sandholm, 2009]▸ Contractive games [Sandholm, 2015];▸ Dissipative [Sorin & Wan, 2016]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 34: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Monotonicity

Theorem (Rosen, 1965)If a game is strictly monotone, it admits a unique Nash equilibrium.

[+ extensions to {…}-monotone games, generalized equilibrium problems,…]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 35: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Outline

Bacgkround

N-player games

No-regret learning in games

Learning with limited feedback

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 36: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

How to achieve no regret?

Take a gradient step and project: [Zinkevich, ICML 2003]

Xt+ = Π(Xt + γt∇ut(Xt)) (OGD)

X

X

..X

X

γ∇u(X)Π

γ∇u(X)Π

Reg(T) = O(T /) for suitable γt ; optimal in T [Abernethy et al, 2008]

…but what about convergence?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 37: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

How to achieve no regret?

Take a gradient step and project: [Zinkevich, ICML 2003]

Xt+ = Π(Xt + γt∇ut(Xt)) (OGD)

X

X

.

.X

X

γ∇u(X)

Π

γ∇u(X)Π

Reg(T) = O(T /) for suitable γt ; optimal in T [Abernethy et al, 2008]

…but what about convergence?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 38: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

How to achieve no regret?

Take a gradient step and project: [Zinkevich, ICML 2003]

Xt+ = Π(Xt + γt∇ut(Xt)) (OGD)

X

X

..X

X

γ∇u(X)Π

γ∇u(X)Π

Reg(T) = O(T /) for suitable γt ; optimal in T [Abernethy et al, 2008]

…but what about convergence?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 39: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

How to achieve no regret?

Take a gradient step and project: [Zinkevich, ICML 2003]

Xt+ = Π(Xt + γt∇ut(Xt)) (OGD)

X

X

..X

X

γ∇u(X)Π

γ∇u(X)

Π

Reg(T) = O(T /) for suitable γt ; optimal in T [Abernethy et al, 2008]

…but what about convergence?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 40: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

How to achieve no regret?

Take a gradient step and project: [Zinkevich, ICML 2003]

Xt+ = Π(Xt + γt∇ut(Xt)) (OGD)

X

X

..X

X

γ∇u(X)Π

γ∇u(X)Π

Reg(T) = O(T /) for suitable γt ; optimal in T [Abernethy et al, 2008]

…but what about convergence?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 41: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

How to achieve no regret?

Take a gradient step and project: [Zinkevich, ICML 2003]

Xt+ = Π(Xt + γt∇ut(Xt)) (OGD)

X

X

..X

X

γ∇u(X)Π

γ∇u(X)Π

Reg(T) = O(T /) for suitable γt ; optimal in T [Abernethy et al, 2008]

…but what about convergence?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 42: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

How to achieve no regret?

Take a gradient step and project: [Zinkevich, ICML 2003]

Xt+ = Π(Xt + γt∇ut(Xt)) (OGD)

X

X

..X

X

γ∇u(X)Π

γ∇u(X)Π

Reg(T) = O(T /) for suitable γt ; optimal in T [Abernethy et al, 2008]

…but what about convergence?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 43: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

A dynamical systems viewpoint

Vector flow of V (simplest case: no constraints, smooth, etc.):

dXi

dt= −Vi(X(t)) (GD)

Energy function:

E(x) = ∥x − x∗∥

Lyapunov property:dEdt= −⟨V(X(t)), X(t) − x∗⟩ ≤

Distance to solutions is (weakly) decreasing along trajectories of (GD)

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 44: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Cycles

Roadblock: the energy might be a constant of motion [Hofbauer et al, 2009]

-��� -��� ��� ��� ���-���

-���

���

���

���

��

� �

Figure: Hamiltonian flow of f (x , x) = xx .P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 45: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Poincaré recurrence

Cycles are an example of recurrence:

Definition (Poincaré, 1890’s)A dynamical system is Poincaré recurrent if almost all solution trajectories returnarbitrarily close to their starting point infinitely many times.

Theorem (M, Papadimitriou, Piliouras, SODA 2018; bare-bones version)(GD) is recurrent in all bilinear saddle-point problems with an interior solution.

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 46: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Poincaré recurrence

Cycles are an example of recurrence:

Definition (Poincaré, 1890’s)A dynamical system is Poincaré recurrent if almost all solution trajectories returnarbitrarily close to their starting point infinitely many times.

Theorem (M, Papadimitriou, Piliouras, SODA 2018; bare-bones version)(GD) is recurrent in all bilinear saddle-point problems with an interior solution.

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 47: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

OGD in games

OGD as a forward (Euler) scheme:

X+ = X − γV(X)

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 48: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

OGD in games

OGD as a forward (Euler) scheme:

X+ = X − γV(X)

Energy no longer a constant:

∥X+ − x∗∥ =

∥X − x∗∥ − γ ⟨V(X), X − x∗⟩$%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%'

from (GD)

+ γ∥V(X)∥$%%%%%%%%%%%%%%%%%%%%%%&%%%%%%%%%%%%%%%%%%%%%%'discretization error

…even worse

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 49: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

OGD in games

OGD as a forward (Euler) scheme:

X+ = X − γV(X)

Energy no longer a constant:

∥X+ − x∗∥ =

∥X − x∗∥ − γ❤❤❤❤❤❤❤⟨V(X), X − x∗⟩$%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%'

from (GD)

+ γ∥V(X)∥$%%%%%%%%%%%%%%%%%%%%%%&%%%%%%%%%%%%%%%%%%%%%%'discretization error

…even worse

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 50: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

OGD in games

OGD as a forward (Euler) scheme:

Xt+ = Xt − γV(Xt)

-��� -��� ��� ��� ���-���

-���

���

���

���

��

� �

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 51: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Time averages: a very different story

No-regret captures behavior of time-averaged process:

Xt = t∑t

s= Xs

-��� -��� ��� ��� ���-���

-���

���

���

���

��

� �

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 52: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Convergence to equilibrium

Behavior different under strictmonotonicity:

∥Xt+ − x∗∥ =

∥Xt − x∗∥ − γt ⟨V(Xt), Xt − x∗⟩$%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%'< if Xt not Nash

+ γt ∥V(Xt)∥$%%%%%%%%%%%%%%%%%%%%%%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%'discretization error

Can the drift overcome the discretization error?

Theorem (M & Zhou, MathProg 2019)▸ Assume: game strictly monotone,∑t γt =∞,∑t γ

t <∞▸ Then: Xt converges to a Nash equilibrium from any initial condition

In strictly monotone games, no-regret↝ Nash equilibrium

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 53: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Convergence to equilibrium

Behavior different under strictmonotonicity:

∥Xt+ − x∗∥ =

∥Xt − x∗∥ − γt ⟨V(Xt), Xt − x∗⟩$%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%'< if Xt not Nash

+ γt ∥V(Xt)∥$%%%%%%%%%%%%%%%%%%%%%%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%'discretization error

Can the drift overcome the discretization error?

Theorem (M & Zhou, MathProg 2019)▸ Assume: game strictly monotone,∑t γt =∞,∑t γ

t <∞▸ Then: Xt converges to a Nash equilibrium from any initial condition

In strictly monotone games, no-regret↝ Nash equilibrium

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 54: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Convergence to equilibrium

Behavior different under strictmonotonicity:

∥Xt+ − x∗∥ =

∥Xt − x∗∥ − γt ⟨V(Xt), Xt − x∗⟩$%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%'< if Xt not Nash

+ γt ∥V(Xt)∥$%%%%%%%%%%%%%%%%%%%%%%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%'discretization error

Can the drift overcome the discretization error?

Theorem (M & Zhou, MathProg 2019)▸ Assume: game strictly monotone,∑t γt =∞,∑t γ

t <∞▸ Then: Xt converges to a Nash equilibrium from any initial condition

In strictly monotone games, no-regret↝ Nash equilibrium

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 55: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Outline

Bacgkround

N-player games

No-regret learning in games

Learning with limited feedback

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 56: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Feedback

(OGD) requires gradient information, which may be difficult to come by:▸ Other players’ actions unknown▸ Measurement errors▸ Stochastic utilities (realized vs. expected gradients)▸ ⋯

Imperfect gradient feedback:

Vt = V(xt) +Ut

with the following hypotheses:

[H1] Zero-mean error: E[Ut ∣Ft−] = [#⇒ E[Vt ∣Ft−] = V(xt)][H2] Finite mean squared error: E[∥Ut∥∗ ∣Ft−] ≤ σ [#⇒ E[∥Vt∥∗ ∣Ft−] ≤V ]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 57: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Learning with imperfect gradients

Algorithm 1 Stochastic gradient descentRequire: step-size sequence γt > 1: choose X ∈ X # initialization2: for t = , , . . . do3: oracle query at state X returns V #gradient feedback4: set X ← Π(X + γtV) #new state5: end for6: return X

Guarantees:

▸ E[Reg(T)] = O(√T) [folk]

▸ Strict monotonicity #⇒ Xt converges to Nash (a.s.) [M & Zhou, 2019]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 58: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Learning with imperfect gradients

Algorithm 1 Stochastic gradient descentRequire: step-size sequence γt > 1: choose X ∈ X # initialization2: for t = , , . . . do3: oracle query at state X returns V #gradient feedback4: set X ← Π(X + γtV) #new state5: end for6: return X

Guarantees:

▸ E[Reg(T)] = O(√T) [folk]

▸ Strict monotonicity #⇒ Xt converges to Nash (a.s.) [M & Zhou, 2019]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 59: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

No gradient feedback whatsoever

In many cases, even stochastic gradients are out of reach:▸ Multi-armed bandits (clinical trials, …)▸ Other players’ actions unknown (auctions, …)▸ ⋯

Possible fixes:▸ Two-time-scale approach: fast samples, slow updates [can be slow !]▸ Multiple-point estimates [needs synchronization !]▸ Simultaneous perturbation stochastic approximation [Spall, 1997]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 60: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

No gradient feedback whatsoever

In many cases, even stochastic gradients are out of reach:▸ Multi-armed bandits (clinical trials, …)▸ Other players’ actions unknown (auctions, …)▸ ⋯

Possible fixes:▸ Two-time-scale approach: fast samples, slow updates [can be slow !]

▸ Multiple-point estimates [needs synchronization !]▸ Simultaneous perturbation stochastic approximation [Spall, 1997]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 61: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

No gradient feedback whatsoever

In many cases, even stochastic gradients are out of reach:▸ Multi-armed bandits (clinical trials, …)▸ Other players’ actions unknown (auctions, …)▸ ⋯

Possible fixes:▸ Two-time-scale approach: fast samples, slow updates [can be slow !]▸ Multiple-point estimates [needs synchronization !]

▸ Simultaneous perturbation stochastic approximation [Spall, 1997]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 62: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

No gradient feedback whatsoever

In many cases, even stochastic gradients are out of reach:▸ Multi-armed bandits (clinical trials, …)▸ Other players’ actions unknown (auctions, …)▸ ⋯

Possible fixes:▸ Two-time-scale approach: fast samples, slow updates [can be slow !]▸ Multiple-point estimates [needs synchronization !]▸ Simultaneous perturbation stochastic approximation [Spall, 1997]

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 63: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Simultaneous perturbation stochastic approximation

Estimate u′(x) at target point x ∈ Ru′(x) ≈ u(x + δ) − u(x − δ)

δ

Pick z = ± with probability /. Then:E[u(x + δz)z] =

u(x + δ) −

u(x − δ)

#⇒ Estimate u′(x) up toO(δ) by sampling u at x = x + δz and looking at δ u(x)z

Algorithm 2 Single-point estimator of∇u at X

1: Draw z uniformly from Sd

2: Play X = X + δz3: Get u = u(X)4: Set V = d

δ uz

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 64: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Simultaneous perturbation stochastic approximation

Estimate u′(x) at target point x ∈ Ru′(x) ≈ u(x + δ) − u(x − δ)

δ

Pick z = ± with probability /. Then:E[u(x + δz)z] =

u(x + δ) −

u(x − δ)

#⇒ Estimate u′(x) up toO(δ) by sampling u at x = x + δz and looking at δ u(x)z

Algorithm 2 Single-point estimator of∇u at X

1: Draw z uniformly from Sd

2: Play X = X + δz3: Get u = u(X)4: Set V = d

δ uz

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 65: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Simultaneous perturbation stochastic approximation

Estimate u′(x) at target point x ∈ Ru′(x) ≈ u(x + δ) − u(x − δ)

δ

Pick z = ± with probability /. Then:E[u(x + δz)z] =

u(x + δ) −

u(x − δ)

#⇒ Estimate u′(x) up toO(δ) by sampling u at x = x + δz and looking at δ u(x)z

Algorithm 2 Single-point estimator of∇u at X

1: Draw z uniformly from Sd

2: Play X = X + δz3: Get u = u(X)4: Set V = d

δ uz

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 66: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Learning with bandit feedback

X

X

.X

δz

.

γ(d/δ) uz

.X

Π

.X

δz

.X

γ(d/δ) uz

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 67: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Learning with bandit feedback

X

X

.X

δz

.

γ(d/δ) uz

.X

Π

.X

δz

.X

γ(d/δ) uz

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 68: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Learning with bandit feedback

X

X

.X

δz

.

γ(d/δ) uz

.X

Π

.X

δz

.X

γ(d/δ) uz

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 69: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Learning with bandit feedback

X

X

.X

δz

.

γ(d/δ) uz

.X

Π

.X

δz

.X

γ(d/δ) uz

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 70: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Learning with bandit feedback

X

X

.X

δz

.

γ(d/δ) uz

.X

Π

.X

δz

.X

γ(d/δ) uz

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 71: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Learning with bandit feedback

X

X

.X

δz

.

γ(d/δ) uz

.X

Π

.X

δz

.X

γ(d/δ) uz

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 72: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Bandit gradient descent

Algorithm 3Multi-agent gradient ascent with bandit feedback

Require: step-size γt > , query radius δt > , safety ball Br(p) ⊆ X1: choose X ∈ X # initialization2: repeat at each stage t = , , . . .3: draw Z uniformly from Sd #perturbation direction4: setW ← Z − r−(X − p) # feasibility adjustment5: play X ← X + δtW #choose action6: receive u ← u(X) #get payoff7: set V ← (d/δt)u ⋅ Z #estimate gradient8: update X ← Π(X + γt V) #update pivot9: until end

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 73: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Challenges

Key difficulty:▸ One-point estimates may be biased (no more thanO(δ) accuracy)

▸ Can eliminate bias by taking decreasing δt → but variance explodes

E[∥Vt∥] = O(/δt )▸ Stochastic approximation analysis requires bounded variance▸ Bias-variance dilemma: accuracy vs. stability?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 74: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Challenges

Key difficulty:▸ One-point estimates may be biased (no more thanO(δ) accuracy)▸ Can eliminate bias by taking decreasing δt →

but variance explodes

E[∥Vt∥] = O(/δt )▸ Stochastic approximation analysis requires bounded variance▸ Bias-variance dilemma: accuracy vs. stability?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 75: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Challenges

Key difficulty:▸ One-point estimates may be biased (no more thanO(δ) accuracy)▸ Can eliminate bias by taking decreasing δt → but variance explodes

E[∥Vt∥] = O(/δt )

▸ Stochastic approximation analysis requires bounded variance▸ Bias-variance dilemma: accuracy vs. stability?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 76: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Challenges

Key difficulty:▸ One-point estimates may be biased (no more thanO(δ) accuracy)▸ Can eliminate bias by taking decreasing δt → but variance explodes

E[∥Vt∥] = O(/δt )▸ Stochastic approximation analysis requires bounded variance

▸ Bias-variance dilemma: accuracy vs. stability?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 77: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Challenges

Key difficulty:▸ One-point estimates may be biased (no more thanO(δ) accuracy)▸ Can eliminate bias by taking decreasing δt → but variance explodes

E[∥Vt∥] = O(/δt )▸ Stochastic approximation analysis requires bounded variance▸ Bias-variance dilemma: accuracy vs. stability?

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 78: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Convergence analysis

Must balance step-size γt against query radius δt :

▸ limt→∞ γt = limt→∞ δt = # vanishing noise and bias

▸ ∑∞t= γt =∞ # the process doesn’t stop

▸ ∑∞t= γt /δt <∞ # variance control

▸ limt→∞ γtδt = # bias control

Theorem (Bravo, Leslie & M, NIPS 2018)1. Under strict monotonicity, Xt converges to Nash equilibrium with probability .

2. Under strong monotonicity (H(x) ≺ −βI), γt ∝ /t, δt ∝ /t/, we have:

E[∥Xt − x∗∥] = O(/t/).

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 79: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Convergence analysis

Must balance step-size γt against query radius δt :

▸ limt→∞ γt = limt→∞ δt = # vanishing noise and bias

▸ ∑∞t= γt =∞ # the process doesn’t stop

▸ ∑∞t= γt /δt <∞ # variance control

▸ limt→∞ γtδt = # bias control

Theorem (Bravo, Leslie & M, NIPS 2018)1. Under strict monotonicity, Xt converges to Nash equilibrium with probability .

2. Under strong monotonicity (H(x) ≺ −βI), γt ∝ /t, δt ∝ /t/, we have:

E[∥Xt − x∗∥] = O(/t/).

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 80: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Convergence rate

Speed of convergence in a repeated Kelly auction

○○

○○ ○ ○ ○ ○ ○ ○ ○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○������ ����������

○ ���������� �������

� � �� �� ��� ��� ����

��-�

��-�

����

����������

||��-�*||�

�������� �� ���� �����������

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 81: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Conclusions and perspectives

Conclusions▸ No-regret learning does not guarantee stability by itself ✗

▸ No-regret learning plus suitable monotonicity does ✓

▸ Convergence to equilibrium does not require gradient feedback ✓

Open questions▸ Faster rates?▸ Delayed payoff observations?▸ Beyond monotonicity?▸ ???

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 82: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

Bacgkround N-player games No-regret learning in games Learning with limited feedback

Conclusions and perspectives

Conclusions▸ No-regret learning does not guarantee stability by itself ✗

▸ No-regret learning plus suitable monotonicity does ✓

▸ Convergence to equilibrium does not require gradient feedback ✓

Open questions▸ Faster rates?▸ Delayed payoff observations?▸ Beyond monotonicity?▸ ???

P. Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

Page 83: LetterSpace=4NO-REGRET LEARNING IN GAMES [1em]€¦ · Bacgkround N-playergames No-regretlearningingames Learningwithlimitedfeedback TheKellyauction Proportionallyfairallocationofresourcestodifferentclients

NetEcon 2019The 14th Workshop on the Economics of Networks, Systems and Computation

Phoenix, Arizona, 28th June 2019

In conjunction with ACM EC 2019 & SIGMETRICS 2019

Keynote speakers: Itai Ashlagi *** David Parkes *** Nicolas Stier

Topics: Networks & … learning, resource pricing, market design, auctions

https://netecon19.inria.fr