Post on 31-Dec-2015
description
Response Regret
Martin ZinkevichAAAI Fall SymposiumNovember 5th, 2005
This work was supported by NSF Career Grant #IIS-0133689.
Outline
Introduction Repeated Prisoners’ Dilemma
Tit-for-Tat Grim Trigger
Traditional Regret Response Regret Conclusion
The Prisoner’s Dilemma
Two prisoners (Alice and Bob) are caught for a small crime. They make a deal not to squeal on each other for a large crime.
Then, the authorities meet with each prisoner separately, and offer a pardon for the small crime if the prisoner turns (his/her) partner in for the large crime.
Each has two options: Cooperate with (his/her) fellow prisoner, or Defect from the deal.
Bimatrix Game
Alice: 5 yearsBob: 5 years
Alice: 0 yearsBob: 6 years
AliceDefects
Alice: 6 yearsBob: 0 years
Alice: 1 yearBob: 1 year
AliceCooperates
BobDefects
BobCooperates
Bimatrix Game
BobCooperates
BobDefects
AliceCooperates
-1,-1 -6,0
AliceDefects
0,-6 -5,-5
Nash Equilibrium
BobCooperates
BobDefects
AliceCooperates
-1,-1 -6,0
AliceDefects
0,-6 -5,-5
The Problem
Each acting to slightly improve his/her circumstances hurts the other player, such that if they both acted “irrationally”, they would both do better.
A Better Model for Real Life
Consequences for misbehavior These improve life A better model: Infinitely repeated
games
The Goal
Can we come up with algorithms with performance guarantees in the presence of other intelligent agents which take into account the delayed consequences?
Side effect: a goal for reinforcement learning in infinite POMDPs.
Regret Versus Standard RL
Guarantees of performance during learning.
No guarantee for the “final” policy……for now.
A New Measure of Regret
Traditional Regret measures immediate consequences
Response Regret measures delayed effects
Outline
Introduction Repeated Prisoners’ Dilemma
Tit-for-Tat Grim Trigger
Traditional Regret Response Regret Conclusion
Outline
Introduction Repeated Prisoners’ Dilemma
Tit-for-Tat Grim Trigger
Traditional Regret Response Regret Conclusion
Repeated Bimatrix Game
-5,-50,-6Alice
Defects
-6,0-1,-1Alice
Cooperates
BobDefects
BobCooperates
Finite State Machine (for Bob)
Bobcooperates
Bob defects
Alicedefects
Alicedefects
Alice*
Alicecooperates
Bobcooperates
Alicecooperates
Grim Trigger
Bobcooperates
Bobdefects
Alicedefects
Alice*
Alicecooperates
Always Cooperate
Bobcooperates
Alice*
Always Defect
Bobdefects
Alice*
Tit-for-Tat
Bobcooperates
Bobdefects
Alicedefects
Alicedefects
Alicecooperates
Alicecooperates
Discounted Utility
Bobcooperates
Bobdefects
Alicecooperates
Alicedefects
Alicedefects
Alicecooperates
GOSTOPGO
STOPGO
STOPGO
GOSTOPGO
STOPGO
STOPGO
GOSTOPGO
STOPGO
STOPGO
GOSTOPGO
STOPGO
STOPC -1C -1
D 0C -6
C -6D 0
D 0C -6
GOPr[ ]=2/3
STOPPr[ ]=1/3
Discounted Utility
The expected value of that process t=1
1 ut t-1
Optimal Value Functions for FSMs
V*(s) discounted utility of OPTIMAL
policy from state s V
*(s) immediate maximum utility at state s
V*(B) discounted utility of OPTIMAL
policy given belief over states B V
*(B) immediate maximum utility given belief over states B
GOPr[ ]= STOPPr[ ]=(1-)
Best Responses, Discounted Utility
If >1/5, a policy is a best response to grim trigger iff it always cooperates when playing grim trigger.
Bobcooperates
Bobdefects
Alicedefects
Alice*
Alicecooperates
Best Responses, Discounted Utility
Similarly, if >1/5, a policy is a best response to tit-for-tat iff it always cooperates when playing tit-for-tat.
Bobcooperates
Bobdefects
Alicedefects
Alicedefects
Alicecooperates
Alicecooperates
Knowing Versus Learning
Given a known FSM for the opponent, we can determine the optimal policy (for some ) from an initial state.
However, if it is an unknown FSM, by the time we learn what it is, it will be too late to act optimally.
Grim Trigger or Always Cooperate?
Bobcooperates
Bobdefects
Alicedefects
Alice*
Alicecooperates
Bobcooperates
Alice*
Grim Trigger Always Cooperate
For learning, optimality from the initial state is a bad goal.
Deterministic Infinite SMs
represent any deterministic policy de-randomization
C
D
C
C
D
D
D
New Goal
Can a measure of regret allow us to play like tit-for-tat in the Infinitely Repeated Prisoner’s Dilemma?
In addition, it should be possible for one algorithm to minimize regret against all possible opponents (finite and infinite SMs).
Outline
Introduction Repeated Prisoners’ Dilemma
Tit-for-Tat Grim Trigger
Traditional Regret Response Regret Conclusion
Traditional Regret:Rock-Paper-Scissors
Bob playsRock
Bob playsPaper
Bob playsScissors
Alice playsRock
TieBob wins
$1Alice wins
$1
Alice playsPaper
Alice wins $1
TieBob wins
$1
Alice playsScissors
Bob wins $1
Alice wins $1
Tie
Traditional Regret:Rock-Paper-Scissors
Bob playsRock
Bob playsPaper
Bob playsScissors
Alice playsRock
0,0 -1,1 1,-1
Alice playsPaper
1,-1 0,0 -1,1
Alice playsScissors
-1,1 1,-1 0,0
Rock-Paper-Scissors Bob plays BR to Alice’s Last
Rock-Paper-Scissors Bob plays BR to Alice’s Last
Rock-Paper-Scissors Bob plays BR to Alice’s Last
Utility of the Algorithm
Define ut to be the utility of ALG at time t.
Define u0ALG to be:
u0ALG=(1/T)t=1
T ut
Here:u0
ALG=(1/5)(0+1+(-1)+1+0)=1/5
u0ALG=1/5
Rock-Paper-Scissors Visit Counts for Bob’s Internal States
3 Visits 1 Visit
1 Visit
u0ALG=1/5
Rock-Paper-Scissors Frequencies
3/5 Visits 1/5 Visits
1/5 Visits
u0ALG=1/5
Rock-Paper-Scissors Dropped according to Frequencies
3/5 Visits 1/5 Visits
1/5 Visits
0
2/5
-2/5
u0ALG=1/5
Traditional Regret
Consider B to be the empirical frequency states were visited.
Define u0ALG to be the average utility
of the algorithm. Traditional regret of ALG is:
R= V*(B)-u0
ALG
R=(2/5)-(1/5)
u0ALG=1/5
0
2/5
-2/5
Traditional Regret
Goal: regret approach zero a.s. Exists an algorithm that will do this
for all opponents.
What Algorithm?
Gradient Ascent With Euclidean Projection (Zinkevich, 2003):
(when pi strictly positive)
What Algorithm?
Exponential Weighted Experts (Littlestone + Warmuth, 1994):
And a close relative:
What Algorithm?
Regret Matching:
What Algorithm?
Lots of them!
Extensions to Traditional Regret
(Foster and Vohra, 1997)
Into the past… Have a short history Optimal against BR to Alice’s Last.
Extensions to Traditional Regret
(Auer et al) Only see ut, not ui,t:
Use an unbiased estimator of ui,t:
Outline
Introduction Repeated Prisoners’ Dilemma
Tit-for-Tat Grim Trigger
Traditional Regret Response Regret Conclusion
This Talk
Do you want to? Even then, is it possible?
Traditional Regret:Prisoner’s Dilemma
Bobcooperates
Bobdefects
Alicedefects
Alicedefects
Alicecooperates
Alicecooperates
CCDCDD
DD
DD
DD
DD
DD
DD
DD
Traditional Regret:Prisoner’s Dilemma
Bobcooperates
(0.2)
Bobdefects
(0.8)
Alicecooperates
Alicedefects
Alicedefects
Alicecooperates
Alice defects: -4Alice cooperates: -5
Traditional Regret
BobCooperates
BobDefects
AliceCooperates
-1,-1 -6,0
AliceDefects
0,-6 -5,-5
The New Dilemma
Traditional regret forces greedy, short-sighted behavior.
A new concept is needed.
A New Measurement of Regret
Bobcooperates
(0.2)
Bobdefects
(0.8)
Alicedefects
Alicedefects
Alicecooperates
Alicecooperates
V*(B)
instead of V0*(B)
Response Regret
Consider B to be the empirical distribution over states visited.
Define u0ALG to be the average utility
of the algorithm. Traditional regret is:
R0= V*(B)-u0
ALG
Response regret is:R= V
*(B)-?
Averaged Discounted Utility
Utility of algorithm at time t’=ut’
Discounted utility from time t=t’=t
1 ut’t’-t
Averaged discounted utility from 1 to Tu
ALG=(1/T)t=1T t’=t
1 ut’t’-t
Dropped in at random but play optimally:V
*(B)Response Regret
R= V*(B)-u
ALG
Response Regret
Consider B to be the empirical distribution over states visited.
Traditional regret is:R0= V
*(B)-u0ALG
Response regret is:R= V
*(B)-uALG
Comparing Regret Measures:when Bob Plays Tit-for-Tat
Bobcooperates
(0.2)
Bobdefects
(0.8)
Alicedefects
Alicedefects
Alicecooperates
Alicecooperates
CCDCDD
DD
DD
DD
DD
DD
DD
DDR0=1/10 (defect)R1/5=0 (any policy)R2/3=(203/30)¼6.76 (always cooperate)
Comparing Regret Measures: when Bob Plays Tit-for-Tat
Bobcooperates
(1.0)
Bobdefects
(0.0)
Alicedefects
Alicedefects
Alicecooperates
Alicecooperates
CCCCCC
CC
CC
CC
CC
CC
CC
CCR0=1 (defect)R1/5=0 (any policy)R2/3=0 (always cooperate/tit-for-tat/grim trigger)
Comparing Regret Measures: when Bob Plays Grim Trigger
Bobcooperates
(0.2)
Bobdefects
(0.8)
Alicedefects
Alice*
Alicecooperates
CCDCDD
DD
DD
DD
DD
DD
DD
DDR0=1/10 (defect)R1/5=0 (grim trigger/tit-for-tat/always defect)R2/3=11/30 (grim trigger/tit-for-tat)
Comparing Regret Measures: when Bob Plays Grim Trigger
Bobcooperates
(1.0)
Bobdefects
(0.0)
Alicedefects
Alice*
Alicecooperates
CCCCCC
CC
CC
CC
CC
CC
CC
CCR0=1 (defect)R1/5=0 (always cooperate/always defect/tit-for-tat/grim trigger)R2/3=0 (always cooperate/tit-for-tat/grim trigger)
Regrets
vs Tit-for-Tat
vs Grim Trigger
CDDDDDDDDDCCDDDDDDDD
R0=0.1
R1/5=0
R2/3¼6.76
R0=0.1
R1/5=0
R2/3¼0.36
CCCCCCCCCCCCCCCCCCCCCC
R0=1
R1/5=0
R2/3=0
R0=1
R1/5=0
R2/3=0
What it Measures:
constant opportunities high response regret
a few drastic mistakes low response regret
convergence implies Nash Equilibrium of the repeated game
Philosophy
Response regret cannot be known without knowing the opponent.
Response regret can be estimated while playing the opponent, so that the estimate in the limit will be exact a.s.
Determining Utility of a Policyin a State
If I want to know the discounted utility of using a policy P from the third state visited…
Use the policy P from the third time step ad infinitum, and take the discounted reward.
S1 S2 S3 S4 S5
Determining Utility of a Policyin a State in Finite Time
Start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used.
In EXPECTATION, the same as before.
S1 S2 S3 S4 S5
Determining Utility of a Policyin a State in Finite Time Without ALWAYS Using It
With a probability , start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used and multiply it by 1/.
In EXPECTATION, the same as before. Can estimate any finite number of policies
at the same time this way.
S1 S2 S3 S4 S5
Traditional Regret
Goal: regret approach zero a.s. Exists an algorithm for all
opponents.
Response Regret
Goal: regret approach zero a.s. Exists an algorithm for all
opponents.
A Hard Environment:The Combination Lock Problem
Bd
Bd
Ad
Ac Bd
Bd
Bc
Ac Ac Ad
A*
Ad
Ac
SPEED!
Response regret takes time to minimize (combination lock problem).
Current work: restricting the adversary’s choice of policies. In particular, if the number of policies is N, then the regret is linear in N and polynomial in 1/(1-).
Related Work
Other work De Farias and Meggido 2004 Browning, Bowling, and Veloso 2004 Bowling and McCracken 2005
Episodic solutions: similar problems to Finitely Repeated Prisoner’s Dilemma.
What is in a Name?
Why not Consequence Regret?
Questions?
Thanks to:Avrim Blum (CMU)
Michael Bowling (U Alberta)Amy Greenwald (Brown)
Michael Littman (Rutgers)Rich Sutton (U Alberta)
Always Cooperate
Bobcooperates
Alice*
CCDCDD
DD
DD
DD
DD
DD
DD
DDR0=1/10R1/5=1/10R2/3=1/10
Practice
Using these estimation techniques, it is possible to minimize response regret (make it approach zero almost surely in the limit in an ARBITRARY environment).
Similar to the Folk Theorems, it is also possible to converge to the socially optimal behavior if is close enough to 1.(???)
Traditional Regret:Prisoner’s Dilemma
Bobcooperates
Bobdefects
Alicedefects
Alicedefects
Alicecooperates
Alicecooperates
Possible Outcomes
Alice cooperates, Bob cooperates: Alice: 1 year Bob: 1 year
Alice defects, Bob cooperates: Alice: 0 years Bob: 6 years
Alice cooperates, Bob defects: Alice: 6 years Bob: 0 years
Alice defects, Bob defects: Alice: 5 years Bob: 5 years
Bimatrix Game
BobCooperates
BobDefects
AliceCooperates
Alice: 1 yearBob: 1 year
Alice: 6 yearsBob: 0 years
AliceDefects
Alice: 0 yearsBob: 6 years
Alice: 5 yearsBob: 5 years
Repeated Bimatrix Game
The same one-shot game is played repeatedly.
Either average reward or discounted reward is considered.
Rock-Paper-Scissors Bob plays BR to Alice’s Last
Rock-Paper-Scissors Bob plays BR to Alice’s Last
Rock-Paper-Scissors Bob plays BR to Alice’s Last
Rock-Paper-Scissors Bob plays BR to Alice’s Last
Rock-Paper-Scissors Bob plays BR to Alice’s Last
One Slide Summary
Problem: Prisoner’s Dilemma Solution: Infinitely Repeated
Prisoner’s Dilemma Same Problem: Traditional Regret Solution: Response Regret
Formalism for FSMs (S,A,,O,u,T)
States S Finite actions A Finite observations Observation function O:S! Utility function u:S£A!R
(or u:S£O!R) Transition function T:S£A!S V*(s)=maxa2 A [u(s,a)+V*(T(s,a))]
Beliefs
Suppose S is a set of states.
T(s,a) state O(s)
observation u(s,a) value V*(s)=maxa2A
[u(s,a)+V*(T(s,a))]
Suppose B is a distribution over states.
T(B,a,o) belief O(B,o) probability u(B,a) expected value V*(B)=maxa2A
[u(B,a)+o2O(B,o)V*(T(B,a,o))]