Regret Minimization: Algorithms and Applications - New York University
Adaptive Regret Minimization in Bounded Memory Games
description
Transcript of Adaptive Regret Minimization in Bounded Memory Games
Adaptive Regret Minimization in Bounded Memory Games
Jeremiah Blocki, Nicolas Christin, Anupam Datta, Arunesh Sinha
1
GameSec 2013 – Invited Paper
Motivating Example: Cheating Game
4
Semester 1 Semester 2 Semester 3
Motivating Example: Speeding Game
5
Week 1 Week 2 Week 3
Motivating Example: Speeding GameExample
Actions
6
QuestionsAppropriate Game Model for this Interaction?
Defender Strategies?
:
Outcomes
:High InspectionLow Inspection
Speed Behave
Game Elements
8
o Repeated Interactiono Two Players: Defender and Adversary
o Imperfect Informationo Defender only observes outcome
o Short Term Adversarieso Adversary Incentives Unknown to
Defendero Last presentation! [JNTP 13]
o Adversary may be uninformed/irrational
Repeated G
ame M
odel?
Stackelberg
Additional Game Elements
9
o History-dependent Actionso Adversary adapts behavior following unknown strategyo How should defender respond?
o History-dependent Rewards:o Point Systemo Reputation of defender depends both on its history
and on the current outcome
StandardRegretMinimization
Repeated Game Model?
Outline
10
Motivation Background
Standard Definition of Regret Regret Minimization Algorithms Limitations
Our Contributions Bounded Memory Games Adaptive Regret Results
Speeding Game: Repeated Game Model
Example
11
.19 0.70.2 1
High InspectionLow Inspection
Defender’s (D) Expected Utility
+
Regret Minimization ExampleExample
Experts
13
Low Inspecti
onHigh
Inspection
What should I do?
BehaveLow
BehaveHigh
SpeedHighLowHigh
LowHigh
LowHigh
Regret Minimization ExampleExample
.19 0.70.2 1
High InspectionLow Inspection
Defender’s Utility
Experts
14
AdversaryDefender
Utility1.892.21.59
AristotlePlato
0.19 + 1 + 0.7 = 1.89 0.2 + 1 + 1 = 2.2Day 1 Day 2 Day 3
Regret Minimization ExampleExample
.19 0.70.2 1
High InspectionLow Inspection
Defender’s Utility
Regret
15
DefenderAristotlePlato
Utility1.892.20.59
Regret Minimization ExampleExample
.19 0.70.2 1
High InspectionLow Inspection
Defender’s Utility
Regret
16
Regret Minimization ExampleExample
.19 0.70.2 1
High InspectionLow Inspection
Defender’s utility
Regret Minimization Algorithm (A)
17
lim¿𝑇→∞ (Regret ( 𝐴 ,𝐸𝑥𝑝𝑒𝑟𝑡𝑠 ,𝑇 )𝑇 )≤0
Regret Minimization: Basic Idea
18
Low Inspectio
n
High Inspectio
n
1.0 1.0Weights
Choose action probabilistically based on weights
.19 0.70.2 1
High Inspection
Low Inspection
Regret Minimization: Basic Idea
19
Updated weights
Low Inspectio
n
High Inspectio
n
0.5 1.5
.19 0.70.2 1
High InspectionLow Inspection
Speeding GameExample
.19 0.70.2 1
High InspectionLow Inspection
Defender’s utility
Defender’s Strategy
20
Nash Equilibrium: Low Inspection
Regret Minimization: Low Inspection
Dominant Strategy
Low Inspectio
nHigh
Inspection
0.5 1.5
Speeding GameExample
.19 0.70.2 1
High InspectionLow Inspection
Defender’s utility
Defender’s Strategy
21
Nash Equilibrium: Low Inspection
Regret Minimization: Low Inspection
Dominant Strategy
Low Inspectio
nHigh
Inspection
0.3 1.7
Speeding GameExample
.19 0.70.2 1
High InspectionLow Inspection
Defender’s utility
Defender’s Strategy
22
Nash Equilibrium: Low Inspection
Regret Minimization: Low Inspection
Dominant Strategy
Low Inspectio
nHigh
Inspection
0.1 1.9
Philosophical Argument
24
See! My advice
was better!
We need
a better game model
!
Unmodeled Game Elements
29
o Adversary Incentives Unknown to Defendero Last presentation! [JNTP 13]
o Adversary may be uninformed/irrational
o History-dependent Rewards:o Point Systemo Reputation of defender depends both on its history
and on the current outcome
o History-dependent Actionso Adversary adapts behavior following unknown strategyo How should defender respond?
Outline
30
Motivation Background Our Contributions
Bounded Memory Games Adaptive Regret Results
Bounded Memory Games
32
State s: Encodes last m outcomes
States: can capture history dependent rewards
𝑂𝑖−𝑚 ,…,𝑂 𝑖− 2 ,𝑂𝑖−1 𝑂𝑖−𝑚+1 ,…,𝑂𝑖−1 ,𝑂𝑖𝑂𝑖
- Defender payoff when actions d,a are played at state s
Bounded Memory Games
33
State s: Encodes last m outcomes
Current outcome is only dependent on current actions
, ¿ ,…, , - Defender payoff when actions d,a are played at state s
Bounded Memory Games - Experts
34
Expert advice may depend on the last m outcomes If no violations have been
detected in the last m rounds then play High Inspection, otherwise Low Inspection
Fixed Defender Strategy:
State Action
Outline
35
Motivation Background Our Contributions
Bounded Memory Games Adaptive Regret Results
k-Adaptive Strategy
36
Decision tree for the next k roundsSpeed
Speed
Speed Speed
Day 1
Day 2
Day 3
…… …
Behave
… …
Speed
Speed
k-Adaptive Strategy
37
Decision tree for the next k rounds
Week 1 Week 2 Week 3
I will never speed while I am on vacation.
I will speed until I get caught. If I ever get a ticket then I will
stop.
I will keep speeding until I get two tickets. If I ever get two
tickets then I will stop.
If violations have been detected in the last 7 rounds
then play High Inspection, otherwise Low Inspection
k-Adaptive Regret
38
Regret (𝐷 , Expert ,𝑇 )=∑𝑖=1
𝑇(𝑟 𝑖′−𝑟𝑖 )
Initial State
Defender … O-1 O0
Actions (a1,d1) (a2,d2) … (ak+1,dk+1)
Outcome
O1 O2 … Ok+1 …
r1 r2 … rk+1
Expert … O-1 O0
Actions (a1,d1’)
(a2’,d2’)
… (ak+1,dk+1’)
…
Outcome
O1’ O2’ … Ok+1’ …
r1’ r2’ … rk+1’
k-Adaptive Regret Minimization
39
Definition: An algorithm D is a-approximate k-adaptive regret minimization if for any bounded memory-m game and any fixed set of experts EXP
lim ¿ 𝑇→∞(max𝐸∈exp
Regret (D ,𝐸 ,𝑇 )𝑇 )≤𝛾 .
Outline
40
Motivation Background Bounded Memory Games Adaptive Regret Results
k-Adaptive Regret Minimization
41
lim¿𝑇→∞(max𝐸∈exp
Regret (D ,𝐸 ,𝑇 )𝑇 )≤𝛾 .
Definition: An algorithm D is a-approximate k-adaptive regret minimization if for any bounded memory-m game and any fixed set of experts EXP
Theorem: For any there is an inefficient –approximate k-adaptive regret minimization algorithm.
Inefficient Regret Minimization Algorithm
42
• Use standard regret minimization algorithm for repeated games of imperfect information [AK04, McMahanB04,K05,FKM05]
, ¿ ,…, , …
f1
f2
…
Bounded Memory-mGame
fixed
stra
tegy
Repeated Game
Inefficient Regret Minimization Algorithm
43
, ¿ ,…, , …
f1
f2
…
Bounded Memory-mGame
fixed
stra
tegy
Repeated Game
Expected reward in original game given:
1. Defender follows fixed strategy f2 for next mkt rounds of original game
2. Defender sees sequence of k-adaptive adversaries below
• Use standard regret minimization algorithm for repeated games of imperfect information [AK04, McMahanB04,K05,FKM05]
Current outcome is only dependent on current actions
Inefficient Regret Minimization Algorithm
Start State Stagei (m*k*t rounds)
Real Game … O1 … Om …
Repeated Game
… O1 … Om …
44
Inefficient Regret Minimization Algorithm
Start State Stagei (m*k*t rounds)
Real Game … O1 … Om …
Repeated Game
… O1 … Om …
45
• After m rounds in Stagei View 1 and View 2 must converge to the same state
Inefficient Regret Minimization Algorithm
46
, ¿ ,…, , …
f1
f2
…
Bounded Memory-mGame
fixed
stra
tegy
Repeated GameStandard Regret Minimization algorithms maintain weight
for each expert.
Inefficient: Exponentially many fixed strategies!
• Use standard regret minimization algorithm for repeated games of imperfect information [AK04, McMahanB04,K05,FKM05]
Summary of Technical Results
47
Imperfect Information
Perfect Information
Oblivious Regret Hard (Theorem 1)APX (Theorem 5)
APX (Theorem 4)
k-Adaptive Regret Hard (Theorem 1) Hard (Remark 2)
Fully Adaptive Regret X (Theorem 6) X (Theorem 6)
Easier
X – No Regret Minimization Algorithm ExistsHard – unless no regret minimization algorithm is efficient in
APX – efficient approximate regret minimization algorithm
Summary of Technical Results
48
Imperfect Information
Perfect Information
Oblivious Regret Hard (Theorem 1)APX (Theorem 5)
APX (Theorem 4)
k-Adaptive Regret Hard (Theorem 1)APX (New!)
Hard (Remark 2)APX (New!)
Fully Adaptive Regret X (Theorem 6) X (Theorem 6)
Easier
X – No Regret Minimization Algorithm ExistsHard – unless no regret minimization algorithm is efficient in
APX – efficient approximate regret minimization algorithm in n, k
Summary of Technical Results
49
Imperfect Information
Perfect Information
Oblivious Regret k-Adaptive Regret
Fully Adaptive Regret X (Theorem 6) X (Theorem 6)
Easier
Ideas: Implicit weight representation + Dynamic Programming
Warning! f(k) is a very large constant!
Implicit Weights: Outcome Tree
50
… …
Behave… …
SpeedBehave Speed 𝑶 ( ln𝒏𝜸 )
❑
𝒘𝒖𝒗
How often is edge (s,t) relevant?
nodes
Implicit Weights: Outcome Tree
51
Expert: E… …
Behave… …
SpeedBehave Speed
𝒘𝒖𝒗 𝒘 𝑬= ∑𝒖 ,𝒗∈𝑬
❑
𝑹𝒖𝒗𝒘𝒖𝒗
nodes
𝑶 ( ln𝒏𝜸 )❑
Open Questions
52
Perfect Information: efficient 𝛄-Approximate k-Adaptive Regret Minimization Algorithm when k = 0 and 𝛄 = 0?
𝛄-Approximate k-Adaptive Regret Minimization Algorithm with more efficient running time? ?
Imperfect Information
Perfect Information
Oblivious Regret Hard (Theorem 1)APX (Theorem 5)
APX (Theorem 4)
k-Adaptive Regret Hard (Theorem 1)APX
Hard (Remark 2)APX
Fully Adaptive Regret X (Theorem 6) X (Theorem 6)
Thanks for Listening!
THEOREM 3Unless RP=NP there is no efficient RegretMinimization algorithm for Bounded Memory
Games even against an oblivious adversary.
Reduction from MAX 3-SAT (7/8+ε) [Hastad01] Similar to reduction in [EKM05] for MDP’s
54
THEOREM 3: SETUP
Defender Actions A: {0,1}x{0,1}
m = O(log n)
States: Two states for each variable S0 = {s1,…, sn} S1 = {s’1,…,s’n}
Intuition: A fixed strategy corresponds to a variable assignment 55
THEOREM 3: OVERVIEW The adversary picks a clause uniformly at
random for the next n rounds
Defender can earn reward 1 by satisfying this unknown clause in the next n rounds
The game will “remember” if a reward has already been given so that defender cannot earn a reward multiple times during n rounds
56
THEOREM 3: STATE TRANSITIONS
57
Adversary Actions B: {0,1}x{0,1,2,3}
b = (b1,b2)
g(a,b) = b1
f(a,b) = S1 if a2 = 1 or b2 = a1 (reward already given) S0 else (no reward given)
THEOREM 3: REWARDS
58
b = (b1,b2)
No reward whenever B plays b2 = 2
r(a,b,s) =
1 if s S0 and a = b2
-5 if s S1 and f(a,b) = S0 and b2 3 0 otherwise
No reward whenever s S1
THEOREM 3: OBLIVIOUS ADVERSARY(d1,…,dn) - binary De Buijn sequence of order
n
1. Pick a clause C uniformly at random2. For i = 1,…,n
Play b = (di,b2)
3. Repeat Step 159
b2 =
1 If xi C0 If xi C3 If i = n2 Otherwise
ANALYSIS
Defender can never be rewarded from s S1 Get Reward => Transition to s S1 Defender is punished for leaving S1
Unless adversary plays b2 = 3 (i.e when i = n)60
f(a,b) = S1 if a2 = 1 or b2 = a1
S0 else
r(a,b,s)=
1 if s S0 and a = b2
-5 if s S1 and f(a,b) = S0 and b2 3 0 otherwise
THEOREM 3: ANALYSIS φ - assignment satisfying ρ fraction of clauses
fφ – average score ρ/n
Claim: No strategy (fixed or adaptive) can obtain an average expected score better than ρ*/n
Regret Minimization Algorithm Run until expected average regret < ε/n Expected average score > (ρ*- ε )/n
61