Adaptive Regret Minimization in Bounded Memory Games Jeremiah Blocki, Nicolas Christin, Anupam...

Post on 29-Mar-2015

217 views 0 download

Tags:

Transcript of Adaptive Regret Minimization in Bounded Memory Games Jeremiah Blocki, Nicolas Christin, Anupam...

Adaptive Regret Minimization in Bounded Memory Games

Jeremiah Blocki, Nicolas Christin,

Anupam Datta, Arunesh Sinha

1

GameSec 2013 – Invited Paper

Motivating Example: Cheating Game

4

Semester 1 Semester 2 Semester 3

Motivating Example: Speeding Game

5

Week 1 Week 2 Week 3

Motivating Example: Speeding Game

Example

Actions

6

Questions

Appropriate Game Model for this Interaction?

Defender Strategies?

:

Outcomes

:

High InspectionLow Inspection

Speed Behave

Game Elements

8

o Repeated Interactiono Two Players: Defender and Adversary

o Imperfect Informationo Defender only observes outcome

o Short Term Adversarieso Adversary Incentives Unknown to

Defendero Last presentation! [JNTP 13]

o Adversary may be uninformed/irrational

Rep

eated G

ame M

od

el?

Stackelb

erg

Additional Game Elements

9

o History-dependent Actionso Adversary adapts behavior following unknown strategyo How should defender respond?

o History-dependent Rewards:o Point Systemo Reputation of defender depends both on its history

and on the current outcome

StandardRegretMinimization

Repeated Game Model?

Outline

10

Motivation Background

Standard Definition of Regret Regret Minimization Algorithms Limitations

Our Contributions Bounded Memory Games Adaptive Regret Results

Speeding Game: Repeated Game Model

Example

11

.19 0.7

0.2 1

High Inspection

Low Inspection

Defender’s (D) Expected Utility

+

Regret Minimization Example

Example

Experts

13

Low Inspecti

onHigh

Inspection

What should I do?

Behave

Low

Behave

High

Speed

High

Low

High

Low

High

Low

High

Regret Minimization Example

Example

.19 0.7

0.2 1

High Inspection

Low Inspection

Defender’s Utility

Experts

14

Adversary

Defender

Utility

1.89

2.2

1.59

Aristotle

Plato

0.19 + 1 + 0.7 = 1.89 0.2 + 1 + 1 = 2.2

Day 1 Day 2 Day 3

Regret Minimization Example

Example

.19 0.7

0.2 1

High Inspection

Low Inspection

Defender’s Utility

Regret

15

Defender

Aristotle

Plato

Utility

1.89

2.2

0.59

Regret Minimization Example

Example

.19 0.7

0.2 1

High Inspection

Low Inspection

Defender’s Utility

Regret

16

Regret Minimization Example

Example

.19 0.7

0.2 1

High Inspection

Low Inspection

Defender’s utility

Regret Minimization Algorithm (A)

17

lim¿𝑇→∞(Regret ( 𝐴 ,𝐸𝑥𝑝𝑒𝑟𝑡𝑠 ,𝑇 )𝑇 )≤0

Regret Minimization: Basic Idea

18

Low Inspectio

n

High Inspectio

n

1.0 1.0Weights

Choose action probabilistically based on weights

.19 0.7

0.2 1

High Inspection

Low Inspection

Regret Minimization: Basic Idea

19

Updated weights

Low Inspectio

n

High Inspectio

n

0.5 1.5

.19 0.7

0.2 1

High Inspection

Low Inspection

Speeding Game

Example

.19 0.7

0.2 1

High Inspection

Low Inspection

Defender’s utility

Defender’s Strategy

20

Nash Equilibrium: Low Inspection

Regret Minimization: Low Inspection

Dominant Strategy

Low Inspectio

nHigh

Inspection

0.5 1.5

Speeding Game

Example

.19 0.7

0.2 1

High Inspection

Low Inspection

Defender’s utility

Defender’s Strategy

21

Nash Equilibrium: Low Inspection

Regret Minimization: Low Inspection

Dominant Strategy

Low Inspectio

nHigh

Inspection

0.3 1.7

Speeding Game

Example

.19 0.7

0.2 1

High Inspection

Low Inspection

Defender’s utility

Defender’s Strategy

22

Nash Equilibrium: Low Inspection

Regret Minimization: Low Inspection

Dominant Strategy

Low Inspectio

nHigh

Inspection

0.1 1.9

Philosophical Argument

24

See! My advice

was better!

We need

a better game model

!

Unmodeled Game Elements

29

o Adversary Incentives Unknown to Defendero Last presentation! [JNTP 13]

o Adversary may be uninformed/irrational

o History-dependent Rewards:o Point Systemo Reputation of defender depends both on its history

and on the current outcome

o History-dependent Actionso Adversary adapts behavior following unknown strategyo How should defender respond?

Outline

30

Motivation Background Our Contributions

Bounded Memory Games Adaptive Regret Results

Bounded Memory Games

32

State s: Encodes last m outcomes

States: can capture history dependent rewards

𝑂𝑖−𝑚 ,…,𝑂 𝑖− 2 ,𝑂𝑖−1 𝑂𝑖−𝑚+1 ,…,𝑂𝑖 −1 ,𝑂𝑖𝑂 𝑖

- Defender payoff when actions d,a are played at state s

Bounded Memory Games

33

State s: Encodes last m outcomes

Current outcome is only dependent on current actions

, ¿ ,…, , - Defender payoff when actions d,a are played at state s

Bounded Memory Games - Experts

34

Expert advice may depend on the last m outcomes

If no violations have been detected in the last m rounds

then play High Inspection, otherwise Low Inspection

Fixed Defender Strategy:

State Action

Outline

35

Motivation Background Our Contributions

Bounded Memory Games Adaptive Regret Results

k-Adaptive Strategy

36

Decision tree for the next k rounds

Speed

Speed

SpeedSpeed

Day 1

Day 2

Day 3

…… …

Behave

… …

Speed

Speed

k-Adaptive Strategy

37

Decision tree for the next k rounds

Week 1 Week 2 Week 3

I will never speed while I am on vacation.

I will speed until I get caught. If I ever get a ticket then I will

stop.

I will keep speeding until I get two tickets. If I ever get two

tickets then I will stop.

If violations have been detected in the last 7 rounds

then play High Inspection, otherwise Low Inspection

k-Adaptive Regret

38

Regret (𝐷 , Expert ,𝑇 )=∑𝑖=1

𝑇

(𝑟 𝑖′−𝑟𝑖 )

Initial State

Defender … O-1 O0

Actions (a1,d1) (a2,d2) … (ak+1,dk+1

)

Outcome

O1 O2 … Ok+1 …

r1 r2 … rk+1

Expert … O-1 O0

Actions (a1,d1’)

(a2’,d2’)

… (ak+1,dk+1

’)…

Outcome

O1’ O2’ … Ok+1’ …

r1’ r2’ … rk+1’

k-Adaptive Regret Minimization

39

Definition: An algorithm D is a-approximate k-adaptive regret minimization if for any bounded memory-m game and any fixed set of experts EXP

lim ¿ 𝑇→∞(max𝐸∈exp

Regret (D ,𝐸 ,𝑇 )𝑇 )≤𝛾 .

Outline

40

Motivation Background Bounded Memory Games Adaptive Regret Results

k-Adaptive Regret Minimization

41

lim¿𝑇→∞(max𝐸∈exp

Regret (D ,𝐸 ,𝑇 )𝑇 )≤𝛾 .

Definition: An algorithm D is a-approximate k-adaptive regret minimization if for any bounded memory-m game and any fixed set of experts EXP

Theorem: For any there is an inefficient –approximate k-adaptive regret minimization algorithm.

Inefficient Regret Minimization Algorithm

42

• Use standard regret minimization algorithm for repeated games of imperfect information [AK04, McMahanB04,K05,FKM05]

, ¿ ,…, , …

f1

f2

Bounded Memory-mGame

fixed

str

ateg

y

Repeated Game

Inefficient Regret Minimization Algorithm

43

, ¿ ,…, , …

f1

f2

Bounded Memory-mGame

fixed

str

ateg

y

Repeated Game

Expected reward in original game given:

1. Defender follows fixed strategy f2 for next mkt rounds of original game

2. Defender sees sequence of k-adaptive adversaries below

• Use standard regret minimization algorithm for repeated games of imperfect information [AK04, McMahanB04,K05,FKM05]

Current outcome is only dependent on current actions

Inefficient Regret Minimization Algorithm

Start State Stagei (m*k*t rounds)

Real Game … O1 … Om …

Repeated Game

… O1 … Om …

44

Inefficient Regret Minimization Algorithm

Start State Stagei (m*k*t rounds)

Real Game … O1 … Om …

Repeated Game

… O1 … Om …

45

• After m rounds in Stagei View 1 and View 2 must converge to the same state

Inefficient Regret Minimization Algorithm

46

, ¿ ,…, , …

f1

f2

Bounded Memory-mGame

fixed

str

ateg

y

Repeated GameStandard Regret Minimization algorithms maintain weight

for each expert.

Inefficient: Exponentially many fixed strategies!

• Use standard regret minimization algorithm for repeated games of imperfect information [AK04, McMahanB04,K05,FKM05]

Summary of Technical Results

47

Imperfect Information

Perfect Information

Oblivious Regret Hard (Theorem 1)APX (Theorem 5)

APX (Theorem 4)

k-Adaptive Regret Hard (Theorem 1) Hard (Remark 2)

Fully Adaptive Regret X (Theorem 6) X (Theorem 6)

Easier

X – No Regret Minimization Algorithm Exists

Hard – unless no regret minimization algorithm is efficient in APX – efficient approximate regret minimization algorithm

Summary of Technical Results

48

Imperfect Information

Perfect Information

Oblivious Regret Hard (Theorem 1)APX (Theorem 5)

APX (Theorem 4)

k-Adaptive Regret Hard (Theorem 1)APX (New!)

Hard (Remark 2)APX (New!)

Fully Adaptive Regret X (Theorem 6) X (Theorem 6)

Easier

X – No Regret Minimization Algorithm Exists

Hard – unless no regret minimization algorithm is efficient in APX – efficient approximate regret minimization algorithm in n, k

Summary of Technical Results

49

Imperfect Information

Perfect Information

Oblivious Regret

k-Adaptive Regret

Fully Adaptive Regret X (Theorem 6) X (Theorem 6)

Easier

Ideas: Implicit weight representation + Dynamic Programming

Warning! f(k) is a very large constant!

Implicit Weights: Outcome Tree

50

… …

Behave… …

SpeedBehave Speed 𝑶 ( ln𝒏𝜸 )

𝒘𝒖𝒗

How often is edge (s,t) relevant?

nodes

Implicit Weights: Outcome Tree

51

Expert: E

… …

Behave… …

SpeedBehave Speed

𝒘𝒖𝒗𝒘 𝑬= ∑

𝒖 ,𝒗∈𝑬

𝑹𝒖𝒗𝒘𝒖𝒗  

nodes

𝑶 ( ln𝒏𝜸 )❑

Open Questions

52

Perfect Information: efficient 𝛄-Approximate k-Adaptive Regret Minimization Algorithm when k = 0 and 𝛄 = 0?

𝛄-Approximate k-Adaptive Regret Minimization Algorithm with more efficient running time? ?

Imperfect Information

Perfect Information

Oblivious Regret Hard (Theorem 1)APX (Theorem 5)

APX (Theorem 4)

k-Adaptive Regret Hard (Theorem 1)APX

Hard (Remark 2)APX

Fully Adaptive Regret X (Theorem 6) X (Theorem 6)

Thanks for Listening!

THEOREM 3

Unless RP=NP there is no efficient RegretMinimization algorithm for Bounded Memory

Games even against an oblivious adversary.

Reduction from MAX 3-SAT (7/8+ε) [Hastad01] Similar to reduction in [EKM05] for MDP’s

54

THEOREM 3: SETUP

Defender Actions A: {0,1}x{0,1}

m = O(log n)

States: Two states for each variable S0 = {s1,…, sn} S1 = {s’1,…,s’n}

Intuition: A fixed strategy corresponds to a variable assignment 55

THEOREM 3: OVERVIEW

The adversary picks a clause uniformly at random for the next n rounds

Defender can earn reward 1 by satisfying this unknown clause in the next n rounds

The game will “remember” if a reward has already been given so that defender cannot earn a reward multiple times during n rounds

56

THEOREM 3: STATE TRANSITIONS

57

Adversary Actions B: {0,1}x{0,1,2,3}

b = (b1,b2)

g(a,b) = b1

f(a,b) = S1 if a2 = 1 or b2 = a1 (reward already given)

S0 else (no reward given)

THEOREM 3: REWARDS

58

b = (b1,b2)

No reward whenever B plays b2 = 2

r(a,b,s) =

1 if s S0 and a = b2

-5 if s S1 and f(a,b) = S0 and b2 3

0 otherwise

No reward whenever s S1

THEOREM 3: OBLIVIOUS ADVERSARY

(d1,…,dn) - binary De Buijn sequence of order n

1. Pick a clause C uniformly at random2. For i = 1,…,n

Play b = (di,b2)

3. Repeat Step 159

b2 =

1 If xi C

0 If xi C

3 If i = n

2 Otherwise

ANALYSIS

Defender can never be rewarded from s S1

Get Reward => Transition to s S1

Defender is punished for leaving S1

Unless adversary plays b2 = 3 (i.e when i = n)60

f(a,b) = S1 if a2 = 1 or b2 = a1

S0 else

r(a,b,s)=

1 if s S0 and a = b2

-5 if s S1 and f(a,b) = S0 and b2 3

0 otherwise

THEOREM 3: ANALYSIS φ - assignment satisfying ρ fraction of clauses

fφ – average score ρ/n

Claim: No strategy (fixed or adaptive) can obtain an average expected score better than ρ*/n

Regret Minimization Algorithm Run until expected average regret < ε/n Expected average score > (ρ*- ε )/n

61