A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU...

44
A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail: {mbonneau,peyrard,sabbadin}@toulouse.inra.fr MSTGA, Bordeaux, 7 2012

Transcript of A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU...

Page 1: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

A reinforcement learning algorithm for sampling design in Markov random fields

Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN

INRA-MIA ToulouseE-Mail: {mbonneau,peyrard,sabbadin}@toulouse.inra.fr

MSTGA, Bordeaux, 7 2012

Page 2: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

1. Problem statement.

2. General Approach.

3. Formulation using dynamic model.

4. Reinforcement learning solution.

5. Experiments.

6. Conclusions.

Plan

Page 3: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

PROBLEM STATEMENT

Adaptive sampling

X(7)

X(6)X(5)

X(1)

X(4)

X(3)X(2)

X(8) X(9)

• Adaptive selection of variables to observe for reconstruction of the random vector

X=(X(1),….,X(n))

• c(A) -> Cost of observing variables X(A)

• B -> Initial budget

•Observations are reliable

Page 4: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

PROBLEM STATEMENT

Adaptive sampling

• Adaptive selection of variables to observe for reconstruction of the random vector X=(X(1),….,X(n))

• c(A,x(A)) -> Cost of observing variables X(A) in state x(A)

• B -> Initial budget•Observations are reliable

Problem: Find strategy / sampling policy to adaptively select variables to observe in order to :

Optimize Quality of the reconstruction of X / Respect Initial Budget

Page 5: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

PROBLEM STATEMENT

Adaptive sampling

• Adaptive selection of variables to observe for reconstruction of the random vector X=(X(1),….,X(n))

• c(A,x(A)) -> Cost of observing variables X(A) in state x(A)

• B -> Initial budget•Observations are reliable

Problem: Find strategies / sampling policy to adaptively select variables to observed in order to :

Optimize Quality of the reconstructed vector X / Respect Initial Budget

Page 6: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe:

𝛿((A1,x(A1)),…,(At,x(At)))=At+1.

DEFINITION: Adaptive sampling policy

Page 7: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

A1= 𝛿1={ 8 }A2= 𝛿2 ((8,0))={ 6 , 7 }

x(A1)=0

- - Example - -

x(A2)=(1,3)

For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe:

𝛿((A1,x(A1)),…,(At,x(At)))=At+1.

DEFINITION: Adaptive sampling policy

X(7)

X(6)X(5)

X(1)

X(4)

X(3)X(2)

X(8) X(9)

Page 8: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

DEFINITIONS

A1= 𝛿1={ 8 }A2= 𝛿2 ((8,0))={ 6 , 7 }

•Vocabulary :

•A history {(A1, x (A1)) ,… ,(AH, x (AH) } is a trajectory followed when applying 𝛿• : set of all reachable histories of 𝛿• c(𝛿) ≤ B cost of any history respects the initial budget

δτ x(A2)=(1,3)x(A1)=0

For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe:

𝛿((A1,x(A1)),…,(At,x(At)))=At+1.

Page 9: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

GENERAL APPROACH

Page 10: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

GENERAL APPROACH

1. Find a distribution ℙ that well describes the phenomenon under study.

2. Define the value of adaptive sampling policy:

3. Define approximate resolution method for finding near optimal policy:

Page 11: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

STATE OF THE ART1. Find a distribution ℙ that well describes the phenomenon

under study.

2. Define the value of adaptive sampling policy:

3. Define approximate resolution method for finding near optimal policy:

X continuous random vector

ℙ multivariate Gaussian joint distribution

Entropy based criterion

Kriging variance

Greedy algorithm

Page 12: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

1. Find a distribution ℙ that well desribed the phenomenon under study.

2. Define the value of adaptive sampling policy:

3. Define approximate resolution method for finding near optimal policy:

X continuous random vector X discrete random vector

ℙ multivariate Gaussian joint distribution

ℙ Markov random field distribution

Entropy based criterions Maximum Posterior Marginals (MPM)

Kriging variance

OUR CONTRIBUTION

Greedy algorithm Reinforcement learning

Page 13: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

1. Find a distribution ℙ that well desribed the phenomenon under study.

2. Define the value of adaptive sampling policy:

3. Define approximate resolution method for finding near optimal policy:

X continuous random vector X discrete random vector

ℙ multivariate Gaussian joint distribution

ℙ Markov random field distribution

Entropy based criterions Maximum Posterior Marginals (MPM)

Kriging variance

OUR CONTRIBUTION

Greedy algorithm Reinforcement learning

Page 14: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Formulation using dynamic model

An adapted framework for reinforcement learning

Page 15: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Summarize knowledge on X in a random vector S of length n

•Observe variables update our knowledge on X Evolution of S

•Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k i

s1 s2 s3

U(sH+1)

A1 A2 A3

sH+1

Page 16: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Summarize knowledge on X in a random vector S of length n

•Observe variables update our knowledge on X Evolution of S

•Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k i

s1 s2 s3

U(sH+1)

A1 A2 A3

sH+1

Page 17: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Reinforcement learning solution

Page 18: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Find optimal policy: The Q-function

Compute Q* Compute 𝛿 *

= « The expected value of the history when starting in st, observing variables X(At)and then

following policy 𝛿*»

Page 19: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Find optimal policy: The Q-function

s1 s2 s3

U(s3)

A1 A2

• How to compute Q*: classical solution (Q-learning …)

with proba 1-𝜀random with proba 𝜀

1. Initialize Q

2. Simulate history

Page 20: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Find optimal policy: The Q-function

s1 s2 s3

U(s3)

Update Q(s1,A1) Update Q(s2,A2)

A1 A2

with proba 1-𝜀random with proba 𝜀

many times!

• How to compute Q*: classical solution (Q-learning …)

1. Initialize Q

2. Simulate history

Q Q*C.V

Page 21: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Alternative approach• Linear approximation of Q-function:

• Choice of function Φi:

Page 22: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

LSDP Algorithm

• Linear approximation of Q-function:

Define weights for each decision step

Compute weights using “backward induction”

H

Page 23: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

LSDP Algorithm: application to sampling1. Computation of Φi(st,At):

2. Computation of

3. Computation of

Page 24: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

LSDP Algorithm: application to sampling1. Computation of Φi(st,At):2. Computation of 3. Computation of

• We fix|At|=1 and use the approximation:

Page 25: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Experiments

Page 26: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Experiments• Regular grid with first order neighbourhood.

•X(i) are binary variables.

•ℙ is a Potts model with β=0.5

• Simple cost: observation of each variale cost 1

X(7)

X(6)X(5)

X(1)

X(4)

X(3)X(2)

X(8) X(9)

Page 27: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Experiments• Comparison between:

Random policy

BP-max heuristic: at each time step observed variable

LSPI policy “ common reinforcement learning algorithm”

LSDP policy

• using score:

Page 28: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Experiment: 100 variables (n=100)

Page 29: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

LSDP et BP-max : No observation

Action’s value for LSDP policy

Max marginals

Page 30: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

LSDP et BP-max : one observation

Action’s value for LSDP policy

Action’s value for LSDP policy

Max marginals

Page 31: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

LSDP et BP-max : two observations

Action’s value for LSDP policy

Action’s value for LSDP policy

Max marginals

Page 32: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Experiment: 100 variables - constraint move

• Allowed to visit second ordre neighbourood only !

Page 33: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Experiment: 100 variables - constraint move

• Allowed to visit second ordre neighbourood only !

Page 34: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Experiment: 100 variables - constraint move

Page 35: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Experiment: 200 variables - Different costRandom

BP-Max

LSDP Min Cost

Policy Value

60.27% 61.77%

64.8% 64.58%

Mean number

of observe

d variable

s

26.65 19 27.3 38

Initial Budget = 38

Page 36: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Experiment: 200 variables - Different costRandom BP-

MaxLSDP Min Cost

Policy Value

60.27% 61.77% 64.8% 64.58%

Mean number

of observed variables

26.65 19 27.3 38

Initial Budget = 38

•Cost Repartition:

Random

BP-max LSDP

Page 37: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Experiment: 100 variables - Different costRandom BP-Max LSDP

Policy Value 65.3% 66.2% 67%

Mean number of observed variables

15.4 15.4 15.4

WR 1 65.9% 65.4% 67%

WR 0 61% 63.2% 64%

•Initial Budget = 30

•Cost of 1 when observed variable is in state 0

•Cost of 3 when observed variable is in state 1

Page 38: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Conclusions• An adapted framework for adaptive sampling in discrete random variables

•LSDP: a reinforcement learning approach for finding near optimal policy

Adaptation of common reinforcement learning algorithm for solving adaptive sampling problem

Computation of near optimal policy « off-line »

Design of new policies that outperform simple heuristics and usual RL method

• Possible application?

Weeds sampling in crop field

Page 39: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

THANK YOU!

Page 40: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Reconstruction of X(R) and trajectory valueA1= 𝛿1={ 8 }

A2= 𝛿2 ((8,0))={ 6 , 7 }xA1=0

xA2=(4,2)=Reconstruction of XR

• Maximum Posterior Marginal for reconstruction:

Page 41: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

Reconstruction of X(R) and trajectory valueA1= 𝛿1={ 8 }

A2= 𝛿2 ((8,0))={ 6 , 7 }xA1=0

xA2=(4,2)

• Maximum Posterior Marginal for reconstruction:

• Quality of trajectory:

Page 42: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

LSDP Algorithm• Linear approximation of Q-function:

•How to comput w1,…,wH:

sH+1s1 s2 sH-1U(sH+1)

a1 a2

sH

aH-1 aH

sH+1s1 s2 sH-1

a1 a2

sH

aH-1 aH

U(sH+1)

Page 43: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

LSDP Algorithm• Linear approximation of Q-function:

•How to comput w1,…,wH:

sH+1s1 s2 sH-1U(sH+1)

a1 a2

sH

aH-1 aH

sH+1s1 s2 sH-1

a1 a2 aH-1

U(sH+1)sH

aH

LINEAR SYSTEM

wH

Page 44: A reinforcement learning algorithm for sampling design in Markov random fields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail:

LSDP Algorithm• Linear approximation of Q-function:

•How to comput w1,…,wH:

s1 s2 sH-1

a1 a2 aH-1

s1 s2 sH-1

a1 a2 aH-1

LINEAR SYSTEM

wH-1