Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement...

18
Motivation Bayesian Reinforcement Learning Experiments and Applications Bayesian Reinforcement Learning in Continuous POMDPs Stéphane Ross 1 , Brahim Chaib-draa 2 and Joelle Pineau 1 1 School of Computer Science, McGill University, Canada 2 Department of Computer Science, Laval University, Canada December 19 th , 2007 Bayes-Adaptive POMDP Stéphane Ross 1 , Brahim Chaib-draa 2 and Joelle Pineau 1 1 / 18

Transcript of Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement...

Page 1: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian Reinforcement Learning inContinuous POMDPs

Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1

1School of Computer Science, McGill University, Canada2Department of Computer Science, Laval University, Canada

December 19th, 2007

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 1 / 18

Page 2: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Motivation

How should robots make decisions when :Environment is partially observable and continuousPoor model of sensors and actuatorsParts of the model has to be learnt entirely during execution (e.g.users’ preferences/behavior)

such as to maximize expected long-term rewards ?

Typical Examples :

[Rottmann]

Solution : Bayesian Reinforcement Learning !

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 2 / 18

Page 3: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Partially Observable Markov Decision Processes

POMDP : (S,A,T ,R,Z ,O, γ, b0)

S : Set of statesA : Set of actionsT (s,a, s′) = Pr(s′|s,a), the transition probabilitiesR(s,a) ∈ R, the immediate rewardsZ : Set of observationsO(s′,a, z) = Pr(z|s′,a), the observation probabilitiesγ : discount factorb0 : Initial state distribution

Belief monitoring via Bayes rule :bt (s′) = ηO(s′,at−1, zt )

∑s∈S T (s,at−1, s′)bt−1(s)

Value function :V ∗(b) = maxa∈A

[R(b,a) + γ

∑z∈Z Pr(z|b,a)V ∗(τ(b,a, z))

]Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 3 / 18

Page 4: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian Reinforcement Learning

General Idea :Define prior distributions over all unknown parameters.Maintain posteriors via Baye’s rule as experience is acquired.Plan considering posterior distribution over model.

Allows us to :Learn the system at same time we achieve the task efficiently.Tradeoff optimally exploration and exploitation.Consider model uncertainty during planning.Include prior knowledge explicitly.

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 4 / 18

Page 5: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Finite MDPs

In Finite MDPs (T unknown) : ([Dearden 99], [Duff 02], [Poupart 06])

To learn T : Maintain counts φass′ of number of times s a→ s′ observed,

starting from prior φ0.

Counts define Dirichlet prior/posterior over T .

Planning according to φ is a MDP problem itself :S′ : physical state (s ∈ S) + information state (φ)T ′(s, φ,a, s′, φ′) = Pr(s′, φ′|s, φ,a)

= Pr(s′|s, φ,a) Pr(φ′|φ, s,a, s′)=

φass′∑

s′′∈S φass′′

I(φ′, φ+ δass′)

V ∗(s, φ) = maxa∈A

[R(s,a) + γ

∑s′∈S

φass′∑

s′′∈S φass′′

V ∗(s′, φ+ δass′)]

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 5 / 18

Page 6: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Finite POMDPs

In Finite POMDPs (T ,O unknown) : ([Ross 07])

Let :φa

ss′ : number of times s a→ s′ observed.ψa

sz : number of times z observed in s after doing a.

Given action-observation sequence, use Bayes rule to maintain beliefover (s, φ, ψ).

⇒ Decision under partial observability of (s, φ, ψ) is a POMDP itself :S′ : physical state (s ∈ S) + information state (φ, ψ)P ′(s, φ, ψ, a, s′, φ′, ψ′, z) = Pr(s′, φ′, ψ′, z|s, φ, ψ,a)= Pr(s′|s, φ,a) Pr(z|ψ, s′,a) Pr(φ′|φ, s,a, s′) Pr(ψ′|ψ,a, s′, z)

=φa

ss′∑s′′∈S φ

ass′′

ψas′z∑

z′∈Z ψas′z′

I(φ′, φ+ δass′)I(ψ

′, ψ + δas′z)

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 6 / 18

Page 7: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Example

Tiger domain with unknown sensor accuracy :Suppose prior ψ0 = (5,3), b0 = (0.5,0.5)

Sequence of action-observation is :{ (Listen,l), (Listen,l), (Listen,l), (Right,-) }

b0 : Pr(L, < 5, 3 >) = 12

Pr(R, < 5, 3 >) = 12

b1 : Pr(L, < 6, 3 >) = 58

Pr(R, < 5, 4 >) = 38

b3 : Pr(L, < 8, 3 >) = 79

Pr(R, < 5, 6 >) = 29

b4 : Pr(L, < 8, 3 >) = 718

Pr(L, < 5, 6 >) = 218

Pr(R, < 8, 3 >) = 718

Pr(R, < 5, 6 >) = 218

0 0.2 0.4 0.6 0.8 10

1

2

3

4

accuracy

b0(L,⋅)

b3(L,⋅)

b3(R,⋅)

b4(L,⋅)

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 7 / 18

Page 8: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Continuous Domains

In robotics, continuous domains are common (continuous state,continuous action, continuous observations).

Could discretize the problem and apply our current method, but :Combinatorial explosion or poor precisionCan require lots of training data (visit every small cell)

Can we extend Bayesian RL to continuous domains ?

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 8 / 18

Page 9: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Continuous Domains ?

Can’t use counts (Dirichlet distribution) to learn about the model.

We assume a parametric form for transtion and observation model.

For instance, in the Gaussian case :S ⊂ Rm, A ⊂ Rn, Z ⊂ Rp

st+1 = gT (st ,at ,Xt )

zt+1 = gO(st+1,at ,Yt )

where Xt ∼ N(µX ,ΣX ), Yt ∼ N(µY ,ΣY ), and gT ,gO are arbitraryfunctions (possibly non-linear).

We assume gT ,gO are known, but that the parameters µX ,ΣX , µY ,ΣYare unknown.

Relevant statistics depends on the parametric form.

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 9 / 18

Page 10: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Continuous Domains ?

µ,Σ can be learned by maintaining sample mean µ̂ and samplecovariance Σ̂.

These define a Normal-Wishart posterior over µ,Σ :µ|Σ = R ∼ N(µ̂, R

ν )

Σ−1 ∼Wishart(α, τ−1)

where :ν : number of observations for µ̂α : degree of freedom of Σ̂

τ = αΣ̂

These can be updated easily after observation X = x :µ̂′ = νµ̂+x

ν+1

ν′ = ν + 1

α′ = α + 1τ ′ = τ + ν

ν+1 (µ̂− x)(µ̂− x)T

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 10 / 18

Page 11: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Continuous POMDP

Let’s defineφ = (µ̂X , νX , αX , τX ) : the posterior over (µX ,ΣX )

ψ = (µ̂Y , νY , αY , τY ) : the posterior over (µY ,ΣY )

U : the update function of φ,ψ, i.e. U(φ, x) = φ′ and U(ψ, y) = ψ′

Bayes-Adaptive Continuous POMDP : (S′,A′,Z ′,P ′,R′)

S′ = S × R|X |+|X |2+2 × R|Y |+|Y |2+2

A′ = AZ ′ = ZP ′(s, φ, ψ,a, s′, φ′, ψ′, z) =I(gT (s,a, x), s′)I(gO(s′,a, y), z)I(φ′,U(φ, x))I(ψ′,U(ψ, y))fX |φ(x)fY |ψ(y)

R′(s, φ, ψ, a) = R(s,a)

where x = (νX + 1)µ̂′X − νX µ̂X and y = (νY + 1)µ̂′Y − νY µ̂Y .

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 11 / 18

Page 12: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Continuous POMDP

Monte Carlo Belief monitoring : (1 extra assumption : gO(s,a, ·) is 1-1transformation of Y )

1 Sample (s, φ, ψ) ∼ bt

2 Sample (µX ,ΣX ) ∼ NW (φ)

3 Sample X ∼ N(µX ,ΣX )

4 Compute s′ = gT (s,at ,X )

5 Find unique Y s.t.zt+1 = gO(s′,at ,Y )

6 Compute φ′ = U(φ,X ),ψ′ = U(ψ,Y )

7 Sample (µY ,ΣY ) ∼ NW (ψ)

8 Add f (Y |µY ,ΣY ) to particlebt+1(s′, φ′, ψ′)

9 Repeat until K particles inbt+1

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 12 / 18

Page 13: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian RL in Continuous POMDP

Monte Carlo Online Planning (Receding Horizon Control) :

b0

b1

b2

b3

a1

a2

an

o1

o2

on

...

...

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 13 / 18

Page 14: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Experiments

Simple Robot Navigation Task :S : (x , y) positionA : (v , θ) (velocity v ∈ [0,1] and angleθ ∈ [0,2π])Z : Noisy (x , y) position

gT (s,a,X ) = s + v[

cos θ − sin θsin θ cos θ

]X

gO(s′,a,Y ) = s′ + YR(s,a) = I(||s − sGOAL||2 < 0.25)

γ = 0.85

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 14 / 18

Page 15: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Robot Navigation Task

We choose exact parameters :

µX =

[0.80.3

]µY =

[00

] ΣX =

[0.04 −0.01−0.01 0.01

]ΣY =

[0.01 00 0.01

]Starting with prior based on 10 "artificial" samples :

µ̂X =

[10

]µ̂Y =

[00

] Σ̂X =

[0.04 −0.01−0.01 0.16

]Σ̂Y =

[0.16 00 0.16

]Such that φ0 = (µ̂X ,10,9,9Σ̂X ), ψ0 = (µ̂Y ,10,9,9Σ̂Y ).Each time robot reaches the goal, a new goal is chosen randomly at 5distance unit from previous goal.

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 15 / 18

Page 16: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Robot Navigation TaskAverage evolution of the return over time :

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Training Steps

Ave

rage

Ret

urn

Prior modelExact ModelLearning

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 16 / 18

Page 17: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Robot Navigation Task

Average accuracy of the model over time :

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Training Steps

WL1

Model Accuracy is measured as follows :WL1(b) =∑

(s,φ,ψ) b(s, φ, ψ) [||µφ − µX ||1 + ||Σφ − ΣX ||1 + ||µψ − µY ||1 + ||Σψ − ΣY ||1]

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 17 / 18

Page 18: Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Conclusion

We presented a new framework for planning in partially observableand continuous domains with uncertain model parameters.

Optimal policy maximizes long-term return given the prior over modelparameters.

Monte Carlo methods can be used for more tractable approximatebelief monitoring and and planning.

Interesting future applications human-computer interactions androbotics.

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 18 / 18