Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement...

Motivation Bayesian Reinforcement Learning Experiments and Applications Conclusion

Bayesian Reinforcement Learning inContinuous POMDPs

Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1

1School of Computer Science, McGill University, Canada2Department of Computer Science, Laval University, Canada

December 19th, 2007

Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 1 / 18


Motivation

How should robots make decisions when :Environment is partially observable and continuousPoor model of sensors and actuatorsParts of the model has to be learnt entirely during execution (e.g.users’ preferences/behavior)

such as to maximize expected long-term rewards ?

Typical Examples :

[Rottmann]

Solution : Bayesian Reinforcement Learning !



Partially Observable Markov Decision Processes

POMDP : (S,A,T ,R,Z ,O, γ, b0)

S : Set of statesA : Set of actionsT (s,a, s′) = Pr(s′|s,a), the transition probabilitiesR(s,a) ∈ R, the immediate rewardsZ : Set of observationsO(s′,a, z) = Pr(z|s′,a), the observation probabilitiesγ : discount factorb0 : Initial state distribution

Belief monitoring via Bayes rule :bt (s′) = ηO(s′,at−1, zt )

∑s∈S T (s,at−1, s′)bt−1(s)

Value function :V ∗(b) = maxa∈A

[R(b,a) + γ

∑z∈Z Pr(z|b,a)V ∗(τ(b,a, z))

]Bayes-Adaptive POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 3 / 18


Bayesian Reinforcement Learning

General Idea :Define prior distributions over all unknown parameters.Maintain posteriors via Baye’s rule as experience is acquired.Plan considering posterior distribution over model.

Allows us to :Learn the system at same time we achieve the task efficiently.Tradeoff optimally exploration and exploitation.Consider model uncertainty during planning.Include prior knowledge explicitly.



Bayesian RL in Finite MDPs

In Finite MDPs (T unknown) : ([Dearden 99], [Duff 02], [Poupart 06])

To learn T : Maintain counts φass′ of number of times s a→ s′ observed,

starting from prior φ0.

Counts define Dirichlet prior/posterior over T .

Planning according to φ is a MDP problem itself :S′ : physical state (s ∈ S) + information state (φ)T ′(s, φ,a, s′, φ′) = Pr(s′, φ′|s, φ,a)

= Pr(s′|s, φ,a) Pr(φ′|φ, s,a, s′)=

φass′∑

s′′∈S φass′′

I(φ′, φ+ δass′)

V ∗(s, φ) = maxa∈A

[R(s,a) + γ

∑s′∈S

φass′∑

s′′∈S φass′′

V ∗(s′, φ+ δass′)]



Bayesian RL in Finite POMDPs

In Finite POMDPs (T ,O unknown) : ([Ross 07])

Let :φa

ss′ : number of times s a→ s′ observed.ψa

sz : number of times z observed in s after doing a.

Given action-observation sequence, use Bayes rule to maintain beliefover (s, φ, ψ).

⇒ Decision under partial observability of (s, φ, ψ) is a POMDP itself :S′ : physical state (s ∈ S) + information state (φ, ψ)P ′(s, φ, ψ, a, s′, φ′, ψ′, z) = Pr(s′, φ′, ψ′, z|s, φ, ψ,a)= Pr(s′|s, φ,a) Pr(z|ψ, s′,a) Pr(φ′|φ, s,a, s′) Pr(ψ′|ψ,a, s′, z)

=φa

ss′∑s′′∈S φ

ass′′

ψas′z∑

z′∈Z ψas′z′

I(φ′, φ+ δass′)I(ψ

′, ψ + δas′z)



Example

Tiger domain with unknown sensor accuracy :Suppose prior ψ0 = (5,3), b0 = (0.5,0.5)

Sequence of action-observation is :{ (Listen,l), (Listen,l), (Listen,l), (Right,-) }

b0 : Pr(L, < 5, 3 >) = 12

Pr(R, < 5, 3 >) = 12

b1 : Pr(L, < 6, 3 >) = 58

Pr(R, < 5, 4 >) = 38

b3 : Pr(L, < 8, 3 >) = 79

Pr(R, < 5, 6 >) = 29

b4 : Pr(L, < 8, 3 >) = 718

Pr(L, < 5, 6 >) = 218

Pr(R, < 8, 3 >) = 718

Pr(R, < 5, 6 >) = 218

0 0.2 0.4 0.6 0.8 10

1

2

3

4

accuracy

b0(L,⋅)

b3(L,⋅)

b3(R,⋅)

b4(L,⋅)



Continuous Domains

In robotics, continuous domains are common (continuous state,continuous action, continuous observations).

Could discretize the problem and apply our current method, but :Combinatorial explosion or poor precisionCan require lots of training data (visit every small cell)

Can we extend Bayesian RL to continuous domains ?



Bayesian RL in Continuous Domains ?

Can’t use counts (Dirichlet distribution) to learn about the model.

We assume a parametric form for transtion and observation model.

For instance, in the Gaussian case :S ⊂ Rm, A ⊂ Rn, Z ⊂ Rp

st+1 = gT (st ,at ,Xt )

zt+1 = gO(st+1,at ,Yt )

where Xt ∼ N(µX ,ΣX ), Yt ∼ N(µY ,ΣY ), and gT ,gO are arbitraryfunctions (possibly non-linear).

We assume gT ,gO are known, but that the parameters µX ,ΣX , µY ,ΣYare unknown.

Relevant statistics depends on the parametric form.



Bayesian RL in Continuous Domains ?

µ,Σ can be learned by maintaining sample mean µ̂ and samplecovariance Σ̂.

These define a Normal-Wishart posterior over µ,Σ :µ|Σ = R ∼ N(µ̂, R

ν )

Σ−1 ∼Wishart(α, τ−1)

where :ν : number of observations for µ̂α : degree of freedom of Σ̂

τ = αΣ̂

These can be updated easily after observation X = x :µ̂′ = νµ̂+x

ν+1

ν′ = ν + 1

α′ = α + 1τ ′ = τ + ν

ν+1 (µ̂− x)(µ̂− x)T



Bayesian RL in Continuous POMDP

Let’s defineφ = (µ̂X , νX , αX , τX ) : the posterior over (µX ,ΣX )

ψ = (µ̂Y , νY , αY , τY ) : the posterior over (µY ,ΣY )

U : the update function of φ,ψ, i.e. U(φ, x) = φ′ and U(ψ, y) = ψ′

Bayes-Adaptive Continuous POMDP : (S′,A′,Z ′,P ′,R′)

S′ = S × R|X |+|X |2+2 × R|Y |+|Y |2+2

A′ = AZ ′ = ZP ′(s, φ, ψ,a, s′, φ′, ψ′, z) =I(gT (s,a, x), s′)I(gO(s′,a, y), z)I(φ′,U(φ, x))I(ψ′,U(ψ, y))fX |φ(x)fY |ψ(y)

R′(s, φ, ψ, a) = R(s,a)

where x = (νX + 1)µ̂′X − νX µ̂X and y = (νY + 1)µ̂′Y − νY µ̂Y .




Monte Carlo Belief monitoring : (1 extra assumption : gO(s,a, ·) is 1-1transformation of Y )

1 Sample (s, φ, ψ) ∼ bt

2 Sample (µX ,ΣX ) ∼ NW (φ)

3 Sample X ∼ N(µX ,ΣX )

4 Compute s′ = gT (s,at ,X )

5 Find unique Y s.t.zt+1 = gO(s′,at ,Y )

6 Compute φ′ = U(φ,X ),ψ′ = U(ψ,Y )

7 Sample (µY ,ΣY ) ∼ NW (ψ)

8 Add f (Y |µY ,ΣY ) to particlebt+1(s′, φ′, ψ′)

9 Repeat until K particles inbt+1




Monte Carlo Online Planning (Receding Horizon Control) :

b0

b1

b2

b3

a1

a2

an

o1

o2

on

...

...



Experiments

Simple Robot Navigation Task :S : (x , y) positionA : (v , θ) (velocity v ∈ [0,1] and angleθ ∈ [0,2π])Z : Noisy (x , y) position

gT (s,a,X ) = s + v[

cos θ − sin θsin θ cos θ

]X

gO(s′,a,Y ) = s′ + YR(s,a) = I(||s − sGOAL||2 < 0.25)

γ = 0.85



Robot Navigation Task

We choose exact parameters :

µX =

[0.80.3

]µY =

[00

] ΣX =

[0.04 −0.01−0.01 0.01

]ΣY =

[0.01 00 0.01

]Starting with prior based on 10 "artificial" samples :

µ̂X =

[10

]µ̂Y =

[00

] Σ̂X =

[0.04 −0.01−0.01 0.16

]Σ̂Y =

[0.16 00 0.16

]Such that φ0 = (µ̂X ,10,9,9Σ̂X ), ψ0 = (µ̂Y ,10,9,9Σ̂Y ).Each time robot reaches the goal, a new goal is chosen randomly at 5distance unit from previous goal.



Robot Navigation TaskAverage evolution of the return over time :

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Training Steps

Ave

rage

Ret

urn

Prior modelExact ModelLearning



Robot Navigation Task

Average accuracy of the model over time :

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Training Steps

WL1

Model Accuracy is measured as follows :WL1(b) =∑

(s,φ,ψ) b(s, φ, ψ) [||µφ − µX ||1 + ||Σφ − ΣX ||1 + ||µψ − µY ||1 + ||Σψ − ΣY ||1]



Conclusion

We presented a new framework for planning in partially observableand continuous domains with uncertain model parameters.

Optimal policy maximizes long-term return given the prior over modelparameters.

Monte Carlo methods can be used for more tractable approximatebelief monitoring and and planning.

Interesting future applications human-computer interactions androbotics.


Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement...

Documents

Transcript of Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement...