Bayesian Reinforcement Learning in Continuous...

17
Motivation POMDP Bayesian RL Experiments Conclusion Bayesian Reinforcement Learning in Continuous POMDPs Stéphane Ross 1 , Brahim Chaib-draa 2 and Joelle Pineau 1 1 School of Computer Science, McGill University, Canada 2 Department of Computer Science, Laval University, Canada May 23 rd , 2008 Bayesian RL in Continuous POMDP Stéphane Ross 1 , Brahim Chaib-draa 2 and Joelle Pineau 1 1 / 17

Transcript of Bayesian Reinforcement Learning in Continuous...

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Bayesian Reinforcement Learning inContinuous POMDPs

    Stéphane Ross1, Brahim Chaib-draa2 and Joelle Pineau1

    1School of Computer Science, McGill University, Canada2Department of Computer Science, Laval University, Canada

    May 23rd , 2008

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 1 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Motivation

    Robots have to make decisions under :Imperfect ActuatorsNoisy SensorsPoor/Approximate Model

    How to maximize long-term rewards ?[Rottmann]

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 2 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Continuous POMDP

    States : S ⊆ Rm

    Actions : A ⊆ Rn

    Observations : Z ⊆ Rp

    Rewards : R(s,a) ∈ R

    Gaussian model for Transition/Observation function :

    st = gT (st−1,at−1,Xt) Xt ∼ N(µX ,ΣX )zt = gO(st ,at−1,Yt) Yt ∼ N(µY ,ΣY )

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 3 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Example

    Simple Robot Navigation Task :

    [x ′

    y ′

    ]=

    [xy

    ]+ v

    [cos θ − sin θsin θ cos θ

    ] [X1X2

    ][

    zxzy

    ]=

    [xy

    ]+

    [Y1Y2

    ]

    +1 reward when ||s − sGOAL||2 < d

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 4 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Problem

    In practice : µX , ΣX , µY , ΣY unknown.

    Need to trade-off between :Learning the modelIdentifying the stateGathering rewards

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 5 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Bayesian Reinforcement Learning

    Observation

    Action

    New Posterior

    Current Prior/Posterior

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 6 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Bayesian Reinforcement Learning

    Planning problem representable as a new POMDP :States : (s, θ)Actions : a ∈ AObservations : z ∈ ZRewards : R(s, θ,a) = R(s,a)

    Joint Transition-Observation Probabilities :Pr(s′, θ′, z|s, θ,a) = Pr(s′, z|s,a, θ)Iθ(θ′)

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 7 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Bayesian Reinforcement Learning

    Belief State = Posterior

    Belief Update :baz(s′, θ) ∝

    ∫S b(s, θ) Pr(s

    ′, z|s,a, θ)ds

    Optimal policy by solving :V ∗(b) =maxa∈A

    [∫S R(s,a) Pr(s|b)ds + γ

    ∫Z Pr(z|b,a)V

    ∗(baz)dz]

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 8 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Belief Update

    Bayesian Learning of (µ,Σ) :Normal-Wishart prior⇒ Normal-Wishart posteriorParametrized by (n, µ̂, Σ̂)

    Start with prior : (n0, µ̂0, Σ̂0)

    Posterior Update (after observing X = x) :n′ = n + 1µ̂′ = nµ̂+xn+1

    Σ̂′ = n−1n Σ̂ +1

    n+1 (x − µ̂)(x − µ̂)T

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 9 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Belief Update

    But X not directly observable :Pr(µ,Σ|z) ∝

    ∫Pr(µ,Σ|x) Pr(z|x) Pr(x)dx

    Approximate infinite mixture by finite mixture

    Particle filter :Use particles of the form (s, φ, ψ)φ,ψ : Normal-Wishart posterior parameters for X ,Y

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 10 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Particle Filter

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 11 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Online Planning

    Monte Carlo Online Planning (Receding Horizon Control) :

    b0

    b1

    b2

    b3

    a1

    a2

    an

    o1

    o2

    on

    ...

    ...

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 12 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Simple Robot Navigation TaskAverage evolution of the return over time :

    0 50 100 150 200 2500

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Training Steps

    Ave

    rage

    Ret

    urn

    Prior modelExact ModelLearning

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 13 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Simple Robot Navigation Task

    Average accuracy of the model over time :

    0 50 100 150 200 2500

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Training Steps

    WL1

    Model Accuracy is measured as follows :WL1(b) =∑

    (s,φ,ψ) b(s, φ, ψ) [||µφ − µX ||1 + ||Σφ − ΣX ||1 + ||µψ − µY ||1 + ||Σψ − ΣY ||1]

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 14 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Conclusion

    Presented a framework for optimal control undermodel and state uncertainty.Monte Carlo approximations for efficient tracking andplanning.Framework can easily be extended to unknownrewards and mixture of Gaussians model.

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 15 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Future Work

    What if gT , gO unknown ?What if (µ,Σ) change over time ?More efficient planning algorithms.Apply to a real robot.

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 16 / 17

  • Motivation POMDP Bayesian RL Experiments Conclusion

    Thank you !

    Questions?

    Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 17 / 17

    MotivationPOMDPBayesian Reinforcement LearningExperimentsConclusion