Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers
description
Transcript of Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers
Gaussian Processes for Fast Policy Optimisation of
POMDP-based Dialogue ManagersM. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young
Cambridge University Engineering Department | {mg436, fj228, sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk
AcknowledgementsThis research was partly funded by the UK EPSRC under grant agreement EP/F013930/1 and
by the EU FP7 Programme under grant agreement 216594 (CLASSiC project:
www.classic-project.org).
Real-world Problem: CamInfo DialoguePOMDP-based Dialogue Management Kernel Function• Models the uncertainty about the dialogue state by maintaining a distribution over all possible dialogue states in every turn
– belief state
• Maximises the overall dialogue success by optimising Q-function Q(a,b(s)) – the highest expected long-term reward for
action a taken in belief state b(s)
Problem: To ensure tractability of optimisation, standard methods discretise the belief space into a small number of points,
causing the loss of information.
Solution: Model the continuous nature of the belief state and include the prior knowledge about the domain to speed up the
learning process.
Gaussian Processes in
Reinforcement Learning
Toy problem: VoiceMail Dialogue
Conclusion• GP-Sarsa can obtain the optimal policy faster than standard methods provided an adequate choice of the kernel
function
• The measure of uncertainty that GP-Sarsa is estimating can be utilised in an Active Learning framework to
additionally speed up the learning process
Results on CamInfo
• Gaussian Process (GP) – non-parametric Bayesian model for function approximation
• For given prior function correlations and some noisy function observations, it estimates the posterior of any function
value
• GP-Sarsa is an online Reinforcement learning algorithm that models Q-function as a Gaussian Process
If the Q-function value was known in one belief state-action pair what is the Q-function value in another belief state for
the same action?
• Prior knowledge about Q-function correlations is incorporated in the kernel function.
• Kernel hyper-parameters can be learned from the data labelled with belief states, actions and rewards. In that way
the kernel captures correlations found in the data.
Results on VoiceMailComparison:
• GP-Sarsa with various kernel functions
• Grid-based Monte Carlo Control algorithm
• Exact POMDP solution
In order to estimate the speed of convergence, the policy was evaluated after every training batch:
Hidden Information State Dialogue Manager
• POMDP-based Dialogue Manager that can tractably maintain belief state for real-world problems
• Optimises policy in a reduced summary space
CamInfo Domain
• Tourist Information domain for Cambridge, UK
Comparison:
• GP-Sarsa with polynomial kernel
• Active learning GP-Sarsa – during exploration selects the action that has the highest uncertainty estimated by the
Gaussian Process
• Grid-based Monte Carlo Control Algorithm
Would you like
to save or
delete the
message?Your message
is deleted.
b(s) b’(s)
a
r
belief state
(immediate)
reward
action
Would you like
to save or
delete the
message?
Would you like
to save or
delete the
message?
Q-function value ActionBelief state
The user asks the system to save or delete the message.
The user input is corrupted with noise, so the true
dialogue state is unknown.
A Gaussian Process models every Q-function value Q(a,b(s)) as a Gaussian distributed random variable. The variance of the
Gaussian distribution provides a measure of uncertainty about the approximation.
GP-Sarsa:
Distribution of Q(a,b(s)) value Action a
Which action leads to success?Standard approach:
Q(a,b(s)) value