Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young Cambridge University Engineering Department | {mg436, fj228, sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk Acknowledgements This research was partly funded by the UK EPSRC under grant agreement EP/F013930/1 and by the EU FP7 Programme under grant agreement 216594 (CLASSiC project: www.classic-project.org ). Real-world Problem: CamInfo Dialogue POMDP-based Dialogue Management Kernel Function • Models the uncertainty about the dialogue state by maintaining a distribution over all possible dialogue states in every turn – belief state • Maximises the overall dialogue success by optimising Q-function Q(a,b(s)) – the highest expected long-term reward for action a taken in belief state b(s) Problem: To ensure tractability of optimisation, standard methods discretise the belief space into a small number of points, causing the loss of information. Solution: Model the continuous nature of the belief state and include the prior knowledge about the domain to speed up the learning process. Gaussian Processes in Reinforcement Learning Toy problem: VoiceMail Dialogue Conclusion • GP-Sarsa can obtain the optimal policy faster than standard methods provided an adequate choice of the kernel function • The measure of uncertainty that GP-Sarsa is estimating can be utilised in an Active Learning framework to additionally speed up the learning process Results on CamInfo • Gaussian Process (GP) – non-parametric Bayesian model for function approximation • For given prior function correlations and some noisy function observations , it estimates the posterior of any function value • GP-Sarsa is an online Reinforcement learning algorithm that models Q-function as a Gaussian Process If the Q-function value was known in one belief state-action pair what is the Q- function value in another belief state for the same action? • Prior knowledge about Q-function correlations is incorporated in the kernel function. • Kernel hyper-parameters can be learned from the data labelled with belief states, actions and rewards. In that way the kernel captures correlations found in the data. Results on VoiceMail Comparison: • GP-Sarsa with various kernel functions • Grid-based Monte Carlo Control algorithm • Exact POMDP solution In order to estimate the speed of convergence, the policy was evaluated after every training batch: Hidden Information State Dialogue Manager • POMDP-based Dialogue Manager that can tractably maintain belief state for real-world problems • Optimises policy in a reduced summary space CamInfo Domain • Tourist Information domain for Cambridge, UK Comparison: • GP-Sarsa with polynomial kernel • Active learning GP-Sarsa – during exploration selects the action that has the highest uncertainty estimated by the Gaussian Process • Grid-based Monte Carlo Control Algorithm Would you like to save or delete the message? Your message is deleted. b(s) b’(s) a r belief state (immediate) reward action Would you like to save or delete the message? Would you like to save or delete the message? Q-function value Action Belief state The user asks the system to save or delete the message. The user input is corrupted with noise, so the true dialogue state is unknown. A Gaussian Process models every Q-function value Q(a,b(s)) as a Gaussian distributed random variable. The variance of the Gaussian distribution provides a measure of uncertainty about the approximation. GP-Sarsa: Distribution of Q(a,b(s)) value Action a Which action leads to success? Standard approach: Q(a,b(s)) value

Upload
tangia
Category

Documents
view
37
download
0

TAGS:

Embed Size (px):

description

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers. M. Gašić , F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young Cambridge University Engineering Department | {mg436, fj228 , sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk. - PowerPoint PPT Presentation

Transcript of Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers

Gaussian Processes for Fast Policy Optimisation of

POMDP-based Dialogue ManagersM. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young

Cambridge University Engineering Department | {mg436, fj228, sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk

AcknowledgementsThis research was partly funded by the UK EPSRC under grant agreement EP/F013930/1 and

by the EU FP7 Programme under grant agreement 216594 (CLASSiC project:

www.classic-project.org).

Real-world Problem: CamInfo DialoguePOMDP-based Dialogue Management Kernel Function• Models the uncertainty about the dialogue state by maintaining a distribution over all possible dialogue states in every turn

– belief state

• Maximises the overall dialogue success by optimising Q-function Q(a,b(s)) – the highest expected long-term reward for

action a taken in belief state b(s)

Problem: To ensure tractability of optimisation, standard methods discretise the belief space into a small number of points,

causing the loss of information.

Solution: Model the continuous nature of the belief state and include the prior knowledge about the domain to speed up the

learning process.

Gaussian Processes in

Reinforcement Learning

Toy problem: VoiceMail Dialogue

Conclusion• GP-Sarsa can obtain the optimal policy faster than standard methods provided an adequate choice of the kernel

function

• The measure of uncertainty that GP-Sarsa is estimating can be utilised in an Active Learning framework to

additionally speed up the learning process

Results on CamInfo

• Gaussian Process (GP) – non-parametric Bayesian model for function approximation

• For given prior function correlations and some noisy function observations, it estimates the posterior of any function

value

• GP-Sarsa is an online Reinforcement learning algorithm that models Q-function as a Gaussian Process

If the Q-function value was known in one belief state-action pair what is the Q-function value in another belief state for

the same action?

• Prior knowledge about Q-function correlations is incorporated in the kernel function.

• Kernel hyper-parameters can be learned from the data labelled with belief states, actions and rewards. In that way

the kernel captures correlations found in the data.

Results on VoiceMailComparison:

• GP-Sarsa with various kernel functions

• Grid-based Monte Carlo Control algorithm

• Exact POMDP solution

In order to estimate the speed of convergence, the policy was evaluated after every training batch:

Hidden Information State Dialogue Manager

• POMDP-based Dialogue Manager that can tractably maintain belief state for real-world problems

• Optimises policy in a reduced summary space

CamInfo Domain

• Tourist Information domain for Cambridge, UK

Comparison:

• GP-Sarsa with polynomial kernel

• Active learning GP-Sarsa – during exploration selects the action that has the highest uncertainty estimated by the

Gaussian Process

• Grid-based Monte Carlo Control Algorithm

Would you like

to save or

delete the

message?Your message

is deleted.

b(s) b’(s)

belief state

(immediate)

reward

action

Would you like

to save or

delete the

message?

Would you like

to save or

delete the

message?

Q-function value ActionBelief state

The user asks the system to save or delete the message.

The user input is corrupted with noise, so the true

dialogue state is unknown.

A Gaussian Process models every Q-function value Q(a,b(s)) as a Gaussian distributed random variable. The variance of the

Gaussian distribution provides a measure of uncertainty about the approximation.

GP-Sarsa:

Distribution of Q(a,b(s)) value Action a

Which action leads to success?Standard approach:

Q(a,b(s)) value

http://www.classic-project.org/

Gaussian Processes for Regression and Optimisation

Approximate POMDP planning: Overcoming the curse of history!

Global Optimisation with Gaussian Processesgpss.cc/gpss13/assets/Sheffield-GPSS2013-Osborne.pdf · Optimisation is a decision problem: we must select evaluations to determine the

Sampling Networks and Aggregate Simulation for Online POMDP … · We introduce a new algorithm SNAP (Sampling Networks and Aggregate simulation for POMDP) that expands the scope

DESPOT: Online POMDP Planning with Regularizationbigbird.comp.nus.edu.sg/m2ap/wordpress/wp-content/... · DESPOT: Online POMDP Planning with Regularization Nan Ye N.YE@QUT EDU AU

D1.3: POMDP Learning for ISU Dialogue Management · D1.3: POMDP Learning for ISU Dialogue Management Paul Crook, James Henderson, Oliver Lemon, Xingkun Liu Distribution: Public CLASSiC

POMDP-based Statistical Spoken Dialogue Systems: a …mi.eng.cam.ac.uk/~sjy/papers/ygtw13.pdf · POMDP-based Statistical Spoken Dialogue Systems: a Review Steve Young, Fellow, IEEE,

Quick and Automatic Selection of POMDP Implementations on ...

Latent Gaussian Processes for Distribution Estimation of ...mlg.eng.cam.ac.uk/yarin/PDFs/CLGP-arXiv.pdflearning-rate free stochastic optimisation (Tieleman & Hin-ton,2012) to optimise

POMDP solution methods - University of Torontodarius/papers/POMDP_survey.pdf · POMDP solution methods Darius Braziunas Department of Computer Science University of Toronto 2003 Abstract

POMDP approach to robotic sorting and manipulation … · POMDP approach to robotic sorting and manipulation of deformable objects Pol Mons o Purt pmonso@iri.upc.es Advisors: Dr.

POMDP approach to robotic sorting and manipulation … · POMDP approach to robotic sorting and manipulation of deformable objects Pol Mons o Purt [email protected] Advisors: Dr.

POMDP solution methods - uni-bielefeld.deskopp/Lehre/STdKI_SS10/POM… · St Rt A t St+1 O +1 Figure 2: Causal relationships between POMDP states, actions, rewards, and observations.

Fast approximate POMDP planning: Overcoming the curse of history!

Gaussian Processes for Regression and Optimisation...Gaussian processes and lead me down the path of using them for optimisa-tion. Marcus has supplied endless enthusiasm, invaluable

A POMDP APPROACH TO UNDERWATER ROBOT PATH …lisc.mae.cornell.edu/PastThesis/QuanxingLuMaster.pdf · A POMDP APPROACH TO UNDERWATER ROBOT PATH PLANNING FOR MULTI-VIEW MULTI-TARGET

Convolution Pyramids - uni-saarland.de · Convolution Pyramids Approach Forward and Backward Transform Flow Chart and Pseudocode Optimisation Application 1 - Gaussian Kernels Application

Gaussian Processes for Text Regressionetheses.whiterose.ac.uk/17619/1/thesis.pdf · 3.2 String kernel hyperparameter optimisation results. For each hyperparameter its original value

POMDP Seminar Backup3

A Tractable POMDP for a Class of Sequencing Problems Paat ... · A Tractable POMDP for a Class of Sequencing Problems Paat Rusmevichientong, Benjamin Van Roy Stanford University Stanford,

Theoretical Analysis of Bayesian Optimisation with Unknown ... · for Bayesian optimisation with Gaussian processes and unknown kernel hyper-parameters in the stochastic setting.

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers

Documents

Transcript of Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers