Dialogue Policy Optimisation Milica Gašić Dialogue Systems Group.

30
Dialogue Policy Optimisation Milica Gašić Dialogue Systems Group

Transcript of Dialogue Policy Optimisation Milica Gašić Dialogue Systems Group.

  • Slide 1

Slide 2 Dialogue Policy Optimisation Milica Gai Dialogue Systems Group Slide 3 Reinforcement learning Slide 4 Dialogue as a partially observable Markov decision process (POMDP) atat stst s t+1 rtrt otot o t+1 State is unobservable State depends on a noisy observation Action selection (policy) is based on the distribution over all states at every time step t belief state b(s t ) Slide 5 Dialogue policy optimisation action state reward state Slide 6 Optimal Policy Slide 7 Reinforcement learning the idea Take actions randomly Compute average reward Change policy to take actions that generated high reward Slide 8 Challenges in dialogue policy optimisation How to define the reward? Belief state is large and continuous Reinforcement learning takes many iterations Slide 9 Problem 1: The reward function Solution: Reward is a measure of how good the dialogue is Slide 10 Problem 2: Belief state is large and continuous Solution: Compress the belief state into a smaller scale summary space 1 J. Williams and S. Young (2005). "Scaling up POMDPs for Dialogue Management: The Summary POMDP Method." Original Belief Space Actions Policy Summary Space Summary Actions Summary Function Master Function Summary Policy Slide 11 Summary space Summary space contains features of the belief space that are important for learning This is hand-coded! It can contain probabilities of concepts, their values and so on! Continuous variables can be discretised into a grid Slide 12 Q-function Q-function measures the expected discounted reward that can be obtained at a grid point when an action is taken Takes into account the reward of the future actions Optimising the Q-function is equivalent to optimising the policy Discount Factor in (0,1] Reward Starting grid point Starting action Expectation with respect to policy Slide 13 Online learning Reinforcement learning in direct interaction with the environment Actions are taken e-greedily Exploitation: choose action according to the best estimate of Q function Exploration: choose action randomly (with probability e) Slide 14 Monte Carlo control algorithm Initialise Q arbitrary Repeat Repeat for every turn in a dialogue Update belief state, map to summary space Record grid point, record reward Until the end of dialogue For each grid point sum up all rewards that followed Update Q function and policy Slide 15 How many iterations? Each grid point needs to be visited sufficiently enough to obtain good estimate If the grid is large then the estimate is not precise enough If there are lots of grid points then the policy optimization is slow In practice 10,000s dialogues are needed! Slide 16 Learning in interaction with a Simulated User Dialogue Manager Speech Generation Speech Understanding Dialogue State Dialogue Policy Expected Reward Optimise Policy Slide 17 Simulated user Various models Exhibit a variety of behaviour Imitate real users Slide 18 Agenda-based user simulator Consists of an agenda and a goal Goal: Concepts that describe the entity that the user wants Example: restaurant, cheap, Chinese Agenda Dialogue acts needed to elicit the user goal Dynamically changed during the dialogue Generated either deterministically or stochastically Slide 19 Learning with noisy input inform ( price = cheap, area = centre) inform ( price = cheap, area = south) 0.63 inform ( price = expensice ) 0.22 request ( area ) 0.15 Slide 20 Evaluating a dialogue system Dialogue system consists of many components and joint evaluation is difficult What matters is the user experience Dialogue manager uses reward to optimise the dialogue policy This can also be used for evaluation Slide 21 Results Slide 22 Problem 3: Policy optimisation requires a lot of dialogues Policy optimisations requires 10,000s dialogues Solution: Take into account similarities between different belief states Essential ingredients Gaussian process Kernel function Outcome Fast policy optimisation Slide 23 The Q-function as a Gaussian Process The Q-function in a POMDP is the expected long-term reward from taking action a in belief state b(s). It can be modelled as a stochastic process a Gaussian process to take into account the uncertainty of approximation Slide 24 VoiceMail example The user asks the system to save or delete the message. System actions: save, delete, confirm The user input is corrupted with noise, so the true dialogue state is unknown. Slide 25 Q-function as a Gaussian process belief state b(s) Slide 26 The role of kernel function in a Gaussian process The kernel function models correlation between different Q-function values Confirm Q-function value Action Belief state Confirm Slide 27 Exploration in online learning State space needs to be sufficiently explored to find the optimal path How to efficiently explore the space? Slide 28 Active learning in GP Reinforcement learning gives the uncertainty GP model for Q-function choose action that the model in uncertain about Exploration choose action with the highest expected reward Exploitation Slide 29 Results Cambridge tourist information domain Slide 30 Learning in interaction with real people Slide 31 Conclusions Statistical dialogue modelling Automate dialogue manager optimisation Robust to speech recognition errors Enables fast learning Future work: Refined reward function