Statisical Spoken Dialogue System Talk 2 – Belief tracking CLARA Workshop Presented by Blaise...

53
Statisical Spoken Dialogue System Talk 2 – Belief tracking CLARA Workshop Presented by Blaise Thomson Cambridge University Engineering Department [email protected] http://mi.eng.cam.ac.uk/~brmt2

Transcript of Statisical Spoken Dialogue System Talk 2 – Belief tracking CLARA Workshop Presented by Blaise...

  • Slide 1

Statisical Spoken Dialogue System Talk 2 Belief tracking CLARA Workshop Presented by Blaise Thomson Cambridge University Engineering Department [email protected] http://mi.eng.cam.ac.uk/~brmt2 http://mi.eng.cam.ac.uk/~brmt2 Slide 2 Human-machine spoken dialogue Recognizer Semantic Decoder Dialog Manager Synthesizer Message Generator User WaveformsWords Dialog Acts I want a restaurant inform(type=restaurant) What kind of food do you want.? request(food) Typical structure of a spoken dialogue system Slide 3 Spoken Dialogue Systems State of the art Slide 4 Outline Introduction An example user model (spoken dialogue model) The Partially Observable Markov Decision Process (POMDP) POMDP models for dialogue systems POMDP models for off-line experiments POMDP models for simulating Inference Belief propagation (Fixed parameters) Expectation Propagation (Learning parameters) Optimisations Results Slide 5 Intro An example user model? Partially Observable Markov Decision Process (POMDP) Probabilistic model of what the user will say Variables: Dialogue state, s t. (e.g. User wants a restaurant) System action, a t. (e.g. What type of food) Observation of what was said, o t. (e.g. N-best semantic list) Assumes Input-Output Hidden Markov structure: s1s1 s2s2 sTsT... o1o1 o2o2 oToT a1a1 a2a2 aTaT Slide 6 Intro Simplifying the POMDP user model Typically split dialogue state, s t : s1s1 s2s2 sTsT... o1o1 o2o2 oToT a1a1 a2a2 aTaT Slide 7 Intro Simplifying the POMDP user model Typically split dialogue state, s t : True user goal, g t True user act, u t g1g1 g2g2 gTgT... o1o1 o2o2 oToT a1a1 a2a2 aTaT u1u1 u2u2 uTuT Slide 8 Further split the goal, g t, into sub-goals g t,c e.g. User wants a Chinese restaurant food=Chinese, type=restaurant Intro Simplifying the POMDP user model gtgt g t,food g t,type g t,stars g t,area g Slide 9 Intro Simplifying the POMDP user model g type u type g food u food o a U G g type u type g food u food o a U G Slide 10 How can I help you? Im looking for a beer [0.5] Im looking for a bar [0.4] Sorry, what did you say? bar [0.3] bye [0.3] When decisions are based on probabilistic user goals: Partially Observable Markov Decision Process (POMDPs) Intro POMDP models for dialogue systems Beer Bar Bye Slide 11 Intro POMDP models for dialogue systems Slide 12 Intro belief model for dialogue systems Beer Bar Bye confirm(beer) Choose actions according to beliefs in the goal instead of most likely hypothesis More robust some key reasons Full hypothesis list User model Slide 13 Intro POMDP models for off-line experiments How can I help you? Im looking for a beer Im looking for a bar Sorry, what did you say? bar bye Beer Bar Bye [0.5] [0.4] [0.2] [0.7] [0.3] [0.5] [0.1] Slide 14 Intro POMDP models for simulation Often useful to be able to simulate how people behave: For reinforcement learning For testing a given system In theory, simply generate from the POMDP user model g type u type g food u food a U G restaurant Chinese inform(type=restaurant) silence() Slide 15 An example voicemail We have a voicemail system with 2 possible user goals: g = SAVE: The user wants to save g = DEL: The user wants to delete In each turn until we save/delete we observe one of two things o = USAVE: The user said save o = UDEL: The user said delete We assume that the goal changes between each turn, and for the moment we only look at two turns We start by being completely unsure what the user wants Slide 16 An example exercise Observation probability: P(o | g) If we observe the user saying they want to save and then what is the probability they want to save. P(g 1 | o 1 = OSAVE) Use Bayes Theorem P(A|B) = P(B|A) P(A) / P(B) g \ oOSAVEODEL SAVE0.80.2 DEL0.20.8 Slide 17 An example exercise Observation probability: P(o | g) Transition probability: P(g | g) If we observe the user saying they want to save and then saying they want to delete, what is the probability they want to save in the second turn. i.e. what is: P(g 2 | o 1 = OSAVE, o 2 = ODEL) g \ oOSAVEODEL SAVE0.80.2 DEL0.20.8 g \ gSAVEDEL SAVE0.90.1 DEL0.01.0 Slide 18 An example answer g2g2 g1=SAVEg1=DELTOTALPROB SAVE0.5*0.8*0.9*0.20.5*0.2*0*0.20.0720.39 DEL0.5*0.8*0.1*0.80.5*0.2*1*0.80.1120.61 Slide 19 An example expanding further In general we will want to compute probabilities conditional on the observations (we will call this the data D). This always becomes a marginal on the joint distribution with the observation probabilities fixed. e.g. These sums can be computed much more cleverly using dynamic programming Slide 20 Belief Propagation Interested in the marginals p(x|D) Assume network is a tree with observations above and below D = {D a, D b } x D a D b Slide 21 Belief Propagation When we split D b = {D c, D d } These are called the messages into x. We have one message for every probability factor connected x D a D c D d Slide 22 Belief Propagation - message passing ab D a D b Slide 23 Belief Propagation - message passing a b D a D b c D c Slide 24 Belief Propagation We can do the same thing repeatedly. Start on one side, and keep getting p(x|D a ) Then start on the other ends and keep getting p(D b |x) To get a marginal simply multiply these Slide 25 Belief Propagation our example g1g1 o1o1 g2g2 o2o2 Write probabilities as vectors with SAVE on top Slide 26 Parameter Learning The problem g type u type g food u food o a U G g type u type g food u food o a U G Slide 27 Parameter Learning The problem For every (action, goal, goal) triple there is a parameter The parameters are a probability table of P(g|g,a) The goals are all hidden and factorized and there are many of them g t-1 gtgt atat Slide 28 Parameter Learning Some options 1. Hand-craft Roy et al, Zhang et al, Young et al, Thomson et al, Bui et al 2. Annotate user goal and use Maximum Likelihood Williams et al, Kim et al, Henderson & Lemon Isnt always possible 3. Expectation Maximisation Doshi & Roy (7 states), Syed et al (no goal changes) Uses an unfactorised state Intractable 4. Expectation Propagation (EP) Allows parameter tying (details in paper) Handles factorized hidden variables Handles large state spaces Doesnt require any annotations (incl of user act) though it does use the semantic decoder output Slide 29 Belief Propagation as message passing ab D a D b Message from outside the factor q \ (a) input message from above a Message from outside the factor q \ (b) product of input messages below b Message from this factor to b q * (b) Message from this factor to a q * (a) Slide 30 Belief Propagation as message passing ab D a D b Message from outside network q \ (a) = p(a|D a ) Message from outside network q \ (b) = p(D b |a) Message from this factor q * (b) = p(b|D b ) Message from this factor q * (a) = p(D b |a) q * (b)q * (a)q \ (a) q \ (b) Think in terms of approximations from each probability factor Slide 31 Belief Propagation Unknown parameters? Imagine we have a discrete choice for the parameters Integrate over our estimate from the rest of the network: To estimate, we want to sum over a and b: Slide 32 Belief Propagation Unknown parameters? But we actually have continuous parameters Integrate over our estimate from the rest of the network: To estimate, we want to sum over a and b: Slide 33 Expectation Propagation This doesnt make sense is a probability! Multiplying by q \ () gives: Choose q * () to minimize KL divergence with this If we restrict ourselves to Dirichlet distributions, we need to find the Dirichlet that best matches a mixture of Dirichlets Slide 34 Expectation Propagation Example g type u o a g u o a g Slide 35 Expectation Propagation Example g type u o a g u o a g Slide 36 Expectation Propagation Example g type u o a g u o a g p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2] Slide 37 Expectation Propagation Example g type u o a g u o a g p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2] inform(type=bar) [0.5] inform(type=hotel) [0.2] Slide 38 p(u=bar|g) 0.4 * p(u=hotel|g) 0.1 Expectation Propagation Example g type u o a g u o a g p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2] inform(type=bar) [0.5] inform(type=hotel) [0.2] type=bar [0.45] type=hotel [0.18] Slide 39 p(u=bar|g) 0.4 * p(u=hotel|g) 0.1 Expectation Propagation Example g type u o a g u o a g p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2] inform(type=bar) [0.5] inform(type=hotel) [0.2] type=bar [0.45] type=hotel [0.18] type=bar [0.44] type=hotel [0.17] Slide 40 p(u=bar|g) 0.4 * p(u=hotel|g) 0.1 Expectation Propagation Example g type u o a g u o a g p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2] inform(type=bar) [0.5] inform(type=hotel) [0.2] type=bar [0.45] type=hotel [0.18] type=bar [0.44] type=hotel [0.17] p(o|inform(type=bar)) [0.6] p(o|inform(type=rest)) [0.3] Slide 41 p(u=bar|g) 0.4 * p(u=hotel|g) 0.1 Expectation Propagation Example g type u o a g u o a g p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2] inform(type=bar) [0.5] inform(type=hotel) [0.2] type=bar [0.45] type=hotel [0.18] type=bar [0.44] type=hotel [0.17] p(o|inform(type=bar)) [0.6] p(o|inform(type=rest)) [0.3] Slide 42 Expectation Propagation Optimisation 1 In dialogue systems, most of the values are equally likely We can use this to reduce computations: - Compute the q distributions only once - Multiply instead of summing the same value repeatedly 1 2 3 4 5 Number of stars Twee stars please Slide 43 Expectation Propagation Optimisation 2 For each value, assume transition to most other values is the same (mostly constant factor) e.g. constant probability of change The reduced number of parameters means we can speed up learning too! Slide 44 Results Computation times No opt Grouping Const Change Both Slide 45 Results Simulated re-ranking Train on 1000 simulated dialogues Re-rank simulated semantics on 1000 dialogues Oracle accuracy is 93.5% TAcc Semantic accuracy of the top hypothesis NCE Normalized Cross Entropy Score (Confidence scores) ICE Item Cross Entropy Score (Accuracy + Confidence) TAccNCEICE No rescoring75.70.5410.921 Trained with noisy semantics81.70.6500.870 Trained with semantic annotations81.50.6320.903 Slide 46 Results Data re-ranking Train on Mar09 TownInfo trial data (720 dialogues) Test on Feb08 TownInfo trial data (648 dialogues) Oracle accuracy is 79.2% TAccNCEICE No rescoring73.3-0.0331.687 Trained with noisy semantics73.40.3271.586 Trained with semantic annotations73.90.3381.655 Slide 47 Results Simulated dialogue management Use reinforcement learning (Natural Actor Critic algorithm) to train two systems: One uses hand-crafted parameters One uses parameters learned from 1000 simulated dialogues Slide 48 Results Live evaluations (control) Tested in the Spoken Dialogue Challenge Provide bus timetables in Pittsburgh 800 road names (pairs represent a stop). Required to get place from, to and time All parameters of the Cambridge system were hand-crafted # Dial# Succ% SuccWER BASELINE915964.8 +/- 5.042.35 System 2612337.7 +/- 6.260.66 Cambridge756789.3 +/- 3.632.65 System 4836274.7 +/- 4.834.34 Slide 49 Results Live evaluations (control) WER Estimated success rate CAM BASELINE CAM Success CAM Failure BASELINE Success BASELINE Failure Slide 50 Summary POMDP models are an effective model of dialogue: For use in dialogue systems For re-ranking semantic hypotheses off-line Expectation Propagation allows parameter learning for complex models, without annotations of dialogue state Experiments show: EP gives improvements in re-ranked hypotheses EP gives improvements in simulated dialogue management performance Probabilistic belief gives improvements in live dialogue management performance Slide 51 Current/Future work Using the POMDP as a simulator too Need to change the model to better handle user acts (the sub-acts are not independent!) g type u type g food u food g type u g food Slide 52 The End Thanks! Dialogue Group homepage: http://mi.eng.cam.ac.uk/research/dialogue/ My homepage: http://mi.eng.cam.ac.uk/~brmt2/ Slide 53 Expectation Propagation Optimisations AB Assume this is constant for A-A* Compute this offline