Learning Optimal Strategies for Spoken Dialogue Systems

July 13, 2006 ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne) 1

Learning Optimal Strategies for Spoken Dialogue Systems

Diane LitmanUniversity of Pittsburgh

Pittsburgh, PA 15260 USA


Outline

• Motivation• Markov Decision Processes and

Reinforcement Learning• NJFun: A Case Study• Advanced Topics


Motivation• Builders of real-time spoken dialogue

systems face fundamental design choices that strongly influence system performance – when to confirm/reject/clarify what the user just said?– when to ask a directive versus open prompt?– when to use user, system, or mixed initiative?– when to provide positive/negative/no feedback?– etc.

• Can such decisions be automatically optimized via reinforcement learning?


Spoken Dialogue Systems (SDS)

• Provide voice access to back-end via telephone or microphone

• Front-end: ASR (automatic speech recognition) and TTS (text to speech)

• Back-end: DB, web, etc.• Middle: dialogue policy (what

action to take at each point in a dialogue)


Typical SDS ArchitectureLanguage

Understanding

Dialogue Policy

DomainBack-end

LanguageGeneration

SpeechRecognition

Text to Speech


Reinforcement Learning (RL)

• Learning is associated with a reward• By optimizing reward, algorithm learns

optimal strategy• Application to SDS

– Key assumption: SDS can be represented as a Markov Decision Process

– Key benefit: Formalization (when in a state, what is the reward for taking a particular action, among all action choices?)


Reinforcement Learning and SDS

LanguageUnderstanding

DialogueManager

DomainBack-end

LanguageGeneration

SpeechRecognition

SpeechSynthesis

noisysemantic input

actions(semantic output)

• debate over design choices• learn choices using

reinforcement learning• agent interacting with an

environment• noisy inputs• temporal / sequential

aspect• task success / failure


Sample Research Questions

• Which aspects of dialogue management are amenable to learning and what reward functions are needed?

• What representation of the dialogue state best serves this learning?

• What reinforcement learning methods are tractable with large scale dialogue systems?


Outline




Markov Decision Processes (MDP)

• Characterized by:– a set of states S an agent can be in– a set of actions A the agent can take– A reward r(a,s) that the agent

receives for taking an action in a state

– (+ Some other things I’ll come back to (gamma, state transition probabilities))


Modeling a Spoken Dialogue System as a Probabilistic

Agent• A SDS can be characterized by:

– The current knowledge of the system• A set of states S the agent can be in

– a set of actions A the agent can take– A goal G, which implies

• A success metric that tells us how well the agent achieved its goal

• A way of using this metric to create a strategy or policy for what action to take in any particular state.


Reinforcement Learning• The agent interacts with its environment to

achieve a goal

• It receives reward (possibly delayed reward) for its actions– it is not told what actions to take– instead, it learns from indirect, potentially delayed

reward, to choose sequences of actions that produce the greatest cumulative reward

• Trial-and-error search– neither exploitation nor exploration can be pursued

exclusively without failing at the task

• Life-long learning– on-going exploration


ReinforcementLearning

state

action reward

Policy : S A

s0 r0

a0 s1 r1

a1 s2 r2

a2 . . .


State Value Function, V

s2 s0

s3

s1

r(s0, a1) = 2 p(s0, a1, s2) = 0.3

p(s0, a2, s2) = 0.5

p(s0, a2, s3) = 0.5

p(s0, a1, s1) = 0.7

r(s0, a2) = 5

State, s V(s)

s0 ...

s1 10

s2 15

s3 6

Choosing a1: 2 + 0.7 × 10 + 0.3 × 15 = 13.5Choosing a2: 5 + 0.5 × 15 + 0.5 × 6 = 15.5

V(s) predicts the futuretotal reward we can obtain by entering state s

Ss

sVsaspasr'

)'()',,(),(

can exploit V greedily, i.e. in s, choose action a for which the following is largest:


Q(s, a) predicts the future total reward we can obtain by executing a in s

Action Value Function, Q

s0

State, s Action, a Q(s, a)

s0 a1 13.5

s0 a2 15.5

s1 a1 ...

s1 a2 ...

can exploit Q greedily, i.e. in s, choose action a for which Q(s, a) is largest


Q LearningFor each (s, a), initialise Q(s, a) arbitrarily

Observe current state, s

Do until reach goal state

Select action a by exploiting Q ε-greedily, i.e. with probability ε, choose a randomly; else choose the a for which Q(s, a) is largest

Execute a, entering state s’ and receiving immediate reward r

Update the table entry for Q(s, a)

s s’

Exploration versus

exploitation

One-step temporal difference update rule, TD(0)

)),()','(max(),(),('

asQasQrasQasQa

Watkins 1989


More on Q Learning

s’

s

Q(s, a)

Q(s’, a’)

a’

a

r


A Brief Tutorial Example

• A Day-and-Month dialogue system• Goal: fill in a two-slot frame:

– Month: November– Day: 12th

• Via the shortest possible interaction with user

• Levin, E., Pieraccini, R. and Eckert, W. A Stochastic Model of Human-Machine Interaction for Learning Dialog Strategies. IEEE Transactions on Speech and Audio Processing. 2000.


What is a State?

• In principle, MDP state could include any possible information about dialogue– Complete dialogue history so far

• Usually use a much more limited set– Values of slots in current frame– Most recent question asked to user– Users most recent answer– ASR confidence– etc


State in the Day-and-Month Example

• Values of the two slots day and month.• Total:

– 2 special initial state si and sf.– 365 states with a day and month– 1 state for leap year – 12 states with a month but no day– 31 states with a day but no month– 411 total states


Actions in MDP Models of Dialogue

• Speech acts!– Ask a question– Explicit confirmation– Rejection– Give the user some database

information– Tell the user their choices

• Do a database query


Actions in the Day-and-Month Example

• ad: a question asking for the day

• am: a question asking for the month

• adm: a question asking for the day+month

• af: a final action submitting the form and terminating the dialogue


A Simple Reward Function• For this example, let’s use a cost function for

the entire dialogue• Let

– Ni=number of interactions (duration of dialogue)

– Ne=number of errors in the obtained values (0-2)

– Nf=expected distance from goal• (0 for complete date, 1 if either data or month are missing, 2

if both missing)

• Then (weighted) cost is:

• C = wiNi + weNe + wfNf


3 Possible Policies

Open prompt

Directive prompt

Dumb

P1=probability of error in open prompt

P2=probability of error in directive prompt


3 Possible Policies

P1=probability of error in open prompt

P2=probability of error in directive prompt

Strategy 3 is better than strategy 2 when improved error rate justifies longer interaction:

OPEN

DIRECTIVE

p1 p2 wi

2we


That was an Easy Optimization

• Only two actions, only tiny # of policies

• In general, number of actions, states, policies is quite large

• So finding optimal policy is harder• We need reinforcement learning• Back to MDPs:


MDP• We can think of a dialogue as a

trajectory in state space

• The best policy is the one with the greatest expected reward over all trajectories

• How to compute a reward for a state sequence?


Reward for a State Sequence

• One common approach: discounted rewards• Cumulative reward Q of a sequence is

discounted sum of utilities of individual states

• Discount factor between 0 and 1• Makes agent care more about current than

future rewards; the more future a reward, the more discounted its value


The Markov Assumption

• MDP assumes that state transitions are Markovian

P(st1 | st ,st 1,...,so,at ,at 1,...,ao) PT (st1 | st ,at )


Expected Reward for an Action

• Expected cumulative reward Q(s,a) for taking a particular action from a particular state can be computed by Bellman equation:

– immediate reward for current state– + expected discounted utility of all possible next states

s’– weighted by probability of moving to that state s’– and assuming once there we take optimal action a’


Needed for Bellman Equation

• A model of p(s’|s,a) and estimate of R(s,a)– If we had labeled training data

• P(s’|s,a) = C(s,s’,a)/C(s,a)

– If we knew the final reward for whole dialogue R(s1,a1,s2,a2,…,sn)

• Given these parameters, can use value iteration algorithm to learn Q values (pushing back reward values over state sequences) and hence best policy


How to estimate p(s’|s,a) without labeled data

• Random conversations with real people– Carefully hand-tune small number of states and

policies– Build a dialogue system which explores state space

by generating a few hundred random conversations with real humans

– Set probabilities from this corpus• Random conversations with simulated people

– Have millions of conversations with simulated people– Can have a slightly larger state space


Final Reward

• What is the final reward for whole dialogue R(s1,a1,s2,a2,…,sn)?

• The general goodness of a whole dialogue!

• Can be automatically computed using PARADISE evaluation framework (more later)


Outline




NJFun: A Case Study

• Idea: use reinforcement learning to make a small set of optimal policy decisions while interacting with real people, and empirically evaluate– Singh, S., D. Litman, M. Kearns, and M.

Walker. Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System. Journal of AI Research. 2002.


NJFun (aka RLDS) DialoguesTask Description:"You have lived in Stanhope for many years but have never managed to visit its historic sites.

Today you are feeling virtuous. Use RLDS to find out what you can see this morning.”

TTS: Welcome to RLDS. Please say an activity name or say 'list activities' for a list of activities I know about. USER: I would like to visit a historic site in Stanhope this morning. TTS: Did you say you are interested in going to a zoo? USER: No. TTS: I know about amusement parks, aquariums, cruises, historic sites, museums, parks, theaters, wineries, and zoos. Please say an activity name from this list. USER: I would like to visit a historic site. TTS: Did you say you are interested in going to a historic site? USER: Yes. TTS: Please give me more information. USER: ... TTS: Please tell me the location that you are interested in. You can also tell me the time.…. (8 turns omitted)TTS: Thank you for using the system. Please give me feedback by saying 'good', 'so-so', or 'bad'. USER: Bad. TTS: Goodbye and have a nice day!

TTS: Welcome to RLDS. How may I help you? USER: I would like to um find about Stanhope historic sites. TTS: Did you say you are interested in Stanhope? USER: Yes. TTS: What time of the day do you want to go? USER: Uh...morning. TTS: Did you say you want to go in the morning? USER: Yes. TTS: I found 2 historic sites in Stanhope that are open in the morning. They are the Lenape Indian Village, and the Waterloo Village. Thank you for using the system. Please give me feedback by saying 'good', 'so-so', or 'bad'. USER: Good. TTS: Goodbye and have a nice day!

Unoptimized SystemOptimized

System


Some Issues in Automatic Speech Recognition (ASR)

• Inputs: audio file; grammar/language model; acoustic model

• Outputs: utterance matched from grammar, or no match; confidence score

• Performance tradeoff:– “small” grammar --> high accuracy on

constrained utterances, lots of no-matches– “large” grammar --> match more

utterances, but with lower confidence


Some Issues in Dialogue Policy Design

• Initiative policy• Confirmation policy• Criteria to be optimized


Initiative Policy

• System initiative vs. user initiative:– “Please state your departure city.”– “How can I help you?”

• Influences expectations• ASR grammar must be chosen accordingly• Best choice may differ from state to state• May depend on user population & task


Confirmation Policy

• High ASR confidence: accept ASR match and move on

• Moderate ASR confidence: confirm• Low ASR confidence: re-ask• How to set confidence thresholds?• Early mistakes can be costly later,

but excessive confirmation is annoying


Criteria to be Optimized

• Task completion• Sales revenues• User satisfaction• ASR performance• Number of turns


Typical System Design: Sequential Search

• Choose and implement several “reasonable” dialogue policies

• Field systems, gather dialogue data • Do statistical analyses• Refield system with “best” dialogue

policy• Can only examine a handful of policies


Why Reinforcement Learning?

• Agents can learn to improve performance by interacting with their environment

• Thousands of possible dialogue policies, and want to automate the choice of the “optimal”

• Can handle many features of spoken dialogue– noisy sensors (ASR output)– stochastic behavior (user population)– delayed rewards, and many possible rewards– multiple plausible actions

• However, many practical challenges remain


Proposed Approach• Build initial system that is deliberately

exploratory wrt state and action space• Use dialogue data from initial system to

build a Markov decision process (MDP)• Use methods of reinforcement learning

to compute optimal policy (here, dialogue policy) of the MDP

• Refield (improved?) system given by the optimal policy

• Empirically evaluate


State-Based Design• System state: contains information

relevant for deciding the next action– info attributes perceived so far– individual and average ASR confidences– data on particular user– etc.

• In practice, need a compressed state• Dialogue policy: mapping from each

state in the state space to a system action


Markov Decision Processes

• System state s (in S)• System action a in (in A)• Transition probabilities P(s’|s,a)• Reward function R(s,a) (stochastic)• Our application: P(s’|s,a) models

the population of users


SDSs as MDPs

...332211 ususus

Initial systemutterance

Initial userutterance

Actions haveprob. outcomes

estimate transition probabilities... P(next state | current state & action)...and rewards... R(current state, action)...from set of exploratory dialogues (random action choice)Violations of Markov property! Will this work?

a e a e a e ...1 21 2 33

+ system logs


Computing the Optimal

• Given parameters P(s’|s,a), R(s,a), can efficiently compute policy maximizing expected return

• Typically compute the expected cumulative reward (or Q-value) Q(s,a), using value iteration

• Optimal policy selects the action with the maximum Q-value at each dialogue state


Potential Benefits• A principled and general framework for

automated dialogue policy synthesis – learn the optimal action to take in each state

• Compares all policies simultaneously– data efficient because actions are evaluated

as a function of state– traditional methods evaluate entire policies

• Potential for “lifelong learning” systems, adapting to changing user populations


The Application: NJFun• Dialogue system providing telephone

access to a DB of activities in NJ• Want to obtain 3 attributes:

– activity type (e.g., wine tasting)– location (e.g., Lambertville)– time (e.g., morning)

• Failure to bind an attribute: query DB with don’t-care


NJFun as an MDP

• define state-space• define action-space• define reward structure• collect data for training & learn

policy• evaluate learned policy

a closer look : RL in spoken dialog systems : current challenges : RL for error handling


The State SpaceFeature Values Explanation Attribute (A) 1,2,3 Which attribute is being worked on

Confi dence/ Confi rmed (C)

0,1,2 3,4

0,1,2 f or low, medium and high ASR confi dence 3.4 f or explicitly confi rmed, disconfi rmed

Value (V) 0,1 Whether value has been obtained f or current attribute

Tries (T) 0,1,2 How many times current attr has been asked

Grammar (G) 0,1 Whether open or closed grammar was used

History (H) 0,1 Whether trouble on any previous attribute

N.B. Non-state variables record attribute values;state does not condition on previous attributes!


Sample Action Choices

• Initiative (when T = 0)– user (open prompt and grammar)– mixed (constrained prompt, open grammar)– system (constrained prompt and grammar)

• Example– GreetU: “How may I help you?” – GreetS: “Please say an activity name.”


Sample Confirmation Choices

• Confirmation (when V = 1)– confirm– no confirm

• Example– Conf3: “Did you say want to go in the

<time>?”– NoConf3: “”


Dialogue Policy Class• Specify “reasonable” actions for each

state– 42 choice states (binary initiative or

confirmation action choices)– no choice for all other states

• Small state space (62), large policy space (2^42)

• Example choice state– initial state: [1,0,0,0,0,0]– action choices: GreetS, GreetU

• Learn optimal action for each choice state


Some System Details• Uses AT&T’s WATSON ASR and TTS

platform, DMD dialogue manager• Natural language web version used

to build multiple ASR language models

• Initial statistics used to tune bins for confidence values, history bit (informative state encoding)


The Experiment• Designed 6 specific tasks, each with web survey• Split 75 internal subjects into training and test, controlling

for M/F, native/non-native, experienced/inexperienced• 54 training subjects generated 311 dialogues• Training dialogues used to build MDP• Optimal policy for BINARY TASK COMPLETION computed

and implemented• 21 test subjects (for modified system) generated 124

dialogues• Did statistical analyses of performance changes


Example of Learning

• Initial state is always– Attribute(1), Confidence/Confirmed(0), Value(0),

Tries(0), Grammar(0), History(0)

• Possible actions in this state– GreetU: How may I help you?– GreetS: Please say an activity name or say “list

activities” for a list of activities I know about

• In this state, system learned that GreetU is the optimal action.


Reward Function• Binary task completion (objective measure):

– 1 for 3 correct bindings, else -1• Task completion (allows partial credit):

– -1 for an incorrect attribute binding– 0,1,2,3 correct attribute bindings

• Other evaluation measures: ASR performance (objective), and phone feedback, perceived completion, future use, perceived understanding, user understanding, ease of use (all subjective)

• Optimized for binary task completion, but predicted improvements in other measures


Main Results• Task completion (-1 to 3):

– train mean = 1.72– test mean = 2.18– p-value < 0.03

• Binary task completion:– train mean = 51.5 %– test mean = 63.5 %– p-value < 0.06


Other Results

• ASR performance (0-3):– train mean = 2.48 – test mean = 2.67 – p-value < 0.04

• Binary task completion for experts (dialogues 3-6):– train mean = 45.6%– test mean = 68.2 %– p-value < 0.01


Subjective Measures

Subjective measures“move to the middle” rather thanimprove

First graph: It was easy to find the place that I wanted (strongly agree = 5,…, strongly disagree=1)train mean = 3.38, test mean = 3.39, p-value = .98


Comparison to Human Design• Fielded comparison infeasible, but

exploratory dialogues provide a Monte Carlo proxy of “consistent trajectories”

• Test policy: Average binary completion reward = 0.67 (based on 12 trajectories)

• Outperforms several standard fixed policies– SysNoConfirm: -0.08 (11)– SysConfirm: -0.6 (5)– UserNoConfirm: -0.2 (15)– Mixed: -0.077 (13)– User Confirm: 0.2727 (11), no difference


A Sanity Check of the MDP• Generate many random policies • Compare value according to MDP and value based on

consistent exploratory trajectories• MDP evaluation of policy: ideally perfectly accurate

(infinite Monte Carlo sampling), linear fit with slope 1, intercept 0

• Correlation between Monte Carlo and MDP:– 1000 policies, > 0 trajs: cor. 0.31, slope 0.953, int.

0.067, p < 0.001– 868 policies, > 5 trajs: cor. 0.39, slope 1.058, int.

0.087, p < 0.001


Conclusions from NJFun• MDPs and RL are a promising framework

for automated dialogue policy design• Practical methodology for system-building

– given a relatively small number of exploratory dialogues, learn the optimal policy within a large policy search space

• NJFun: first empirical test of formalism• Resulted in measurable and significant

system improvements, as well as interesting linguistic results


Caveats

• Must still choose states, actions, reward• Must be exploratory with taste• Data sparsity• Violations of the Markov property• A formal framework and methodology,

hopefully automating one important step in system design


Outline




Some Current Research Topics

• Scale to more complex systems• Automate state representation• POMDPs due to hidden state• Learn terminal (and non-terminal)

reward function• Online rather than batch learning


Addressing Scalability

• Approach 1: user models / simulations– costly to obtain real data → simulate users

• inexpensive and potentially richer source of large corpora

• but - what’s the quality of the simulated data?

– again, real-world evaluation becomes paramount

• Approach 2: value function approximation– data-driven state abstraction / aggregation


Some Example Simulation Models

• P (userAction | system Action)• P (yesAnswer | explicitConfirmation)• P (yesAnswer | explicitConfirmation, goal)

• E.g.,– Levin, Pierraccini, Eckert– Georgila, Henderson, Lemon– Pietquin– Scheffler and Young


Example Simulation Models in Spoken Dialogue Tutoring

• ProbCorrect Model – Answers a question with the same

correctness rate as a real student• Random Model

– Gives random answers

• Hua Ai and Diane Litman. Comparing Real-Real, Simulated-Simulated, and Simulated-Real Spoken Dialogue Corpora. Proceedings of the AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue Systems. 2006.


ProbCorrect Model

• Compute correctness rates from the system logs

• Use real student utterance as candidate answers

• A Bigram Model

Pick an answer

based on the

question

Answers for Question i

Answers for Question j

…


Random Model

• Randomly pick an answer from the whole utterance set

• A Unigram Model

Pick an answer

from the whole set

Answers for question i

Answers for question j

…


Simulated Dialogue (ProbCorrect Model)

Tutor said:

To analyze this problem, which law of motion would you use?

You said: The second law.

Tutor said:

The best law is the 3rd law. Do you recall what this law says?

You said: No.

Tutor said:

It says… Is there a gravitational force here?

You said: Yes.

Tutor said:

Good. Now, what’s the relationship between the two force’s magnitudes?

You said: They are equal.

Tutor said:

Yeah. What’s the relationshiop between the two force’s directions?

You said: They’re different.

Tutor said:

To summarize…


Simulated Dialogue (Random Model)

Tutor said: To analyze this problem, which law of motion would you use?

You said: More.

Tutor said: The best law is the 3rd law. Do you recall what this law says?

You said: After all forces.

Tutor said: It says… Is there a gravitational force here?

You said: Yes.

Tutor said: We just discussed... Now, what’s the relationship between the two force’s magnitudes?

You said: Yes.

Tutor said: Well... What’s the relationshiop between the two force’s directions?

You said: It’s decreasing.

Tutor said: To summarize…


Evaluating Simulation Models

• Does the model produce human-like behavior– Compare real and simulated user responses– Metrics: precision and recall

• Does the model reproduce the variety of human behavior– Compare real and simulated dialogue corpora– Metrics: statistical characteristics of dialogue

features (see below)


Evaluating Simulated Corpora

High-level Dialogue Features Dialog Length (Number of turns)Turn Length (Number of actions per turn)Participant Activity (Ratio of system/user actions per dialog)

Dialogue style and cooperativeness Proportion of goal-directed dialogues vs. othersNumber of times a piece of information is re-asked

Dialogue success rate and efficiency Average goal/subgoal achievement rate

– Schatzmann, J., Georgila, K., and Young, S. Quantitative Evaluation of User Simulation Techniques for Spoken Dialogue Systems. In Proceedings 6th SIGdial Workshop on Discourse and Dialogue. 2005.


Evaluating ProbCorrect vs. Random

• Differences shown by similar metrics are not necessarily related to the reality level – two real corpora can be very different

• Metrics can distinguish to some extent– real from simulated corpora– two simulated corpora generated by different

models trained on the same real corpus – two simulated corpora generated by the same

model trained on two different real corpora


• Q can be represented by a table only if the number of states & actions is small

• Besides, this makes poor use of experience

• Hence, we use function approximation, e.g.– neural nets– weighted linear functions– case-based/instance-based/memory-based

representations

Scalability Approach 2:Function Approximation


Current Research Topics




Designing the State Representation

• Incrementally add features to a state and test whether the learned strategy improves

– Frampton, M. and Lemon, O. Learning More Effective Dialogue Strategies Using Limited Dialogue Move Features. Proceedings ACL/Coling. 2006.

• Adding Last System and User Dialogue Acts improves 7.8%

– Tetreault J. and Litman, D. Using Reinforcement Learning to Build a Better Model of Dialogue State. Proceedings EACL. 2006.

• See below


Example Methodology and Evaluation in SDS Tutoring

• Construct MDP’s to test the inclusion of new state features to a baseline– Develop baseline state and policy– Add a state feature to baseline and compare polices– A feature is deemed important if adding it results in a

change in policy from a baseline policy

– Joel R. Tetreault and Diane J. Litman. Comparing the Utility of State Features in Spoken Dialogue Using Reinforcement Learning. Proceedings HLT/NAACL. 2006.


Baseline Policy

• Trend: if you only have student correctness as a model of student state, the best policy is to always give simple feedback

# State State Size Policy1 [Correct] 1308 SimpleFeedback

2 [Incorrect] 872 SimpleFeedback


Adding Certainty Features:Hypothetical Policy Change

Baseline State

Policy B+Certainty State

1 [C] SimFeed [C,Certain] [C,Neutral] [C,Uncertain]

2 [I] SimFeed [I,Certain][I,Neutral][I,Uncertain]

+Cert 1Policy SimFeedSimFeedSimFeed

SimFeedSimFeedSimFeed

+Cert 2Policy MixSimFeedMix

MixComplexFeedbackMix

0 shifts 5 shifts


Evaluation Results• Incorporating new features into standard

tutorial state representation has an impact on Tutor Feedback policies

• Including Certainty, Student Moves and Concept Repetition into the state effected the most change

• Similar feature utility for choosing Tutor Questions


Designing the State Representation (continued)

• Other Approaches, e.g.,

• Paek, T. and Chickering, D. The Markov Assumption in Spoken Dialogue Management. Proc. SIGDial. 2005.

• Henderson, J., Lemon, O, and Georgila, K. Hybrid Reinforcement/Supervised Learning for Dialogue Policies from Communicator Data. Proc. IJCAI Workshop on K&R in Practical Dialogue Systems. 2005.


Beyond MDPs

• Partially Observable MDPs (POMDPs)– We don’t REALLY know the user’s state (we only

know what we THOUGHT the user said)– So need to take actions based on our BELIEF , I.e. a

probability distribution over states rather than the “true state”

– e.g., Roy, Pineau and Thrun; Young and Williams

• Decision Theoretic Methods– e.g., Paek and Horvitz


Why POMDPs?

• Does “state” model uncertainty natively (i.e., is it partially rather than fully observable)?– Yes: POMDP and DT – No: MDP

• Does the system plan (i.e., can cumulative reward force the system to construct a plan for choice of immediate actions)?– Yes: MDP and POMDP– No: DT


POMDP Intuitions• At each time step t machine in some hidden state sS• Since we don’t observe s, we keep a distribution over states

called a “belief state” b• So the probability of being in state s given belief state b is b(s).• Based on the current belief state b, the machine

– selects an action am Am

– Receives a reward r(s,am)

– Transitions to a new (hidden) state s’, where s’ depends only on s and am

• Machine then receives an observation o’ O, which is dependent on s’ and am

• Belief distribution is then updated based on o’ and am.


How to Learn Policies?

• State space is now continuous– With smaller discrete state space, MDP

could use dynamic programming; this doesn’t work for POMDB

• Exact solutions only work for small spaces

• Need approximate solutions• And simplifying assumptions


Dialogue System Evaluation

• The normal reason: we need a metric to help us compare different implementations

• A new reason: we need a metric for “how good a dialogue went” to automatically improve SDS performance via reinforcement learning– Marilyn Walker. An Application of Reinforcement Learning to

Dialogue Strategy Selection in a Spoken DIalouge System for Email. JAIR. 2000.


PARADISE: PARAdigm for DIalogue System Evaluation

• “Performance” of a dialogue system is affected both by what gets accomplished by the user and the dialogue agent and how it gets accomplished

• Walker, M. A., Litman, D. J., Kamm, C. A., and Abella, A. PARADISE: A Framework for Evaluating Spoken Dialogue Agents. Proceedings of ACL/EACL. 1997.


Performance as User Satisfaction (from

Questionnaire)


PARADISE Framework

• Measure parameters (interaction costs and benefits) and performance in a corpus

• Train model via multiple linear regression over parameters, predicting performance

System Performance = ∑ wi * pi

• Test model on new corpus

• Predict performance during future system design

n

i=1


Example Learned Performance Function from Elvis [Walker 2000]

• User Sat.=.27*COMP+.54*MRS- .09*BargeIn%+.15*Reject%– COMP: User perception of task completion (task success)– MRS: Mean (concept) recognition accuracy (quality cost)– BargeIn%: Normalized # of user interruptions (quality cost)– Reject%: Normalized # of ASR rejections (quality cost)

• Amount of variance in User Sat. accounted for by the model– Average Training R2 = .37– Average Testing R2 = .38

• Used as Reward for Reinforcement Learning


Some Current Research Topics




Offline versus Online Learning

MDP

DialogueSystem

Training dataPolicy

User Simulator

HumanUser

-MDP typically works offline-Would like to learn policy online

-System can improve over time-Policy can change as environment changes

Interactions work online


Summary• (PO)MDPs and RL are a promising framework

for automated dialogue policy design– Designer states the problem and the desired goal– Solution methods find (or approximate) optimal

plans for any possible state– Disparate sources of uncertainty unified into a

probabilistic framework

• Many interesting problems remain, e.g.,– using this approach as a practical methodology for

system building– making more principled choices (states, rewards,

discount factors, etc.)


Acknowledgements

• Talks on the web by Dan Bohus, Derek Bridge, Joyce Chai, Dan Jurafsky, Oliver Lemon and James Henderson, Jost Schatzmann and Steve Young, and Jason Williams were used in the development of this presentation

• Slides from ITSPOKE group at University of Pittsburgh


Further Information• Reinforcement Learning

– Sutton, R. and Barto G. Reinforcement Learning: An Introduction, MIT Press. 1998 (much available online)

– Artificial Intelligence and Machine Learning Journals and Conferences

• Application to Dialogue– Jurafsky, D. and Martin, J. Dialogue and

Conversational Agents. Chapter 19 of Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of May 18, 2005 (available online only)

– “ACL” Literature– Spoken Language Community (e.g., IEEE and ISCA

publications)

Learning Optimal Strategies for Spoken Dialogue Systems

Documents

Transcript of Learning Optimal Strategies for Spoken Dialogue Systems