Error Awareness and Recovery in Task-Oriented Spoken Dialogue Systems Thesis Proposal Dan Bohus...

Error Awareness and Recovery in Task-Oriented Spoken Dialogue Systems

Thesis Proposal

Dan BohusCarnegie Mellon University, January 2004

Thesis CommitteeAlex Rudnicky (Chair)Roni RosenfeldJeff SchneiderEric Horvitz (Microsoft Research)

2

Problem

Spans most domains and interaction types Has a significant impact on performance

Lack of robustness when faced with understanding errors

3

An exampleS: Are you a registered user ?U: No I'm not. No [NO I'M NOT NO]S: What is your full name ?U: Adam Schumacher [ATHENS IN AKRON]S: There is an Athens in Georgia and in Greece. Which destination did you want ?U: Neither [NEITHER]S: I'm sorry, I didn't catch that. There is an Athens in Georgia and in Greece. Which

destination did you want ?U: Georgia [GEORGIA]S: A flight from Athens... Where do you want to go ?U: Start over [START OVER]S: Do you really want to start over ?U: Yes [YES]S: What city are you leaving from ?U: Hamilton, Ontario [HILTON ONTARIO]S: Sorry, I'm not sure I understood what you said. Where are you leaving from ?U: Hamilton [HILTON]S: I'm still having trouble understanding you. To go on, I need you to answer the

following question. What city are you leaving from ?U: Toronto [TORONTO]

4

Some Statistics …

Corrections [Krahmer, Swerts, Litman, Levow]

30% of utterances correct system mistakes 2-3 times more likely to be misrecognized

Semantic error rates

CMU Communicator [CMU] 32%

CU Communicator [CU] 27%

How May I Help You? [AT&T] 36%

Jupiter [MIT] 28%

SpeechActs [SRI] 25%

5

Significant Impact on Interaction

CMU Communicator

40% 26%

Contain understanding errors

Failed

Multi-site Communicator Corpus [Shin et al]

37%

Failed

sessions

sessions

33%

63%

6

Outline

ProblemApproach Infrastructure Research Program Timeline & Summary

problem : approach : infrastructure : indicators : strategies : decision process : summary

7

Increasing Robustness …

Increase the accuracy of speech recognition

Assume recognition is unreliable, and create the mechanisms for acting robustly at the dialogue management level

ASR performance increases / demands increase More general


8

Snapshot of Existing Work: Slide 1

Theoretical models of grounding Contribution Model [Clark], Grounding Acts [Traum]

Practice: heuristic rules Misunderstandings

Threshold(s) on confidence scores

Non-understandings


Analytical/Descriptive, not decision oriented

Ad-hoc, lack generality, not easy to extend

9

Snapshot of Existing Work: Slide 2

Conversation as Action under Uncertainty [Paek and Horvitz]

Belief networks to model uncertainties Decisions based on expected utility, VOI-analysis

Reinforcement learning for dialogue control policies [Singh, Kearns, Litman, Walker, Levin, Pieraccini, Young, Scheffler, etc]

Formulate dialogue control as an MDP Learn optimal control policy from data

Do not scale up to complex, real-world domains


10

A task-independent, adaptive and scalable framework for error recovery in task-oriented spoken dialogue systems

A task-independent, adaptive and scalable framework for error recovery in task-oriented spoken dialogue systems

Research Program: Goals & Approach

Decision making under uncertainty

Approach:


11

1. Error awareness

2. Error recovery strategies

3. Error handling decision process

Three Components

Develop indicators that … Assess reliability of information Assess how well the dialogue is advancing

Develop and investigate an extended set of conversational error handling strategies

Develop a scalable reinforcement-learning based approach for error recovery in spoken dialogue systems


0. Infrastructure


12

Infrastructure

RavenClaw Modern dialog management framework for

complex, task-oriented domains

RavenClaw spoken dialogue systems Test-bed for evaluation


Completed

Completed

13

RavenClaw

Dialogue Task (Specification)

Domain-Independent Dialogue Engine

RoomLine

Login

Welcome

AskRegistered AskName

GreetUser

GetQuery

DateTime Location Properties

Network Projector Whiteboard

GetResults DiscussResults

user_nameregistered

query

results

RoomLine

Login

AskRegistered

Dialogue Stack

registered: [No]-> false, [Yes] -> true

registered: [No]-> false, [Yes] -> trueregistered: [No]-> false, [Yes] -> trueuser_name: [UserName]

registered: [No]-> false, [Yes] -> trueregistered: [No]-> false, [Yes] -> trueuser_name: [UserName]user_name: [UserName]query.date_time: [DateTime]query.location: [Location]query.network: [Network]

Expectation Agenda

Error HandlingDecision Process

Strategies

Indicators

ExplicitConfirm


14

RavenClaw-based Systems

RoomLine CMU Let’s Go!! Bus Information

System LARRI [Symphony]

TeamTalk [11-741]

Eureka [11-743]


15

0. Infrastructure

1. Error awareness



Three Components





16

Existing Work

Confidence Annotation Traditionally focused on speech recognizer

[Bansal, Chase, Cox, and others]

Recently, multiple sources of knowledge[San-Segundo, Walker, Bosch, Bohus, and others]

Recognition, parsing, dialogue management

Detect misunderstandings: ~ 80-90% accuracy

Correction and Aware Site Detection[Swerts, Litman, Levow and others]

Multiple sources of knowledge Detect corrections: ~ 80-90% accuracy


17

S: Where are you flying from?

U: [CityName={Aspen/0.6; Austin/0.2}]

S: Did you say you wanted to fly out of Aspen?

U: [No] [CityName={Boston/0.8}]

Proposed: Belief Updating

Continuously assess beliefs in light of initial confidence and subsequent events

[CityName={Aspen/?; Austin/?; Boston/?}]

An example:


initial belief+

system action+

user response

updated belief

18

contents

Belief Updating: Approach

Model the update in a dynamic belief network

C C

systemaction

User response features

t t + 1


C C

systemaction

initial belief updated belief

confidence

correction

contents

confidence

correction

CurrentTop Current2ndCurrent3

rd

Confidence

Yes No

PositiveMarkers

NegativeMarkers

UtteranceLength

19

0. Infrastructure

1. Error awareness



Three Components





20

Is the Dialogue Advancing Normally?

Locally, turn-level: Non-understanding indicators

Non-understanding flag directly available Develop additional indicators

Recognition, Understanding, Interpretation

Globally, discourse-level: Dialogue-on-track indicators

Summary statistics of non-understanding indicators

Rate of dialogue advance


21

0. Infrastructure

1. Error awareness



Three Components





22

Error Recovery Strategies

Identify Identify and define an extended set of error

handling strategies

Implement Construct task-decoupled implementations of a

large number of strategies

Evaluate Evaluate performance and bring further

refinements

23

List of Error Recovery Strategies

HelpWhere are we?Start overScratch concept valueGo backChannel establishmentSuspend/ResumeRepeatSummarizeQuit Restart subtask plan

Select alternative planStart overTerminate session / Direct to operator

Local problems(non-understandings)

Global problems(compounded, discourse-level problems)Switch input modality

SNR repairAsk repeat turn

Notify non-understandingExplicit confirm turnTargeted helpWH-reformulationKeep-a-word reformulationGeneric helpYou can say

Ask rephrase turn


User Initiated System Initiated

Explicit confirmationImplicit confirmationDisambiguationAsk repeat conceptReject concept

Ensure that the system has reliable information(misunderstandings)

Ensure that the dialogueon track

24

List of Error Recovery Strategies

HelpWhere are we?Start overScratch concept valueGo backChannel establishmentSuspend/ResumeRepeatSummarizeQuit Restart subtask plan

Select alternative planStart overTerminate session / Direct to operator

Local problems(non-understandings)

Global problems(compounded, discourse-level problems)Switch input modality

SNR repairAsk repeat turn

Notify non-understandingExplicit confirm turnTargeted helpWH-reformulationKeep-a-word reformulationGeneric helpYou can say

Ask rephrase turn


User Initiated System Initiated

Explicit confirmationImplicit confirmationDisambiguationAsk repeat conceptReject concept

Ensure that the system has reliable information(misunderstandings)

Ensure that the dialogueon track

25

Error Recovery Strategies: Evaluation

Reusability Deploy in different spoken dialogue systems

Efficiency of non-understanding strategies Simple metric: Is the next utterance understood? Efficiency depends on decision process Construct upper and lower bounds for efficiency

Lower bound: decision process which chooses uniformly Upper bound: human performs decision process (WOZ)


26

0. Infrastructure

1. Error awareness



Three Components





27

Dialogue control ~ Markov Decision Process States Actions Rewards

Previous work: successes in small domains NJFun [Singh, Kearns, Litman, Walker et al]

Problems Lack of scalability Once learned, policies are not reusable

Previous Reinforcement Learning Work


S1

S2

S3A

28

Proposed Approach

Overcome previous shortcomings:

1. Focus learning only on error handling Reduces the size of the learning problem Favors reusability of learned policies Lessens the system development effort

2. Use a “divide-and-conquer” approach Leverage independences in dialogue


29

Gated Markov Decision Processes

RoomLine

Login

Welcome


GreetUser

user_nameregistered

GatingMechanism

Concept-MDP Concept-MDP

Topic-MDP

Topic-MDP

Topic-MDP

Small-size models Parameters can be tied across

models Easy to design initial policies

Decoupling favors reusability of policies

Accommodate dynamic task generation

Independence assumption


No Action

Explicit Confirm

No Action

No Action

No Action

ExplicitConfirmation

30

Reward structure & learning

Gating Mechanism

MDP MDP MDP

Action

Global, post-gate rewardsReward

Gating Mechanism

MDP MDP MDP

Action

Local rewards

Reward Reward Reward

Rewards based on any dialogue performance metric

Atypical, multi-agent reinforcement learning setting

Multiple, standard RL problems

Model-based approaches


31

Evaluation

Performance Compare learned policies with initial heuristic

policies Metrics

Task completion Efficiency Number and lengths of error segments User satisfaction

Scalability Deploy in a system operating with a sizable task Theoretical analysis


32

Outline

Problem Approach Infrastructure Research Program Summary & Timeline


33

Overall Goal: develop a task-independent, adaptive and scalable framework for error recovery in task-oriented spoken dialogue systems

Modern dialogue management framework Belief updating framework Investigation of an extended set of error handling

strategies Scalable data-driven approach for learning error

handling policies

Summary of Contributions


34

Timelineproposal

milestone 1

milestone 2

milestone 3

defense

end ofyear 4

end ofyear 5

now

5.5 years

Data collection forbelief updating and

WOZ study

Develop andevaluate the

belief updatingmodels

Implementdialogue-on-track

indicators

Misunderstanding and

non-understandingstrategies Investigate

theoreticalaspects ofproposed

reinforcementlearningmodel

Evaluatenon-understandingstrategies; develop

remaining strategies

Error handlingdecision process:

reinforcementlearning

experiments

Data collection forRL training

Data collection forRL evaluation

data indicators strategies decisions

Contingencydata collection

efforts

Additional experiments: extensions or

contingency work


35

Thank You!

Questions & Commentscommittee members,

then floor

36

Indicators: Goals

Goal: Increase awareness and capacity to detect problems Develop indicators which can inform the error

handling process about potential problems

Understandingprocess

System acquiresinformation

System does not acquire information

Non-understanding

System acquirescorrect information

System acquiresincorrect informationMisunderstanding

OK

37

problem : approach : support work : indicators : strategies : decision process : summary

Year 4 Year 5 Year 6 Spring’04 Summer’04 Fall’04 Winter’04-05 Spring’05 Summer’05 Fall’05 Winter 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Ex

pe

rim

en

ts

Data Collection and Experiments

BDC: Background Data Collection

[5] – 4 months DC-1: Data collection for belief updating and non-understanding strategies evaluation WOZ: Wizard-of-oz experiment for non-understanding strategies

BDC: Background Data Collection

[9] – 3 months DC-2L: Data collection for decision process training and baselines

[11] – 2 m DC-2E: Data collection for decision process evaluation

[14] – 6 months Contingency (or extension work items) data collections / experiments

Belief Updating (Work Item 5)

[6] – 5 months Build and evaluate belief updating models, integrate in RavenClaw

Ind

ica

tors

Non-understanding and Dialogue-On-Track Indicators (Work Item 6)

[7] – 3 months Implement dialogue-on-track indicators

Str

ate

gie

s

Error prevention and recovery strategies (Work Item 8)

[1] – 4 months - Finish RavenClaw implementations for the misunderstanding and non-understanding strategies

[4] – 6 months - Evaluate non-understanding strategies in random exploration mode and in a WOZ setting - Develop the rest of the error handling strategies

[15] – 6 months Refinements of the proposed model, follow-up work for evaluating adaptability and reusability of policies

De

cis

ion

P

roc

es

s Decision Process:

Reinforcement Learning Work (Work Item 9)

[2] – 12 months Investigate more the theoretical aspects of the proposed RL model, establish final structure for the topic and concept MDPs, design initial policies, and finalize structure for gating function. Implement the models in the RavenClaw dialogue management framework.

[10] – 6 months Perform reinforcement learning experiments/evaluation for the decision process

[16] – 6 months (Contingency time) Alternative data-driven models

[13] – 3 months Write decision process paper

Wri

tin

g

Writing

[3] – 3 months Write paper on RavenClaw conversational strategies for error handling

[8] – 3 months Write belief updating paper

[12] – 10 months Write thesis document

1 2 3 4 5 6 M1 8 9 10 11 M2 13 14 15 16 17 M3 19 20 21 22 23 24

38

Three Desired Properties

Task-Independence Reuse the proposed architecture across different spoken

dialogue systems with a minimal amount of authoring effort

Adaptability Learn from experience how to adapt to the characteristics of

various domains

Scalability Applicable in spoken dialogue systems operating with large,

practical tasks

39

HC

ExplConf

ImplConf

NoAct

LC

ExplConf

ImplConf

NoAct

MC

ExplConf

ImplConf

NoAct

0NoAct

40

Belief Updating: Approach

Model the update in a dynamic belief network

C C

SystemAction

User response features

YesNo


rd

Confidence

PositiveMarkers

NegativeMarkers

UtteranceLength

t t + 1 Top-N values Fixed structure Learn parameters

Data collection

Evaluation Accuracy Soft-error


C C

SystemAction


rd

ConfidenceYes

No

PositiveMarkers

NegativeMarkers

UtteranceLength

41

Gated Markov Decision Processes

Issues: Structure of individual MDPs Gating mechanism Reward structure and learning


RoomLine

Login

Welcome


GreetUser

user_nameregistered

GatingMechanism

Concept-MDP Concept-MDP

Topic-MDP

Topic-MDP

Topic-MDP

No Action

Explicit Confirm

No Action

No Action

No Action

ExplicitConfirmation

42

Structure for individual MDPs

State-space: informative subset of corresponding indicators Concept-MDPs: confidence / beliefs Topic-MDPs: non-understanding, dialogue-on-

track indicators

Action-space corresponding system-initiated error handling

strategies


43

Gating Mechanism

Heuristic derived from domain-independent dialogue principles Give priority to topics over concept Give priority to entities closer to the conversational

focus


Error Awareness and Recovery in Task-Oriented Spoken Dialogue Systems Thesis Proposal Dan Bohus...

Documents

Transcript of Error Awareness and Recovery in Task-Oriented Spoken Dialogue Systems Thesis Proposal Dan Bohus...