SDS Architectures

PowerPoint Presentation

SDS ArchitecturesJulia HirschbergCOMS 4706(Thanks to Josh Gordon for slides.)1SDS ArchitecturesSoftware abstractions that coordinate the NLP components required for human-computer dialogueConduct task-oriented, limited-domain conversations Manage levels of information processing (e.g., utterance interpretation, turn-taking) needed for dialogueIn real-time, under uncertainty22Examples: Information-Seeking, TransactionalMost commonCMU Bus route informationColumbia Virtual LibrarianGoogle Directory service

3

Lets Go PublicExamples: USC Virtual HumansMultimodal input / outputProsody and facial expressionAuditory and visual clues assist turn takingMany limitationsScriptingConstrained domain

4http://ict.usc.edu/projects/virtual_humans

Examples: Interactive Kiosks5

Multi-participant conversationsSurprises and challenges passersby to trivia games[Bohus and Horvitz, 2009]Dan built Ravenclaw at CMU5Examples: Robotic Interfaces

6www.cellbots.com

Speech interface to a UAV[Eliasson, 2007]Conversational SkillsSDS Architectures tie together:Speech recognitionTurn-takingDialogue managementUtterance interpretationGrounding mutual informationNatural language generationAnd increasingly includeMultimodal input / outputGesture recognition7Research ChallengesSpeech recognition: Accuracy in interactive settings, detecting emotionTurn-taking: Fluidly handling overlap, backchannelsDialogue management: Increasingly complex domains, better generalization, multi-party conversationsUtterance interpretation: Reducing constraints on what the user can say, and how they can say it. Attending to prosody, emphasis, speech rate.8Real-World SDSCMU OlympusOpen source collection of dialogue system componentsResearch platform used to investigate dialogue management, turn taking, spoken language interpretationActively developedMany implementationsLets go public, Team Talk, CheckItOut9www.speech.cs.cmu.edu

Conventional SDS Pipeline

10Speech signals to words. Words to domain concepts. Concepts to system intentions. Intentions to utterances (represented as text). Text to speech.10Olympus under the Hood: Provider Components

11Ravenclaw: Dan Bohus thesis11Speech recognition

12DM asks for information12The Sphinx Open Source Recognition Toolkit Pocket-sphinxContinuous speech, speaker independent recognition systemIncludes tools for language model compilation, pronunciation, and acoustic model adaptationProvides word level confidence annotation, n-best listsEfficient runs on embedded devices (including an iPhone SDK)Olympus supports parallel decoding engines / modelsTypically runs parallel acoustic models for male and female speech13http://cmusphinx.sourceforge.net/Speech recognition challenge in interactive settings

14Noisy conditions14Spontaneous Dialogue Hard for ASRPoor in interactive settings compared to one-off applications like voice search and dictationPerformance phenomena: backchannels, pause-fillers, false-startsOOV wordsInteraction with an SDS is cognitively demanding for usersWhat can I say and when? Will the system understand me? Uncertainty increases disfluency, resulting in further recognition errors15Sample Word Error RatesNon-interactive settings Google Voice Search: 17% deployed (0.57% OOV over 10k queries randomly sampled from Sept-Dec, 2008)Interactive settings:Lets Go Public: 17% in controlled conditions vs. 68% in the fieldCheckItOut: Used to investigate task-oriented performance under worst case ASR - 30% to 70% depending on experimentVirtual Humans: 37% in laboratory conditions16Examples of (worst-case) Recognizer ErrorS: What book would you like?U: The Language of SycamoresASR: THE LANGUAGE OF IS .A. COMING WARSASR: SCOTT SARAH SCOUT LAW1717Error PropagationRecognizer noise injects uncertainty into the pipelineInformation loss occurs when moving from an acoustic signal to a lexical representationMost SDSs ignore prosody, amplitude, emphasisInformation provided to downstream components includesAn n-best list, or word latticeLow level features: speech rate, speech energy18Spoken Language Understanding

19SLU maps from words to conceptsDialog acts (the overall intent of an utterance) Domain specific concepts (like a book, or bus route)Single utterances vs. SLU across turnsChallenging in noisy settingsEx. Does the library have Hitchhikers Guide to the Galaxy by Douglas Adams on audio cassette?20Dialog ActBook RequestTitleThe Hitchhikers Guide to the GalaxyAuthorDouglas AdamsMediaAudio CassetteSemantic GrammarsDomain independent concepts[Yes], [No], [Help], [Repeat], [Number]Domain specific concepts[Book], [Author][Quit](*THANKS *good bye)(*THANKS goodbye)(*THANKS +bye);

THANKS(thanks *VERY_MUCH)(thank you *VERY_MUCH)

VERY_MUCH(very much)(a lot);

2121Grammars Generalize PoorlyUseful for extracting fine-grained concepts, butHand engineeredTime consuming to develop and tuneRequires expert linguistic knowledge to construct Difficult to maintain over complex domainsLack robustness to OOV words, novel phrasingSensitive to recognizer noise22SLU in Olympus: the Phoenix ParserPhoenix is a semantic parser, intended to be robust to recognition noisePhoenix parses the incoming stream of recognition hypothesesMaps words in ASR hypotheses to semantic framesEach frame has an associated CFG Grammar, specifying word patterns that match the slotMultiple parses may be produced for a single utteranceThe frame is forwarded to the next component in the pipeline 23Statistical MethodsSupervised learning is commonly used for single utterance interpretationGiven word sequence W, find the semantic representation of meaning M that has maximum a posteriori probability P(M|W)Useful for dialogue act identification, determining broad intentLike all supervised techniquesRequires a training corpusOften is domain and recognizer dependent24Belief updating

25Cross-utterance SLUU: Get my coffee cup and put it on my desk. The one at the back. Difficult in noisy settings Mostly new territory for SDS

26[Zuckerman, 2009]Dialogue Management

27The Dialogue ManagerRepresents the systems agendaMany techniquesHierarchal plans, state / transaction tables, Markov processesSystem initiative vs. mixed initiativeSystem initiative means less uncertainty about the dialog state, but is time-consuming and restrictive for usersRequired to manage uncertainty and error handingBelief updating, domain independent error handling strategies

2829Task Specification, Agenda, and Execution

[Bohus, 2007]System for conference room scheduling.29Domain Independent Error Handling

30[Bohus, 2007]Parameterized for different concepts: outputs decision to confirm and how or to request repetition3031Error Recovery StrategiesError Handling Strategy (misunderstanding)ExampleExplicit confirmationDid you say you wanted a room starting at 10 a.m.?Implicit confirmationStarting at 10 a.m. ... until what time?Error Handling Strategy (non-understanding)ExampleNotify that a non-understanding occurredSorry, I didnt catch that .Ask user to repeatCan you please repeat that?Ask user to rephraseCan you please rephrase that?Repeat promptWould you like a small room or a large one?Statistical Approaches to Dialogue ManagementLearning management policy from a corpusDialogue can be modeled as Partially Observable Markov Decision Processes (POMDP)Reinforcement Learning is applied (either to existing corpora or to user simulation studies) to learn an optimal strategyEvaluation functions typically reference the PARADISE framework

32Interaction Management

33The Interaction ManagerMediates between the discrete, symbolic reasoning of the Dialogue Manager, and the continuous real-time nature of user interactionManages timing, turn-taking, and barge-inYields the turn to the user on interruptionPrevents the system from speaking over the userNotifies the Dialogue Manager ofInterruptions and incomplete utterances34Natural Language Generation and Speech Synthesis

35NLG and Speech SynthesisTemplate based, e.g., for explicit error handling strategiesDid you say ? More interesting cases in disambiguation dialogsA TTS system synthesizes the NLG outputThe audio server allows interruption mid utteranceProduction systems incorporateProsody, intonation contours to indicate degree of certaintyOpen source TTS frameworksFestival - http://www.cstr.ed.ac.uk/projects/festival/Flite - http://www.speech.cs.cmu.edu/flite/

36Asynchronous Architectures

37Blaylock, 2002An asynchronous modification of TRIPS, most work is directed toward best-case speech recognitionLemon, 2003Backup recognition pass enables better discussion of OOV utterances

37NextDialogue management problems and strategies38

SDS Architectures

Documents

Transcript of SDS Architectures