On Organic Interfaces Victor Zue ([email protected]) MIT Computer Science and Artificial...

35
On Organic Interfaces Victor Zue ([email protected]) MIT Computer Science and Artificial Intelligence Laboratory

Transcript of On Organic Interfaces Victor Zue ([email protected]) MIT Computer Science and Artificial...

Page 1: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

On Organic Interfaces

Victor Zue ([email protected])

MIT Computer Science and Artificial

Intelligence Laboratory

Page 2: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Acknowledgements

Eric BrillScott CyphersJim GlassDave GoddeauT J HazenLee HetheringtonLynette HirschmanRaymond LauHong LeungHelen MengMike PhillipsJoe PolifroniShinsuke SakaiStephanie SeneffDave ShipmanMichelle SpinaNikko StrömChao Wang

Research StaffGraduate Students

Anderson, M.Aull, A.Brown, R.Chan, W.Chang, J.Chang, S.Chen, C.Cyphers, S.Daly, N.Doiron, R.Flammia, G.Glass, J.Goddeau, D.Hazen, T.J.Hetherington, L.

Huttenlocher, D.Jaffe, O.Kassel, R.Kasten,P.Kuo, J. Kuo, S.Lauritzen, N.Lamel, L.Lau, R.Leung, H.Lim, A.Manos, A.Marcus, J.Neben, N.Niyogi, P.

Mou, X.Ng, K.Pan, K.Pitrelli, J.Randolph, M.Rtischev, D.Sainath, T.Sarma, S.Seward, D.Soclof, M.Spina, M.Tang, M.Wichiencharoen, A.Zeiger, K.

Page 3: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Introduction

Page 4: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Speech interfaces are ideal for information access and management when:

• The information space is broad and complex,

• The users are technically naive,

• The information device is small, or

• Only telephones are available.

Speech interfaces are ideal for information access and management when:

• The information space is broad and complex,

• The users are technically naive,

• The information device is small, or

• Only telephones are available.

Virtues of Spoken Language

Natural: Requires no special training

Flexible: Leaves hands and eyes free

Efficient: Has high data rate

Economical: Can be transmitted/received inexpensively

Page 5: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

SpeechSpeech

TextText

Recognition

SpeechSpeech

TextText

Synthesis

UnderstandingGeneration

Communication via Spoken Language

MeaningMeaning

Human

Computer

Input Output

Page 6: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Components of a Spoken Dialogue System

DISCOURSE CONTEXT

DISCOURSE CONTEXT

DIALOGUEMANAGEMENT

DIALOGUEMANAGEMENT

DATABASE

Graphs& Tables

LANGUAGEUNDERSTANDING

LANGUAGEUNDERSTANDING

MeaningRepresentation

MeaningRepresentation

Meaning

LANGUAGEGENERATIONLANGUAGE

GENERATIONSPEECH

SYNTHESISSPEECH

SYNTHESISSpeech

Sentence

SPEECHRECOGNITION

SPEECHRECOGNITION

Speech

Words

Page 7: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Tremendous Progress to Date

Technological Advances

Inexpensive Computing Increased Task Complexity

Data Intensive Training

Page 8: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Some Example Systems

BBN, 2007

MIT, 2007 KTH, 2007

Page 9: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Speech Synthesis

• Recent trend moves toward corpus-based approaches– Increased storage and compute capacity

– Availability of large text and speech corpora

– Modeled after successful utilization for speech recognition

• Many successful implementations, e.g.,– AT&T

– Cepstral

– Microsoft

compassiondisputed

cedar citysincegiantsince

compassiondisputed

cedar citysincegiantsince

computerscience

Page 10: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

But we are far from done …

• Machine performance typically lags far behind human performance

• How can interfaces be truly anthropomorphic?

MACHINE HUMAN0

20

40

60

80SWITCHBOARD (Spontaneous Speech)

43%

4%

Lippmann, 1997

Page 11: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Premise of the Talk

• Propose a different perspective on development of speech-based interfaces

• Draw from insights in evolution of computer science – Computer systems are increasingly complex

– There is a move towards treating these complex systems like organisms that can observe, grow, and learn

• Will focus on spoken dialogue systems

Page 12: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Organic Interfaces

Page 13: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Computer: Yesterday and Today

• Computation of static functions in a static environment, with well-understood specification

• Computation is its main goal xxxxx

• Single agent xxxxxxxxxxxxxxxxxx

• Batch processing of text and homogeneous data

• Stand-alone applications

• Binary notion of correctness

• Adaptive systems operating in environments that are dynamic and uncertain

• Communication, sensing, and control just as important

• Multiple agents that may be cooperative, neutral, adversarial

• Stream processing of massive, heterogeneous data

• Interaction with humans is key

• Trade off multiple criteria

Increasingly, we rely on probabilistic representation, machine learning techniques, and optimization principles to build complex systems

Increasingly, we rely on probabilistic representation, machine learning techniques, and optimization principles to build complex systems

Page 14: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Properties of Organic Systems

• Robust to changes in environment and operating conditions

• Learning through experiences

• Observe their own behavior

• Context aware

• Self healing

• …

Page 15: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Research Challenges

Page 16: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Some Research Challenges

• Robustness– Signal Representation

– Acoustic Modeling

– Lexical Modeling

– Multimodal Interactions

• Establishing Context

• Adaptation

• Learning– Statistical Dialogue Management

– Interactive Learning

– Learning by Imitation

• Robustness– Signal Representation

– Acoustic Modeling

– Lexical Modeling

– Multimodal Interactions

• Establishing Context

• Adaptation

• Learning– Statistical Dialogue Management

– Interactive Learning

– Learning by Imitation

* Please refer to written paper for topics not covered in talk

Page 17: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Robustness: Acoustic Modeling

• Statistical n-grams have masked the inadequacies in acoustic modeling, but at a cost – Size of training corpus

– Application-dependent performance

• To promote acoustic modeling research, we may want to develop a sub-word based recognition kernel– Application independent

– Stronger constraints than phonemes

– Closed vocabulary for a given language

• Some success has been demonstrated (e.g., Chung & Seneff, 1998)

sentence

phonetics

syntax

semantics

word (syllable)

morphology

phonotactics

phonemics

acousticsAcousticModels LMUnits

Sub-word Units

SpeechRecognition

Kernel

Page 18: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Robustness: Lexical Access

• Current approaches represent words as phoneme strings

• Phonological rules are sometimes used to derive alternate pronunciations

“temperature”

• Lexical representation based on features offers much appeal (Stevens, 1995)

– Fewer models, less training data, greater parsimony

– Alternative lexical access models (e.g., Zue, 1983)

• Lexical access based on islands of reliability might be better able to deal with variability

Page 19: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Robustness: Multimodal Interactions

• Other modalities can augment/complement speech

LANGUAGEUNDERSTANDING

LANGUAGEUNDERSTANDING

meaning

SPEECHRECOGNITION

SPEECHRECOGNITION

GESTURERECOGNITION

GESTURERECOGNITION

HANDWRITINGRECOGNITION

HANDWRITINGRECOGNITION

MOUTH & EYESTRACKING

MOUTH & EYESTRACKING

Page 20: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Challenges for Multimodal Interfaces

• Input needs to be understood in the proper context – “What about that one”

• Timing information is a useful way to relate inputs

Speech: “Move this one over here”

Pointing: (object) (location)

time

• Handling uncertainties and errors (Cohen, 2003)

• Need to develop a unifying linguistic framework

Page 21: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Audio Visual Symbiosis

• The audio and visual signals both contain information about:– Identity/location of the person

– Linguistic message

– Emotion, mood, stress, etc.

• Integration of these sources of information has been known to help humans

Benoit, 2000

Page 22: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Audio Visual Symbiosis

• The audio and visual signals both contain information about:– Identity/location of the person

– Linguistic message

– Emotion, mood, stress, etc.

• Integration of these sources of information has been known to helps humans

• Exploiting this symbiosis can lead to robustness, e.g.,– Locating and identifying the speaker Hazen et al., 2003

Page 23: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Audio Visual Symbiosis

• The audio and visual signals both contain information about:– Identity/location of the person

– Linguistic message

– Emotion, mood, stress, etc.

• Integration of these sources of information has been known to helps humans

• Exploiting this symbiosis can lead to robustness, e.g.,– Locating and identifying the speaker

– Speech recognition/understanding augmented with facial features

Huang et al., 2004

Page 24: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Audio Visual Symbiosis

• The audio and visual signals both contain information about:– Identity/location of the person

– Linguistic message

– Emotion, mood, stress, etc.

• Integration of these sources of information has been known to helps humans

• Exploiting this symbiosis can lead to robustness, e.g.,– Locating and identifying the speaker

– Speech recognition/understanding augmented with facial features

– Speech and gesture integration

Gruenstein et al., 2006

Cohen, 2005

Page 25: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Audio Visual Symbiosis

• The audio and visual signals both contain information about:– Identity/location of the person

– Linguistic message

– Emotion, mood, stress, etc.

• Integration of these sources of information has been known to helps humans

• Exploiting this symbiosis can lead to robustness, e.g.,– Locating and identifying the speaker

– Speech recognition/understanding augmented with facial features

– Speech and gesture integration

– Audio/visual information delivery

Ezzat, 2003

Page 26: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Establishing Context

• Context setting is important for dialogue interaction– Environment

– Linguistic constructs

– Discourse

• Much work has been done, e.g.,– Context-dependent acoustic and language models

– Sound segmentation

– Discourse modeling

• Some interesting new directions– Tapestry of applications

– Acoustic scene analysis (Ellis, 2006)

calendar

photos

weather

address

stocks

phonebook music

Page 27: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Acoustic Scene Analysis

• Acoustic signals contain a wealth of information (linguistic message, environment, speaker, emotion, …)

• We need to find ways to adequately describe the signals

time

signal type: speech

transcript: although both of the, both sides of the Central Artery …

topic: traffic report

speaker: female

. . .

signal type: speech

transcript: Forecast calls for at least partly sunny weather …

topic: weather, sponsor acknowledgement, time

speaker: male

. . .

signal type: speech

transcript: This is Morning Edition, I’m Bob Edwards …

topic: NPR news

speaker: male, Bob Edwards

. . .

signal type: music

genre: instrumental

artist: unknown

. . .

Some time in the future …

Page 28: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Learning

• Perhaps the most important aspect of organic interfaces– Use of stochastic modeling techniques for speech recognition,

language understanding, machine translation, and dialogue modeling

• Many different ways to learn– Passive learning

– Interactive learning

– Learning by imitation

Page 29: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Hetherington, 1991

Interactive Learning: An Example

• New words are inevitable, and they cannot be ignored

• Acoustic and linguistic knowledge is needed to– Detect

– Learn, and

– Utilize new words

• Fundamental changes in problem formulation and search strategy may be necessary

Page 30: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Interactive Learning: An Example

• New words are inevitable, and they cannot be ignored

• Acoustic and linguistic knowledge is needed to– Detect

– Learn, and

– Utilize new words

• Fundamental changes in problem formulation and search strategy may be necessary

• New words can be detected and incorporated through– Dynamic update of vocabulary

Chung & Seneff, 2004

Page 31: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Interactive Learning: An Example

• New words are inevitable, and they cannot be ignored

• Acoustic and linguistic knowledge is needed to– Detect

– Learn, and

– Utilize new words

• Fundamental changes in problem formulation and search strategy may be necessary

• New words can be detected and incorporated through– Dynamic update of vocabulary

– Speak and Spell

Fillisko & Seneff, 2006

Page 32: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Learning by Imitation

• Many tasks can be learned through interaction– “This is how you enable Bluetooth.”

“Enable Bluetooth.”

– “These are my glasses.” “Where are my glasses?”

• Promising research by James Allen (2007)– Learning phase:

* User shows the system how to perform tasks (perhaps through some spoken commentary)

* System learns the task through learning algorithms and updates its knowledge base

– Application phase* Looks up tasks in its knowledge base and executes the procedure

Allen et.al., (2007)

Page 33: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

In Summary

• Great strides have been made in speech technologies

• Truly anthropomorphic spoken dialogue interfaces can only be realized if they can behave like organisms– Observe, learn, grow, and heal

• Many challenges remain …

Page 34: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

Thank You

Page 35: On Organic Interfaces Victor Zue (zue@csail.mit.edu) MIT Computer Science and Artificial Intelligence Laboratory.

MIT Computer Science and Artificial Intelligence Laboratory

What’s the phone number of Flora in Arlington????What’s the phone number of Flora in Arlington

Dynamic Vocabulary Understanding

• Dynamically alter vocabulary within a single utterance

“What’s the phone number for Flora in Arlington.”

Arlington DinerBlue Plate ExpressTea Tray in the SkyAsiana GrilleBagels etcFlora….

Hub

NLGNLG

ASRASR ContextContext

TTSTTS DialogDialog

NLUNLU

AudioAudio DBDB

“The telephone number for Flora is …”

Clause: wh_questionProperty: phoneTopic: restaurantName: ????City: Arlington

Clause: wh_questionProperty: phoneTopic: restaurantName: FloraCity: Arlington