Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.
-
Upload
maria-gibson -
Category
Documents
-
view
216 -
download
1
Transcript of Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.
![Page 1: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/1.jpg)
Multi-Modal Dialogue in Personal Navigation Systems
Arthur Chan
![Page 2: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/2.jpg)
Introduction
The term “multi-modal” General description of an application that could
be operated in multiple input/output modes. E.g
Input: voice, pen, gesture, face expression. Output: voice, graphical output
[Also see the supplementary slides on Alex-Arthur’s discussion on the definition]
![Page 3: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/3.jpg)
Multi-modal Dialogue (MMD) in Personal Navigation System Motivation of this presentation
Navigation System provides MMD an interesting scenario a case why MMD is useful
Structure of this presentation 3 system papers
AT&T MATCH• speech and pen input with pen gesture
Speechworks Walking Direction System• speech and stylus input
Univ. of Saarland REAL• Speech and pen input• Both GPS and a magnetic tracker were used.
![Page 4: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/4.jpg)
Multi-modal Language Processing for Mobile Information Access
![Page 5: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/5.jpg)
Overall Function
A working city guide and navigation system Easy access restaurant and subway information
Runs on a Fujitsu pen computerUsers are free to
give speech command draw on display with stylus
![Page 6: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/6.jpg)
Types of Inputs
Speech Input “show cheap italian
restaurants in chelsea” Simultaneous Speech and
Pen Input Circle and area Say “show cheap italian
restaurants in neighborhood” at the same time.
Functionalities include Review Subway routine
![Page 7: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/7.jpg)
Input Overview
Speech Input Use AT&T Watson speech recognition engine
Pen Input (electron Ink) Allow usage of pen gesture. It could be a complex, pen input
Use special aggregation techniques for all this gesture.
Inputs would be combined using lattice combination.
![Page 8: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/8.jpg)
Pen Gesture and Speech Input
For example: U: “How do I get to this
place?” <user circled one of the
restaurant displayed on the map>
S: “Where do you want to go from?”
U “25th St & 3rd Avenue”• <user writes 25th St & 3rd
Avenue>
<System compute the shortest route >
![Page 9: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/9.jpg)
Summary
Interesting aspects of the system Illustrate the real life scenario where multi-
modal inputs could be used Design issue:
how different inputs should be used together? Algorithmic issue:
how different inputs should be combined together?
![Page 10: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/10.jpg)
Multi-modal Spoken Dialog with Wireless Devices
![Page 11: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/11.jpg)
Overview
Work by Speechworks Jointly conducted by speech recognition and
user interface folks Two distinct elements
Speech recognition• In a embedded domain, which speech recognition
paradigm should be used? embedded speech recognition? network speech recognition? distributed speech recognition?
User interface• How to “situationlize” the application?
![Page 12: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/12.jpg)
Overall Function
Walking Directions Application Assume user walking in an unknown city Compaq iPAQ 3765 PocketPC Users could
Select a city, start-end addresses Display a map Control the display Display directions Display interactive directions in the form of list of steps.
Accept speech input and stylus input Not pen gesture.
![Page 13: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/13.jpg)
Choice of speech recognition paradigm
Embedded speech recognition Only simple commands could be used due to
computation limits.
Network speech recognition Bandwidth is required Sometimes network would be cut-off
Distributed speech recognition Client takes care of front-end Server takes care of decoding <Issues: higher complexity of the code. >
![Page 14: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/14.jpg)
User Interface
Situationalization Potential scenario
Sitting at a desk Getting out of a cab, building, subway and preparing
to walk somewhere Walking somewhere with hands free Walking somewhere carrying things Driving somewhere in heavy traffic Driving somewhere in light traffic Being the passenger in a car Being in highly noisy environment.
![Page 15: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/15.jpg)
Their conclusion
Balances of audio and visual information Could be reduced to 4 complementary
components Single-modal
• 1, Visual Mode• 2, Audio Mode
Multi-modal• 3, Visual dominant• 4, Visual dominant
![Page 16: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/16.jpg)
A glance of UI
![Page 17: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/17.jpg)
Summary
Interesting aspects Great discussion on
how speech recognition could be used in an embedded domain
how the user would use the dialogue application
![Page 18: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/18.jpg)
Multi-modal Dialog in a Mobile Pedestrian Navigation System
![Page 19: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/19.jpg)
Overview
Pedestrian Navigation System Two components:
IRREAL : indoor navigation system• Use magnetic tracker
ARREAL: outdoor navigation system• Use GPS
![Page 20: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/20.jpg)
Speech Input/Output
Speech Input: HTK / IBM Viavoice
embedded and Logox was being evaluated
Speech Output: Festival
![Page 21: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/21.jpg)
Visual output
Both 2D and 3D spatialization supported
![Page 22: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/22.jpg)
Interesting aspects
Tailor the system for elderly people Speaker clustering
to improve recognition rate for elderly people Model selection
Choose from two models based on likelihood• Elderly models• Normal adult models
![Page 23: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/23.jpg)
Conclusion
Aspects of multi-modal dialogue What kind of inputs should be used? How speech and other inputs could be
combined/interacted? How users would use the system? How the system should respond to the users?
![Page 24: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/24.jpg)
Supplements on Definition of Multi-modal Dialogue &How MATCH combine multi-modal inputs
![Page 25: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/25.jpg)
Definition of Multi-modal Dialog
In slide “Introduction”, Arthur’s definition of multi-modal application
General description of an application that could be operated in multiple input/output modes.
Alex’s comment “So how about the laptop? Will you consider it as a
multi-modal application?”
![Page 26: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/26.jpg)
I am stunned! Alex makes some sense!
The laptop examples show We expect “multi-modal application” to be in
some way to allow two different modes to operate simultaneously.
So, though laptop allows both mouse input and keyboard input, it doesn’t fit into what people call multi-modal application.
![Page 27: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/27.jpg)
A further refinement
It is still important to consider a multi-modal application as a generalization of a single-modal application
This allows Thinking on how to deal with situation where a
particular mode fails.
![Page 28: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/28.jpg)
How multi-modal inputs could be combined?
How speech are input? Simple click-to-speak input is used. Output are speech lattice.
![Page 29: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/29.jpg)
How pen gesture are input?
Key strokes could contain Lines and arrows Handwritten words Or selection of entities on the screen
Standard template-based algorithm is used Also extract arrow head and mark.
Recognition could be 285 words
attribute of the restaurants “Cheap”, “Chinese” Zones or point of interest “soho”, “empire”
10 basic gesture marks: Lines, arrows, areas, points and question mark
Input broken into a lattice of strokes.
![Page 30: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/30.jpg)
Pen Input Representation
FORM MEANING (NUMBER TYPE) SEMFORM: physical form of the gesture e.g. area, point, line, arrowMEANING: meaning of the forme.g. “area” could be loc(cation) or sel(ection)NUMBER: number of entities in the selectione.g. 1, 2, 3 or manyTYPE: the type of entitiese.g. res(taurant) and theaterSEM: place holder for specific contents of a gesturee.g. points make up an area, identifiers of an object.
![Page 31: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/31.jpg)
Example:
First Area Gesture Second Area Gesture
![Page 32: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/32.jpg)
Example (cont.)
Either a Location (0->1->2->3->7)Or the restaurant (0->1->2->4->5->6->7)
Either a Location (8->9->10->16)Or two restaurant (8->9->11->12->13->16)
Aggregate numerical expression from Gesture 1 and 2->14->15
![Page 33: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/33.jpg)
Example (cont.)
User say: “show Chinese restaurant in this and this neighborhood” (Two locations are specified)
![Page 34: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/34.jpg)
Example (cont.)
User say: “Tell me about this place and these places (Two restaurants are specified)
![Page 35: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/35.jpg)
Example (cont.)
Not covered here: If users say “these three restaurants” The program need to aggregate two gestures
together. Covered by “Deixis and Conjunction in Multimodal
Systems” by Michael Johnston In brief: gestures will be combined and forming
new paths of lattice.
![Page 36: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/36.jpg)
How Multi-modal inputs are integrated?
Issues:1, Timing of inputs2, How Inputs are processed? (FST)
Details could be found in • “Finite-state multimodal parsing and understanding”• “Tight-coupling of Multimodal Language Processing
with Speech Recognition “
3, Multi-modal Grammars
![Page 37: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/37.jpg)
Timing of Inputs
MATCH Takes speech and gesture lattice and create
meaning lattice A time out system is used. When user hit a click-to-speak button, speech result
arrives If inking is on the progress, MATCH waits for the
gesture lattice in the short time-out Otherwise MATCH will treat the input as unimodal.
Similar case for the gesture lattice.
![Page 38: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/38.jpg)
FST processing of multi-modal inputs
Multi-modal integration Modeled by a 3-tape finite state device
Speech and Gesture stream (gesture symbols) Their combined meaning (meaning symbols)
Device take speech and gesture as inputs and create the meaning output.
Simulated by two transducers G:W -> aligning speech and gesture G_W:M -> composite alphabet of speech and
gesture symbols as inputs and outputs meaning Speech and gesture input will be composed by G:W Then G_W will be composed by G_W:M
![Page 39: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/39.jpg)
Multi-modal Grammar
Input word and gesture streams generate an XML representation of meaning eps : epsilon
Output would look like <cmd>
• <phone> <restaurant> [id1] </restaurant>
• </phone> </cmd>
![Page 40: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d955503460f94a7d8b1/html5/thumbnails/40.jpg)
Multi-modal Grammar