The WAMI Toolkit for Developing, Deploying, and Evaluating ... · PDF fileThe WAMI Toolkit for...

8
The WAMI Toolkit for Developing, Deploying, and Evaluating Web-Accessible Multimodal Interfaces Alexander Gruenstein MIT CSAIL 32 Vassar Street Cambridge, MA 02139, USA [email protected] Ian McGraw MIT CSAIL 32 Vassar Street Cambridge, MA 02139, USA [email protected] Ibrahim Badr MIT CSAIL 32 Vassar Street Cambridge, MA 02139, USA [email protected] ABSTRACT Many compelling multimodal prototypes have been devel- oped which pair spoken input and output with a graphi- cal user interface, yet it has often proved difficult to make them available to a large audience. This unfortunate re- ality limits the degree to which authentic user interactions with such systems can be collected and subsequently ana- lyzed. We present the WAMI toolkit, which alleviates this difficulty by providing a framework for developing, deploy- ing, and evaluating Web-Accessible Multimodal Interfaces in which users interact using speech, mouse, pen, and/or touch. The toolkit makes use of modern web-programming techniques, enabling the development of browser-based ap- plications which rival the quality of traditional native inter- faces, yet are available on a wide array of Internet-connected devices. We will showcase several sophisticated multimodal applications developed and deployed using the toolkit, which are available via desktop, laptop, and tablet PCs, as well as via several mobile devices. In addition, we will discuss re- sources provided by the toolkit for collecting, transcribing, and annotating usage data from multimodal user interac- tions. Categories and Subject Descriptors H.5.2 [Information Interfaces and Presentation]: User Interfaces—graphical user interfaces, natural language, voice I/O ; I.2.7 [Artificial Intelligence]: Natural Language Pro- cessing—language parsing and understanding, speech recog- nition and synthesis General Terms Design, Experimentation Keywords multimodal interface, speech recognition, dialogue system, World Wide Web, Voice over IP Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI’08, October 20–22, 2008, Chania, Crete, Greece. Copyright 2008 ACM 978-1-60558-198-9/08/10 ...$5.00. 1. INTRODUCTION A broad range of multimodal interfaces have been devel- oped in which a spoken language understanding capability complements a graphical user interface (GUI). Many such interfaces provide for spoken natural language input in con- junction with drawing, clicking and/or other rich GUI in- teraction, for example: [6, 9, 12, 16]. Such interfaces offer compelling alternatives to GUI-only applications, particu- larly when used on devices such as tablet computers and mobile phones which offer relatively high-resolution displays and audio input/output capabilities, but limited keyboard input. Indeed, the prevalence of such portable but powerful devices offers an unparalleled opportunity to bring a variety multimodal interfaces to the mainstream. Unfortunately, while researchers may develop many in- teresting multimodal applications, it is often challenging to make them available to a large number of users outside of the laboratory. The systems typically rely on a hodgepodge of components: speech and gesture recognition engines, natu- ral language processing components, and homegrown graph- ical user interfaces. It can be challenging to package up these components so that they can be easily installed by novice users, particularly for an academic research lab with limited engineering resources. As such, system deployment is usu- ally limited to hardware within the laboratory, limiting data collection to traditional in-lab user studies. We believe, however, that breakthroughs in multimodal interfaces will come when they can more cheaply and easily be made available. Unimodal spoken language interfaces, for example, have benefited enormously from their accessibility via the telephone, which is a natural, easy, and cheap means of making them available to a large number of users. Inter- face quality benefits greatly from the iterative improvements which become possible by analyzing usage data gleaned from a large number of interactions with real users [14, 20]. In this paper, we present a toolkit which enables interface developers to cheaply develop, deploy, and evaluate multi- modal interfaces. The toolkit, called WAMI, provides a ba- sis for developing Web-Accessible Multimodal Interfaces, which are accessible to any user with a standard web browser and a network connection. Applications are easily accessed not only from desktop, laptop, and tablet computers, but from a variety of mobile devices as well. Moreover, because WAMI takes advantage of modern web-programming tech- niques revolving around AJAX (Asynchronous Javascript and XML), graphical user interfaces can be developed in the framework which rival the quality of traditional native interfaces.

Transcript of The WAMI Toolkit for Developing, Deploying, and Evaluating ... · PDF fileThe WAMI Toolkit for...

The WAMI Toolkit for Developing, Deploying, andEvaluating Web-Accessible Multimodal Interfaces

Alexander GruensteinMIT CSAIL

32 Vassar StreetCambridge, MA 02139, [email protected]

Ian McGrawMIT CSAIL

32 Vassar StreetCambridge, MA 02139, [email protected]

Ibrahim BadrMIT CSAIL

32 Vassar StreetCambridge, MA 02139, USA

[email protected]

ABSTRACTMany compelling multimodal prototypes have been devel-oped which pair spoken input and output with a graphi-cal user interface, yet it has often proved difficult to makethem available to a large audience. This unfortunate re-ality limits the degree to which authentic user interactionswith such systems can be collected and subsequently ana-lyzed. We present the WAMI toolkit, which alleviates thisdifficulty by providing a framework for developing, deploy-ing, and evaluating Web-Accessible Multimodal Interfacesin which users interact using speech, mouse, pen, and/ortouch. The toolkit makes use of modern web-programmingtechniques, enabling the development of browser-based ap-plications which rival the quality of traditional native inter-faces, yet are available on a wide array of Internet-connecteddevices. We will showcase several sophisticated multimodalapplications developed and deployed using the toolkit, whichare available via desktop, laptop, and tablet PCs, as well asvia several mobile devices. In addition, we will discuss re-sources provided by the toolkit for collecting, transcribing,and annotating usage data from multimodal user interac-tions.

Categories and Subject DescriptorsH.5.2 [Information Interfaces and Presentation]: UserInterfaces—graphical user interfaces, natural language, voiceI/O ; I.2.7 [Artificial Intelligence]: Natural Language Pro-cessing—language parsing and understanding, speech recog-nition and synthesis

General TermsDesign, Experimentation

Keywordsmultimodal interface, speech recognition, dialogue system,World Wide Web, Voice over IP

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICMI’08, October 20–22, 2008, Chania, Crete, Greece.Copyright 2008 ACM 978-1-60558-198-9/08/10 ...$5.00.

1. INTRODUCTIONA broad range of multimodal interfaces have been devel-

oped in which a spoken language understanding capabilitycomplements a graphical user interface (GUI). Many suchinterfaces provide for spoken natural language input in con-junction with drawing, clicking and/or other rich GUI in-teraction, for example: [6, 9, 12, 16]. Such interfaces offercompelling alternatives to GUI-only applications, particu-larly when used on devices such as tablet computers andmobile phones which offer relatively high-resolution displaysand audio input/output capabilities, but limited keyboardinput. Indeed, the prevalence of such portable but powerfuldevices offers an unparalleled opportunity to bring a varietymultimodal interfaces to the mainstream.

Unfortunately, while researchers may develop many in-teresting multimodal applications, it is often challenging tomake them available to a large number of users outside of thelaboratory. The systems typically rely on a hodgepodge ofcomponents: speech and gesture recognition engines, natu-ral language processing components, and homegrown graph-ical user interfaces. It can be challenging to package up thesecomponents so that they can be easily installed by noviceusers, particularly for an academic research lab with limitedengineering resources. As such, system deployment is usu-ally limited to hardware within the laboratory, limiting datacollection to traditional in-lab user studies.

We believe, however, that breakthroughs in multimodalinterfaces will come when they can more cheaply and easilybe made available. Unimodal spoken language interfaces, forexample, have benefited enormously from their accessibilityvia the telephone, which is a natural, easy, and cheap meansof making them available to a large number of users. Inter-face quality benefits greatly from the iterative improvementswhich become possible by analyzing usage data gleaned froma large number of interactions with real users [14, 20].

In this paper, we present a toolkit which enables interfacedevelopers to cheaply develop, deploy, and evaluate multi-modal interfaces. The toolkit, called WAMI, provides a ba-sis for developing Web-Accessible Multimodal Interfaces,which are accessible to any user with a standard web browserand a network connection. Applications are easily accessednot only from desktop, laptop, and tablet computers, butfrom a variety of mobile devices as well. Moreover, becauseWAMI takes advantage of modern web-programming tech-niques revolving around AJAX (Asynchronous Javascriptand XML), graphical user interfaces can be developed inthe framework which rival the quality of traditional nativeinterfaces.

Multimodal interfaces developed using web-standards havea number of key advantages beyond their easy accessibilityto a large number of users. First, interfaces provided via thenetwork can run computationally demanding processes suchas speech recognition on fast servers, which is especially im-portant for mobile devices. Second, web-based applicationscan make use of a growing array of powerful services avail-able via the web. Third, collecting usage data for researchis straightforward: users interact with the system via thecomfort of their own devices, yet those interactions can becentrally logged.

The ability to analyze usage data collected via the web iscritical for developers wishing to improve their multimodalinterfaces; as such, the WAMI toolkit provides a stream-lined interface for replaying, transcribing, and annotatingusage data. System developers and annotators can replaya user’s interaction, watching it unfold in a web browser asthey transcribe utterances and/or annotate actions. In addi-tion, a user-study management framework provides systemdesigners with an easy way to gather data: users sign upfor a study via the web and then are led through a series oftasks in their own browsers.

In the remainder of this paper, we will describe the de-velopment framework and evaluation architecture providedby WAMI in greater detail. We will then discuss severalapplications built within the WAMI framework, showing offits flexibility and power. Next, we’ll reflect on some of theusage data collected with those applications, noting some ofthe differences between web-based and in-lab user studies.Finally, we will describe our nascent efforts to make WAMIavailable to outside developers.

2. RELATED WORKA notable early effort at making multimodal interfaces

accessible via the web was WebGALAXY [11], which takesadvantage of the easy accessibility of both the telephoneand the Web. Users view a web site while speaking overthe telephone to a spoken dialogue system which providesweather information. The system’s responses to user queriesare spoken over the phone, with corresponding informationalso displayed on the web page. The WAMI toolkit dif-fers from WebGALAXY in two key respects. First, it doesnot require the use of the telephone, which is awkward inmany situations. Second, while it does support interfaceslike WebGALAXY, in which the graphical modality is pri-marily a means of displaying information also provided inthe speech output, the WAMI toolkit is mainly intended tosupport much richer GUI interactivity.

More recently, efforts have been undertaken to developstandards for deploying multimodal interfaces to web browsers,typically entailing the use of VoIP or local speech recogni-tion. The two major standards under development, SpeechApplication Language Tags (SALT) [18] and XHTML+Voice(X+V) [3] are both aimed at augmenting the existing HTMLstandard. The intent of these specifications is that, just ascurrent web browsers render a GUI described via HTML,browsers of the future would also provide a speech interfacebased on this augmented HTML. As such, interfaces imple-mented using either of these specifications require the use ofa special web browser supporting the standard.

SALT and X+V are designed mainly to enable the use ofspeech to fill in the field values of traditional HTML forms.X+V, for example, integrates the capabilities of VoiceXML

into the browser, so that interactions typically consist oftelephony-like mixed initiative dialogues to fill in field values.Related multimodal markup languages, such as those in [10]and [7], generally attempt to make designing these sorts ofinteractions more straightforward.

Unlike SALT and X+V, applications produced with theWAMI toolkit do not require the use of special web browsers;instead, they can be accessed from today’s standard browsers.Furthermore, the toolkit is not intended to provide form-filling capabilities; though, it could potentially be used inthis way. Though the formal structure of SALT and X+Vprovide the power to make authoring certain types of form-filling interfaces easier, this structure makes it difficult todesign applications incorporating the rich and rapid multi-modal interactivity in which we believe many researchers areinterested.

Finally, several commercial multimodal applications haverecently been released which are widely accessible via var-ious mobile devices. These include Microsoft’s Live SearchFor Mobile [1], TellMe’s mobile download [17], and Yahoo’svoice enabled oneSearch [19]. In these applications, usersgenerally click on a text input box, speak a few query terms,view and correct the results, and then submit the query toa search engine. The WAMI toolkit could also be used tosupport such an interaction model, however it also supportsmuch richer multimodal interactions. In addition, theseare native applications, and as their download pages attest,they must be customized for each supported device—a time-consuming task for a researcher interested in exploring anovel multimodal interface.

3. SYSTEM DEVELOPMENT WITH WAMIThe WAMI toolkit supports two main threads of devel-

opment. First, it can be used to connect a graphical userinterface to“traditional”spoken dialogue systems built usingthe GALAXY architecture [15]. In this architecture, speechrecognition using an n-gram language model typically servesas a first step before parsing, contextual integration, dia-logue management, natural language generation, and speechsynthesis. Interfaces built in this manner can be made to un-derstand a wide range of natural language; yet they requirea good deal of expertise to develop.

WAMI also supports a second, more lightweight develop-ment model meant to appeal to developers who don’t haveexpertise in speech recognition, parsing, dialogue manage-ment, and natural language generation. WAMI’s lightweightplatform provides a standard structure with which develop-ers can build highly interactive multimodal applications withmodest natural language understanding capabilities. Ex-perts also find the lightweight platform useful, as it providesan easy means to quickly prototype new applications.

3.1 Core PlatformAt the center of both application development models is

the core platform depicted in figure 1, which provides thebasic infrastructure requisite for developing and deployingweb-based multimodal applications. The platform consistsof a small set of components split across a server and aclient device running a standard web browser. Core com-ponents on the server provide speech recognition and (op-tionally) speech synthesis capabilities, a JavaEE (Java En-terprise Edition) web server, and a logger component whichrecords user interactions. Applications must provide a lan-

Application GUI

Browser

GUI Controller(javascript)

Webservices

Web Server

Audio Controller(Java applet)

SpeechRecognizer

SpeechSynthesizer

ApplicationSpecific

MultimodalProcessing

andGeneration

Logger

Server

LanguageModel

Core GUI(html/javascript)

Figure 1: WAMI toolkit core platform. Shadedboxes show standard core components. Whiteclouds indicate application-specific components.

guage model for the speech recognizer, and any application-specific processing modules for handling spoken input andgenerating appropriate verbal and graphical responses. Onthe client, a web browser renders an application-specificGUI, which exchanges messages with the server via an inter-face provided by the AJAX-powered GUI Controller compo-nent.

In addition, an Audio Controller component runs on theclient to capture a user’s speech and stream it to the speechrecognizer, as well as to play synthesized speech generatedon the server and streamed to the client. On devices whichsupport Java, the Audio Controller is a Java Applet whichis embedded directly in the web page, and which provides abutton users click when they want to provide speech input.For other devices, native Audio Controllers must be devel-oped, which are run externally from the web browser. Forexample, on each of the mobile devices we will discuss in thispaper, the Audio Controller is a native application which iscontrolled by physical button presses. Once started, thesenative Audio Controllers then point the web browser on thedevice to the appropriate URL to bring up the GUI. Thoughnative Audio Controllers are a break with our web-only phi-losophy, they need only be written once for a particular de-vice, as they can be shared by any WAMI-powered interface.This means that GUI developers are freed from developingnative code for a particular platform, and that users needonly to install a single native application to gain access toany WAMI application.

The main value proposition of the core WAMI platform isthat it provides developers with a standard mechanism forlinking the client GUI and audio input/output to the server,insulating the developer from these chores. This is perhapsbest understood by looking at the typical sequence of eventsinvolved in a user interaction. First, a user navigates to anapplication’s web page, where the core WAMI componentstest the browser’s compatibility and ensure sufficient capac-ity on the server exists for a new client. Next, the GUI andAudio Controllers connect to the server. Once both are con-nected, the web server notifies the GUI Controller, whichpasses the message along to the application-specific GUI sothat it can initiate the interaction.

The user then presses the click-to-talk button (or holdsa physical button on a mobile device) and begins speak-ing, causing the Audio Controller to stream the input audioto the speech recognizer. Concurrently, if the user inter-

acts with the application’s GUI—for example, by clicking ordrawing—events encoded as XML are transmitted via theGUI Controller and Web Server to the application-specificmodule on the server. When the user finishes speaking, therecognizer’s hypothesis is routed to the application-specificlogic on the server, which formulates a response. It passesmessages encoded as XML via the Web Server and GUI Con-troller to the GUI, which updates the display accordingly.It can also provide a string of text to be spoken, which issynthesized as speech and streamed to the Audio Controller.

3.2 Lightweight PlatformWAMI’s core platform provides developers with a min-

imally sufficient set of components to build a web-basedmultimodal interface. A“lightweight”development platformwhich provides structure on top of the core components is de-signed to ease the development of multimodal interfaces withmodest natural language understanding needs. Developersprovide speech recognition language models for their appli-cations in the form of grammars, written using the JSGF(Java Speech Grammar Format) standard [8]. Moreover,the grammars are constructed in a standard way, so thatassociated with each recognition hypothesis is a set of slotsfilled in with values. Figure 2 shows an example grammarfor a hypothetical lightweight WAMI interface, as well asthe slot/value pairs which are associated with an exampleutterance.

Because the lightweight platform relies on small constrainedlanguage models which output sequences of slot/value pairs,speech recognition and natural language understanding pro-cessing often occur rapidly, in near real time. We have takenadvantage of this low overhead to add a perhaps surpris-ing feature to the lightweight platform: incremental speechrecognition and natural language understanding. The speechrecognizer has been augmented to emit incremental recog-nition hypotheses as an utterance is processed. Because thegrammar-based language models used provide strong con-straints, the incremental hypotheses tend to be accurate,especially if the “bleeding edge” is ignored—so long as theuser’s utterance remains within the confines of the gram-mar. This means that while a user is speaking, slots andtheir values can be emitted as soon as the partial utteranceis unambiguous. The grammar in figure 2 is written in justsuch a way: slots and values are assigned as soon as possible.This means that the slot/value pairs shown with the exam-ple utterance in that figure are emitted as soon as the userfinishes speaking the word shown above each pair. An In-cremental Understanding Aggregator on the server collectsthese slots and values as they are emitted.

Incremental speech recognition hypotheses can be used toprovide incremental language understanding and immediategraphical feedback, which has been shown to improve theuser experience and the system accuracy [2]. For instance,an application making use of the grammar in figure 2 couldhighlight the set of appropriate shapes as each attribute isspoken. Indeed, careful readers will note that the grammaris constructed so as to support false starts and repetitions,since visual feedback may prompt users to correct utter-ances as they speak. For instance, an in-grammar utterancewith a false start would be: select the small . . . the large redrectangle. Such an utterance might arise because as soonas the user sees graphical confirmation of “small”, he mightrealize he actually meant to say “large” and correct himself

<command> = <select> | <drop> | <make> ;<select> = (select | choose) {[command=select]} <shape_desc>+ ;<shape_desc> = [the] <shape_attr>+ ;<shape_attr> = <size> | <color> | <shape_type> ;<size> = <size_adj> [sized] ;<size_adj> = (small | tiny) {[size=small]} | medium {[size=medium]} | (large | big) {[size=large]} ;<color> = red {[color=red]} | green {[color=green]} | blue {[color=blue]} ;<shape_type> = (rectangle | bar) {[shape_type=rectangle]} | square {[shape_type=square]}

| circle {[shape_type=circle]} | triangle {[shape_type=triangle]} ;...

select the small red rectangle[command=select] [size=small] [color=red] [shape=rectangle]

Figure 2: Sample JSGF grammar snippet for a lightweight application and an example utterance with asso-ciated slot/value output.

mid-utterance. The vocabulary game applications describedbelow in section 4.1 explore the utility of such incrementalfeedback.

3.3 Traditional Dialogue System PlatformThe core WAMI toolkit also provides a platform for in-

tegrating web-based graphical user interfaces with “tradi-tional” spoken dialogue systems built using the GALAXYarchitecture. Such systems typically provide an n-gram lan-guage model, which may be dynamically updated as a userinteracts with an application. Speech recognition hypothe-ses are passed to the dialogue manager architecture for pro-cessing. The dialogue system can update the graphical userinterface in one of two ways. First, as in the lightweight case,messages encoded as XML can be sent to the application-specific GUI, where calls to javascript then update the dis-play. Alternatively, HTML may be output, in which casethe GUI is refreshed using this HTML. The first methodis usually preferred, as it can be used to create a dynamic,native-like interface. However, generating HTML via thedialogue system itself can be useful as well, as it can begenerated, for instance, using traditional language genera-tion techniques. It is an easy way to develop an applicationwhich is primarily a speech interface, but which benefits byshowing content to the user in response to spoken input.

In addition, the GUI can send messages to notify the di-alogue system of events such as clicks and gestures. Thesemay be encoded directly in javascript in the native messagepassing format of the dialogue system architecture. This al-lows events from the GUI to be easily integrated into thenormal processing chain.

3.4 Supported DevicesThe WAMI toolkit can be used to produce applications

which run on any network connected device with a micro-phone for audio input and a modern web browser. On desk-top, laptop, and tablet computers, it supports Windows,GNU/Linux, and OS X running recent versions of Firefox,Opera, Safari, or Internet Explorer. Audio input and outputon these platforms is handled via a Java applet embeddeddirectly in the browser.

The toolkit also supports mobile devices with standards-compliant browsers; however, generally a native Audio Con-troller must be written. Currently, the toolkit has beentested on two mobile platforms:

• Several smartphone and PDA devices running Win-dows Mobile 5, with the Opera Mobile browser,

Browser

GUI Controller(javascript)

Web Server

Audio Controller(Java applet)

SpeechRecognizer

Logger

Server

IncrementalUnderstandingAggregator

Yahoo! Image Search,Chinese English Dictionary

JSGF Grammar

Core GUI(html/javascript)

Card Database

Card Game GUI

Card Game Engine

Figure 3: Chinese Cards Architecture; WAMItoolkit modules shaded.

• The Nokia N810 internet tablet, using a native audiocontroller and the pre-installed Mozilla-derived browser.

We are exploring an Audio Controller for the iPhone.

4. EXAMPLE APPLICATIONSIn this section, we will briefly showcase several applica-

tions which highlight the utility and flexibility of the WAMItoolkit. We focus on applications in which both the speechand graphical modalities play important roles.

4.1 Chinese CardsThe lightweight platform has served as the basis for a

novel group of multimodal games designed to help non-nativespeakers practice Mandarin Chinese. The games share acommon Chinese Cards architecture depicted in figure 3.Students and teachers first login to a web application andbuild a personalized deck of image-based flash cards, a taskwhich is made easier because Yahoo Image Search and anonline Chinese-English dictionary1 are integrated directlyinto the application. The personalized cards are stored in aback-end database, where they can then be used in any ofthe card games. JSGF grammars, personalized to includethe vocabulary from each user’s deck of flash cards, are usedas language models for each card game.

One of the card games which can be played with the per-sonalized deck of flash cards is Word War, a picture match-ing race [13]. A single player can race against the clock,or two players can race against each other from their re-spective web browsers, each using his or her own deck of

1http://www.xuezhongwen.net

Figure 4: The Word War flash card game for language learners.

cards. Figure 4 shows a snapshot of two players compet-ing on a five-column game grid. The goal of Word War isto use spoken commands to move images from the bottomtwo rows of the grid into a set of numbered slots in thesecond row, so that they match the images along the top.While a simple task in a person’s native language, it can be achallenging and fun way to practice vocabulary in a foreignlanguage. Each player uses his or her own flash cards—so the two players may be practicing different vocabularywords, or even be speaking in different languages—but theycompete over shared access to the numbered slots. Whenan image is matched, the slot is captured and the matchingimage appears on both players’ game grids. Notice that infigure 4, Player 1 has captured the third and fourth slots,while Player 2 has only captured the first slot. The winneris the first player to fill in a majority of the slots.

The language model used for the game is conceptuallysimilar to the example snippet shown in figure 2. For exam-ple, Player 1 in figure 4 can utter the Mandarin equivalentof Take the hand and put it in slot number five causing thisaction to be carried out on the GUI. Since the grammars aretailored to each player’s vocabulary, they remain relativelysmall, aiding robust recognition of non-native speech.

Word War demonstrates well some of the advantages of-fered by the incremental speech recognition and the Incre-mental Understanding Aggregator. Most importantly, theincremental understanding is used to provide visual feed-back while the student is speaking. For instance, as soon asPlayer 1 has said Take the, all of the available items whichcan be selected have been highlighted. Likewise, once Player2 has uttered Select the sheep and drop it the sheep has al-ready been highlighted, as have the available slots in whichit can be dropped.

In a more conventional multimodal interface, the userwould have to finish speaking a complete command beforereceiving visual confirmation of the system’s understanding.In this case, incremental feedback provides immediate feed-back to users as they speak. This means that students ofa foreign language can become aware of errors immediatelyand correct them mid-utterance. They can also string to-gether a sequence of commands in a single utterance, allow-ing them to rapidly practice speaking fluently.

4.2 Multimodal Dialogue SystemsThe WAMI toolkit has also served as a platform for de-

veloping several multimodal interfaces which make use ofmore traditional spoken dialogue system components. We

Browser

GUI Controller(javascript)

Web Server

Audio Controller(Java applet)

SpeechRecognizer

SpeechSynthesizer

Logger

Server

Core GUI(html/javascript)

Context SensitiveLanguage Model

NL ParserGoogle Maps

API Context & GestureResolver

NL Generator

Dialogue Manager

SuggestionsGenerator

POIs Geography

Map Based GUI

ConfidenceAnnotator

MultimodalCorrections Driver

Figure 5: City Browser architecture. WAMI mod-ules shaded.

will briefly describe two of the applications here with theintent of emphasizing that their interfaces behave with aresponsiveness and sophistication typically associated withnative multimodal interfaces, rather than traditional webinterfaces.

City Browser is a map-based WAMI application whichprovides users access to urban information via an interfacewhich leverages the Google Maps API [5]. Users can speakaddresses to locate them on the map, obtain driving direc-tions, get public transportation information, and search forPoints of Interest (POIs) such as restaurants, museums, ho-tels, and landmarks. Search results are shown on the screen,and users can use a mouse, pen, stylus, or finger—dependingon the device being used—to select a result, circle a set ofresults, outline a region of interest, or indicate a path alongwhich to search.

Figure 5 depicts City Browser ’s architecture, which is typ-ical of a multimodal dialogue system built on top of theWAMI core. The map-based GUI, shown on a mobile de-vice in figure 6(a), is built around the WAMI skeleton. TheGUI passes messages to the natural language understand-ing components on the server via the GUI Controller andWeb Server. The natural language parser, context and ges-ture resolution component, dialogue manager, and naturallanguage generator are previously developed generic compo-nents which have been used in the service of many dialoguesystems. Application-specific logic for City Browser is en-coded by the grammars, dialogue control scripts, generationtemplates, and so forth used by these components.

We have also leveraged a similar set of dialogue systemcomponents in conjunction with the WAMI toolkit to de-

(a) City Browser on a Nokia N810 in its Mozilla-basedbrowser

(b) Home entertainment multimodal interface on a Win-dows Mobile smartphone running the Opera browser anddisplayed on a television via Firefox on OS X

Figure 6: Web-based multimodal interfaces built us-ing the WAMI toolkit, running on various devices.

velop and deploy a multimodal interface to a living-roommultimedia server containing a music library and digitalvideo recorder [4]. The interface, shown in figure 6(b), issplit across a mobile device and a television screen. Themicrophone and speakers on the mobile device are used forspeech input and synthesized speech output, and its GUIprovides access to the music and video libraries, televisionlistings, and recording settings. Users can speak and use astylus or the arrow keys to browse television listings andschedule recordings on the mobile device from anywherewith a network connection. Then, while at home, a tele-vision display enables the playback of music and recordings.

We conclude this section by emphasizing that the screen-shots in figures 4 and 6 show WAMI-based interfaces run-ning on a variety of devices, using various web browsers:two mobile devices, a windows laptop, and a Mac attachedto a television screen. As these applications demonstrate,high quality web-based multimodal interfaces can cheaplyand easily be deployed to users in various environments.

5. DATA COLLECTION AND EVALUATIONARCHITECTURE

A fundamental step in the development process for anymultimodal interface involves evaluating prototypes basedon authentic user interactions. Because applications builtusing the WAMI toolkit are web-based, it’s easy to makethem available to any user with a web browser. However, be-cause these users are remote, their interactions can’t simplybe observed and/or videotaped. In this section, we describeresources provided by the WAMI toolkit which enable devel-opers to run studies to collect such interactions, as well toolsfor transcribing and annotating the collected usage data.

Figure 7: Web-based user study management.

5.1 User Study ManagementOne method of data collection for a web-based applica-

tion is to simply deploy the application, publicize its URL,and analyze whatever user interactions occur. However, it isoften useful for researchers to have a greater amount of con-trol over the types of users recruited and tasks performed.The WAMI toolkit provides a web-based user study manage-ment interface that gives researchers tools to manage studiesinvolving remote users. Several user studies have been con-ducted with increasingly sophisticated versions of the tools.

To begin a new user study, a set of tasks must first be de-fined using the management interface, as shown in figure 7.Typical tasks might include: reading instructions, complet-ing a warmup exercise, solving a problem, and filling out asurvey. Defining common tasks, such as surveys, is partic-ularly easy, as all WAMI applications inherit the ability todisplay user-defined web forms and record the results in adatabase. Application-specific tasks can also be defined, inwhich case the application must display appropriate contentand monitor a user’s progress.

Once the tasks are defined, one or more user groups arecreated, and each group is assigned a sequence of tasks whichmust be completed. User accounts can be created by thestudy administrator, or users can sign up online via an ad-vertised URL. When a user logs into a WAMI applicationin user-study mode, he or she is presented with a sequenceof tasks to complete. For example, in a recent City Browserstudy, which usually required 30-60 minutes, subjects werefirst presented with instructions on how to use the interfaceand then given a simple warmup task. Next, they were ledthrough ten scenarios in which they were asked to, for exam-ple, get directions between two locations or find a restaurantwith certain attributes. Then, they were given “free play”time to interact as they wished. Finally, they completed asurvey. A less complex study performed with Word Warsimply required each user to complete as many games aspossible in 10 minutes.

5.2 Transcription, Annotation, and PlaybackWhen designing any application which makes use of a

speech recognizer, it is critical to transcribe user utterances:transcripts are necessary to measure error rates, and improvethe language and acoustic models used by the speech recog-

Figure 8: Web-based Session Playback.

nizer. Beyond transcripts, interface designers are typicallyinterested in understanding how users interacted with anapplication. Such knowledge allows them to identify systemmalfunctions and identify the strengths and weaknesses of aninterface. For speech-only interfaces, this sort of knowledgecan generally be gleaned from transcripts, speech recogni-tion hypotheses, and system responses. In developing mul-timodal interfaces, however, transcripts do not suffice: it isimmensely helpful to also understand what the GUI lookedlike at the time of interaction, and what sort of non-speechinteraction occurred (clicking, circling, etc).

When performing studies in the laboratory, experimenterscan observe and/or videotape a user’s interaction; however,this is simply not possible for web-based systems accessedremotely. Thus, another key capability of the WAMI toolkitis the ability to log user interactions and then “replay” theinteractions later. We have found that this is often the onlyfeasible manner in which an annotator can, for example, ac-curately judge whether or not an application’s response to auser’s utterance was appropriate. Playback and annotationtools can also be useful in evaluating human performance.For example, teachers are able to judge a student’s speakingability by replaying student interactions with an applicationlike Word War.

The WAMI toolkit includes a logging component as partof the core architecture, as shown in figure 1. The loggerrecords the user utterance waveform, recognition hypothe-ses, and the system’s natural language responses. In addi-tion, any events sent from the client GUI to the applicationspecific logic are logged, so that notifications of clicks, ges-tures, the scrolling of windows, and so forth are recorded.On the flip side, any messages sent from the application-specific logic residing on the server to the client GUI arealso logged.

To play back these logs, the toolkit provides a web-basedTranscription and Annotation tool as part of the manage-ment interface. Using this interface, depicted in figure 8, de-velopers or annotators can control a server-based Log Player,which loads a log of one user’s interaction and replays thetime-stamped events at the appropriate pace. The web-based tool loads the application’s GUI, which plays the eventsas it receives them from the Log Player. For instance, recordeduser utterances are played back, while any GUI interac-

tions such as gestures are shown on the GUI. As the logis replayed, the annotator can watch the interaction unfold,while simultaneously transcribing the user utterances andmaking any appropriate annotations.

5.3 Usage Data AnalysisWe have used the tools described in this section to col-

lect, transcribe, and annotate usage data for the Word Warand City Browser applications described above. For veryearly prototypes, we have relied on traditional laboratorystudies, in which users come to the lab and are assigned aset of tasks to perform. As prototypes advance, we moveto remote studies, in which subjects are recruited via publicannouncements and login from their own computers. In ourexperience, this allows us to collect data much more rapidlywith much less time invested on our part.

Moreover, our impressions after transcribing and annotat-ing thousands of utterances is that users act more naturallywhen they are in the comfort of their home or office. In par-ticular, they experiment more with the capabilities of thesystem and push the bounds of what’s possible. For the re-searcher, watching these interactions in playback mode canbe a simultaneously thrilling and sobering experience. Whilereplaying a Word War interaction captured in single playerpractice mode, for example, it became apparent that a fa-ther and daughter were practicing Chinese together, at timessounding out words simultaneously! Similarly, we have ob-served interactions in which City Browser is used by two ormore people at once in a social setting. A more subtle dif-ference is that our impression has been that in-lab users ofboth applications tend to stick more to the provided samplesentence structures, while remote users are willing to exper-iment more.

Since individual users rely on their own hardware resourcesto perform the tasks in a remote user study, the audio qualityalso varies greatly. Though clear instructions and tutorialtasks can prevent many problems, users with poor micro-phones or mis-configured audio settings are difficult to avoidentirely. While in some ways this is a disadvantage, by de-ploying a WAMI application to the web and performing anerror analysis on recognition results, the researcher can gaininsight into the bottlenecks with regards to robust recogni-tion in a realistic environment. In [13], Word War recogni-tion performance was analyzed along four dimensions, oneof which was audio quality. While its effects were significant,it turned out that the “non-nativeness” of a student’s speechwas a larger impediment to accurate recognition.

Finally, another clear advantage of deploying web-basedinterfaces is that users can be recruited from all over theworld, and can use the interface whenever is most conve-nient. Figure 9 shows the geographical distribution of userswho have interacted with Word War and City Browser anda histogram of the user’s local time of day during which theseinteractions took place. Subjects logged in from all over theworld, and there was heavy usage outside of normal businesshours, when user studies are typically conducted.

6. CONCLUSIONS AND FUTURE WORKWe have presented the WAMI toolkit which allows the de-

velopment, deployment, and evaluation of highly interactiveweb-based multimodal interfaces on a wide range of network-connected devices. We have demonstrated the feasibility ofthis approach by describing several rich interfaces using the

(a) Geographical distribution of study subjects.

0 5 10 15 200

20

40

60

Local Hour of Day

Ses

sion

s

(b) Local time of day of interactions.

Figure 9: City Browser and Word War usage.

toolkit, and by showing how the provided web-based studymanagement system makes it straightforward to gather andannotate user interactions with such systems. Our hope isthat research in the field will benefit by making it relativelycheap and easy to deploy multimodal interfaces to a largenumber of users.

To that end, we are currently in the process of making thetoolkit publicly available. In particular, we are developing aportal at http://web.sls.csail.mit.edu/wami/ which willprovide speech recognition services to developers. Any de-veloper will be able to create a multimodal application usingthe lightweight model described in section 3.2 without hav-ing to configure a speech recognizer; instead, the developerwill upload a JSGF grammar and connect to a recognizermaintained on our servers. We hope that by significantlylowering the barrier to entry, we will enable developers withfresh ideas to create novel, compelling multimodal interfaces.

7. ACKNOWLEDGMENTSThanks to Stephanie Seneff for feedback on drafts of this

paper. This research is sponsored by the T-Party Project, ajoint research program between MIT and Quanta Computer,Inc., and by the Industrial Technology Research Institute.

8. REFERENCES[1] A. Acero, N. Bernstein, R. Chambers, Y. C. Jui,

X. Li, J. Odell, P. Nguyen, O. Scholz, and G. Zweig.Live search for mobile: Web services by voice on thecellphone. In Proc. of ICASSP, 2008.

[2] G. Aist, J. Allen, E. Campana, C. G. Gallo, S. Stoness,M. Swift, and M. K. Tanenhaus. Incrementalunderstanding in human-computer dialogue andexperimental evidence for advantages overnonincremental methods. In R. Artstein and L. Vieu,editors, Proc. of the 11th Workshop on the Semanticsand Pragmatics of Dialogue, pages 149–154, 2007.

[3] J. Axelsson, C. Cross, J. Ferrans, G. McCobb, T. V.Raman,and L. Wilson. Mobile X+V 1.2. Technical report, 2005.http://www.voicexml.org/specs/multimodal/x+v/mobile/12/.

[4] A. Gruenstein, B.-J. P. Hsu, J. Glass, S. Seneff,L. Hetherington, S. Cyphers, I. Badr, C. Wang, andS. Liu. A multimodal home entertainment interfacevia a mobile device. In Proc. of the ACL Workshop onMobile Language Processing, 2008.

[5] A. Gruenstein, S. Seneff, and C. Wang. Scalable andportable web-based multimodal dialogue interactionwith geographical databases. In Proc. ofINTERSPEECH, 2006.

[6] A. Hjalmarsson. Evaluating AdApt, a multi-modalconversational dialogue system using PARADISE.Master’s thesis, KTH, Stockhom, Sweden, 2002.

[7] M. Honkala and M. Pohja. Multimodal interactionwith XForms. In Proc. of the 6th InternationalConference on Web Engineering, pages 201–208, 2006.

[8] Java speech grammar format.http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/.

[9] M. Johnston, S. Bangalore, G. Vasireddy, A. Stent,P. Ehlen, M. Walker, S. Whittaker, and P. Maloor.MATCH: An architecture for multimodal dialoguesystems. In Proc. of ACL, 2002.

[10] K. Katsurada, Y. Nakamura, H. Yamada, andT. Nitta. XISL: a language for describing multimodalinteraction scenarios. In Proc. of the 5th InternationalConference on Multimodal Interfaces, 2003.

[11] R. Lau, G. Flammia, C. Pao, and V. Zue.WebGALAXY: beyond point and click – aconversational interface to a browser. ComputerNetworks and ISDN Systems, 29:1385–1393, 1997.

[12] O. Lemon and A. Gruenstein. Multithreaded contextfor robust conversational interfaces: context-sensitivespeech recognition and interpretation of correctivefragments. ACM Transactions on Computer-HumanInteraction, 11(3):241–267, 2004.

[13] I. McGraw and S. Seneff. Speech-enabled card gamesfor language learners. In Proc. 23rd AAAI Conferenceon Artificial Intelligence, 2008.

[14] A. Raux, D. Bohus, B. Langner, A. Black, andM. Eskenazi. Doing research on a deployed spokendialogue system: One year of Let’s Go! experience. InProc. of INTERSPEECH-ICSLP, 2006.

[15] S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid, andV. Zue. Galaxy-II: A reference architecture forconversational system development. In Proc. ICSLP,1998.

[16] A. Stent, J. Dowding, J. M. Gawron, E. O. Bratt, andR. Moore. The CommandTalk spoken dialoguesystem. In Proc. of ACL, 1999.

[17] Tellme mobile. http://m.tellme.com.

[18] K. Wang. SALT: A spoken language interface forweb-based multimodal dialog systems. In Proc. of the7th International Conference on Spoken LanguageProcessing, 2002.

[19] Yahoo onesearch. http://mobile.yahoo.com.

[20] V. Zue, S. Seneff, J. Glass, J. Polifroni, C. Pao, T. J.Hazen, and L. Hetherington. JUPITER: Atelephone-based conversational interface for weatherinformation. IEEE Transactions on Speech and AudioProcessing, 8(1), January 2000.