Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full...

13
Interleaved Preparation and Output in the COMIC Fission Module Mary Ellen Foster Institute for Communicating and Collaborative Systems School of Informatics, University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW United Kingdom [email protected] Abstract We give a technical description of the fission module of the COMIC mul- timodal dialogue system, which both plans the multimodal content of the sys- tem turns and controls the execution of those plans. We emphasise the parts of the implementation that allow the sys- tem to begin producing output as soon as possible by preparing and outputting the content in parallel. We also demonstrate how the module was designed to ensure robustness and configurability, and de- scribe how the module has performed successfully as part of the overall sys- tem. Finally, we discuss how the tech- niques used in this module can be ap- plied to other similar dialogue systems. 1 Introduction In a multimodal dialogue system, even minor de- lays in processing at each stage can add up to pro- duce a system that produces an overall sluggish impression. It is therefore critical that the output system avoid as much as possible adding any de- lays of its own to the sequence; there should be as little time as possible between the dialogue man- ager’s selection of the content of the next turn and the start of that turn’s output. When the output in- corporates temporal modalities such as speech, it is possible to take advantage of this by planning later parts of the turn even as the earlier parts are being played. This means that the initial parts of the output can be produced more quickly, and any delay in preparing the later parts is partly or en- tirely eliminated. The net effect is that the over- all perceived delay in the output is much shorter than if the whole turn had been prepared before any output was produced. In this paper, we give a technical description of the output system of the COMIC multimodal dia- logue system, which is designed to allow exactly this interleaving of preparation and output. The paper is arranged as follows. In Section 2, we begin with a general overview of multimodal di- alogue systems, concentrating on the design de- cisions that affect how output is specified and produced. In Section 3, we then describe the COMIC multimodal dialogue system and show how it addresses each of the relevant design de- cisions. Next, in Section 4, we describe how the segments of an output plan are represented in COMIC, and how those segments are prepared and executed in parallel. In Section 5, we discuss two aspects of the module implementation that are rel- evant to its role within the overall COMIC system: the techniques that were used to ensure the robust- ness of the fission module, and how it can be con- figured to support a variety of requirements. In Section 6, we then assess the practical impact of the parallel processing on the overall system re- sponsiveness, and show that the output speed has a perceptible effect on the overall user experiences with the system. Finally, in Section 7, we outline the aspects of the COMIC output system that are applicable to similar systems.

Transcript of Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full...

Page 1: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

Interleaved Preparation and Outputin the COMIC Fission Module

Mary Ellen FosterInstitute for Communicating and Collaborative Systems

School of Informatics, University of Edinburgh2 Buccleuch Place, Edinburgh EH8 9LW United Kingdom

[email protected]

Abstract

We give a technical description of thefission module of the COMIC mul-timodal dialogue system, which bothplans the multimodal content of the sys-tem turns and controls the execution ofthose plans. We emphasise the parts ofthe implementation that allow the sys-tem to begin producing output as soon aspossible by preparing and outputting thecontent in parallel. We also demonstratehow the module was designed to ensurerobustness and configurability, and de-scribe how the module has performedsuccessfully as part of the overall sys-tem. Finally, we discuss how the tech-niques used in this module can be ap-plied to other similar dialogue systems.

1 Introduction

In a multimodal dialogue system, even minor de-lays in processing at each stage can add up to pro-duce a system that produces an overall sluggishimpression. It is therefore critical that the outputsystem avoid as much as possible adding any de-lays of its own to the sequence; there should be aslittle time as possible between the dialogue man-ager’s selection of the content of the next turn andthe start of that turn’s output. When the output in-corporates temporal modalities such as speech, itis possible to take advantage of this by planninglater parts of the turn even as the earlier parts are

being played. This means that the initial parts ofthe output can be produced more quickly, and anydelay in preparing the later parts is partly or en-tirely eliminated. The net effect is that the over-all perceived delay in the output is much shorterthan if the whole turn had been prepared beforeany output was produced.

In this paper, we give a technical description ofthe output system of the COMIC multimodal dia-logue system, which is designed to allow exactlythis interleaving of preparation and output. Thepaper is arranged as follows. InSection 2, webegin with a general overview of multimodal di-alogue systems, concentrating on the design de-cisions that affect how output is specified andproduced. InSection 3, we then describe theCOMIC multimodal dialogue system and showhow it addresses each of the relevant design de-cisions. Next, inSection 4, we describe howthe segments of an output plan are represented inCOMIC, and how those segments are prepared andexecuted in parallel. InSection 5, we discuss twoaspects of the module implementation that are rel-evant to its role within the overall COMIC system:the techniques that were used to ensure the robust-ness of the fission module, and how it can be con-figured to support a variety of requirements. InSection 6, we then assess the practical impact ofthe parallel processing on the overall system re-sponsiveness, and show that the output speed hasa perceptible effect on the overall user experienceswith the system. Finally, inSection 7, we outlinethe aspects of the COMIC output system that areapplicable to similar systems.

Page 2: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

ASR

Pen

...

Fusion Dialoguemanager

Speech

Animation

...

Fission

Dialogue anddomain knowledge

Input modules Output modules

User

Figure 1: High-level architecture of a typical multimodal dialogue system

2 Output in Multimodal DialogueSystems

Most multimodal dialogue systems use the basichigh-level architecture shown inFigure 1. Inputfrom the user is analysed by one or more input-processing modules, each of which deals with anindividual input channel; depending on the ap-plication, the input channels may include speechrecognition, pen-gesture or handwriting recogni-tion, or information from visual sensors, for ex-ample. The messages from the various sources arethen combined by a fusion module, which resolvesany cross-modal references and produces a com-bined representation of the user input. This com-bined representation is sent to the dialogue man-ager, which uses a set of domain and dialogueknowledge sources to process the user input, in-teract with the underlying application if necessary,and specify the content to be output by the sys-tem in response. The output specification is sent tothe fission module, which creates a presentation tomeet the specification, using a combination of theavailable output channels. Again, depending onthe application, a variety of output channels maybe used; typical channels are synthesised speech,on-screen displays, or behaviour specifications foran animated agent or a robot.

This general structure is typical across mul-timodal dialogue systems; however, there are anumber of design decisions that must be madewhen implementing a specific system. As this pa-

per concentrates on the output components high-lighted inFigure 1, we will discuss the design de-cisions that have a particular impact on those partsof the dialogue system: the domain of the applica-tion, the output modalities, the turn-taking proto-col, and the division of labour among the modules.We will use as examples the the WITAS (Lemonet al., 2002), MATCH (Walker et al., 2002), andSmartKom (Wahlster, 2005) systems.

The domain of the system and the interactionsthat it is intended to support both have an influ-ence on the type of output that is to be gener-ated. Many systems are designed primarily to sup-port information exploration and presentation, andconcentrate on effectively communicating the nec-essary information to the user. SmartKom andMATCH both fall into this category: SmartKomdeals with movie and television listings, whileMATCH works in the domain of restaurant rec-ommendations. In a system such as WITAS,which incorporates real-time control of a robot he-licopter, very different output must be generatedto communicate the current state and goals of therobot to the user.

The choice of output modalities also affects theoutput system—different combinations of modali-ties require different types of temporal and spatialcoordination, and different methods of allocatingthe content across the channels. Most multimodaldialogue systems use synthesised speech as an out-put modality, often in combination with lip-synch

Page 3: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

and other behaviours of an animated agent (e.g.,MATCH, SmartKom). Various types of visualoutput are also often employed, including inter-active maps (MATCH, WITAS), textual informa-tion presentations (SmartKom, MATCH), or im-ages from visual sensors (WITAS). Some systemsalso dynamically adapt the output channels basedon changing constraints; for example, SmartKomchooses a spoken presentation over a visual one inan eyes-busy situation.

Another factor that has an effect on the designof the output components is the turn-taking proto-col selected by the system. Some systems—suchas WITAS—supportbarge-in (Strom and Seneff,2000); that is, the user may interrupt the systemoutput at any time. Allowing the user to inter-rupt can permit a more intuitive interaction withthe system; however, supporting barge-in createsmany technical complications. For example, it iscrucial that the output system be prepared to stopat any point, and that any parts of the system thattrack the dialogue history be made aware of howmuch of the intended content was actually pro-duced. For simplicity, many systems—includingSmartKom—instead use half-duplex turn-taking:when the system is producing output, the inputmodules are not active. This sort of system is tech-nically more straightforward to implement, but re-quires that the user be given very clear signals asto when the system is and is not paying attention totheir input. MATCH uses a click-to-talk interface,where the user presses a button on the interfaceto indicate that they want to speak; it is not clearwhether the system supports barge-in.

The division of labour across the modules alsodiffers among implemented systems. First of all,not all systems actually incorporate a separatecomponent that could be labelledfission: for ex-ample, in WITAS, the dialogue manager itself alsoaddresses the tasks of presentation planning andcoordination. The components of the typical nat-ural language generation “pipeline” (Reiter andDale, 2000) may be split across the modules in avariety of ways. When it comes to content selec-tion, for instance, in MATCH the dialogue man-ager specifies the content at a high level, while thetext planner selects and structures the actual factsto include in the presentation; in WITAS, on the

other hand, the specific content is selected by thedialogue manager. The tasks of text planning andsentence planning may be addressed by variouscombinations of the fission module and any text-generation modules involved—SmartKom createsthe text in a separate generation module, whilein MATCH text and sentence planning is moretightly integrated with content selection.

Coordination across multiple output channels isalso implemented in various ways. If the only pre-sentation modality is an animated agent, in manycases the generated text is sent directly to theagent, which then communicates privately withthe speech synthesiser to ensure synchronisation.This “visual text-to-speech” configuration is thedefault behaviour of the Greta (de Rosis et al.,2003) and RUTH (DeCarlo et al., 2004) animatedpresentation agents, for instance. However, if thebehaviour of the agent must be coordinated withother forms of output, it is necessary that the be-haviour of all synchronised modules be coordi-nated centrally. How this is accomplished in prac-tice depends on the capabilities of selected speechsynthesiser that is used. In SmartKom, for exam-ple, the presentation planner pre-synthesises thespeech and uses the schedule returned by the syn-thesiser to create the full multimodal schedule; inMATCH, on the other hand, the speech synthe-siser sends progress messages as it plays its out-put, which are used to control the output in theother modalities at run time.

3 The COMIC Dialogue System

COMIC1 (COnversational Multimodal Interac-tion with Computers) is an EU IST 5th frame-work project combining fundamental research onhuman-human dialogues with advanced technol-ogy development for multimodal conversationalsystems. The COMIC multimodal dialogue sys-tem adds a dialogue interface to a CAD-like ap-plication used in sales situations to help clientsredesign their bathrooms. The input to COMICconsists of speech, pen gestures, and handwriting;turn-taking is strictly half-duplex, with no barge-inor click-to-talk. The output combines the follow-ing modalities:

1http://www.hcrc.ed.ac.uk/comic/

Page 4: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

“ [Nod] Okay. [Choose design] [Look at screen]THIS design[circling gesture]is CLAS-SIC. It uses tiles from VILLEROY AND BOCH’s CENTURY ESPRIT series. There areFLORAL MOTIFS andGEOMETRIC SHAPESon theDECORATIVE tiles.”

Figure 2: COMIC interface and sample output

• Synthesised speech, created using theOpenCCG surface realiser (White, 2005a;b)and synthesised by a custom Festival 2 voice(Clark et al., 2004) with support for APMLprosodic markup (de Carolis et al., 2004).

• Facial expressions and gaze shifts of a talkinghead (Breidt et al., 2003).

• Direct commands to the design application.

• Deictic gestures at objects on the applicationscreen, using a simulated mouse pointer.

Figure 2shows the COMIC interface and a typicaloutput turn, including commands for all modali-ties; the small capitals indicate pitch accents in thespeech, with corresponding facial emphasis.

The specifications from the COMIC dialoguemanager are high-level and modality-independent;for example, the specification of the output shownin Figure 2 would indicate that system shouldshow a particular set of tiles on the screen, andshould give a detailed description of those tiles.When the fission module receives input from thedialogue manager, it selects and structures mul-timodal content to create an output plan, us-ing a combination of scripted and dynamically-generated output segments. The fission moduleaddresses the tasks of low-level content selec-tion, text planning, and sentence planning; sur-face realisation of the sentence plans is done by

the OpenCCG realiser. The fission module alsocontrols the output of the planned presentation bysending appropriate messages to the output mod-ules including the text realiser, speech synthesiser,talking head, and bathroom-design GUI. Coordi-nation across the modalities is implemented usinga technique similar to that used in SmartKom: thesynthesised speech is prepared in advance, and thetiming information from the synthesiser is used tocreate the schedule for the other modalities.

The plan for an output turn in COMIC is rep-resented in a tree structure; for example,Figure 3shows part of the plan for the output inFigure 2.A plan tree like this is created from the top down,with the children created left-to-right at each level,and is executed in the same order. The planningand execution processes for a turn are started to-gether and run in parallel, which makes it possibleto begin producing output as soon as possible andto continue planning while output is active. In thefollowing section, we describe the set of classesand algorithms that make this interleaved prepara-tion and execution possible.

The COMIC fission module is implemented in acombination of Java and XSLT. The current mod-ule consists of 18 000 lines of Java code in 88source files, and just over 9000 lines of XSLT tem-plates. In the diagrams and algorithm descriptionsthat follow, some non-essential details are omittedfor simplicity.

Page 5: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

Turn

Choose design Describe design

“This designis classic.”

“It uses tilesfrom ...”

[...]

Look at screenAcknowledge

Nod “Okay.”

Figure 3: Output plan

4 Representing an Output Plan

Each node in a output-plan tree such as that shownin Figure 3 is represented by an instance of theSegment class. The structure of this abstract classis shown inFigure 4; the fields and methods de-fined in this class control the preparation and out-put of the corresponding segment of the plan tree,and allow preparation and output to proceed inparallel.

Each Segment instance stores a reference to itsparent in the tree, and defines the following threemethods:

• plan() Begins preparing the output.

• execute() Produces the prepared output.

• reportDone() Indicates to the Segment’sparent that its output has been completed.

plan() and execute() are abstract methods ofthe Segment class; the concrete implementationsof these methods on the subclasses of Segmentare described later in this section. Each Segmentalso has the following Boolean flags that controlits processing; all are initially false.

• ready This flag is set internally once the Seg-ment has finished all of its preparation and isready to be output.

• skip This flag is set internally if the Segmentencounters a problem during its planning, andindicates that the Segment should be skippedwhen the time comes to produce output.

Segment# parent : Sequence# ready : boolean# skip : boolean# active : boolean+ plan()+ execute()# reportDone()

Figure 4: Structure of the Segment class

• active This flag is set externally by the Seg-ment’s parent, and indicates that this Seg-ment should produce its output as soon as itis ready.

The activity diagram inFigure 5 shows howthese flags and methods are used during the prepa-ration and output of a Segment. Note that aSegment may send asynchronous queries to othermodules as part of its planning. When such aquery is sent, the Segment sets its internal stateand exits itsplan() method; when the responseis received, preparation continues from the laststate reached. Since planning and execution pro-ceed in parallel across the tree, and the planningprocess may be interrupted to wait for responsesfrom other modules, theready andactive flagsmay be set in either order on a particular Seg-ment. Once both of these flags have been set, theexecute() method is called automatically. If bothskip andactive are set, the Segment instead au-tomatically callsreportDone() without ever ex-

Page 6: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

plan()

Set activeSet ready Set skip

execute()

reportDone()

Called by parent

Called by parent

Callsparent.childIsDone()

Success Error

Communicateswith other modules

Figure 5: Segment preparation and output

Page 7: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

Segment

# p

are

nt

: S

eq

uence

# r

ead

y :

boole

an

# a

ctiv

e :

boole

an

# s

kip :

boole

an

+ plan()

+ execute()

# r

eport

Done()

Sequence

# c

hild

ren :

Lis

t<S

egm

ent>

# c

ur

: in

t

+ e

xecu

te()

+ c

hild

IsD

one()

BallisticSegment

+ p

lan()

+ e

xecu

te()

StandaloneAvatarC

ommand

+ e

xecu

te()

+ h

and

lePoolM

ess

ag

e()

StandaloneAvatarC

ommand

+ e

xecu

te()

+ h

and

lePoolM

ess

ag

e()

Sentence

+ e

xecu

te()

+ h

and

lePoolM

ess

ag

e()

ScriptedSentence

+ p

lan()

- se

nd

Speech

Mess

ag

e()

PlannedSentence

+ p

lan()

- se

nd

Realis

erM

ess

ag

e()

PlannedSequence

+ h

and

lePoolM

ess

ag

e()

Turn

+ p

lan()

+ h

and

lePoolM

ess

ag

e()

TurnSequence

+ a

dd

Turn

()+

pla

n()

+ c

hild

IsD

one()

ScriptedSequence

+ p

lan()

AskSee

# p

ropN

am

e :

Str

ing

# p

ropV

al :

Str

ing

+ p

lan()

Describe

# o

bjID

s :

List

<S

trin

g>

# r

eq

uest

ed

Feats

: L

ist<

Str

ing

>#

um

Resp

onse

: U

MR

ankin

gs

# d

eta

iled

: b

oole

an

# p

ropV

al :

Str

ing

+ p

lan()

Figure 6: Segment class hierarchy

Page 8: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

ecuting; this allows Segments with errors to beskipped without affecting the output of the rest ofthe turn.

The full class hierarchy under Segment is shownin Figure 6. There are three main top-level sub-classes of Segment, which differ primarily basedon how they implementexecute():

Sequence An ordered sequence of Segments. Itis executed by activating each child in turn.

BallisticSegment A single command whose du-ration is determined by the module producing theoutput. It is executed by sending a message to theappropriate module and waiting for that module toreport back that it has finished.

Sentence A single sentence, incorporating coor-dinated output in all modalities. Its schedule iscomputed in advance, as part of the planning pro-cess; it is executed by sending a “go” command tothe appropriate output modules.

In the remainder of this section, we discuss eachof these classes and its subclasses in more detail.

4.1 Sequence

All internal nodes in a presentation-plan tree(coloured blue inFigure 3) are instances of sometype of Sequence. A Sequence stores a list of childSegments, which it plans and activates in order,along with a pointer to the currently active Seg-ment.Figure 7shows the pseudocode for the mainmethods of a typical Sequence.

Note that a Sequence calls sets itsready flag assoon as all of its necessary child Segments havebeen created, and only then begins callingplan()on them. This allows the Sequence’sexecute()method to be called as soon as possible, whichis critical to allowing the fission module to beginproducing output from the tree before the full treehas been created.

When execute() is called on a Sequence, itcallsactivate() on the first child in its list. Allsubsequent children are activated by calls to thechildIsDone() method, which is called by eachchild as part of itsreportDone() method after itsexecution is completed. Note that this ensures thatthe children of a Sequence will always be executedin the proper order, even if they are prepared out of

public void plan() {// Create child Segments

cur = 0;ready = true ;

for ( Segment seg: children ) {seg.plan();

}}

public void execute() {children.get( 0 ).activate();

}

public void childIsDone() {cur++;if ( cur >= children.size() ) {

reportDone();} else {

children.get( cur ).activate();}

}

Figure 7: Pseudocode for Sequence methods

order. Once all of the Sequence’s children have re-ported that they are done, the Sequence itself callsreportDone().

The main subclasses of Sequence, and their rel-evant features, are as follows:

TurnSequence The singleton class that is theparent of all Turns. It is always active, and newchildren can be added to its list at any time.

Turn Corresponds to a single message from thedialogue manager; the root of the output plan inFigure 3 is a Turn. Itsplan() implementationcreates a Segment corresponding to each dialogueact from the dialogue manager; in some cases, theTurn adds additional children not directly speci-fied by the DAM, such as the verbal acknowledge-ment and the gaze shift inFigure 3.

ScriptedSequence A sequence of canned outputsegments stored as an XSLT template. A Scripted-Sequence is used anywhere in the dialogue wheredynamically-generated content is not necessary;for example, instructions to the user and acknowl-edgements such as the leftmost subtree inFigure 3are stored as ScriptedSequences.

PlannedSequence In contrast to a ScriptedSe-quence, a PlannedSequence creates its children

Page 9: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

dynamically depending on the dialogue context.The principal type of PlannedSequence is a de-scription of one or more tile designs, such as thatshown inFigure 2. To create the content of sucha description, the fission module uses informationfrom the system ontology, the dialogue history,and the model of user preferences to select andstructure the facts about the selected design andto create the sequence of sentences to realise thatcontent. This process is described in detail in (Fos-ter and White, 2004; 2005).

4.2 BallisticSegment

A BallisticSegment is a single command for a sin-gle output module, where the output module is al-lowed to choose the duration at execution time.In Figure 3, the orangeNod, Choose design, andLook at screennodes are examples of BallisticSeg-ments. In itsplan() method, a BallisticSegmenttransforms its input specification into an appropri-ate message for the target output module. Whenexecute() is called, the BallisticSegment sendsthe transformed command to the output moduleand waits for that module to report back that it isdone; it callsreportDone() when it receives thatacknowledgement.

4.3 Sentence

The Sentence class represents a single sentence,combining synthesised speech, lip-synch com-mands for the talking head, and possible coordi-nated behaviours on the other multimodal chan-nels. The timing of a sentence is based on thetiming of the synthesised speech; all multimodalbehaviours are scheduled to coincide with partic-ular words in the text. Unlike a BallisticSegment,which allows the output module to determine theduration at execution time, a Sentence must pre-pare its schedule in advance to ensure that outputis coordinated across all of the channels. InFig-ure 3, all of the green leaf nodes containing textare instances of Sentence.

There are two types of Sentences: ScriptedSen-tences and PlannedSentences. A ScriptedSentenceis generally created as part of a ScriptedSequence,and is based on pre-written text that is sent directlyto the speech synthesiser, along with any neces-sary multimodal behaviours. A PlannedSentence

forms part of a PlannedSequence, and is based onlogical forms for the OpenCCG realiser (White,2005a;b). The logical forms may contain multiplepossibilities for both the text and the multimodalbehaviours; the OpenCCG realiser uses statisticallanguage models to make a final choice of the ac-tual content of the sentence.

The first step in preparing either type of Sen-tence is to send the text to the speech synthe-siser (Figure 8(a)). For a ScriptedSentence, thecanned text is sent directly to the speech synthe-siser; for a PlannedSentence, the logical forms aresent to the realiser, which then creates the textand sends it to the synthesiser. In either case, thespeech-synthesiser input also includes marks at allpoints where multimodal output is intended. Thespeech synthesiser prepares and stores the wave-form based on the input text, and returns timing in-formation for the words and phonemes, along withthe timing of any multimodal coordination marks.

The fission module uses the returned timinginformation to create the final schedule for allmodalities. It then sends the animation schedule(lip-synch commands, along with any coordinatedexpression or gaze behaviours) to the talking-headmodule so that it can prepare its animation in ad-vance (Figure 8(b)). Once the talking-head mod-ule has prepared the animation for a turn, it returnsa “ready” message. The design application doesnot need its schedule in advance, so once the re-sponse is received from the talking head, the Sen-tence has finished its preparation and is able to setits ready flag.

When a Sentence is executed by its parent,it selects a desired start time slightly in the fu-ture and sends two messages, as shown inFig-ure 8(c). First, it sends a “go” message withthe selected starting time to the speech-synthesisand talking-head modules; these modules thenplay the prepared output for that turn at the giventime. The Sentence also sends the concrete sched-ule for any coordinated gesture commands to thebathroom-design application at this point. Af-ter sending its messages, the Sentence waits untilthe scheduled duration has elapsed, and then callsreportDone().

Page 10: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

Fission

Realiser

Synthesiser

Talking head

Design application

Logicalforms

Generated text

Canned text

Timing

(a) Preparing the speech

Fission

Realiser

Synthesiser

Talking head

Design application

Phonemes, motions

“Ready”

(b) Preparing the animation

Fission

Realiser

Synthesiser

Talking head

Design application

Start time

Gestureschedule

(c) Producing the output

Figure 8: Planning and executing a Sentence

Page 11: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

5 Robustness and Configurability

In the preceding section, we gave a description ofthe data structures and methods that are used whenpreparing and executing and output plan. In thissection, we describe two other aspects of the mod-ule that are important to its functioning as part ofthe overall dialogue system: its ability to detectand deal with errors in its processing, and the var-ious configurations in which it can be run.

5.1 Error Detection and Recovery

Since barge-in is not implemented in COMIC, thefission module plays an important role in turn-taking for the whole COMIC system: it is themodule that informs the input components whenthe system output is finished, so that they are ableto process the next user input. The fission mod-ule therefore incorporates several measures to en-sure that it is able to detect and recover from un-expected events during its processing, so that thedialogue is able to continue even if there are errorsin some parts of the output.

Most input from external modules is validatedagainst XML schemas to ensure that it is well-formed, and any messages that fail to validate arenot processed further. As well, all queries to exter-nal modules are sent with configurable time-outs,and any Segment that is expecting a response to aquery is also prepared to deal with a time-out.

If a problem occurs while preparing any Seg-ment for output—either due to an error in internalprocessing, or because of an issue with some ex-ternal module—that Segment immediately sets itsskip flag and stops the preparation process. Asdescribed inSection 4, any Segments with this flagset are then skipped at execution time. This en-sures that processing is able to continue as muchas possible despite the errors, and that the fissionmodule is still able to produce output from theparts of an output plan unaffected by the problemsand to perform its necessary turn-taking functions.

5.2 Configurability

The COMIC fission module can be run in sev-eral different configurations, to meet a variety ofevaluation, demonstration, and development situa-tions. The fission module can be configured not to

wait for “ready” and “done” responses from eitheror both of the talking-head and design-applicationmodules; the fission module simply proceeds withthe rest of its processing as if the required responsehad been received. This allows the whole COMICsystem to be run without those output modules en-abled. This is useful during development of otherparts of the system, and for running demos andevaluation experiments where not all of the outputchannels are used. The module also has a num-ber of other configuration options to control fac-tors such as query time-outs and the method of se-lecting multimodal coarticulations.

As well, the fission module has the ability togenerate multiple alternative versions of a singleturn, using different user models, dialogue-historysettings, or multimodal planning techniques; thisis useful both as a testing tool and as part of a sys-tem demonstration. The module can also store allof the generated output to a script, and to play backthe scripted output at a later time using a subset ofthe full system. This allows alternative versionsof the system output to be directly compared inuser evaluation studies such as (Foster, 2004; Fos-ter and White, 2005).

6 Output Speed

In the final version of the COMIC system, the av-erage time2 that the speech synthesiser takes toprepare the waveform for a sentence is 1.9 sec-onds, while the average synthesised length of asentence is 2.7 seconds. This means that, on aver-age, each sentence takes long enough to play thatthe next sentence is ready as soon as it is needed;and even when this is not the case, the delay be-tween sentences is still greatly reduced by the par-allel planning process.

The importance of beginning output as soon aspossible was demonstrated by a user evaluation ofan interim version of COMIC (White et al., 2005).Subjects in that study used the full COMIC sys-tem in one of two configurations: an “expressive”condition, where the talking head used all of theexpressions it was capable of, or a “zombie” con-dition where all of the behaviours of the head weredisabled except for lip-synch. One effect of this

2On a Pentium 4 1.6GHz computer.

Page 12: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

difference was that the system gave a consistentlyearlier response in the expressive condition—a fa-cial response was produced an average of 1.4 sec-onds after the dialogue-manager message, whilespoken input did not begin for nearly 4 seconds.Although that version of the system was very slow,the subjects in the expressive condition were sig-nificantly less likely to mention the overall slow-ness than the subjects in the zombie condition.

After this interim evaluation, effort was put intofurther reducing the delay in the final system. Forexample, we now store the waveforms for ac-knowledgements and other frequently-used textspre-synthesised in the speech module instead ofsending them to Festival, and other internal pro-cessing bottlenecks were eliminated. Using thesame computers as the interim evaluation, the fis-sion delay for initial output is under 0.5 seconds inthe final system.

7 Conclusions

The COMIC fission module is able to prepare andcontrol the output of multimodal turns. It preparesand executes its plans in parallel, which allows itto begin producing output as soon as possible andto continue with preparing later parts of the pre-sentation while executing earlier parts. It is ableto produce output coordinated and synchronisedacross multiple modalities, to detect and recoverfrom a variety of errors during its processing, andto be run in a number of different configurationsto support testing, demonstrations, and evaluationexperiments. The parallel planning process is ableto make a significant reduction in the time taken toproduce output, which has a perceptible effect onuser satisfaction with the overall system.

Some aspects of the fission module are specificto the design of the COMIC dialogue system; forexample, the module performs content-selectionand sentence-planning tasks that in other systemsmight be addressed by a dialogue manager or text-generation module. Also, aspects of the commu-nication with the output modules are tailored tothe particular modules involved: the fission mod-ule makes use of features of the OpenCCG realiser

to help choose the content of many of its turns, andthe implementation of the design application is ob-viously COMIC-specific.

However, the general technique of interleavingpreparation and execution, using the time whilethe system is playing earlier parts of a turn toprepare the later parts, is easily applicable to anysystem that produces temporal output, as long asthe same module is responsible for preparing andexecuting the output. There is nothing COMIC-specific about the design of the Segment class orits immediate sub-classes.

As well, the method of coordinating distributedmultimodal behaviour with the speech timing(Section 4.3) is a general one. Although the cur-rent implementation relies on the output modulesto respect the schedules that they are given—withno adaptation at run time—in practice the coordi-nation in COMIC has been generally successful,providing that three conditions are met. First, theselected starting time must be far enough in the fu-ture that it can be received and processed by eachmodule in time. Second, the clocks on all comput-ers involved in running the system must be syn-chronised precisely. Finally, the processing loadon each computer must be low enough that timerevents do not get delayed or pre-empted.

Since the COMIC system does not supportbarge-in, the current fission module always pro-duces the full presentation that is planned, bar-ring processing errors. However, since the mod-ule produces its output incrementally, it would bestraightforward to extend the processing to allowexecution to be interrupted after any Segment, andto know how much of the planned output was ac-tually produced.

Acknowledgements

Many thanks to Peter Poller, Tilman Becker, andespecially Michael White for helpful advice onand discussions about the implementation of theCOMIC fission module, and to the anonymous re-viewers for their comments on the initial versionof this paper. This work was supported by theCOMIC project (IST-2001-32311).

Page 13: Interleaved Preparation and Output in the COMIC Fission Module · thesiser to create the full multimodal schedule; in MATCH, on the other hand, the speech synthe-siser sends progress

References

M. Breidt, C. Wallraven, D.W. Cunningham, andH.H. Bulthoff. 2003. Facial animation based on3d scans and motion capture. In Neill Campbell,editor,SIGGRAPH 03 Sketches & Applications.ACM Press.

R.A.J. Clark, K. Richmond, and S. King. 2004.Festival 2 – build your own general purpose unitselection speech synthesiser. InProceedings,5th ISCA workshop on speech synthesis.

B. de Carolis, C. Pelachaud, I. Poggi, andM. Steedman. 2004. APML, a mark-up lan-guage for believable behaviour generation. InH. Prendinger, editor,Life-like Characters,Tools, Affective Functions and Applications,pages 65–85. Springer.

F. de Rosis, C. Pelachaud, I. Poggi, V. Carofiglio,and B. De Carolis. 2003. From Greta’s mindto her face: modelling the dynamics of affectivestates in a conversational embodied agent.In-ternational Journal of Human-Computer Stud-ies, 59(1–2):81–118.

D. DeCarlo, M. Stone, C. Revilla, and J.J. Ven-ditti. 2004. Specifying and animating facialsignals for discourse in embodied conversa-tional agents.Computer Animation and VirtualWorlds, 15(1):27–38.

M.E. Foster. 2004. User evaluation of generateddeictic gestures in the T24 demonstrator. Publicdeliverable 6.5, COMIC project.

M.E. Foster and M. White. 2004. Techniquesfor text planning with XSLT. InProceedings,NLPXML-2004.

M.E. Foster and M. White. 2005. Assessing theimpact of adaptive generation in the COMIC

multimodal dialogue system. InProceedings,IJCAI 2005 Workshop on Knowledge and Rea-soning in Practical Dialogue systems.

O. Lemon, A. Gruenstein, and S. Peters. 2002.Collaborative activities and multi-tasking in di-alogue systems.Traitement Automatique desLangues (TAL), 43(2):131–154.

E Reiter and R Dale. 2000.Building Natural Lan-guage Generation Systems. Cambridge Univer-sity Press.

N. Strom and S. Seneff. 2000. Intelligent barge-in in conversational systems. InProceedings,ICSLP-2000, volume 2, pages 652–655.

W. Wahlster, editor. 2005. SmartKom: Foun-dations of Multimodal Dialogue Systems.Springer. In press.

M.A. Walker, S. Whittaker, A. Stent, P. Maloor,J.D. Moore, M. Johnston, and G. Vasireddy.2002. Speech-plans: Generating evaluative re-sponses in spoken dialogue. InProceedings,INLG 2002.

M. White. 2005a. Designing an extensible APIfor integrating language modeling and realiza-tion. In Proceedings, ACL 2005 Workshop onSoftware.

M. White. 2005b. Efficient realization of co-ordinate structures in Combinatory CategorialGrammar.Research on Language and Compu-tation. To appear.

M. White, M.E. Foster, J. Oberlander, andA. Brown. 2005. Using facial feedback toenhance turn-taking in a multimodal dialoguesystem. InProceedings, HCI International2005 Thematic Session on Universal Access inHuman-Computer Interaction.