Gesture semantics reconstruction based on motion capturing and...

11
Gesture semantics reconstruction based on motion capturing and complex event processing By: Thies Pfeiffer, Florian Hofmann, Florian Hahn, Hannes Reiser, and Insa Röpke (Lawler) Pfeiffer, T., Hofmann, F., Hahn, F., Rieser, H., Röpke, I. (2013). Gesture semantics reconstruction based on motion capturing and complex event processing. 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 270-279. https://www.aclweb.org/anthology/W13-4041 © 2013 Association for Computational Linguistics. Published under a Creative Commons Attribution International License (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/ Abstract: A fundamental problem in manual based gesture semantics reconstruction is the specification of preferred semantic concepts for gesture trajectories. This issue is complicated by problems human raters have annotating fast-paced three dimensional trajectories. Based on a detailed example of a gesticulated circular trajectory, we present a data-driven approach that covers parts of the semantic reconstruction by making use of motion capturing (mocap) technology. In our FA 3 ME framework we use a complex event processing approach to analyse and annotate multi- modal events. This framework provides grounds for a detailed description of how to get at the semantic concept of circularity observed in the data. Keywords: semantics | complex event processing | gesture annotation | circular trajectory Article: ***Note: Full text of article below

Transcript of Gesture semantics reconstruction based on motion capturing and...

Page 1: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

Gesture semantics reconstruction based on motion capturing and complex event processing By: Thies Pfeiffer, Florian Hofmann, Florian Hahn, Hannes Reiser, and Insa Röpke (Lawler) Pfeiffer, T., Hofmann, F., Hahn, F., Rieser, H., Röpke, I. (2013). Gesture semantics reconstruction based on motion capturing and complex event processing. 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 270-279. https://www.aclweb.org/anthology/W13-4041 © 2013 Association for Computational Linguistics. Published under a Creative Commons Attribution International License (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/ Abstract: A fundamental problem in manual based gesture semantics reconstruction is the specification of preferred semantic concepts for gesture trajectories. This issue is complicated by problems human raters have annotating fast-paced three dimensional trajectories. Based on a detailed example of a gesticulated circular trajectory, we present a data-driven approach that covers parts of the semantic reconstruction by making use of motion capturing (mocap) technology. In our FA3ME framework we use a complex event processing approach to analyse and annotate multi-modal events. This framework provides grounds for a detailed description of how to get at the semantic concept of circularity observed in the data. Keywords: semantics | complex event processing | gesture annotation | circular trajectory Article: ***Note: Full text of article below

Page 2: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

Proceedings of the SIGDIAL 2013 Conference, pages 270–279,Metz, France, 22-24 August 2013. c©2013 Association for Computational Linguistics

Gesture Semantics Reconstruction Based on Motion Capturing andComplex Event Processing: a Circular Shape Example

Thies Pfeiffer, Florian HofmannArtificial Intelligence Group

Faculty of TechnologyBielefeld University, Germany(tpfeiffe|fhofmann)

@techfak.uni-bielefeld.de

Florian Hahn, Hannes Rieser, Insa RopkeCollaborative Research Center

“Alignment in Communication” (CRC 673)Bielefeld University, Germany

(fhahn2|hannes.rieser|iroepke)@uni-bielefeld.de

Abstract

A fundamental problem in manual basedgesture semantics reconstruction is thespecification of preferred semantic con-cepts for gesture trajectories. This issueis complicated by problems human ratershave annotating fast-paced three dimen-sional trajectories. Based on a detailedexample of a gesticulated circular trajec-tory, we present a data-driven approachthat covers parts of the semantic recon-struction by making use of motion captur-ing (mocap) technology. In our FA3MEframework we use a complex event pro-cessing approach to analyse and annotatemulti-modal events. This framework pro-vides grounds for a detailed description ofhow to get at the semantic concept of cir-cularity observed in the data.

1 Introduction

Focussing on iconic gestures, we discuss the ben-efit of motion capturing (mocap) technology forthe reconstruction of gesture meaning and speechmeaning: A fundamental problem is the specifica-tion of semantic concepts for gesture trajectories,e.g., for describing circular movements or shapes.We start with demonstrating the limitations of ourmanual based annotation. Then we discuss twostrategies of how to deal with these, pragmatic in-ference vs. low level annotation based on mocaptechnology yielding a more precise semantics. Wethen argue that the second strategy is to be pre-ferred to the inferential one.

The annotation of mocap data can be re-alised semi-automatically by our FA3ME frame-work for the analysis and annotation of multi-modal events, which we use to record multi-modalcorpora. For mocap we use the tracking sys-tem ART DTrack2 (advanced realtime tracking

GmbH, 2013), but the framework is not restrictedto this technical set-up. In cooperation with others(e.g., (Kousidis et al., 2012)), we also have usedproducts from Vicon Motion Systems (2013) andthe Microsoft Kinect (Microsoft, 2013). Pfeiffer(2013) presents an overview on mocap technologyfor documenting multi-modal studies.

We thus provide details about the way gesturesare analysed with FA3ME and about the procedureto reconstruct the gesture meaning for the circularmovement in our chosen example. We concludewith a discussion of how these low-level recon-structions can be integrated into the reconstructionof speech and gesture meaning.

2 From Linguistic Annotation to MoCap

In this section we describe our methodology forthe reconstruction of gesture meaning, speechmeaning and its interfacing, illustrated by an ex-ample. We then show a shortcoming of ourcorpus-based annotation and discuss two possiblesolutions to amend it, pragmatic inference vs. se-mantics based on mocap technology. The technol-ogy described in Section 3 will in the end enableus to get the preferred reconstruction of gesture se-mantics.

The reconstruction of the gesture meaning andits fusion with speech meaning to get a multi-modal proposition works as follows: On thespeech side we start with a manual transcription,upon which we craft a context free syntax analy-sis followed by a formal semantics. On the ges-ture side we build an AVM-based representationof the gesture resting on manual annotation.1 Tak-ing the gesture as a sign with independent mean-ing (Rieser, 2010), this representation provides thebasis for the formal gesture semantics. In the next

1We do not use an explicit gesture model, which wouldgo against our descriptive intentions. The range of admissiblegestures is fixed by annotation manuals and investigations ingesture typology.

270

Page 3: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

Figure 1: Our example: a circular gesture (left:video still) to describe the path around the pond(right).

step, the gesture meaning and the speech meaningare fused into an interface (Ropke et al., 2013).Every step in these procedures is infested by un-derspecification which, however, we do not dealwith here. These are, for instance, the selectionof annotation predicates, the attribution of logicalform to gestures and the speech analysis.

In our example, we focus on two gesture pa-rameters, the movement feature of the gestureand the representation technique used. It orig-inates from the systematically annotated corpus,called SaGA, the Bielefeld Speech-and-GestureAlignment-corpus (Lucking et al., 2012). It con-sists of 25 dialogues of dyads conversing about a“bus ride” through a Virtual Reality town. Oneparticipant of each dyad, the so-called Route-Giver (RG), has done this ride and describes theroute and the sights passed to the other participant,the so-called Follower (FO). The taped conversa-tions are annotated in a fine-grained way.

In the example, the RG describes a route sectionaround a pond to the FO. While uttering “Du gehstdrei Viertel rum/You walk three quarters around”,she produces the gesture depicted in Figure 1. In-tuitively, the gesture conveys a circularity infor-mation not expressed in the verbal meaning. In or-der to explicate the relation of speech and gesturemeaning, we use our methodology as describedabove. To anticipate, we get a clear contributionof the speech meaning which is restricted by thegesture meaning conveying the roundness infor-mation. The first step is to provide a syntacticalanalysis which you can see in Figure 2.2

2The gesture stroke extends over the whole utterance.Verb phrases can feature so-called “sentence brackets”. Here,due to a sentence bracket, the finite verb stem “gehst” is sepa-rated from its prefix (“rum”). Together they embrace the Ger-man Mittelfeld, here “drei Viertel”. Observe the N-ellipsis“∅” in the NP. The prefix and the finite verb stem cannot befully interpreted on their own and are therefore marked with

S

VP

VPref*

rumaround

VFin**

NP

N

Quant

N

Viertelquarters

NUM

dreithree

VFin*

gehstwalk

NP

PN

DuYou

gesture stroke

Figure 2: Syntax analysis

The speech representation is inspired by aMontague-Parsons-Reichenbach event ontology,and uses type-logic notions. Ignoring the embed-ding in an indirect speech act3, the speech se-mantics represents an AGENT (the FO) who isengaged in a WALK-AROUND event e related tosome path F, and a THEME relating the WALK-AROUND event e with the path F.

∃eyF 3/4x(WALK-AROUND(e) ∧AGENT(e, FO) ∧ THEME(e, x)

∧ F(x, y)) (1)

The gesture semantics is obtained using the an-notated gesture features. The relevant features arethe movement of the wrist (Path of Wrist) and theRepresentation Technique used.

Path of Wrist ARC<ARC<ARC<ARC

Representation Technique Drawing

4

Interpreting the values ARC<ARC<ARC<ARC and Drawing, respectively, thecalculated gesture semantics represents a benttrajectory consisting of four segments:

an asterisk.3We have treated the function of speech-gesture ensem-

bles in dialogue acts and dialogues elsewhere (Rieser andPoesio (2009), Rieser (2011), Hahn and Rieser (2011),Luecking et al. (2012)).

4This is a shortened version of the full gesture-AVM. Fea-tures like hand shape etc. are ignored. See Rieser (2010) forother annotation predicates.

271

Page 4: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

∃xy1y2y3y4(TRAJECTORY0(x) ∧ BENT(y1) ∧BENT(y2)∧ BENT(y3)∧ BENT(y4)∧y1 < y2 < y3

< y4 ∧ SEGMENT(y1, x) ∧ SEGMENT(y2, x) ∧SEGMENT(y3, x) ∧ SEGMENT(y4, x)). (2)

The paraphrase is: There exists a TRAJECTORY0

x which consists of four BENT SEGMENTS y1, y2,y3, y4. We abbreviate this formula to:

∃x1(TRAJECTORY1(x1)) (3)

In more mundane verbiage: There is a particu-lar TRAJECTORY1 x1. In a speech-gesture inter-face5 (Rieser, 2010) both formulae are extendedby adding a parameter in order to compositionallycombine them:

λY.∃eyF 3/4x (WALK-AROUND(e) ∧AGENT(e, FO) ∧ THEME(e, x)

∧ F(x, y) ∧ Y(y)) (4)

We read this as: There is a WALK-AROUND evente the AGENT of which is FO related to a threequarters (path) F. This maps x onto y which is inturn equipped with property Y.

λz.∃x1(TRAJECTORY1(x1) ∧ x1 = z) (5)

This means “There is a TRAJECTORY1 x1 identicalwith an arbitrary z”. The extensions (4) and (5) arebased on the intuition that the preferred reading isa modification of the (path) F by the gesture.

Taking the gesture representation as an argu-ment for the speech representation, we finally geta simplified multi-modal interface formula. Theresulting proposition represents an AGENT (FO)who is engaged in a WALK-AROUND event e and aTHEME that now is specified as being related toa bent trajectory of four arcs due to formula (2):

∃ey 3/4x ∃F(WALK-AROUND(e)∧AGENT(e, FO) ∧ THEME(e, x) ∧ F(x, y)

∧ TRAJECTORY1(y)) (6)

We take this to mean “There is an AGENT FO’sWALK-AROUND event e related to a three quarters(path) F having a TRAJECTORY1 y”.

As a result, the set of models in which theorginal speech proposition is true is restricted to

5How our model deals with interfacing speech meaningand gesture meaning has been elaborated in a series of papers(see footnote 3). We are well aware of the work on gesture-speech integration by Lascarides and colleagues which wedeal with in a paper on interfaces (Rieser (2013)).

the set of models that contain a bent trajectorystanding in relation to the (path) F. But this restric-tion is too weak. Intuitively, the gesture conveysthe meaning of a horizontal circular trajectory andnot just four bent arcs. To see the shortcoming,note that the set of models also includes modelswhich include a path having four bends that do notform a circular trajectory.

We envisage two methods to get the appropri-ate circularity intuition: pragmatic enrichment andan improvement of our gesture datum to capturethe additional information conveyed in the ges-ture: By pragmatic enrichment, on the one hand,horizontal orientation and circularity of the ges-ture trajectory are inferred using abduction or de-faults. However, the drawback of working withdefaults or abduction rules is that we would haveto set up too many of them depending on the vari-ous shapes and functions of bent trajectories.

On the other hand, the datum can be improvedto yield a circularly shaped trajectory instead ofthe weaker one consisting of four bent arcs. Ourmotion capture data supports the second method:The motion capture data allows us to compute thecomplete trajectory drawn in the gesture space.This will be the basis for producing a mappingfrom gesture parameters to qualitative relationswhich we need in the truth conditions. In the end,we achieve a circular trajectory that is defined asone approximating a circle, see Section 4.3.

In this mapping procedure resides an under-specification, which is treated by fixing a thresh-old for the application of qualitative predicatesthrough raters’ decisions. This threshold valuewill be used in giving truth conditions for, e.g.,(11), especially for determining APPROXIMATE.

We prefer the second method since it capturesour hypothesis that the gesture as a sign conveysthe meaning circular trajectory. The gain of theautomated annotation via mocap which we willsee subsequently is an improvement of our orig-inal gesture datum to a more empirically foundedone. As a consequence, the set of models that sat-isfy our multi-modal proposition can be specified.This is also the reason for explicitly focussing ongesture semantics in this paper.

3 FA3ME - Automatic Annotation asComplex Event Processing

The creation of FA3ME, our Framework for theAutomatic Annotation and Augmentation of Multi-

272

Page 5: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

modal Events, is inter alia motivated by our keyinsight from previous studies that human ratershave extreme difficulties when annotating 3D ges-ture poses and trajectories. This is especiallytrue when they only have a restricted view on therecorded gestures. A typical example is the restric-tion to a fixed number of different camera anglesfrom which the gestures have been recorded. Inprevious work (Pfeiffer, 2011), we proposed a so-lution for the restricted camera perspectives basedon mocap data: Our Interactive Augmented DataExplorer (IADE) allowed human raters to immerseinto the recorded data via virtual reality technol-ogy. Using a 3D projection in a CAVE (Cruz-Neira et al., 1992), the raters were enabled tomove freely around and through the recorded mo-cap data, including a 3D reconstruction of the ex-perimental setting. This interactive 3D visuali-sation supported an advanced annotation processand improved the quality of the annotations butat high costs. Since then, we only know of Kipp(2010) who makes mocap data visible for anno-tators by presenting feature graphs in his annota-tion tool Anvil in a desktop-based setting. In laterwork, Nguyen and Kipp (2010) also support a 3Dmodel of the speaker, but this needed to be hand-crafted by human annotators. A more holistic ap-proach for gesture visualizations are the GestureSpace Volumes Pfeiffer (2011), which summarizegesture trajectories over longer timespans or mul-tiple speakers.

The IADE system also allowed us to add visualaugmentations during the playback of the recordeddata. These augmentations were based on theevents from the mocap data, but aggregated sev-eral events to higher-level representations. In astudy on pointing gestures (Lucking et al., 2013),we could test different hypotheses about the con-struction of the direction of pointing by addingvisual pointing rays shooting in a 3D reconstruc-tion of the original real world setting. This al-lowed us to asses the accuracy of pointing at avery high level in a data-driven manner and derivea new model for the direction of pointing (Pfeiffer,2011).

3.1 Principles in FA3ME

In the FA3ME project, we iteratively refine ourmethods for analysing multi-modal events. As acentral concept, FA3ME considers any recordeddatum as a first-level multi-modal event (see Fig-

ure 3, left). This can be a time-stamped framefrom a video camera, an audio sample, 6-degree-of-freedom matrices from a mocap system or gazeinformation from an eye-tracking system (e.g., seeKousidis et al. (2012)).

A distinctive factor of FA3ME is that we con-sider annotations as second-level multi-modalevents. That is, recorded and annotated datashare the same representation. Annotations can beadded by both, human raters and classification al-gorithms (the event rules in Figure 3, middle). An-notations can themselves be target of annotations.This allows us, for example, to create automaticclassifiers that rely on recorded data and manualannotations (e.g., the first yellow event in Figure 3depends on first-level events above and the bluesecond-level event to the right). This is helpfulwhen classifiers for complex events are not (yet)available. If, for instance, no automatic classifiersfor the stroke of a gesture exists, these annotationscan be made by human raters. Once this is done,the automatic classifiers can describe the move-ments during the meaningful phases by analysingthe trajectories of the mocap data.

Third-level multi-modal events are augmenta-tions or extrapolations of the data. They mightrepresent hypotheses, such as in the example ofdifferent pointing rays given above.

3.2 Complex Event Processing

In FA3ME, we consider the analysis of multi-modal events as a complex event processing (CEP)problem. CEP is an area of computer science ded-icated to the timely detection, analysis, aggrega-tion and processing of events (Luckham, 2002). Inthe past years, CEP has gained an increased atten-tion especially in the analysis of business relevantprocesses where large amount of data, e.g., shareprices, with high update rates are analysed. Thishas fostered many interesting tools and frame-works for the analysis of structured events (Arasuet al., 2004a; EsperTech, 2013; Gedik et al., 2008;StreamBase, 2013). Hirte et al. (2012) applyCEP to a motion tracking stream from a MicrosoftKinect for real-time interaction, but we know ofno uses of CEP for the processing of multi-modalevent streams for linguistic analysis.

Dedicated query languages have been devel-oped by several CEP frameworks which allow usto specify our event aggregations descriptively ata high level of abstraction (Arasu et al., 2004b;

273

Page 6: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

Figure 3: In FA3ME, incoming multi-modal events are handled by a complex event processing frame-work that matches and aggregates events based on time windows to compose 2nd level multi-modalevents. All multi-modal events can then be mapped to tiers in an annotation tool.

Gedik et al., 2008). The framework we use forFA3ME is Esper (EsperTech, 2013), which pro-vides a SQL-like query language. As a central ex-tension of SQL, CEP query languages introducethe concept of event streams and time windows asa basis for aggregation (see Figure 3).

The CEP approach of FA3ME allows us to cre-ate second- and third-level multi-modal events on-the-fly. We can thus provide near real-time anno-tations of sensor events. However, we have to con-sider the latencies introduced by sensors or com-putations and back-date events accordingly.

As a practical result, once we have specified ourannotation descriptions formally in the languageof CEP, these descriptions can be used to createclassifiers that operate both on pre-recorded multi-modal corpora and on real-time data. This makesCEP interesting for projects where research in Lin-guistics and Human-Computer Interaction meet.

4 From MoCap to Linguistic Models

In this section, we will now address the problemof annotating the circular trajectory. In order toget the preferred semantics we yet cannot rely ex-clusively on the automatic annotation. We needthe qualitative predicate “phase” to identify themeaningful part of the gesture (the stroke). Addi-tionally, the qualitative predicate “representationtechnique” is required to select the relevant mo-cap trackers. For instance, the representation tech-nique “drawing” selects the marker of the tip ofthe index finger. We thus require a hybrid modelof manual and automatic annotations. In the fol-lowing, we will focus on the automatic annotation.

First of all, when using mocap to record data,a frame of reference has to be specified as a ba-

Figure 4: The coordinate system of the speaker(left). The orientations of the palms are classifiedinto eight main directions (right).

sis for all coordinate systems. We chose a person-centered frame of reference anchored in the solarplexus (see Figure 4). The coronal plane is de-fined by the solar plexus and the two shoulders.The transverse plane is also defined by the solarplexus, perpendicular to the coronal plane with anormal-vector from solar plexus to the point ST(see Figure 4) between the two shoulders.

4.1 Basic Automatic Gesture Annotations

The analysis of mocap data allows us to create ba-sic annotations that we use in our corpora on-the-fly. This speeds up the annotation process and letshuman raters focus on more complex aspects. Onebasic annotation that can be achieved automati-cally is the classification of the position of gestur-ing hands according to the gesture space model ofMcNeill (1992). As his annotation schema (seeFigure 5, right) is tailored for the annotation of

274

Page 7: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

Figure 5: Our extended gesture space categorisa-tion (upper left) is based on the work of McNeill(lower right).

video frames, we extended this model to supportmocap as well (see Figure 5, left). The importantpoint is that the areas of our schema are derivedfrom certain markers attached to the observed par-ticipant. The upper right corner of the area C-UR(Center-Upper Right), for example, is linked to themarker for the right shoulder. Our schema thusscales directly with the size of the participant. Be-sides this, the sub-millimeter resolution of the mo-cap system also allows us to have a more detailedstructure of the center area. The schema is alsooriented according to the current coronal plane ofthe participant and not, e.g., according to the per-spective of the camera.

A second example is the classification of the ori-entation of the hand, which is classified accordingto the scheme depicted in Figure 4, right. Thisclassification is made relative to the transversalplane of the speaker’s body.

4.2 Example: The Circular Trajectory

For the detection and classification of gesturesdrawing shapes two types of multi-modal eventsare required. First, multi-modal events generatedby the mocap system for the hand. These eventscontain matrices describing the position and ori-entation of the back of the hand. Second, multi-

modal events that mark the gesture stroke (oneevent for the start and one for the end) have to begenerated, either by hand or automatically. At themoment, we rely on our manual annotations forthe existing SaGA corpus.

We realise the annotation of circular trajecto-ries in two steps. First, we reduce the trajectoryprovided by the mocap system to two dimensions.Second, we determine how closely the 2D trajec-tory approximates a circle.

Projection of the 3D TrajectoryThe classifier for circles collects all events for thehand that happened between the two events for thestroke. As noted above, these events represent theposition and orientation of the hand in 3D-space.There are several alternatives to reduce these threedimensions to two for classifying a circle (a 3DObject matching a 2D circle would be a sphere, acircular trajectory through all 3 dimensions a spi-ral). The principal approach is to reduce the di-mensions by projecting the events on a 2D plane.

∃xy (TRAJECTORY(x)∧ PROJECTION- OF(x, y)

∧ TRAJECTORY2D(y)) (7)

Which plane to chose depends on the choicemade for the annotation (e.g., global for the cor-pus) and thus on the context. For the descriptionof gestures in dialogue there are several plausiblealternatives. First, the movements could be pro-jected onto one of the three body planes (sagit-tal plane, coronal plane, transversal plane). In ourcontext, the transversal plane is suitable, as we aredealing with descriptions of routes, which in ourcorpus are made either with respect to the bodyof the speaker or with respect to the plane of animaginary map, both extend parallel to the floor.Figure 6 (upper left) shows the circular movementin the transversal plane. A different perspectiveis presented in Figure 6 (right). There the perspec-tive of a bystander is chosen. This kind of perspec-tive can be useful for describing what the recipientof a dialogue act perceives, e.g., to explain misun-derstandings. For this purpose, the gesture couldalso be annotated twice, once from the speaker’sand once from the recipient’s perspective.

At this point we want to emphasise that posi-tion and orientation of the planes do not have to bestatic. They can be linked to the reference pointsprovided by the mocap system. Thus when thespeaker turns her body, the sagittal, coronal and

275

Page 8: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

Figure 6: The circle-like gesture from our exam-ple can be visualised based on the mocap data. Theright side shows the visualisation from the per-spective of an interlocutor, the visualisation in theupper left corner is a projection of the movementon the transversal plane of the speaker.

transversal planes will move accordingly and thegestures are always interpreted according to thecurrent orientation.

The plane used for projection can also be de-rived from the gesture itself. Using principal com-ponent analysis, the two main axes used by thegesture can be identified. These axes can then havearbitrary orientations. This could be a useful ap-proach whenever 3D objects are described and thecorrect position and orientation of the ideal circlehas to be derived from the gesture.

Circle DetectionOnce the gesture trajectory has been projectedonto a 2D plane, the resulting coordinates are clas-sified. For this, several sketch-recognition algo-rithms have been proposed (e.g., (Alvarado andDavis, 2004; Rubine, 1991)). These algorithmshave been designed for sketch-based interfaces(such as tablets or digitisers), either for recognis-ing commands or for prettifying hand-drawn dia-grams. However, once the 3D trajectory has beenmapped to 2D, they can also be applied to naturalgestures. The individual sketch-recognition algo-rithms differ in the way they are approaching theclassification problem. Many algorithms follow afeature-based approach in which the primitives tobe recognised are described by a set of features(such as aspect ratio or ratio of covered area) (Ru-bine, 1991). This approach is especially suited,when new primitives are to be learned by example.An alternative approach is the model-based ap-proach in which the primitives to be recognised are

described based on geometric models (Alvaradoand Davis, 2004; Hammond and Davis, 2006).Some hybrid approaches also exist (Paulson et al.,2008). The model-based approaches are in linewith our declarative approach to modelling, andare thus our preferred way for classifying shapes.

In our case, the projected 2D trajectoryof the gesture is thus classified by a model-based sketch-recognition algorithm, which clas-sifies the input into one of several shape classes(circle, rectangle, ...) with a correspond-ing member function ISSHAPE(y, CIRCLE) ∈[0 . . . 1]. By this, we can satisfy a subformulaAPPROXIMATES(y, z) ∧ CIRCLE(z) by pre-settinga certain threshold. The threshold has to be cho-sen by the annotators, e.g., by rating positive andnegative examples, as it may vary between partic-ipants and express the sloppiness of their gestures.

4.3 From MoCap to a Revision of SemanticsThe result of the FA3ME reconstruction of ourgesture example can be expressed as follows:

∃xyz (TRAJECTORY(x)

∧ PROJECTION- OF(x, y) ∧ TRAJECTORY2D(y)

∧ APPROXIMATES(y, z) ∧ CIRCLE(z)) (8)

So we have: There is a projection of TRAJEC-TORY x, TRAJECTORY2D y, which is approximat-ing a circle. We can now provide a description ofthe domain which can satisfy formula (8). Conse-quently, formula (8) is enhanced by definition (9).

CIRCULAR TRAJECTORY(x) =DEF

∃yz(TRAJECTORY2(x)∧ PROJECTION- OF(x, y)∧APPROXIMATES(y, z) ∧ circle(z)) (9)

This definition reads as “a CIRCULAR TRAJEC-TORY x is a TRAJECTORY2 which has a PROJEC-TION y that approximates some circle z”.

The formula (9) substitutes the TRAJECTORY1

notion. The improved multi-modal meaning is(10):

∃ey 3/4x ∃F(WALK-AROUND(e)∧AGENT(e, FO) ∧ THEME(e, x) ∧ F(x, y)

∧ CIRCULAR TRAJECTORY(y)) (10)

Interfacing the new gesture representation withthe speech representation captures our intuitionthat the gesture reduces the original set of mod-els to a set including a circular-shaped trajectory.

276

Page 9: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

Speech

Semantics

Gesture

Semantics

Linguistic Model

Classi�cation

of Real World

Events

FA3ME

Preferred Models

Figure 7: Specification of gesture semantics dueto results of classification in FA3ME. Simulationdata feed into the gesture semantics which inter-faces with the speech semantics.

The division of labour between linguistic seman-tics and FA3ME technology regarding the seman-tic reconstruction is represented in Figure 7.

By way of explanation: We have the multi-modal semantics integrating speech semantics andgesture semantics accomplished via λ-calculustechniques as shown in Section 2. As also ex-plained there, it alone would be too weak to de-limit the models preferred with respect to thegesture indicating roundness. Therefore FA3MEtechnology leading to a definition of CIRCU-LAR TRAJECTORY is used which reduces the setof models to the preferred ones assuming a thresh-old n for the gestures closeness of fit to a circle.Thus, the relation between some gesture parame-ters and qualitative relations like circular can beconsidered as a mapping, producing values in therange [0 . . . 1]. Still, it could happen that formula(8) cannot be satisfied in the preferred models. Asa consequence, the multi-modal meaning wouldthen fall short of satisfaction.

5 Conclusion

During our work on the interface between speechand gesture meaning our previous annotationsturned out to be insufficient to support the seman-tics of concepts such as CIRCULAR TRAJECTORY.This concept is a representative of many othersthat for human annotators are difficult to rate withthe rigidity required for the symbolic level of se-mantics. Scientific visualisations, such as depictedin Figure 6, can be created to support the humanraters. However, there is still the problem of per-spective distortions three dimensional gestures aresubject to when viewed from different angles andin particular when viewed on a 2D screen. It is

also difficult to follow the complete trajectory ofsuch gestures over time. Thus, one and the samegesture can be rated differently depending on therater, while an algorithm with a defined thresholdis not subject to these problems.

The presented hybrid approach based on quali-tative human annotations, mocap and our FA3MEframework is able to classify the particular 2D tra-jectories we are interested in following a three-step process: After the human annotator identi-fied the phase and selected relevant trackers, thedimensions are reduced to two and a rigid model-based sketch-recognition algorithm is used to clas-sify the trajectories. This classification is re-peatable, consistent and independent of perspec-tive. A first comparison of the manually anno-tated data and the automatic annotations revealeda high match. All differences between the annota-tions can be explained by restrictions of the videodata which yielded a lower precision in the hu-man annotations specifying the slant of the hand.Thus, the main issues we had with the resultsof human raters have been addressed, however amore formal evaluation on a large corpus remainsto be done. What also remains is a specificationof membership functions for each kind of ges-ture trajectories of interest (e.g., circular, rectan-gular, etc.). For this, a formal specification of whatwe commonly mean by, for instance, CIRCULAR,RECTANGULAR etc. is required.

The automated annotation via mocap im-proves our original gesture datum to capture thecircularity-information conveyed in the gesture.We have a better understanding of the gesturemeaning adopted vis-a-vis the datum considered.As it turns out, resorting to pragmatic inferencecannot be avoided entirely, but we will excludea lot of unwarranted readings which the manual-based logical formulae would still allow by us-ing the approximation provided by body trackingmethods. Not presented here is the way third-levelmulti-modal events are generated by re-simulatingthe data in a 3D world model to generate contextevents, e.g., to support pragmatics.

Acknowledgments

This work has been funded by the DeutscheForschungsgemeinschaft (DFG) in the Collabora-tive Research Center 673, Alignment in Communi-cation. We are grateful to three reviewers whosearguments we took up in this version.

277

Page 10: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

References[advanced realtime tracking GmbH2013] A.R.T.

advanced realtime tracking GmbH. 2013.Homepage. Retrieved May 2013 fromhttp://www.ar-tracking.de.

[Alvarado and Davis2004] Christine Alvarado andRandall Davis. 2004. SketchREAD: a multi-domain sketch recognition engine. In Proceedingsof the 17th annual ACM symposium on Userinterface software and technology, UIST ’04, pages23–32, New York, NY, USA. ACM.

[Arasu et al.2004a] Arvind Arasu, Brian Babcock,Shivnath Babu, John Cieslewicz, Mayur Datar,Keith Ito, Rajeev Motwani, Utkarsh Srivastava, andJennifer Widom. 2004a. Stream: The stanford datastream management system. Technical report, Stan-ford InfoLab.

[Arasu et al.2004b] Arvind Arasu, Shivnath Babu, andJennifer Widom. 2004b. CQL: A language forcontinuous queries over streams and relations. InDatabase Programming Languages, pages 1–19.Springer.

[Cruz-Neira et al.1992] Carolina Cruz-Neira, Daniel J.Sandin, Thomas A. DeFanti, Robert V. Kenyon, andJohn C. Hart. 1992. The cave: audio visual expe-rience automatic virtual environment. Communica-tions fo the ACM 35 (2), 35(6):64–72.

[EsperTech2013] EsperTech. 2013. Homepage of Es-per. Retrieved May 2013 from http://esper.codehaus.org/.

[Gedik et al.2008] Bugra Gedik, Henrique Andrade,Kun-Lung Wu, Philip S Yu, and Myungcheol Doo.2008. SPADE: The System S Declarative StreamProcessing Engine. In Proceedings of the 2008 ACMSIGMOD international conference on Managementof data, pages 1123–1134. ACM.

[Hahn and Rieser2011] Florian Hahn and HannesRieser. 2011. Gestures supporting dialogue struc-ture and interaction in the Bielefeld speech andgesture alignment corpus (SaGA). In Proceedingsof SEMdial 2011, Los Angelogue, 15th Workshop onthe Semantics and Pragmatics of Dialogue, pages182–183, Los Angeles, California.

[Hammond and Davis2006] Tracy Hammond and Ran-dall Davis. 2006. LADDER: A language to describedrawing, display, and editing in sketch recognition.In ACM SIGGRAPH 2006 Courses, page 27. ACM.

[Hirte et al.2012] Steffen Hirte, Andreas Seifert,Stephan Baumann, Daniel Klan, and Kai-UweSattler. 2012. Data3 – a kinect interface for OLAPusing complex event processing. Data Engineering,International Conference on, 0:1297–1300.

[Kipp2010] Michael Kipp. 2010. Multimedia annota-tion, querying and analysis in anvil. Multimedia in-formation extraction, 19.

[Kousidis et al.2012] Spyridon Kousidis, Thies Pfeiffer,Zofia Malisz, Petra Wagner, and David Schlangen.2012. Evaluating a minimally invasive labora-tory architecture for recording multimodal conversa-tional data. In Proceedings of the InterdisciplinaryWorkshop on Feedback Behaviors in Dialog, IN-TERSPEECH2012 Satellite Workshop, pages 39–42.

[Luckham2002] David Luckham. 2002. The Powerof Events: An Introduction to Complex Event Pro-cessing in Distributed Enterprise Systems. Addison-Wesley Professional.

[Lucking et al.2012] Andy Lucking, Kirsten Bergman,Florian Hahn, Stefan Kopp, and Hannes Rieser.2012. Data-based analysis of speech and gesture:the Bielefeld Speech and Gesture Alignment corpus(SaGA) and its applications. Journal on MultimodalUser Interfaces, -:1–14.

[Lucking et al.2013] Andy Lucking, Thies Pfeiffer, andHannes Rieser. 2013. Pointing and reference recon-sidered. International Journal of Corpus Linguis-tics. to appear.

[McNeill1992] David McNeill. 1992. Hand and Mind:What Gestures Reveal about Thought. University ofChicago Press, Chicago.

[Microsoft2013] Microsoft. 2013. Homepage ofKINECT for Windows. Retrieved May 2013from http://www.microsoft.com/en-us/kinectforwindows/develop/.

[Nguyen and Kipp2010] Quan Nguyen and MichaelKipp. 2010. Annotation of human gesture using3d skeleton controls. In Proceedings of the 7th In-ternational Conference on Language Resources andEvaluation. ELDA.

[Paulson et al.2008] Brandon Paulson, Pankaj Rajan,Pedro Davalos, Ricardo Gutierrez-Osuna, and TracyHammond. 2008. What!?! no Rubine features?: us-ing geometric-based features to produce normalizedconfidence values for sketch recognition. In HCCWorkshop: Sketch Tools for Diagramming, pages57–63.

[Pfeiffer2011] Thies Pfeiffer. 2011. UnderstandingMultimodal Deixis with Gaze and Gesture in Con-versational Interfaces. Berichte aus der Informatik.Shaker Verlag, Aachen, Germany, December.

[Pfeiffer2013] Thies Pfeiffer. 2013. Documentationwith motion capture. In Cornelia Muller, AlanCienki, Ellen Fricke, Silva H. Ladewig, DavidMcNeill, and Sedinha Teendorf, editors, Body-Language-Communication: An International Hand-book on Multimodality in Human Interaction, Hand-books of Linguistics and Communication Science.Mouton de Gruyter, Berlin, New York. to appear.

[Rieser and Poesio2009] Hannes Rieser and M. Poe-sio. 2009. Interactive Gesture in Dialogue: aPTT Model. In P. Healey, R. Pieraccini, D. Byron,S. Yound, and M. Purver, editors, Proceedings of theSIGDIAL 2009 Conference, pages 87–96.

278

Page 11: Gesture semantics reconstruction based on motion capturing and …libres.uncg.edu/ir/uncg/f/I_Lawler_Gesture_2013.pdf · 2019-09-13 · et of motion capturing (mocap) technology for

[Rieser2010] Hannes Rieser. 2010. On factoringout a gesture typology from the Bielefeld Speech-And-Gesture-Alignment corpus (SAGA). In Ste-fan Kopp and Ipke Wachsmuth, editors, Proceed-ings of GW 2009: Gesture in Embodied Communi-cation and Human-Computer Interaction, pages 47–60, Berlin/Heidelberg. Springer.

[Rieser2011] Hannes Rieser. 2011. Gestures indicat-ing dialogue structure. In Proceedings of SEMdial2011, Los Angelogue, 15th Workshop on the Seman-tics and Pragmatics of Dialogue, pages 9–18, LosAngeles, California.

[Rieser2013] Hannes Rieser. 2013. Speech-gestureInterfaces. An Overview. In Heike Wiese andMalte Zimmermann, editors, Proceedings of 35thAnnual Conference of the German Linguistic Society(DGfS), March 12-15 2013 in Potsdam, pages 282–283.

[Ropke et al.2013] Insa Ropke, Florian Hahn, andHannes Rieser. 2013. Interface Constructions forGestures Accompanying Verb Phrases. In HeikeWiese and Malte Zimmermann, editors, Proceed-ings of 35th Annual Conference of the German Lin-guistic Society (DGfS), March 12-15 2013 in Pots-dam, pages 295–296.

[Rubine1991] Dean Rubine. 1991. Specifying gesturesby example. In Proceedings of the 18th annual con-ference on Computer graphics and interactive tech-niques, SIGGRAPH ’91, pages 329–337, New York,NY, USA. ACM.

[StreamBase2013] StreamBase. 2013. Homepage ofStreamBase. Retrieved May 2013 from http://www.streambase.com/.

[Vicon Motion Systems2013] Vicon Motion Systems.2013. Homepage. Retrieved May 2013 fromhttp://www.vicon.com.

279