put that there Aim m models for human pointing m ...tpfeiffe/pubs/2008...Conversational Interface...

1
Conversational Interface Agents facilitate natural interactions in Virtual Reality Environments. Bielefeld University Deictic expressions pointing gestures regions or objects primary task Virtual Reality manipulation of objects selection of objects ray casting occlusion arm extension interaction mediated Embodied Conversational Agent understanding natural communication and natural gestures robustness and accuracy interpretation natural pointing gestures study pointing-based conversational interactions immersive Virtual Reality (such as “ ”) are fundamental in human communication to refer to entities in the environment. In situated contexts, deictic expressions often comprise directed at . One of the in applications is the visually perceivable . Thus VR research has focused on developing metaphors that optimize the tradeoff between a swift and precise . Prominent examples are , , or . These technologies are well suited for interacting directly with the system. When the with the system is , e.g., by an (ECA), the primary focus lies on a smooth of . It is thus recommended to improve the of the of , i.e., gestures without facilitation per visual aids or other auxiliaries. To attain these ends, we contribute results from a on pointing and draw conclusions for the implementation of in . put that there Conversational Pointing Gestures for Virtual Reality Interaction Implications from an Empirical Study Thies Pfeiffer tpfeiffe @techfak.uni-bielefeld.de Ipke Wachsmuth ipke @techfak.uni-bielefeld.de m m m m m m m m m m m m Measuring and Reconstructing Pointing in Visual Contexts Deictic Object Reference in Task-oriented Dialogue Processing Instructions Deixis: How to Determine Demonstrated Objects Using a Pointing Cone Resolving Object References in Multimodal Dialogues for Immersive Virtual Environments Resolution of Multimodal Object References Using Conceptual Short Term Memory 3D User Interfaces – Theory and Practice. Mutual Disambiguation of 3D Multimodal Interaction in Augmented and Virtual Reality. A Gesture Processing Framework for Multimodal Interaction in Virtual Reality. SenseShapes: Using Statistical Geometry for Object Selection in a Multimodal Augmented Reality System. A Virtual Interface Agent and its Agency. Towards Preferences in Virtual Environment Interfaces. Kranstedt, Lücking, Pfeiffer, Rieser & Staudacher Kranstedt, Lücking, Pfeiffer, Rieser & Wachsmuth Mouton de Gruyter, Berlin. 2006. Weiß, Pfeiffer, Eikmeyer & Rickheit Mouton de Gruyter, Berlin. 2006. Kranstedt, Lücking, Pfeiffer, Rieser & Wachsmuth . Springer-Verlag GmbH, Berlin Heidelberg. 2006. Pfeiffer & Latoschik Pfeiffer, Voss & Latoschik D. A. Bowman, E. Kruijff, J. Joseph J. LaViola, and I. Poupyrev. Addison- Wesley, 2005. E. Kaiser, A. Olwal, D. McGee, H. Benko, A. Corradini, X. Li, P. Cohen, and S. Feiner. In , pages 12–19. ACM Press, 2003. M. E. Latoschik. In , AFRIGRAPH 2001, pages 95–100. ACM SIGGRAPH, 2001. A. Olwal, H. Benko, and S. Feiner. In , pages 300–301, Tokyo, Japan, October 7–10 2003. I. Wachsmuth, B. Lenzmann, T. Jörding, B. Jung, M. Latoschik, and M. Fröhlich. In , pages 516–517, 1997. C. A. Wingrave, D. A. Bowman, and N. Ramakrishnan. In , pages 63–72, Aire-la-Ville, Switzerland, Switzerland, 2002. Eurographics Association. Proceedings of the Brandial 2006 Situated Communication. Situated Communication. 6th International Gesture Workshop Proceedings of the IEEE Virtual Reality 2004 Proceedings of the EuroCogSci03. Proceedings of the 5th International Conference on Multimodal Interfaces Proceedings of the 1st International Conference on Computer Graphics, Virtual Reality and Visualisation in Africa Proceedings of The Second IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2003) Proceedings of the First International Conference on Autonomous Agents EGVE ’02: Proceedings of the Workshop on Virtual Environments 2002 Research Background Research Background Acknowledgements Acknowledgements This work has been funded by the EC Deutsche Forschungsgemeinschaft (DFG) in the Collaborative Research Center 360, “ ”. in the project , FP6 - IST program - reference number 27654, and by the PASION Situated Articial Communicators Bibliography Bibliography Secondary Future Work m m m Taking the direction of gaze into account does not always improve performance (contrary to the mainstream opinion). Humans display a non-linear behavior at the borders of the domain Confirm the results in a mixed setting with one human and one embodied conversational agent over virtual objects At least in the setting used in our study, with widely spaced objects (20 cm), it can be ignored when going for high overall success A combined model for the extension of proximal and distal pointing. The boundary between proximal and distal pointing is defined by the personal distance . d Motivation Aim How accurate are pointing gestures? m m m Improved Advances for the and Contributing to more models for human pointing interpretation production of pointing gestures robust multimodal conversational interfaces Applications m m m m Human-Computer Interaction Human-Agent Interaction Assistive Technology Empirical research Usability Studies - multimodal interfaces - Human-Robot Interaction and - automatic multimodal annotation - automatic grounding of gestures based on a world model - multimodal conversational interfaces Observations Modelling approach m m m m m (as expected) Ellipse shapes of the bagplots suggest a cone-based model of the extension of pointing Pointing is fuzzy , but even in short range distance Fuzziness increases with distance Overshooting at the edge of the domain (intentional) Still, the human object identifier shows a good performance of 83.9%correct identifications AI Group Faculty of Technology Bielefeld University Germany PASION Marc E. Latoschik marcl @techfak.uni-bielefeld.de Situated Artificial Communicators SFB 360 Method How to determine the parameters of the pointing extension model? What is the opening angle? What defines the anchor of the model? Results Pragmatic Model IFP is more precise than GFP distinguish between proximal and distal pointing When the pointing extension is handled on the level of pragmatics, we can allow for inference mechanisms to disambiguate between several objects and, hence, use heuristics. For the simulation, we used a basic heuristics based on angular distance between objects and the pointing-ray. Again, the table on the right depicts the results of our simulation runs. This time IFP performs better than GFP. . The opening angles in the proximal rows are rather large, while the angles in the more distal angles are much smaller. This motivates us to . Primary m m m m Pointing is best interpreted at the level of pragmatics and not semantics Index-Finger-Pointing is more precise Gaze-Finger-Pointing is more accurate The results stated in the tables to the left and our qualitative observations using IADE suggest a dichotomy of proximal vs. distal pointing. This fits nicely with the dichotomy common in many languages (here vs. there) Conclusion row IFP GFP α perf. α perf. 1 84 70.27 86 68.92 2 80 61.84 68 75 3 71 71.43 69 81.82 4 60 53.95 38 65.79 5 36 43.84 24 57.53 6 24 31.15 25 42.62 7 14 23.26 17 23.26 8 10 7.14 10 14.29 all 71 38.54 61 48.12 row IFP GFP α perf. α perf. 1 120 98.65 143 98.65 2 109 100 124 100 3 99 94.81 94 93.51 4 109 98.68 89 93.42 5 72 97.26 75 94.52 6 44 91.8 50 90.16 7 38 86.05 41 67.44 8 31 52.38 26 69.05 all 120 96.04 143 92.71 Simulations Exploring the data The optimal opening angles per row for a of the pointing extension. In addition, the performance in terms of correctly identified objects in percent of all objects within the specified area is depicted, both for IFP and GFP. The row titled all shows the performance for rows 1-7, row 8 has been excluded because of the overshooting behavior. a strict semantic pointing cone model The optimal opening angles per row for a of the pointing extension. In addition, the performance in terms of correctly identified objects in percent of all objects within the specified area is depicted, both for IFP and GFP. The row titled all shows the performance for rows 1-7, row 8 has been excluded because of the overshooting behavior. a pragmatic pointing cone model A visualization of the intersections of the pointing-ray (dots) for four different objects over all participants. The data is grouped via bagplots, the asterisk marks the mean, the darker area clusters 50 percent, the brighter area 75 percent of the demonstrations. In the depicted setting the person pointing was standing to the left, the person identifying the objects to the right. (M1) (M2) (M3) Study on object pointing m m m m m Interaction of two participants Two conditions: speech + gesture and gesture only Real objects Study with 62 participants Cooperative effort with linguists proximal cone distal cone distance d Technology m m m m Audio + video recordings Motion capturing using ART GmbH optical tracking system Automatic adaptation of a model of the user’s posture Special hand-made soft gloves Interaction Game m m m m m m Description Giver is presented with the object to demonstrate Description Giver utters deictic expression (s+g or g only) Object Identifier tries to identify the object Description Giver gives feedback (yes/no) Proceed with next object No corrections or repairs! The study combines multimodal data comprising audio, video, motion capture and annotation data. A coherent synchronized view of all data sources is provided using the Interactive Augmented Data Explorer (IADE), developed at Bielefeld University. The picture to the right shows a session with the Interactive Aug- mented Data Explorer. The scientist to the right interactively explores the recorded data for qualitative analysis. The model to the left shows the table with the objects and a stick- figure driven by the motion capture data. The video taken from one camera perspective is displayed on the floating panel, together with the audio recordings. Information from specific annotation tiers is presented as floating text. The scientist can, e.g., take the perspective of the description giver or the object identifier. All elements are interactive, e.g. the video panels and annotations can be resized or positioned to allow for a comfortable investigation. Several simulation runs have been conducted to test different approaches to model the pointing extension on the collected data. For a strict semantic model the pointing extension has to single out object. In the simulation runs we determined the optimal angle for such a pointing cone per row and for the overall area. The results are depicted in the table on the right. For the strict semantic model the GFP offers better performance while having a narrower opening angle. Strict Semantic Model one and only one GFP is more accurate than IFP. Do we aim with the index finger (IFP) or by gazing over the index finger (GFP)? spotlight camera display - M1: task - M2 + M3: system time tracking system

Transcript of put that there Aim m models for human pointing m ...tpfeiffe/pubs/2008...Conversational Interface...

Page 1: put that there Aim m models for human pointing m ...tpfeiffe/pubs/2008...Conversational Interface Agents facilitate natural interactions in Virtual Reality Environments. Bielefeld

Conversational Interface Agentsfacilitate natural interactions inVirtual Reality Environments.

Bielefeld University

Deictic expressions

pointinggestures regions or objects primary task

Virtual Reality manipulation ofobjects

selection of objectsray casting occlusion arm extension

interaction mediatedEmbodied Conversational Agent

understanding natural communication andnatural gesturesrobustness and accuracy interpretation naturalpointing gestures

study

pointing-based conversational interactions immersive VirtualReality

(such as “ ”) are fundamentalin human communication to refer to entities in the environment.In situated contexts, deictic expressions often comprise

directed at . One of thein applications is the visuallyperceivable . Thus VR research has focused ondeveloping metaphors that optimize the tradeoff between aswift and precise . Prominent examples are

, , or . These technologies arewell suited for interacting directly with the system.When the with the system is , e.g., by an

(ECA), the primary focus lieson a smooth of

. It is thus recommended to improve theof the of

, i.e., gestures without facilitation per visualaids or other auxiliaries.To attain these ends, we contribute results from a onpointing and draw conclusions for the implementation of

in.

put that there

Conversational Pointing Gestures for Virtual Reality InteractionImplications from an Empirical Study

Thies [email protected]

Ipke [email protected]

m

m

m

m

m

m

m

m

m

m

m

m

Measuring and Reconstructing Pointing inVisual Contexts

Deictic Object Reference in Task-orientedDialogue

Processing Instructions

Deixis: How to Determine DemonstratedObjects Using a Pointing Cone

Resolving Object References in MultimodalDialogues for Immersive VirtualEnvironments

Resolution of Multimodal ObjectReferences Using Conceptual Short TermMemory

3D UserInterfaces – Theory and Practice.

Mutual Disambiguation of 3D MultimodalInteraction in Augmented and VirtualReality.

A Gesture ProcessingFramework for Multimodal Interaction inVirtual Reality.

SenseShapes: Using Statistical Geometryfor Object Selection in a MultimodalAugmented Reality System.

AVirtual Interface Agent and its Agency.

Towards Preferences inVirtual Environment Interfaces.

Kranstedt, Lücking, Pfeiffer, Rieser &Staudacher

Kranstedt, Lücking, Pfeiffer, Rieser &Wachsmuth

Mouton de Gruyter, Berlin. 2006.

Weiß, Pfeiffer, Eikmeyer & Rickheit

Mouton de Gruyter, Berlin. 2006.

Kranstedt, Lücking, Pfeiffer, Rieser &Wachsmuth

.Springer-Verlag GmbH, Berlin Heidelberg.2006.

Pfeiffer & Latoschik

Pfeiffer, Voss & Latoschik

D. A. Bowman, E. Kruijff, J. Joseph J.LaViola, and I. Poupyrev.

Addison-Wesley, 2005.E. Kaiser, A. Olwal, D. McGee, H. Benko, A.Corradini, X. Li, P. Cohen, and S. Feiner.

In

, pages 12–19. ACM Press, 2003.M. E. Latoschik.

In

, AFRIGRAPH 2001, pages 95–100.ACM SIGGRAPH, 2001.A. Olwal, H. Benko, and S. Feiner.

In

, pages 300–301,Tokyo, Japan, October 7–10 2003.I. Wachsmuth, B. Lenzmann, T. Jörding, B.Jung, M. Latoschik, and M. Fröhlich.

In

, pages516–517, 1997.C. A. Wingrave, D. A. Bowman, and N.Ramakrishnan.

In

, pages 63–72,Aire-la-Ville, Switzerland, Switzerland, 2002.Eurographics Association.

Proceedings of the Brandial 2006

Situated Communication.

Situated Communication.

6th International Gesture Workshop

Proceedings of the IEEE Virtual Reality 2004

Proceedings of the EuroCogSci03.

Proceedings of the 5thInternational Conference on MultimodalInterfaces

Proceedings of the 1stInternational Conference on ComputerGraphics, Virtual Reality and Visualisationin Africa

Proceedingsof The Second IEEE and ACM InternationalSymposium on Mixed and AugmentedReality (ISMAR 2003)

Proceedings of the First InternationalConference on Autonomous Agents

EGVE’02: Proceedings of the Workshop onVirtual Environments 2002

Research BackgroundResearch Background

AcknowledgementsAcknowledgements

This work has been funded by theEC

Deutsche Forschungsgemeinschaft(DFG) in the CollaborativeResearch Center 360, “

”.

in the project ,FP6 - IST program - referencenumber 27654, and by the

PASION

SituatedArtificial Communicators

BibliographyBibliography

Secondary

Future Work

m

m

m

Taking the direction of gaze into account does not always improveperformance (contrary to the mainstream opinion).

Humans display a non-linear behavior at the borders of the domain

Confirm the results in a mixed setting with one human and oneembodied conversational agent over virtual objects

At least in thesetting used in our study, with widely spaced objects (20 cm), it canbe ignored when going for high overall success

A combined model for the extension ofproximal and distal pointing. The boundarybetween proximal and distal pointing isdefined by the personal distance .d

Motivation

Aim

How accurate are pointing gestures?

m

m

m

ImprovedAdvances for the and

Contributing to more

models for human pointinginterpretation

production of pointing gesturesrobust

multimodal conversational interfaces

Applicationsm

m

m

m

Human-Computer Interaction

Human-Agent Interaction

Assistive TechnologyEmpirical research Usability Studies

- multimodal interfaces

- Human-Robot Interaction

and- automatic multimodal annotation- automatic grounding of gestures based on a world model

- multimodal conversational interfaces

Observations

Modelling approach

m

m

m

m

m

(as expected)

Ellipse shapes of the bagplots suggest a cone-based model of theextension of pointing

Pointing is fuzzy , but even in short range distanceFuzziness increases with distanceOvershooting at the edge of the domain (intentional)Still, the human object identifier shows a good performance of83.9%correct identifications

AI GroupFaculty of TechnologyBielefeld UniversityGermany

PASION

Marc E. [email protected]

SituatedArtificialCommunicators

SFB 360

Method

How to determine the parameters of the pointing extension model?

What is the opening angle?

What defines the anchor of the model?

Results

Pragmatic Model

IFP is more precisethan GFP

distinguish between proximal and distalpointing

When the pointing extension is handled on thelevel of pragmatics, we can allow for inferencemechanisms to disambiguate between severalobjects and, hence, use heuristics. For thesimulation, we used a basic heuristics based onangular distance between objects and thepointing-ray. Again, the table on the right depictsthe results of our simulation runs. This time IFPperforms better than GFP.

. The opening angles in the proximalrows are rather large, while the angles in the moredistal angles are much smaller. This motivates usto

.

Primarym

m

m

m

Pointing is best interpreted at the level of pragmatics and not semanticsIndex-Finger-Pointing is more preciseGaze-Finger-Pointing is more accurateThe results stated in the tables to the left and our qualitativeobservations using IADE suggest a dichotomy of proximal vs. distalpointing. This fits nicely with the dichotomy common in manylanguages (here vs. there)

Conclusion

row IFP GFP

α perf. α perf.

1 84 70.27 86 68.92

2 80 61.84 68 75

3 71 71.43 69 81.82

4 60 53.95 38 65.79

5 36 43.84 24 57.53

6 24 31.15 25 42.62

7 14 23.26 17 23.26

8 10 7.14 10 14.29

all 71 38.54 61 48.12

row IFP GFP

α perf. α perf.

1 120 98.65 143 98.65

2 109 100 124 100

3 99 94.81 94 93.51

4 109 98.68 89 93.42

5 72 97.26 75 94.52

6 44 91.8 50 90.16

7 38 86.05 41 67.44

8 31 52.38 26 69.05

all 120 96.04 143 92.71

Simulations

Exploring the data

The optimal opening angles per row for aof the pointing extension. In

addition, the performance in terms of correctly identifiedobjects in percent of all objects within the specified area isdepicted, both for IFP and GFP. The row titled all showsthe performance for rows 1-7, row 8 has been excludedbecause of the overshooting behavior.

a strict semanticpointing cone model

The optimal opening angles per row for aof the pointing extension. In

addition, the performance in terms of correctly identifiedobjects in percent of all objects within the specified area isdepicted, both for IFP and GFP. The row titled all showsthe performance for rows 1-7, row 8 has been excludedbecause of the overshooting behavior.

a pragmaticpointing cone model

A visualization of the intersections of the pointing-ray (dots) for four different objects over allparticipants. The data is grouped via bagplots, the asterisk marks the mean, the darker areaclusters 50 percent, the brighter area 75 percent of the demonstrations. In the depicted settingthe person pointing was standing to the left, the person identifying the objects to the right.

(M1)

(M2)

(M3)

Study on object pointingm

m

m

m

m

Interaction of two participantsTwo conditions: speech +gesture and gesture onlyReal objectsStudy with 62 participantsCooperative effort withlinguists

proximalcone

distalcone

distance d

Technologym

m

m

m

Audio + video recordingsMotion capturing using ARTGmbH optical tracking systemAutomatic adaptation of amodel of the user’s postureSpecial hand-made soft gloves

Interaction Gamem

m

m

m

m

m

Description Giver is presented withthe object to demonstrateDescription Giver utters deicticexpression (s+g or g only)Object Identifier tries to identify theobjectDescription Giver gives feedback(yes/no)Proceed with next objectNo corrections or repairs!

The study combines multimodaldata comprising audio, video,motion capture and annotationdata. A coherent synchronized viewof all data sources is provided usingthe Interactive Augmented DataExplorer (IADE), developed atBielefeld University.The picture to the right shows asession with the Interactive Aug-mented Data Explorer. The scientistto the right interactively explores therecorded data for qualitativeanalysis. The model to the left showsthe table with the objects and a stick-figure driven by the motion capturedata. The video taken from onecamera perspective is displayed onthe floating panel, together with theaudio recordings. Information from

specific annotation tiers is presented as floating text.The scientist can, e.g., take the perspective of thedescription giver or the object identifier. Allelements are interactive, e.g. the video panels andannotations can be resized or positioned to allow fora comfortable investigation.

Several simulation runs have been conducted totest different approaches to model the pointingextension on the collected data.

For a strict semantic model the pointing extensionhas to single out object. In thesimulation runs we determined the optimal anglefor such a pointing cone per row and for theoverall area. The results are depicted in the tableon the right. For the strict semantic model theGFP offers better performance while having anarrower opening angle.

Strict Semantic Model

one and only one

GFP is more accuratethan IFP.

Do we aim with the index finger (IFP) orby gazing over the index finger (GFP)?

spotlight

camera

display- M1: task- M2 + M3: system time

tracking system