Combined Gesture-Speech Analysis and Synthesis

20
The SIMILAR NoE Summer Worksh op 2005 Combined Gesture-Speech Analysis and Synthesis M. Emre Sargın, Engin Erzin, Yücel Yemez, A. Murat Tekalp {msargin,eerzin,yyemez,mtekalp}@ku.edu.tr Multimedia Vision and Graphics Laboratory, Koc University

description

Combined Gesture-Speech Analysis and Synthesis. M. Emre Sargın, Engin Erzin, Yücel Yemez, A. Murat Tekalp {msargin,eerzin,yyemez,mtekalp}@ku.edu.tr Multimedia Vision and Graphics Laboratory, Koc University. Outline. Project Objective Technical Description - PowerPoint PPT Presentation

Transcript of Combined Gesture-Speech Analysis and Synthesis

Page 1: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Combined Gesture-Speech Analysis and Synthesis

M. Emre Sargın, Engin Erzin, Yücel Yemez, A. Murat Tekalp{msargin,eerzin,yyemez,mtekalp}@ku.edu.tr

Multimedia Vision and Graphics Laboratory, Koc University

Page 2: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Outline Project Objective Technical Description

Preparation of Gesture-Speech Database Detection of Gesture Elements Gesture-Speech Correlation Analysis Synthesis of Gestures Accompanying Speech

Resources Work Plan Team Members

Page 3: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Project Objective The production of speech and gesture is interactive

throughout the entire communication process. Computer-Human Interaction systems should be

interactive such that, for an edutainment application, animated person’s speech should be aided and complemented by it’s gestures.

Two main goals of this project: Analysis and modeling of correlation between speech and

gestures. Synthesis of correlated natural gestures accompanying

speech.

Page 4: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Technical Description Preparation of Gesture-Speech Database Detection of Gesture Elements Gesture-Speech Correlation Analysis Synthesis of Gestures Accompanying Speech

Page 5: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Preparation of Database

Gestures of a specific person will be investigated. The video database related with that specific person

should include the gestures that he/she frequently uses. Locations of head, arm, elbows, etc. should easily be

detectable and traceable.

Page 6: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Detection of Gesture Elements

In this project, we consider arm and head gestures. Main tasks included in detection of gesture elements:

Tracking of head region. Tracking of hand and possibly shoulder and elbow. Extraction of gesture features. Recognition and labeling of gestures.

Page 7: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Head Region Tracking To extract motion information coming from head one

should first extract head region. Exhaustive search of head in each frame is a possible

solution. However this is computationally inefficient. Tracking is efficient by the means of computational

complexity. Motion information calculated for tracking will be used

for head gesture features.

Page 8: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Tracking Methodology Exhaustive search for head region in initial frame

Haar-Based Face Detection Skin Color information

Extraction of motion information from head region Optical flow vectors Fitting global motion parameters optical flow vectors

Warp search window according to motion information. Search for head region in the search window.

Page 9: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Head Tracking Results

Page 10: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Hand Tracking Methodology Hand region will be extracted using skin color

information. Robust State-Space Tracking will be applied.

Observations are position of hand. States are position, speed and acceleration of hand. Kalman Filtering removes unwanted noise from features In Regular Kalman Filter, parameters are fixed. In Robust Kalman Filter parameters are re-adjusted for

each iteration to minimize MSE and overcome the effects of abrupt changes in motion of hand.

Page 11: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Extraction of Gesture Features Head Gesture Features: Global Motion Parameters

calculated within head region will be used. Hand Gesture Features: Hand center of mass position

and calculated velocity will form hand gesture features.

Page 12: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Gesture-Speech Correlation Analysis

Recognized gestures are labeled w.r.t. time. Head Gestures: Down, Up, Left, Right, Left-Right, … Arm Gestures: Abduction, Adduction, Extension, …

Recognized speech patterns are labeled w.r.t. time. Semantic Info: Approval, Refusal phrases, etc. Prosodic Info: Intonational phrases, ToBI transcriptions, etc.

Correlation Analysis via examining Co-occurrence Matrix Input/Output Hidden Markov Models

Page 13: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Co-occurrence Matrix Estimation of joint probability distribution function, f(g,s) For each time sample give a vote to related gesture-

speech label pair. For a specific speech element the most correlated

gesture feature will be: gi=argmax ( f (gx,si) )

Relatively easy to compute. Gives an intuition about what we are examining.

x

Page 14: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Input/Output Hidden Markov Models IOHMM is a graphical model which allows the mapping of

input sequences into output sequences. It is used in three tasks of sequence processing:

Prediction Regression Classification

The model is trained to maximize the conditional distribution of an output sequence {y1,…,yt} given an input sequence {x1,…,xt}.

In our project: Input sequence will be speech labels. Output sequence will be gesture labels.

Page 15: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Synthesis of Gestures Accompanying Speech

Based on the methodology used in correlation analysis given a speech signal: Features will be extracted. Most probable speech label will be designated to speech

patterns. Gesture pattern that is most correlated with speech pattern

will be used to animate a stick model of a person.

Page 16: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Resources Database Preparation and Labeling

VirtualDub Anvil Paraat

Image Processing and Feature Extraction: Matlab Image Processing Toolbox OpenCV Image Processing Library

Gesture-Speech Correlation Analysis HTK HMM Toolbox Torch Machine Learning Library

Page 17: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Work Plan Timeline of the project:

Schedule of the lectures:

Page 18: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Team Members Ferda Ofli

Koc University Image, Video Processing and Feature Extraction

Yelena Yasinnik Massachusetts Institute of Technology Audio-Visual Correlation Analysis

Oya Aran Bogazici University Gesture Based Human-Computer Interaction Systems

Page 19: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

Team Members Alexey Anatolievich Karpov

Saint-Petersburg Institute for Informatics and Automation Speech Based Human-Computer Interaction Systems

Stephen Wilson University College Dublin Audio-Visual Gesture Annotation

Alexander Refsum Jensenius Department of Music, Oslo University Gesture Analysis

Page 20: Combined Gesture-Speech Analysis and Synthesis

The SIMILAR NoE Summer Workshop 2005

References Jie Yao and Jeremy R. Cooperstock, “Arm Gesture Detection in a Classroom Environment,”

Proc. WACV’02 pp. 153-157, 2002. Y. Azoz, L. Devi. R. Sharma, “Tracking Hand Dynamics in Unconstrained Environments,” Proc.

Int. Conference on Automatic Face and Gesture Recognition’98 pp. 274-279, 1998. S. Malassiotis, N. Aifanti, M.G. Strintzis, “A Gesture Recognition System Using 3D Data,” Proc.

Int. Symposium on 3D Data Processing Visualization and Transmission’02 pp. 190-193,2002. J-M. Chung, N. Ohnishi, “Cue Circles: Image Feature for Measuring 3-D Motion of Articulated

Objects Using Sequential Image Pair,” Proc. Int. Conference on Automatic Face and Gesture Recognition’98 pp. 474-479, 1998.

S. Kettebekov, M. Yeasin, R. Sharma, “Prosody based co-analysis for continuous recognition of coverbal gestures,”Proc. ICMI’02 pp.161-166, 2002.

F. Quek, D. McNeill, R. Ansari, X-F. Ma, R. Bryll, S. Duncan, K.E. McCullough “Gesture cues for conversational interaction in monocular video,” Proc. Int. Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems’99 pp. 119-126, 1999.

For detailed information visit: http://htk.eng.cam.ac.uk Rabiner, L.; Juang, B., “An introduction to hidden Markov models” ASSP Magazine, IEEE, Vol.3,

Iss.1, pp. 4- 16, Jan 1986 Jae-Moon Chung; Ohnishi, N., “Cue circles: image feature for measuring 3-D motion of

articulated objects using sequential image pair” Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, Vol., Iss., pp. 474-479, 14-16 Apr 1998

A. Just, O. Bernier, S. Marcel., “Recognition of isolated complex mono- and bi-manual 3D hand gestures” Proc. 6. ICAFGR, 2004