Collection of multimodal data Face – Speech – Body

Collection of multimodal dataFace – Speech – Body

George Caridakis ICCSGinevra Castellano DIST

Loic Kessous TAU

Overview Objectives Scenario Equipment specifications Subjects & Procedure Visual aspects Acoustic aspects Future processing Please try this at home…

Objectives

Collection of emotional multimodal data

Process different modalities Holy Grail:

“EMOTION RECOGNITION”

Scenario Inspired by GEMEP corpus Pseudo-language sentence

(“Toko”, damato ma gali sa) Standing body posture 10 subjects 8 emotions uniformly distributed through

the quadrants (2D emotion theory, valence-arousal)

3 repetitions of emotion specific gesture 3 repetitions of emotion independent

gesture

Emotion specific gesturesdespair leave me alone

hot anger violent descend of hands

irritation smooth go away

sadness smooth falling hands

interest raise hands

pleasure open hands

joy italianate/explain

pride close hands

Equipment specifications

2 DV cameras Full body Face

Wireless microphone (shirt-mounted) PC + External sound card Uniform dark background 2 artificial light sources Light coloured, long sleeves shirt ;)

Subjects & Procedure Subjects

10 “actors” 6 males 4 females

despair, hot anger, irritation sadness, interest, pleasure, joy, pride

Procedure Subject instructions Clap before every execution: synchronize

streams

Video quality issues

Highest possible resolution Progressive video (not interlaced) Correct exposure Good color quality No compression artifacts Uniform lighting

Interlacing / Over-exposure

Interlacing / De-Interlacing

Over-exposure 70% zebra pattern Prefer lower-exposure

so signal will not be clipped

Colour/Lighting

Medium Y/C Resolution Compression Artifacts Exposure

Good Video quality Source: DV

Archiving

PAL: 720x576 @ 25 frames/second DV Format: ~36Mbit/sec

~16 GBytes/hour MPEG2 @ 4-8Mbit/sec (DVD quality)

~1.8-3.5 GB/hour MPEG-1 @ 1.1 Mbit/sec

~500MBytes/hour

Visual Aspects Summary Video Camera

DV or Better Progressive Scan Capability Over-Exposure Indication, Zebra Patterns

Shooting Use the zebra patterns at 70% Zoom in as much as possible to increase subject’s

resolution Facial features must be visible for facial analysis Try to avoid occlusions (hair, glasses, clothes, hand

movement) Uniform lighting conditions

Archive DV tapes, DV Video or Frames, (not MPEG-1)

Acoustic aspects Why: “Toko, damato ma gali sa”?

Toko: solicitation by naming the interlocutor Vowels found in majority of language Meaning: Toko, can you open it? (request) for

maintaining semantic aspect Sampling frequency 44.1 kHz 16 bits mono information depth Uncompressed .wav files

Future processing Process different modalities

Facial feature extraction Gesture expressiveness analysis Acoustic analysis

Gesture recognition Synchronization Modalities fusion

RNN RSOM + Markov SVM …

Emotion recognition

Collection of multimodal data Face – Speech – Body

Documents

Transcript of Collection of multimodal data Face – Speech – Body