Collection of multimodal data Face – Speech – Body
description
Transcript of Collection of multimodal data Face – Speech – Body
Collection of multimodal dataFace – Speech – Body
George Caridakis ICCSGinevra Castellano DIST
Loic Kessous TAU
Overview Objectives Scenario Equipment specifications Subjects & Procedure Visual aspects Acoustic aspects Future processing Please try this at home…
Objectives
Collection of emotional multimodal data
Process different modalities Holy Grail:
“EMOTION RECOGNITION”
Scenario Inspired by GEMEP corpus Pseudo-language sentence
(“Toko”, damato ma gali sa) Standing body posture 10 subjects 8 emotions uniformly distributed through
the quadrants (2D emotion theory, valence-arousal)
3 repetitions of emotion specific gesture 3 repetitions of emotion independent
gesture
Emotion specific gesturesdespair leave me alone
hot anger violent descend of hands
irritation smooth go away
sadness smooth falling hands
interest raise hands
pleasure open hands
joy italianate/explain
pride close hands
Equipment specifications
2 DV cameras Full body Face
Wireless microphone (shirt-mounted) PC + External sound card Uniform dark background 2 artificial light sources Light coloured, long sleeves shirt ;)
Subjects & Procedure Subjects
10 “actors” 6 males 4 females
despair, hot anger, irritation sadness, interest, pleasure, joy, pride
Procedure Subject instructions Clap before every execution: synchronize
streams
Video quality issues
Highest possible resolution Progressive video (not interlaced) Correct exposure Good color quality No compression artifacts Uniform lighting
Interlacing / Over-exposure
Interlacing / De-Interlacing
Over-exposure 70% zebra pattern Prefer lower-exposure
so signal will not be clipped
Colour/Lighting
Medium Y/C Resolution Compression Artifacts Exposure
Good Video quality Source: DV
Archiving
PAL: 720x576 @ 25 frames/second DV Format: ~36Mbit/sec
~16 GBytes/hour MPEG2 @ 4-8Mbit/sec (DVD quality)
~1.8-3.5 GB/hour MPEG-1 @ 1.1 Mbit/sec
~500MBytes/hour
Visual Aspects Summary Video Camera
DV or Better Progressive Scan Capability Over-Exposure Indication, Zebra Patterns
Shooting Use the zebra patterns at 70% Zoom in as much as possible to increase subject’s
resolution Facial features must be visible for facial analysis Try to avoid occlusions (hair, glasses, clothes, hand
movement) Uniform lighting conditions
Archive DV tapes, DV Video or Frames, (not MPEG-1)
Acoustic aspects Why: “Toko, damato ma gali sa”?
Toko: solicitation by naming the interlocutor Vowels found in majority of language Meaning: Toko, can you open it? (request) for
maintaining semantic aspect Sampling frequency 44.1 kHz 16 bits mono information depth Uncompressed .wav files
Future processing Process different modalities
Facial feature extraction Gesture expressiveness analysis Acoustic analysis
Gesture recognition Synchronization Modalities fusion
RNN RSOM + Markov SVM …
Emotion recognition