RNN Architectures · 2.1 - Mode-Based Challenges Video Audio ＋ Driver conversing － Passenger...

RNN Architectures for Emotion Classification in Multimodal Emotion Recognition Systems

Deepan DasJohn RidleyJuly 19 2019

Advanced Methods for Emotion Recognition in Highly Automated Driving

Agenda

1. Background

2. Challenges

5. Proposed Solution

6. Conclusion

3. Existing Approaches

4. Demo

2

Deepan Das - John Ridley

BACKGROUND

● Autonomous driving is just being realised○ But there is a massive transition window○ Awkward ‘monitored autonomy’ mix

● Why monitor driver emotion?○ Safety○ Comfort/Luxury

● Continuous affective state monitoring ○ Can they safely operate the vehicle?○ Are they likely to be a hazard?○ Can the vehicle influence/improve their mood?


1.0 - Introduction

3

Leve

l 1 -

Han

ds

On

Leve

l 2 -

Han

ds

Off

Leve

l 5 -

No

Inte

ract

ion

Leve

l 3 -

Att

enti

on O

ff

Leve

l 4 -

Dri

ver

Red

un

dan

cy

Emotion Monitoring Critical

Emotion Monitoring Enables Luxury

Russell’s Circumplex Model

BACKGROUND

● Emotions are...○ not inherently bad, but influence behaviour○ classified in various ways○ varyingly relevant to driving behaviour○ temporally dynamic

1.1 - Let’s talk about our feelings

4

Arousal

Act

ive

Valence

Inac

tive

PositiveNegative

Happy

Excited

Delighted

Serene

Relaxed

CalmFatigued

Depressed

Sad

Upset

Stressed

Tense

Plutchik’s Wheel of Emotions


BACKGROUND

● Deep Neural Networks○ Fundamentally changed the ML landscape○ Not directly applicable to temporally dependant

data

● Recurrent Neural Networks○ Unroll networks temporally○ Cells propagate temporal context/state○ Time dependent variable length sequences○ Can also input/output single values


1.2 - Recurrent Neural Networks

5

LayersInput Output

CellInput Output

CellInput Output

CellInput Output

t

t+1

t+2

Unrolled RNN

Regular ANN

Bidirectional RNN

BACKGROUND


1.3 - RNN Zoo

6

Cell

Cell

t

t+1LSTM σ σ σtanh

tanh

GRU

σ

σ tanh

1 -

Cell

Cell

Vanillatanh Vanishing Gradient

Gra

die

nt

Pat

hw

ays

Lon

ger

Ter

m M

emor

yStack RNN Cells

Reverse Gradient Flow

BACKGROUND

● Combine for ‘best of both worlds’○ Certain modes can better indicate certain states [1]○ More difficult than it seems


1.4 - Multimodal Emotion Classification

7

Video

● Emotions observed in many ways - common sensor modalities

Audio Physiological

Images Faces Signal Dialogue EEG ECG

EOG EDA

CHALLENGES

● Which modes to utilise, given a driving context?○ Most physiological modes cannot be used - requires intrusive sensors○ We focus on audio/visual modes (and variants) - scope of most existing research

● How do we utilise RNN?○ RNNs work with sequence of aligned features○ Where do we use the RNNs (before/after mode fusion)

● Where/how are the modes and RNN(s) combined in the pipeline?


8

2.0 - Scope of Challenge

Video Features

ClassificationAudio

Features

?RNN

CHALLENGES

● Mode Variability○ Different sampling rates and numerical dimensions○ Reliability/robustness of sensor or preprocessing methods

● Mode Applicability○ Certain modes can better indicate certain states [1]○ There are positive and negative conditions for both mode types


9

2.1 - Mode-Based Challenges

Video Audio

＋ Driver conversing－ Passenger conversing－ Ambient noise－ Silence

＋ Driver clearly visible－ Face (feature) not visible－ Poor lighting－ Driver not visible

CHALLENGES

● Preprocessing○ Extraction of salient regions○ Feature extraction (e.g. facial keypoints, audio features)

● Fusion○ How/where are the modes combined (early or late)○ How are mode failure states handled/trained

● RNN Placement○ Before/after fusion (or both)○ RNN type and depth○ Combination with CNN○ Resource and gradient limitations


10

2.2 - Technical Challenges

Realistic Driving FaceUnrealistic Driving Face

EXISTING APPROACHES

● Constraints for our reviewed approaches○ Audio/visual data for ‘emotions in the wild’○ Discrete emotion classification (but also some regression techniques)

● Some caveats○ There are numerous datasets all with different classifications and samples○ No consistently used dataset - difficult to compare cross-paper results○ ‘In the wild’ can imply acted or dramatised scenes


11

3.0 - Problem Statement

AFE

W d

atas

et

EXISTING APPROACHES

End-to-End Multimodal Emotion Recognition using Deep Neural Networks [1]


12

3.1 - Tzirakis (2017)

● Mode-wise custom CNN - Multimodal LSTM● Output regressed arousal and valence● Fusion approach yields best of both modalities

Audio CNNAudio

LSTM

Face CNNFaces

End-to-End Trainable

Output

EXISTING APPROACHES

An Early Fusion Approach for Multimodal Emotion Recognition using Deep Recurrent Networks [2]


13

3.2 - Bucur (2018)

● Concatenated RNN - no mode dependent CNNs● Input audio, eye, face, depth features (pre-labelled & rate corrected)● Classified 6 emotions

Audio

Eyes

Faces

DepthRNN Architectures

LSTM

Bidirectional LSTM

GRU

Bidirectional GRU

ParallelRNN

ConcatenatedRNN

Context/FaceRNN

EXISTING APPROACHES

Context-aware Cascade Attention-based RNN for Video Emotion Recognition [3]


14

3.3 - Sun (2018)

● Image CNNs applied across video sequence with LSTM● Face and context (whole) image modes

CNN Face CNN

LSTM

Context CNN

LSTM

Context CNN

Face CNN

Context LSTM

Face LSTM

Context-aware Attention-basedRNN

EXISTING APPROACHES

Context-aware Cascade Attention-based RNN for Video Emotion Recognition [3]


15

3.3 - Sun (2018)

● Proposed Context Attention mechanism● Classified 8 emotions

Context LSTM

Face CNN

Context CNN

Face LSTM

Attention

Example of visual attention

EXISTING APPROACHES

Multimodal Dimension Affect Recognition using Deep Bidirectional LSTM RNNs [4]


16

3.4 - Pei (2015)

● Deeper LSTMs (bidirectional)● Combine mode-wise, multimodal RNNs and moving average● Output regressed arousal, valence and dominance

Audio

Video Frames

Deep Bidirectional LSTM (DBLSTM)

Bidirectional LSTM

Bidirectional LSTM DBLSTM

DBLSTM

DBLSTM

EXISTING APPROACHES

Multi-Feature Based Emotion Recognition for Video Clips [5]


17

3.5 - Liu (2018)

Face

Landmark Detector

CNN

Stats on landmark distances

SVM

Face

DenseNet,Inception

Stats on extracted features

SVM

Face

Tuned VGG16

LSTM

Audio

Tuned SoundNet

● Late fusion, weighted by accuracy of each branch● Not end-to-end trainable, to be avoided despite good performance

Late Fusion

EXISTING APPROACHES

Group-Level Emotion Recognition Using Hybrid Deep Models Based on Faces, Scenes, Skeletons and Visual Attention [6]


18

3.6 - Guo (2018)

Face CNNs

Scene CNNs

Skeleton CNN

Visual Attention

LSTM

Late

Fu

sion

● Visual attention processed by LSTM as a set● Skeleton data includes face, posture and hands

Example of visual attention, from [6]

EXISTING APPROACHES

Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction [7]


19

3.7 - Brady (2016)

Audio

Feature Extraction

SVM Regressor

Video

CNN feature

LSTM

HRV

LSTM

Feature Extraction

EDA

LSTM

Feature Extraction

Kalman Filter (Late Fusion)

● Individual mode error modeled as measurement noise in Kalman● Outputs continuous-time valence arousal values

DEMO

Our Setup○ AFEW 2018 dataset [8]○ Focus on Neutral, Happy, Angry emotions○ Precomputed facial (from CNN of [9]) and audio features (from OpenSMILE)○ All code (except feature preprocessing) is our own○ RNN networks trained from scratch○ Tested various RNN architectures


20

4.0 - Demo Setup

Audio Descriptor

RNN

Face Descriptor

Output

Trained by us

AFEW 2017‘Emotions in the Wild’Unseen Validation Set

DEMO


21

4.1 - Demo Pipeline

Audio Descriptor

Bidirectional LSTM

Face Descriptor

Output

DEMO


22

4.2 - Our Demo

https://docs.google.com/file/d/19Jh9GPSg0lfe7oIvSkj5P_YSiItUEWlW/preview

DEMO

How does the model perform?○ Using our best-performing bidirectional LSTM model○ Trained with modalities both enabled and disabled


23

4.3 - Demo Performance

PROPOSED METHOD


24

5.0 - Cross Modal LearningDrawbacks of existing approaches○ Noisy/missing audio to augment training makes more robust, but hurts performance○ Encoder-decoder with tied weights does not scale well○ Existing models not forced to discover correlations across modalities○ Different hidden units of existing models not forced to learn different modes

Shared Space

Au

dio

Vid

eo

Vid

eoA

ud

io

Multimodal Deep Learning [10]

Generating transcripts from both speech and ‘lip-reading’

PROPOSED METHOD

How can cross-modal learning be combined with RNNs for emotion classification?

● Train mode-wise networks to map to an ‘emotion space’○ Similar emotions map to similar positions in ‘emotion space’○ Accomplished with a triplet loss and mode-wise encoder networks○ Train on precomputed features and three emotions

● RNN learns from ‘emotion space’○ Concatenate mode features like before, but after mapping into ‘emotion space’○ RNN is trained after encoders


25

5.1 - Proposed Approach

Mode Features

Mode Encoder

Embedding Space

d( , ) < d( , )

d( , ) < d( , )

d( , ) < d( , )

PROPOSED METHOD


26

5.2 - Pipeline

Video Features

AudioFeatures

Audio Feature Encoder

Video Feature Encoder

Joint Embedding Space

RNN

Triplet Loss

FC

LSTM

OR

PROPOSED METHOD


27

5.3 - Evaluation

● Results○ Emotion classes embedded into ‘emotion space’○ Projected to 2D using T-SNE approach - designed to show point distances○ Embedding space not sufficiently separated

● Possible Cause○ We use precomputed features - feature descriptors cannot change○ Errors cannot propagate to descriptor networks to select separable features

Validation Overfit

PROPOSED METHOD


28

5.4 - Advantages & Disadvantages＋ Modes have same meaning independent of each other

＋ Easy to tell whether modes are in agreement

＋ RNN no longer needs to learn how modes are related

＋ Easy to add other modalities - just train another encoder

＋ RNN can utilise the emotion description in the latent space

－ Additional training steps & overhead

－ Embedding difficult with precomputed features

－ Projection to embedding space may remove mode-specificities

CONCLUSION

● Autonomous driving revolution - still a long way to go○ Cars need to monitor drivers to ensure safe ‘hybrid’ operation

● Emotion recognition is a well established field○ But still very challenging - emotions difficult to classify consistently○ Multimodal approaches provide measurable benefits

● RNNs work well in Multimodal Emotion Recognition○ Take advantage of sequences of continuous features


29

6.0 - Summary

CONCLUSION

● Multimodal RNN Placement/Application Research○ Where does the RNN go? (Normally post fusion, but sometimes before too)○ Where do we fuse features? (Later is generally better)○ What type(s) of RNNs are used? (Bidirectional LSTM/GRU)

● Cross-modal Approaches○ Generalise representation of concepts across different modes

● Our Approach○ Combine both - unable to evaluate with pretrained features


30

6.1 - Approaches

Audio CNNAudio

LSTM

Face CNNFaces

Output

Mode Features

Mode Encoder

Embedding Space

CONCLUSION

● How do we extract features from various modes?○ Manage representations of emotions from different modes○ Utilise HRV extracted from faces as an additional mode

● Is there a better way to combine features?○ Deal with failed detections in certain modes○ Get the cross modal representation working with descriptors

● How do we quickly and effectively train RNN Encoders?○ More difficult than regular deep networks○ Adopt cutting edge RNN encoder-decoder architectures


31

6.2 - Future Direction

Questions

CONCLUSION

[1] Tzirakis, Panagiotis et al. "End-to-end multimodal emotion recognition using deep neural networks." IEEE Journal of Selected Topics in Signal Processing 11, no. 8 (2017): 1301-1309.

[2] Bucur, Beniamin et al. "An early fusion approach for multimodal emotion recognition using deep recurrent networks." In 2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 71-78. IEEE, 2018.

[3] Sun, Man-Chin, et al. "Context-aware Cascade Attention-based RNN for Video Emotion Recognition." In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pp. 1-6. IEEE, 2018.

[4] Pei, Ercheng, et al. "Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks." In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 208-214. IEEE, 2015.

[5] Liu, Chuanhe, et al. "Multi-feature based emotion recognition for video clips." In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 630-634. ACM, 2018.

[6] Guo, Xin, et al. "Group-Level Emotion Recognition using Hybrid Deep Models based on Faces, Scenes, Skeletons and Visual Attentions." In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 635-639. ACM, 2018.

[7] Brady, Kevin, et al. "Multi-modal audio, video and physiological sensor learning for continuous emotion prediction." In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 97-104. ACM, 2016.

[8] Dhall, Abhinav, et al. "From individual to group-level emotion recognition: EmotiW 5.0." In Proceedings of the 19th ACM international conference on multimodal interaction, pp. 524-528. ACM, 2017.

[9] Knyazev, Boris, et al. "Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video." arXiv preprint arXiv:1711.04598 (2017).

[10] Ngiam, Jiquan, et al. "Multimodal deep learning." In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689-696. 2011.


33

References

RNN Architectures · 2.1 - Mode-Based Challenges Video Audio ＋ Driver conversing － Passenger...

Documents

Transcript of RNN Architectures · 2.1 - Mode-Based Challenges Video Audio ＋ Driver conversing － Passenger...