RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger...

33
RNN Architectures for Emotion Classification in Multimodal Emotion Recognition Systems Deepan Das John Ridley July 19 2019 Advanced Methods for Emotion Recognition in Highly Automated Driving

Transcript of RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger...

Page 1: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

RNN Architectures for Emotion Classification in Multimodal Emotion Recognition Systems

Deepan DasJohn RidleyJuly 19 2019

Advanced Methods for Emotion Recognition in Highly Automated Driving

Page 2: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

Agenda

1. Background

2. Challenges

5. Proposed Solution

6. Conclusion

3. Existing Approaches

4. Demo

2

Deepan Das - John Ridley

Page 3: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

BACKGROUND

● Autonomous driving is just being realised○ But there is a massive transition window○ Awkward ‘monitored autonomy’ mix

● Why monitor driver emotion?○ Safety○ Comfort/Luxury

● Continuous affective state monitoring ○ Can they safely operate the vehicle?○ Are they likely to be a hazard?○ Can the vehicle influence/improve their mood?

Deepan Das - John Ridley

1.0 - Introduction

3

Leve

l 1 -

Han

ds

On

Leve

l 2 -

Han

ds

Off

Leve

l 5 -

No

Inte

ract

ion

Leve

l 3 -

Att

enti

on O

ff

Leve

l 4 -

Dri

ver

Red

un

dan

cy

Emotion Monitoring Critical

Emotion Monitoring Enables Luxury

Page 4: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

Russell’s Circumplex Model

BACKGROUND

● Emotions are...○ not inherently bad, but influence behaviour○ classified in various ways○ varyingly relevant to driving behaviour○ temporally dynamic

1.1 - Let’s talk about our feelings

4

Arousal

Act

ive

Valence

Inac

tive

PositiveNegative

Happy

Excited

Delighted

Serene

Relaxed

CalmFatigued

Depressed

Sad

Upset

Stressed

Tense

Plutchik’s Wheel of Emotions

Deepan Das - John Ridley

Page 5: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

BACKGROUND

● Deep Neural Networks○ Fundamentally changed the ML landscape○ Not directly applicable to temporally dependant

data

● Recurrent Neural Networks○ Unroll networks temporally○ Cells propagate temporal context/state○ Time dependent variable length sequences○ Can also input/output single values

Deepan Das - John Ridley

1.2 - Recurrent Neural Networks

5

LayersInput Output

CellInput Output

CellInput Output

CellInput Output

t

t+1

t+2

Unrolled RNN

Regular ANN

Page 6: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

Bidirectional RNN

BACKGROUND

Deepan Das - John Ridley

1.3 - RNN Zoo

6

Cell

Cell

t

t+1LSTM σ σ σtanh

tanh

GRU

σ

σ tanh

1 -

Cell

Cell

Vanillatanh Vanishing Gradient

Gra

die

nt

Pat

hw

ays

Lon

ger

Ter

m M

emor

yStack RNN Cells

Reverse Gradient Flow

Page 7: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

BACKGROUND

● Combine for ‘best of both worlds’○ Certain modes can better indicate certain states [1]○ More difficult than it seems

Deepan Das - John Ridley

1.4 - Multimodal Emotion Classification

7

Video

● Emotions observed in many ways - common sensor modalities

Audio Physiological

Images Faces Signal Dialogue EEG ECG

EOG EDA

Page 8: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

CHALLENGES

● Which modes to utilise, given a driving context?○ Most physiological modes cannot be used - requires intrusive sensors○ We focus on audio/visual modes (and variants) - scope of most existing research

● How do we utilise RNN?○ RNNs work with sequence of aligned features○ Where do we use the RNNs (before/after mode fusion)

● Where/how are the modes and RNN(s) combined in the pipeline?

Deepan Das - John Ridley

8

2.0 - Scope of Challenge

Video Features

ClassificationAudio

Features

?RNN

Page 9: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

CHALLENGES

● Mode Variability○ Different sampling rates and numerical dimensions○ Reliability/robustness of sensor or preprocessing methods

● Mode Applicability○ Certain modes can better indicate certain states [1]○ There are positive and negative conditions for both mode types

Deepan Das - John Ridley

9

2.1 - Mode-Based Challenges

Video Audio

+ Driver conversing- Passenger conversing- Ambient noise- Silence

+ Driver clearly visible- Face (feature) not visible- Poor lighting- Driver not visible

Page 10: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

CHALLENGES

● Preprocessing○ Extraction of salient regions○ Feature extraction (e.g. facial keypoints, audio features)

● Fusion○ How/where are the modes combined (early or late)○ How are mode failure states handled/trained

● RNN Placement○ Before/after fusion (or both)○ RNN type and depth○ Combination with CNN○ Resource and gradient limitations

Deepan Das - John Ridley

10

2.2 - Technical Challenges

Page 11: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

Realistic Driving FaceUnrealistic Driving Face

EXISTING APPROACHES

● Constraints for our reviewed approaches○ Audio/visual data for ‘emotions in the wild’○ Discrete emotion classification (but also some regression techniques)

● Some caveats○ There are numerous datasets all with different classifications and samples○ No consistently used dataset - difficult to compare cross-paper results○ ‘In the wild’ can imply acted or dramatised scenes

Deepan Das - John Ridley

11

3.0 - Problem Statement

AFE

W d

atas

et

Page 12: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

EXISTING APPROACHES

End-to-End Multimodal Emotion Recognition using Deep Neural Networks [1]

Deepan Das - John Ridley

12

3.1 - Tzirakis (2017)

● Mode-wise custom CNN - Multimodal LSTM● Output regressed arousal and valence● Fusion approach yields best of both modalities

Audio CNNAudio

LSTM

Face CNNFaces

End-to-End Trainable

Output

Page 13: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

EXISTING APPROACHES

An Early Fusion Approach for Multimodal Emotion Recognition using Deep Recurrent Networks [2]

Deepan Das - John Ridley

13

3.2 - Bucur (2018)

● Concatenated RNN - no mode dependent CNNs● Input audio, eye, face, depth features (pre-labelled & rate corrected)● Classified 6 emotions

Audio

Eyes

Faces

DepthRNN Architectures

LSTM

Bidirectional LSTM

GRU

Bidirectional GRU

Page 14: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

ParallelRNN

ConcatenatedRNN

Context/FaceRNN

EXISTING APPROACHES

Context-aware Cascade Attention-based RNN for Video Emotion Recognition [3]

Deepan Das - John Ridley

14

3.3 - Sun (2018)

● Image CNNs applied across video sequence with LSTM● Face and context (whole) image modes

CNN Face CNN

LSTM

Context CNN

LSTM

Context CNN

Face CNN

Context LSTM

Face LSTM

Page 15: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

Context-aware Attention-basedRNN

EXISTING APPROACHES

Context-aware Cascade Attention-based RNN for Video Emotion Recognition [3]

Deepan Das - John Ridley

15

3.3 - Sun (2018)

● Proposed Context Attention mechanism● Classified 8 emotions

Context LSTM

Face CNN

Context CNN

Face LSTM

Attention

Example of visual attention

Page 16: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

EXISTING APPROACHES

Multimodal Dimension Affect Recognition using Deep Bidirectional LSTM RNNs [4]

Deepan Das - John Ridley

16

3.4 - Pei (2015)

● Deeper LSTMs (bidirectional)● Combine mode-wise, multimodal RNNs and moving average● Output regressed arousal, valence and dominance

Audio

Video Frames

Deep Bidirectional LSTM (DBLSTM)

Bidirectional LSTM

Bidirectional LSTM DBLSTM

DBLSTM

DBLSTM

Page 17: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

EXISTING APPROACHES

Multi-Feature Based Emotion Recognition for Video Clips [5]

Deepan Das - John Ridley

17

3.5 - Liu (2018)

Face

Landmark Detector

CNN

Stats on landmark distances

SVM

Face

DenseNet,Inception

Stats on extracted features

SVM

Face

Tuned VGG16

LSTM

Audio

Tuned SoundNet

● Late fusion, weighted by accuracy of each branch● Not end-to-end trainable, to be avoided despite good performance

Late Fusion

Page 18: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

EXISTING APPROACHES

Group-Level Emotion Recognition Using Hybrid Deep Models Based on Faces, Scenes, Skeletons and Visual Attention [6]

Deepan Das - John Ridley

18

3.6 - Guo (2018)

Face CNNs

Scene CNNs

Skeleton CNN

Visual Attention

LSTM

Late

Fu

sion

● Visual attention processed by LSTM as a set● Skeleton data includes face, posture and hands

Example of visual attention, from [6]

Page 19: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

EXISTING APPROACHES

Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction [7]

Deepan Das - John Ridley

19

3.7 - Brady (2016)

Audio

Feature Extraction

SVM Regressor

Video

CNN feature

LSTM

HRV

LSTM

Feature Extraction

EDA

LSTM

Feature Extraction

Kalman Filter (Late Fusion)

● Individual mode error modeled as measurement noise in Kalman● Outputs continuous-time valence arousal values

Page 20: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

DEMO

Our Setup○ AFEW 2018 dataset [8]○ Focus on Neutral, Happy, Angry emotions○ Precomputed facial (from CNN of [9]) and audio features (from OpenSMILE)○ All code (except feature preprocessing) is our own○ RNN networks trained from scratch○ Tested various RNN architectures

Deepan Das - John Ridley

20

4.0 - Demo Setup

Audio Descriptor

RNN

Face Descriptor

Output

Page 21: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

Trained by us

AFEW 2017‘Emotions in the Wild’Unseen Validation Set

DEMO

Deepan Das - John Ridley

21

4.1 - Demo Pipeline

Audio Descriptor

Bidirectional LSTM

Face Descriptor

Output

Page 22: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

DEMO

Deepan Das - John Ridley

22

4.2 - Our Demo

Page 23: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

DEMO

How does the model perform?○ Using our best-performing bidirectional LSTM model○ Trained with modalities both enabled and disabled

Deepan Das - John Ridley

23

4.3 - Demo Performance

Page 24: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

PROPOSED METHOD

Deepan Das - John Ridley

24

5.0 - Cross Modal LearningDrawbacks of existing approaches○ Noisy/missing audio to augment training makes more robust, but hurts performance○ Encoder-decoder with tied weights does not scale well○ Existing models not forced to discover correlations across modalities○ Different hidden units of existing models not forced to learn different modes

Shared Space

Au

dio

Vid

eo

Vid

eoA

ud

io

Multimodal Deep Learning [10]

Generating transcripts from both speech and ‘lip-reading’

Page 25: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

PROPOSED METHOD

How can cross-modal learning be combined with RNNs for emotion classification?

● Train mode-wise networks to map to an ‘emotion space’○ Similar emotions map to similar positions in ‘emotion space’○ Accomplished with a triplet loss and mode-wise encoder networks○ Train on precomputed features and three emotions

● RNN learns from ‘emotion space’○ Concatenate mode features like before, but after mapping into ‘emotion space’○ RNN is trained after encoders

Deepan Das - John Ridley

25

5.1 - Proposed Approach

Mode Features

Mode Encoder

Embedding Space

d( , ) < d( , )

d( , ) < d( , )

d( , ) < d( , )

Page 26: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

PROPOSED METHOD

Deepan Das - John Ridley

26

5.2 - Pipeline

Video Features

AudioFeatures

Audio Feature Encoder

Video Feature Encoder

Joint Embedding Space

RNN

Triplet Loss

FC

LSTM

OR

Page 27: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

PROPOSED METHOD

Deepan Das - John Ridley

27

5.3 - Evaluation

● Results○ Emotion classes embedded into ‘emotion space’○ Projected to 2D using T-SNE approach - designed to show point distances○ Embedding space not sufficiently separated

● Possible Cause○ We use precomputed features - feature descriptors cannot change○ Errors cannot propagate to descriptor networks to select separable features

Validation Overfit

Page 28: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

PROPOSED METHOD

Deepan Das - John Ridley

28

5.4 - Advantages & Disadvantages+ Modes have same meaning independent of each other

+ Easy to tell whether modes are in agreement

+ RNN no longer needs to learn how modes are related

+ Easy to add other modalities - just train another encoder

+ RNN can utilise the emotion description in the latent space

- Additional training steps & overhead

- Embedding difficult with precomputed features

- Projection to embedding space may remove mode-specificities

Page 29: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

CONCLUSION

● Autonomous driving revolution - still a long way to go○ Cars need to monitor drivers to ensure safe ‘hybrid’ operation

● Emotion recognition is a well established field○ But still very challenging - emotions difficult to classify consistently○ Multimodal approaches provide measurable benefits

● RNNs work well in Multimodal Emotion Recognition○ Take advantage of sequences of continuous features

Deepan Das - John Ridley

29

6.0 - Summary

Page 30: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

CONCLUSION

● Multimodal RNN Placement/Application Research○ Where does the RNN go? (Normally post fusion, but sometimes before too)○ Where do we fuse features? (Later is generally better)○ What type(s) of RNNs are used? (Bidirectional LSTM/GRU)

● Cross-modal Approaches○ Generalise representation of concepts across different modes

● Our Approach○ Combine both - unable to evaluate with pretrained features

Deepan Das - John Ridley

30

6.1 - Approaches

Audio CNNAudio

LSTM

Face CNNFaces

Output

Mode Features

Mode Encoder

Embedding Space

Page 31: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

CONCLUSION

● How do we extract features from various modes?○ Manage representations of emotions from different modes○ Utilise HRV extracted from faces as an additional mode

● Is there a better way to combine features?○ Deal with failed detections in certain modes○ Get the cross modal representation working with descriptors

● How do we quickly and effectively train RNN Encoders?○ More difficult than regular deep networks○ Adopt cutting edge RNN encoder-decoder architectures

Deepan Das - John Ridley

31

6.2 - Future Direction

Page 32: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

Questions

Page 33: RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger conversing - Ambient noise - Silence + Driver clearly visible - Face (feature)

CONCLUSION

[1] Tzirakis, Panagiotis et al. "End-to-end multimodal emotion recognition using deep neural networks." IEEE Journal of Selected Topics in Signal Processing 11, no. 8 (2017): 1301-1309.

[2] Bucur, Beniamin et al. "An early fusion approach for multimodal emotion recognition using deep recurrent networks." In 2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 71-78. IEEE, 2018.

[3] Sun, Man-Chin, et al. "Context-aware Cascade Attention-based RNN for Video Emotion Recognition." In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pp. 1-6. IEEE, 2018.

[4] Pei, Ercheng, et al. "Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks." In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 208-214. IEEE, 2015.

[5] Liu, Chuanhe, et al. "Multi-feature based emotion recognition for video clips." In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 630-634. ACM, 2018.

[6] Guo, Xin, et al. "Group-Level Emotion Recognition using Hybrid Deep Models based on Faces, Scenes, Skeletons and Visual Attentions." In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 635-639. ACM, 2018.

[7] Brady, Kevin, et al. "Multi-modal audio, video and physiological sensor learning for continuous emotion prediction." In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 97-104. ACM, 2016.

[8] Dhall, Abhinav, et al. "From individual to group-level emotion recognition: EmotiW 5.0." In Proceedings of the 19th ACM international conference on multimodal interaction, pp. 524-528. ACM, 2017.

[9] Knyazev, Boris, et al. "Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video." arXiv preprint arXiv:1711.04598 (2017).

[10] Ngiam, Jiquan, et al. "Multimodal deep learning." In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689-696. 2011.

Deepan Das - John Ridley

33

References