RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger...
Transcript of RNN Architectures · 2.1 - Mode-Based Challenges Video Audio + Driver conversing - Passenger...
RNN Architectures for Emotion Classification in Multimodal Emotion Recognition Systems
Deepan DasJohn RidleyJuly 19 2019
Advanced Methods for Emotion Recognition in Highly Automated Driving
Agenda
1. Background
2. Challenges
5. Proposed Solution
6. Conclusion
3. Existing Approaches
4. Demo
2
Deepan Das - John Ridley
BACKGROUND
● Autonomous driving is just being realised○ But there is a massive transition window○ Awkward ‘monitored autonomy’ mix
● Why monitor driver emotion?○ Safety○ Comfort/Luxury
● Continuous affective state monitoring ○ Can they safely operate the vehicle?○ Are they likely to be a hazard?○ Can the vehicle influence/improve their mood?
Deepan Das - John Ridley
1.0 - Introduction
3
Leve
l 1 -
Han
ds
On
Leve
l 2 -
Han
ds
Off
Leve
l 5 -
No
Inte
ract
ion
Leve
l 3 -
Att
enti
on O
ff
Leve
l 4 -
Dri
ver
Red
un
dan
cy
Emotion Monitoring Critical
Emotion Monitoring Enables Luxury
Russell’s Circumplex Model
BACKGROUND
● Emotions are...○ not inherently bad, but influence behaviour○ classified in various ways○ varyingly relevant to driving behaviour○ temporally dynamic
1.1 - Let’s talk about our feelings
4
Arousal
Act
ive
Valence
Inac
tive
PositiveNegative
Happy
Excited
Delighted
Serene
Relaxed
CalmFatigued
Depressed
Sad
Upset
Stressed
Tense
Plutchik’s Wheel of Emotions
Deepan Das - John Ridley
BACKGROUND
● Deep Neural Networks○ Fundamentally changed the ML landscape○ Not directly applicable to temporally dependant
data
● Recurrent Neural Networks○ Unroll networks temporally○ Cells propagate temporal context/state○ Time dependent variable length sequences○ Can also input/output single values
Deepan Das - John Ridley
1.2 - Recurrent Neural Networks
5
LayersInput Output
CellInput Output
CellInput Output
CellInput Output
t
t+1
t+2
Unrolled RNN
Regular ANN
Bidirectional RNN
BACKGROUND
Deepan Das - John Ridley
1.3 - RNN Zoo
6
Cell
Cell
t
t+1LSTM σ σ σtanh
tanh
GRU
σ
σ tanh
1 -
Cell
Cell
Vanillatanh Vanishing Gradient
Gra
die
nt
Pat
hw
ays
Lon
ger
Ter
m M
emor
yStack RNN Cells
Reverse Gradient Flow
BACKGROUND
● Combine for ‘best of both worlds’○ Certain modes can better indicate certain states [1]○ More difficult than it seems
Deepan Das - John Ridley
1.4 - Multimodal Emotion Classification
7
Video
● Emotions observed in many ways - common sensor modalities
Audio Physiological
Images Faces Signal Dialogue EEG ECG
EOG EDA
CHALLENGES
● Which modes to utilise, given a driving context?○ Most physiological modes cannot be used - requires intrusive sensors○ We focus on audio/visual modes (and variants) - scope of most existing research
● How do we utilise RNN?○ RNNs work with sequence of aligned features○ Where do we use the RNNs (before/after mode fusion)
● Where/how are the modes and RNN(s) combined in the pipeline?
Deepan Das - John Ridley
8
2.0 - Scope of Challenge
Video Features
ClassificationAudio
Features
?RNN
CHALLENGES
● Mode Variability○ Different sampling rates and numerical dimensions○ Reliability/robustness of sensor or preprocessing methods
● Mode Applicability○ Certain modes can better indicate certain states [1]○ There are positive and negative conditions for both mode types
Deepan Das - John Ridley
9
2.1 - Mode-Based Challenges
Video Audio
+ Driver conversing- Passenger conversing- Ambient noise- Silence
+ Driver clearly visible- Face (feature) not visible- Poor lighting- Driver not visible
CHALLENGES
● Preprocessing○ Extraction of salient regions○ Feature extraction (e.g. facial keypoints, audio features)
● Fusion○ How/where are the modes combined (early or late)○ How are mode failure states handled/trained
● RNN Placement○ Before/after fusion (or both)○ RNN type and depth○ Combination with CNN○ Resource and gradient limitations
Deepan Das - John Ridley
10
2.2 - Technical Challenges
Realistic Driving FaceUnrealistic Driving Face
EXISTING APPROACHES
● Constraints for our reviewed approaches○ Audio/visual data for ‘emotions in the wild’○ Discrete emotion classification (but also some regression techniques)
● Some caveats○ There are numerous datasets all with different classifications and samples○ No consistently used dataset - difficult to compare cross-paper results○ ‘In the wild’ can imply acted or dramatised scenes
Deepan Das - John Ridley
11
3.0 - Problem Statement
AFE
W d
atas
et
EXISTING APPROACHES
End-to-End Multimodal Emotion Recognition using Deep Neural Networks [1]
Deepan Das - John Ridley
12
3.1 - Tzirakis (2017)
● Mode-wise custom CNN - Multimodal LSTM● Output regressed arousal and valence● Fusion approach yields best of both modalities
Audio CNNAudio
LSTM
Face CNNFaces
End-to-End Trainable
Output
EXISTING APPROACHES
An Early Fusion Approach for Multimodal Emotion Recognition using Deep Recurrent Networks [2]
Deepan Das - John Ridley
13
3.2 - Bucur (2018)
● Concatenated RNN - no mode dependent CNNs● Input audio, eye, face, depth features (pre-labelled & rate corrected)● Classified 6 emotions
Audio
Eyes
Faces
DepthRNN Architectures
LSTM
Bidirectional LSTM
GRU
Bidirectional GRU
ParallelRNN
ConcatenatedRNN
Context/FaceRNN
EXISTING APPROACHES
Context-aware Cascade Attention-based RNN for Video Emotion Recognition [3]
Deepan Das - John Ridley
14
3.3 - Sun (2018)
● Image CNNs applied across video sequence with LSTM● Face and context (whole) image modes
CNN Face CNN
LSTM
Context CNN
LSTM
Context CNN
Face CNN
Context LSTM
Face LSTM
Context-aware Attention-basedRNN
EXISTING APPROACHES
Context-aware Cascade Attention-based RNN for Video Emotion Recognition [3]
Deepan Das - John Ridley
15
3.3 - Sun (2018)
● Proposed Context Attention mechanism● Classified 8 emotions
Context LSTM
Face CNN
Context CNN
Face LSTM
Attention
Example of visual attention
EXISTING APPROACHES
Multimodal Dimension Affect Recognition using Deep Bidirectional LSTM RNNs [4]
Deepan Das - John Ridley
16
3.4 - Pei (2015)
● Deeper LSTMs (bidirectional)● Combine mode-wise, multimodal RNNs and moving average● Output regressed arousal, valence and dominance
Audio
Video Frames
Deep Bidirectional LSTM (DBLSTM)
Bidirectional LSTM
Bidirectional LSTM DBLSTM
DBLSTM
DBLSTM
EXISTING APPROACHES
Multi-Feature Based Emotion Recognition for Video Clips [5]
Deepan Das - John Ridley
17
3.5 - Liu (2018)
Face
Landmark Detector
CNN
Stats on landmark distances
SVM
Face
DenseNet,Inception
Stats on extracted features
SVM
Face
Tuned VGG16
LSTM
Audio
Tuned SoundNet
● Late fusion, weighted by accuracy of each branch● Not end-to-end trainable, to be avoided despite good performance
Late Fusion
EXISTING APPROACHES
Group-Level Emotion Recognition Using Hybrid Deep Models Based on Faces, Scenes, Skeletons and Visual Attention [6]
Deepan Das - John Ridley
18
3.6 - Guo (2018)
Face CNNs
Scene CNNs
Skeleton CNN
Visual Attention
LSTM
Late
Fu
sion
● Visual attention processed by LSTM as a set● Skeleton data includes face, posture and hands
Example of visual attention, from [6]
EXISTING APPROACHES
Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction [7]
Deepan Das - John Ridley
19
3.7 - Brady (2016)
Audio
Feature Extraction
SVM Regressor
Video
CNN feature
LSTM
HRV
LSTM
Feature Extraction
EDA
LSTM
Feature Extraction
Kalman Filter (Late Fusion)
● Individual mode error modeled as measurement noise in Kalman● Outputs continuous-time valence arousal values
DEMO
Our Setup○ AFEW 2018 dataset [8]○ Focus on Neutral, Happy, Angry emotions○ Precomputed facial (from CNN of [9]) and audio features (from OpenSMILE)○ All code (except feature preprocessing) is our own○ RNN networks trained from scratch○ Tested various RNN architectures
Deepan Das - John Ridley
20
4.0 - Demo Setup
Audio Descriptor
RNN
Face Descriptor
Output
Trained by us
AFEW 2017‘Emotions in the Wild’Unseen Validation Set
DEMO
Deepan Das - John Ridley
21
4.1 - Demo Pipeline
Audio Descriptor
Bidirectional LSTM
Face Descriptor
Output
DEMO
Deepan Das - John Ridley
22
4.2 - Our Demo
DEMO
How does the model perform?○ Using our best-performing bidirectional LSTM model○ Trained with modalities both enabled and disabled
Deepan Das - John Ridley
23
4.3 - Demo Performance
PROPOSED METHOD
Deepan Das - John Ridley
24
5.0 - Cross Modal LearningDrawbacks of existing approaches○ Noisy/missing audio to augment training makes more robust, but hurts performance○ Encoder-decoder with tied weights does not scale well○ Existing models not forced to discover correlations across modalities○ Different hidden units of existing models not forced to learn different modes
Shared Space
Au
dio
Vid
eo
Vid
eoA
ud
io
Multimodal Deep Learning [10]
Generating transcripts from both speech and ‘lip-reading’
PROPOSED METHOD
How can cross-modal learning be combined with RNNs for emotion classification?
● Train mode-wise networks to map to an ‘emotion space’○ Similar emotions map to similar positions in ‘emotion space’○ Accomplished with a triplet loss and mode-wise encoder networks○ Train on precomputed features and three emotions
● RNN learns from ‘emotion space’○ Concatenate mode features like before, but after mapping into ‘emotion space’○ RNN is trained after encoders
Deepan Das - John Ridley
25
5.1 - Proposed Approach
Mode Features
Mode Encoder
Embedding Space
d( , ) < d( , )
d( , ) < d( , )
d( , ) < d( , )
PROPOSED METHOD
Deepan Das - John Ridley
26
5.2 - Pipeline
Video Features
AudioFeatures
Audio Feature Encoder
Video Feature Encoder
Joint Embedding Space
RNN
Triplet Loss
FC
LSTM
OR
PROPOSED METHOD
Deepan Das - John Ridley
27
5.3 - Evaluation
● Results○ Emotion classes embedded into ‘emotion space’○ Projected to 2D using T-SNE approach - designed to show point distances○ Embedding space not sufficiently separated
● Possible Cause○ We use precomputed features - feature descriptors cannot change○ Errors cannot propagate to descriptor networks to select separable features
Validation Overfit
PROPOSED METHOD
Deepan Das - John Ridley
28
5.4 - Advantages & Disadvantages+ Modes have same meaning independent of each other
+ Easy to tell whether modes are in agreement
+ RNN no longer needs to learn how modes are related
+ Easy to add other modalities - just train another encoder
+ RNN can utilise the emotion description in the latent space
- Additional training steps & overhead
- Embedding difficult with precomputed features
- Projection to embedding space may remove mode-specificities
CONCLUSION
● Autonomous driving revolution - still a long way to go○ Cars need to monitor drivers to ensure safe ‘hybrid’ operation
● Emotion recognition is a well established field○ But still very challenging - emotions difficult to classify consistently○ Multimodal approaches provide measurable benefits
● RNNs work well in Multimodal Emotion Recognition○ Take advantage of sequences of continuous features
Deepan Das - John Ridley
29
6.0 - Summary
CONCLUSION
● Multimodal RNN Placement/Application Research○ Where does the RNN go? (Normally post fusion, but sometimes before too)○ Where do we fuse features? (Later is generally better)○ What type(s) of RNNs are used? (Bidirectional LSTM/GRU)
● Cross-modal Approaches○ Generalise representation of concepts across different modes
● Our Approach○ Combine both - unable to evaluate with pretrained features
Deepan Das - John Ridley
30
6.1 - Approaches
Audio CNNAudio
LSTM
Face CNNFaces
Output
Mode Features
Mode Encoder
Embedding Space
CONCLUSION
● How do we extract features from various modes?○ Manage representations of emotions from different modes○ Utilise HRV extracted from faces as an additional mode
● Is there a better way to combine features?○ Deal with failed detections in certain modes○ Get the cross modal representation working with descriptors
● How do we quickly and effectively train RNN Encoders?○ More difficult than regular deep networks○ Adopt cutting edge RNN encoder-decoder architectures
Deepan Das - John Ridley
31
6.2 - Future Direction
Questions
CONCLUSION
[1] Tzirakis, Panagiotis et al. "End-to-end multimodal emotion recognition using deep neural networks." IEEE Journal of Selected Topics in Signal Processing 11, no. 8 (2017): 1301-1309.
[2] Bucur, Beniamin et al. "An early fusion approach for multimodal emotion recognition using deep recurrent networks." In 2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 71-78. IEEE, 2018.
[3] Sun, Man-Chin, et al. "Context-aware Cascade Attention-based RNN for Video Emotion Recognition." In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pp. 1-6. IEEE, 2018.
[4] Pei, Ercheng, et al. "Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks." In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 208-214. IEEE, 2015.
[5] Liu, Chuanhe, et al. "Multi-feature based emotion recognition for video clips." In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 630-634. ACM, 2018.
[6] Guo, Xin, et al. "Group-Level Emotion Recognition using Hybrid Deep Models based on Faces, Scenes, Skeletons and Visual Attentions." In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 635-639. ACM, 2018.
[7] Brady, Kevin, et al. "Multi-modal audio, video and physiological sensor learning for continuous emotion prediction." In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 97-104. ACM, 2016.
[8] Dhall, Abhinav, et al. "From individual to group-level emotion recognition: EmotiW 5.0." In Proceedings of the 19th ACM international conference on multimodal interaction, pp. 524-528. ACM, 2017.
[9] Knyazev, Boris, et al. "Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video." arXiv preprint arXiv:1711.04598 (2017).
[10] Ngiam, Jiquan, et al. "Multimodal deep learning." In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689-696. 2011.
Deepan Das - John Ridley
33
References