RNN Architectures · 2.1 - Mode-Based Challenges Video Audio ＋ Driver conversing － Passenger...

RNN Architectures for Emotion Classification in Multimodal Emotion Recognition Systems

Deepan DasJohn RidleyJuly 19 2019

Advanced Methods for Emotion Recognition in Highly Automated Driving

Agenda

1. Background

2. Challenges

5. Proposed Solution

6. Conclusion

3. Existing Approaches

4. Demo

Deepan Das - John Ridley

BACKGROUND

● Autonomous driving is just being realised○ But there is a massive transition window○ Awkward ‘monitored autonomy’ mix

● Why monitor driver emotion?○ Safety○ Comfort/Luxury

● Continuous affective state monitoring ○ Can they safely operate the vehicle?○ Are they likely to be a hazard?○ Can the vehicle influence/improve their mood?

1.0 - Introduction

Emotion Monitoring Critical

Emotion Monitoring Enables Luxury

Russell’s Circumplex Model

BACKGROUND

● Emotions are...○ not inherently bad, but influence behaviour○ classified in various ways○ varyingly relevant to driving behaviour○ temporally dynamic

1.1 - Let’s talk about our feelings

Arousal

Valence

PositiveNegative

Excited

Delighted

Serene

Relaxed

CalmFatigued

Depressed

Stressed

Plutchik’s Wheel of Emotions

BACKGROUND

● Deep Neural Networks○ Fundamentally changed the ML landscape○ Not directly applicable to temporally dependant

● Recurrent Neural Networks○ Unroll networks temporally○ Cells propagate temporal context/state○ Time dependent variable length sequences○ Can also input/output single values

1.2 - Recurrent Neural Networks

LayersInput Output

CellInput Output

Unrolled RNN

Regular ANN

Bidirectional RNN

BACKGROUND

1.3 - RNN Zoo

t+1LSTM σ σ σtanh

σ tanh

Vanillatanh Vanishing Gradient

yStack RNN Cells

Reverse Gradient Flow

BACKGROUND

● Combine for ‘best of both worlds’○ Certain modes can better indicate certain states [1]○ More difficult than it seems

1.4 - Multimodal Emotion Classification

● Emotions observed in many ways - common sensor modalities

Audio Physiological

Images Faces Signal Dialogue EEG ECG

EOG EDA

CHALLENGES

● Which modes to utilise, given a driving context?○ Most physiological modes cannot be used - requires intrusive sensors○ We focus on audio/visual modes (and variants) - scope of most existing research

● How do we utilise RNN?○ RNNs work with sequence of aligned features○ Where do we use the RNNs (before/after mode fusion)

● Where/how are the modes and RNN(s) combined in the pipeline?

2.0 - Scope of Challenge

Video Features

ClassificationAudio

Features

CHALLENGES

● Mode Variability○ Different sampling rates and numerical dimensions○ Reliability/robustness of sensor or preprocessing methods

● Mode Applicability○ Certain modes can better indicate certain states [1]○ There are positive and negative conditions for both mode types

2.1 - Mode-Based Challenges

Video Audio

＋ Driver conversing－ Passenger conversing－ Ambient noise－ Silence

＋ Driver clearly visible－ Face (feature) not visible－ Poor lighting－ Driver not visible

CHALLENGES

● Preprocessing○ Extraction of salient regions○ Feature extraction (e.g. facial keypoints, audio features)

● Fusion○ How/where are the modes combined (early or late)○ How are mode failure states handled/trained

● RNN Placement○ Before/after fusion (or both)○ RNN type and depth○ Combination with CNN○ Resource and gradient limitations

2.2 - Technical Challenges

Realistic Driving FaceUnrealistic Driving Face

EXISTING APPROACHES

● Constraints for our reviewed approaches○ Audio/visual data for ‘emotions in the wild’○ Discrete emotion classification (but also some regression techniques)

● Some caveats○ There are numerous datasets all with different classifications and samples○ No consistently used dataset - difficult to compare cross-paper results○ ‘In the wild’ can imply acted or dramatised scenes

3.0 - Problem Statement

EXISTING APPROACHES

End-to-End Multimodal Emotion Recognition using Deep Neural Networks [1]

3.1 - Tzirakis (2017)

● Mode-wise custom CNN - Multimodal LSTM● Output regressed arousal and valence● Fusion approach yields best of both modalities

Audio CNNAudio

Face CNNFaces

End-to-End Trainable

Output

EXISTING APPROACHES

An Early Fusion Approach for Multimodal Emotion Recognition using Deep Recurrent Networks [2]

3.2 - Bucur (2018)

● Concatenated RNN - no mode dependent CNNs● Input audio, eye, face, depth features (pre-labelled & rate corrected)● Classified 6 emotions

DepthRNN Architectures

Bidirectional LSTM

Bidirectional GRU

ParallelRNN

ConcatenatedRNN

Context/FaceRNN

EXISTING APPROACHES

Context-aware Cascade Attention-based RNN for Video Emotion Recognition [3]

3.3 - Sun (2018)

● Image CNNs applied across video sequence with LSTM● Face and context (whole) image modes

CNN Face CNN

Context CNN

Face CNN

Context LSTM

Face LSTM

Context-aware Attention-basedRNN

EXISTING APPROACHES

Context-aware Cascade Attention-based RNN for Video Emotion Recognition [3]

3.3 - Sun (2018)

● Proposed Context Attention mechanism● Classified 8 emotions

Context LSTM

Face CNN

Context CNN

Face LSTM

Attention

Example of visual attention

EXISTING APPROACHES

Multimodal Dimension Affect Recognition using Deep Bidirectional LSTM RNNs [4]

3.4 - Pei (2015)

● Deeper LSTMs (bidirectional)● Combine mode-wise, multimodal RNNs and moving average● Output regressed arousal, valence and dominance

Video Frames

Deep Bidirectional LSTM (DBLSTM)

Bidirectional LSTM

Bidirectional LSTM DBLSTM

DBLSTM

EXISTING APPROACHES

Multi-Feature Based Emotion Recognition for Video Clips [5]

3.5 - Liu (2018)

Landmark Detector

Stats on landmark distances

DenseNet,Inception

Stats on extracted features

Tuned VGG16

Tuned SoundNet

● Late fusion, weighted by accuracy of each branch● Not end-to-end trainable, to be avoided despite good performance

Late Fusion

EXISTING APPROACHES

Group-Level Emotion Recognition Using Hybrid Deep Models Based on Faces, Scenes, Skeletons and Visual Attention [6]

3.6 - Guo (2018)

Face CNNs

Scene CNNs

Skeleton CNN

Visual Attention

● Visual attention processed by LSTM as a set● Skeleton data includes face, posture and hands

Example of visual attention, from [6]

EXISTING APPROACHES

Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction [7]

3.7 - Brady (2016)

Feature Extraction

SVM Regressor

CNN feature

Feature Extraction

Kalman Filter (Late Fusion)

● Individual mode error modeled as measurement noise in Kalman● Outputs continuous-time valence arousal values

Our Setup○ AFEW 2018 dataset [8]○ Focus on Neutral, Happy, Angry emotions○ Precomputed facial (from CNN of [9]) and audio features (from OpenSMILE)○ All code (except feature preprocessing) is our own○ RNN networks trained from scratch○ Tested various RNN architectures

4.0 - Demo Setup

Audio Descriptor

Face Descriptor

Output

Trained by us

AFEW 2017‘Emotions in the Wild’Unseen Validation Set

4.1 - Demo Pipeline

Audio Descriptor

Bidirectional LSTM

Face Descriptor

Output

4.2 - Our Demo

How does the model perform?○ Using our best-performing bidirectional LSTM model○ Trained with modalities both enabled and disabled

4.3 - Demo Performance

PROPOSED METHOD

5.0 - Cross Modal LearningDrawbacks of existing approaches○ Noisy/missing audio to augment training makes more robust, but hurts performance○ Encoder-decoder with tied weights does not scale well○ Existing models not forced to discover correlations across modalities○ Different hidden units of existing models not forced to learn different modes

Shared Space

Multimodal Deep Learning [10]

Generating transcripts from both speech and ‘lip-reading’

PROPOSED METHOD

How can cross-modal learning be combined with RNNs for emotion classification?

● Train mode-wise networks to map to an ‘emotion space’○ Similar emotions map to similar positions in ‘emotion space’○ Accomplished with a triplet loss and mode-wise encoder networks○ Train on precomputed features and three emotions

● RNN learns from ‘emotion space’○ Concatenate mode features like before, but after mapping into ‘emotion space’○ RNN is trained after encoders

5.1 - Proposed Approach

Mode Features

Mode Encoder

Embedding Space

d( , ) < d( , )

PROPOSED METHOD

5.2 - Pipeline

Video Features

AudioFeatures

Audio Feature Encoder

Video Feature Encoder

Joint Embedding Space

Triplet Loss

PROPOSED METHOD

5.3 - Evaluation

● Results○ Emotion classes embedded into ‘emotion space’○ Projected to 2D using T-SNE approach - designed to show point distances○ Embedding space not sufficiently separated

● Possible Cause○ We use precomputed features - feature descriptors cannot change○ Errors cannot propagate to descriptor networks to select separable features

Validation Overfit

PROPOSED METHOD

5.4 - Advantages & Disadvantages＋ Modes have same meaning independent of each other

＋ Easy to tell whether modes are in agreement

＋ RNN no longer needs to learn how modes are related

＋ Easy to add other modalities - just train another encoder

＋ RNN can utilise the emotion description in the latent space

－ Additional training steps & overhead

－ Embedding difficult with precomputed features

－ Projection to embedding space may remove mode-specificities

CONCLUSION

● Autonomous driving revolution - still a long way to go○ Cars need to monitor drivers to ensure safe ‘hybrid’ operation

● Emotion recognition is a well established field○ But still very challenging - emotions difficult to classify consistently○ Multimodal approaches provide measurable benefits

● RNNs work well in Multimodal Emotion Recognition○ Take advantage of sequences of continuous features

6.0 - Summary

CONCLUSION

● Multimodal RNN Placement/Application Research○ Where does the RNN go? (Normally post fusion, but sometimes before too)○ Where do we fuse features? (Later is generally better)○ What type(s) of RNNs are used? (Bidirectional LSTM/GRU)

● Cross-modal Approaches○ Generalise representation of concepts across different modes

● Our Approach○ Combine both - unable to evaluate with pretrained features

6.1 - Approaches

Audio CNNAudio

Face CNNFaces

Output

Mode Features

Mode Encoder

Embedding Space

CONCLUSION

● How do we extract features from various modes?○ Manage representations of emotions from different modes○ Utilise HRV extracted from faces as an additional mode

● Is there a better way to combine features?○ Deal with failed detections in certain modes○ Get the cross modal representation working with descriptors

● How do we quickly and effectively train RNN Encoders?○ More difficult than regular deep networks○ Adopt cutting edge RNN encoder-decoder architectures

6.2 - Future Direction

Questions

CONCLUSION

[1] Tzirakis, Panagiotis et al. "End-to-end multimodal emotion recognition using deep neural networks." IEEE Journal of Selected Topics in Signal Processing 11, no. 8 (2017): 1301-1309.

[2] Bucur, Beniamin et al. "An early fusion approach for multimodal emotion recognition using deep recurrent networks." In 2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 71-78. IEEE, 2018.

[3] Sun, Man-Chin, et al. "Context-aware Cascade Attention-based RNN for Video Emotion Recognition." In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pp. 1-6. IEEE, 2018.

[4] Pei, Ercheng, et al. "Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks." In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 208-214. IEEE, 2015.

[5] Liu, Chuanhe, et al. "Multi-feature based emotion recognition for video clips." In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 630-634. ACM, 2018.

[6] Guo, Xin, et al. "Group-Level Emotion Recognition using Hybrid Deep Models based on Faces, Scenes, Skeletons and Visual Attentions." In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 635-639. ACM, 2018.

[7] Brady, Kevin, et al. "Multi-modal audio, video and physiological sensor learning for continuous emotion prediction." In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 97-104. ACM, 2016.

[8] Dhall, Abhinav, et al. "From individual to group-level emotion recognition: EmotiW 5.0." In Proceedings of the 19th ACM international conference on multimodal interaction, pp. 524-528. ACM, 2017.

[9] Knyazev, Boris, et al. "Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video." arXiv preprint arXiv:1711.04598 (2017).

[10] Ngiam, Jiquan, et al. "Multimodal deep learning." In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689-696. 2011.

References

RNN Architectures · 2.1 - Mode-Based Challenges Video Audio ＋ Driver conversing － Passenger...

Documents

Transcript of RNN Architectures · 2.1 - Mode-Based Challenges Video Audio ＋ Driver conversing － Passenger...

(final) RNN Implementation on FPGA

Conversing with sources

RA TOOLBOX: CONVERSING WITH THE READER: THE RA INTERVIEW.

Conversing with a devil’s advocate: Interpersonal ... w… · RESEARCH ARTICLE Conversing with a devil’s advocate: Interpersonal coordination in deception and disagreement Nicholas

Rnn Translation

Friendly Persuasion: Conversing on the Controversial

Multilinkworld-Roles of Franchisor Conversing new ideas

CUN PANNN DVELPN RNN UD - Denver

cs221 final poster - Stanford Universityweb.stanford.edu/class/cs224n/posters/15744482.pdfContext Final Representation RNN Self-Attention RNN Inter-Attention RNN Contextualized Word

HNPRT LB (LN RNRN D PLNT DNT RNN - COREt Btn ltn 22: 42 l, HNPRT LB (LN RNRN D PLNT DNT RNN DVNT Rfl ZRT, blrd PR, nl NT ntn TRN RN. hnprt lb (Ltt rnrn d plnt dnt rnn dvnt. L rnn n

RNN Slides

Rnn 02.07.2012 copy

+ Conversing with Text Image courtesy of: wcet.wiche.edu.

Explaining RNN Predictions for Sentiment …faculty.cse.tamu.edu/huangrh/Fall18-638/CSCE638_day2.pdfExplaining RNN Predictions for Sentiment Classification Ninghao Liu and QingquanSong

BLCP Year One– Outcomes of Conversing with the Community

A Guide to Conversing with Spanish Speaking Patients

CNN RNN Autoencoder - UNISTisystems.unist.ac.kr/wp-content/uploads/sites/209/2017/... · 2018. 12. 28. · CNN RNN Autoencoder. Today 1. Machine*Learning 2. Deep*Learning –CNN –RNN

Driver Identiﬁcation Based on Vehicle Telematics Data ... · RNN, deep neural network, vehicle telematics data, OBD-II, CAN bus I. INTRODUCTION Although the technological improvement

Session XIII Academic Success Doc Library/Sessi… · Belgium Netherland Citizens Comfortable Conversing in a Second Language 9 54% of Europeans feel comfortable conversing in a second

Dp-Rnn: Type Ii Diabetic Prediction Using Gkfcm And Rnn

CNN RNN Autoencoder - UNISTisystems.unist.ac.kr/wp-content/uploads/sites/209/2017/... · 2018. 12. 28. · CNN RNN Autoencoder. Today 1. MachineLearning 2. DeepLearning –CNN –RNN