Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion...

9
1 Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana * , Julien Epps , Raja Jurdak , Xue Li § , Roland Goecke , Margot Brereton k and Jeffrey Soar ** * Institute of Resilient Regions (IRR) University of Southern Queensland, Springfield, QLD 4300 Email: [email protected] School of Electrical Engineering and Telecommunications, University of New South Wales Distributed Sensing Systems, CSIRO-Data61 § School of Information Technology and Electrical Engineering, University of Queensland Information Technology & Engineering, University of Canberra k Computer Human Interaction, Queensland University of Technology ** School of Management and Enterprise, University of Southern Queensland Abstract—Despite the enormous interest in emotion classifica- tion from speech, the impact of noise on emotion classification is not well understood. This is important because, due to the tremendous advancement of the smartphone technology, it can be a powerful medium for speech emotion recognition in the outside laboratory natural environment, which is likely to incorporate background noise in the speech. We capitalize on the current breakthrough of Recurrent Neural Network (RNN) and seek to investigate its performance for emotion classification from noisy speech. We particularly focus on the recently proposed Gated Recurrent Unit (GRU), which is yet to be explored for emotion recognition from speech. Experiments conducted with speech compounded with eight different types of noises reveal that GRU incurs an 18.16% smaller run-time while performing quite comparably to the Long Short-Term Memory (LSTM), which is the most popular Recurrent Neural Network proposed to date. This result is promising for any embedded platform in general and will initiate further studies to utilize GRU to its full potential for emotion recognition on smartphones. I. I NTRODUCTION Automatic Speech Emotion Recognition has gained increas- ing interest in both research and commercial space due to its tremendous potential to determine affective states [1]. Au- tomatic Speech emotion recognition can offer unprecedented opportunities for the Human-Computer Interaction Systems, as for example, recommendation systems can be designed to make more accurate affect-sensitive recommendations [2]. This will also be highly beneficial for the burgeoning health and wellbeing applications as, for example, affect-diaries can be built to keep people aware of any prolonged negative mood, which would potentially prevent the onset or relapse of affective disorders [3]. In recent years, Deep Learning models have revolutionized many fields, in particular, automatic speech recognition and computer vision [4], [5]. These models have also achieved improved performance in emotion recognition compared to conventional machine learning algorithms. For example, [6], [7], [8], [9], [10], [11], [12], and [13] present some best re- ported results for affective speech analysis using deep learning models. Out of the many forms of Deep Learning models, we con- sider Recurrent Neural Networks (RNNs) as they are targeted to model the long range dependencies between successive observations. Speech emotion possesses temporal dependency as it is unlikely to change rapidly between subsequent speech utterances [14]. Therefore, the Recurrent Neural Networks are most suitable for emotion recognition from speech. How- ever, RNNs often suffer from the classical “Vanishing” and “Exploding” Gradient problems, which results in failing to learn long-range dependencies. To avoid this, two variants of Recurrent Neural Networks have been proposed, which uses a “gating” approach to avoid these problems: (1) Long Short-Term Memory (1997) [15] and (2) Gated Recurrent Unit (2014) [16]. Smartphones are great platforms for speech emotion recog- nition, as people are close to their phones for an extended period of time. Research shows that almost 79% of people have their smartphones with them 22 hours a day 1 . In addition, the processing capacity of the modern smartphones is also extraordinary. In the scope of this paper we assume that speech samples are collected during phone conversation [17]. The key challenge of emotion recognition from phone conversation is background noise as people can talk over phone anywhere including the cafeteria, park, office and many more places. The mainstream research in emotion recognition from speech has mainly focused on clean speech captured in controlled laboratory environment. Therefore, the impact of noise in emotion classification is not well understood [18]. Another major hurdle of running any Deep Learning models on a smartphone is computational complexity. Despite the advancement in the processing capacity, smartphones are still limited by battery power. Therefore, a method with a small runtime is a must. Amongst LSTM and GRU, GRU has rela- tively simplified architecture, therefore, our focus is mainly on GRU. A number of studies have looked into the performance 1 http://www.adweek.com/socialtimes/smartphones/480485 arXiv:1612.07778v1 [cs.HC] 13 Dec 2016

Transcript of Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion...

Page 1: Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana , Julien Eppsy, Raja Jurdakz, Xue Lix,

1

Gated Recurrent Unit (GRU) for EmotionClassification from Noisy Speech

Rajib Rana∗, Julien Epps†, Raja Jurdak‡, Xue Li§, Roland Goecke¶, Margot Brereton‖ and Jeffrey Soar∗∗∗Institute of Resilient Regions (IRR)

University of Southern Queensland, Springfield, QLD 4300Email: [email protected]

†School of Electrical Engineering and Telecommunications, University of New South Wales‡Distributed Sensing Systems, CSIRO-Data61

§School of Information Technology and Electrical Engineering, University of Queensland¶Information Technology & Engineering, University of Canberra

‖Computer Human Interaction, Queensland University of Technology∗∗School of Management and Enterprise, University of Southern Queensland

Abstract—Despite the enormous interest in emotion classifica-tion from speech, the impact of noise on emotion classificationis not well understood. This is important because, due to thetremendous advancement of the smartphone technology, it can bea powerful medium for speech emotion recognition in the outsidelaboratory natural environment, which is likely to incorporatebackground noise in the speech. We capitalize on the currentbreakthrough of Recurrent Neural Network (RNN) and seekto investigate its performance for emotion classification fromnoisy speech. We particularly focus on the recently proposedGated Recurrent Unit (GRU), which is yet to be explored foremotion recognition from speech. Experiments conducted withspeech compounded with eight different types of noises revealthat GRU incurs an 18.16% smaller run-time while performingquite comparably to the Long Short-Term Memory (LSTM),which is the most popular Recurrent Neural Network proposedto date. This result is promising for any embedded platform ingeneral and will initiate further studies to utilize GRU to its fullpotential for emotion recognition on smartphones.

I. INTRODUCTION

Automatic Speech Emotion Recognition has gained increas-ing interest in both research and commercial space due toits tremendous potential to determine affective states [1]. Au-tomatic Speech emotion recognition can offer unprecedentedopportunities for the Human-Computer Interaction Systems,as for example, recommendation systems can be designedto make more accurate affect-sensitive recommendations [2].This will also be highly beneficial for the burgeoning healthand wellbeing applications as, for example, affect-diaries canbe built to keep people aware of any prolonged negativemood, which would potentially prevent the onset or relapseof affective disorders [3].

In recent years, Deep Learning models have revolutionizedmany fields, in particular, automatic speech recognition andcomputer vision [4], [5]. These models have also achievedimproved performance in emotion recognition compared toconventional machine learning algorithms. For example, [6],[7], [8], [9], [10], [11], [12], and [13] present some best re-ported results for affective speech analysis using deep learning

models.Out of the many forms of Deep Learning models, we con-

sider Recurrent Neural Networks (RNNs) as they are targetedto model the long range dependencies between successiveobservations. Speech emotion possesses temporal dependencyas it is unlikely to change rapidly between subsequent speechutterances [14]. Therefore, the Recurrent Neural Networksare most suitable for emotion recognition from speech. How-ever, RNNs often suffer from the classical “Vanishing” and“Exploding” Gradient problems, which results in failing tolearn long-range dependencies. To avoid this, two variantsof Recurrent Neural Networks have been proposed, whichuses a “gating” approach to avoid these problems: (1) LongShort-Term Memory (1997) [15] and (2) Gated Recurrent Unit(2014) [16].

Smartphones are great platforms for speech emotion recog-nition, as people are close to their phones for an extendedperiod of time. Research shows that almost 79% of peoplehave their smartphones with them 22 hours a day1. In addition,the processing capacity of the modern smartphones is alsoextraordinary. In the scope of this paper we assume that speechsamples are collected during phone conversation [17].

The key challenge of emotion recognition from phoneconversation is background noise as people can talk overphone anywhere including the cafeteria, park, office and manymore places. The mainstream research in emotion recognitionfrom speech has mainly focused on clean speech captured incontrolled laboratory environment. Therefore, the impact ofnoise in emotion classification is not well understood [18].

Another major hurdle of running any Deep Learning modelson a smartphone is computational complexity. Despite theadvancement in the processing capacity, smartphones are stilllimited by battery power. Therefore, a method with a smallruntime is a must. Amongst LSTM and GRU, GRU has rela-tively simplified architecture, therefore, our focus is mainly onGRU. A number of studies have looked into the performance

1http://www.adweek.com/socialtimes/smartphones/480485

arX

iv:1

612.

0777

8v1

[cs

.HC

] 1

3 D

ec 2

016

Page 2: Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana , Julien Eppsy, Raja Jurdakz, Xue Lix,

2

of LSTM (e.g., [19], [20], [21]) for emotion recognition fromspeech, however, the performance of GRU for that is currentlyunexplored.

In this paper, we address the above two challenges. Thecontributions of this paper are as follows:

1) To the best of our knowledge we for the first timeanalyze the accuracy and the run-time complexity ofGated Recurrent Unit for emotion classification fromspeech.

2) We superimpose various real-life noises on clean speechand analyze the classification performance of the GatedRecurrent Unit under various noise exposures.

3) We use LSTM as the benchmark and conduct a detailedaccuracy and run-time comparisons of these two gatingRecurrent units for emotion classification from noisyspeech.

The paper is organized as follows. In next section, we pro-vide background information on Recurrent Neural Networksfollowed by the description of how GRU can be used foremotion classification from a noisy speech in Section III. Wethen present the results, and discussion in Section IV. Thisis followed by existing work in Section V; and finally, weconclude in Section VI.

Ot−1 Ot+1Ot

StSt−1St+1

xt+1xt−1 xt

W

U

VVV

W W

U U

W

Fig. 1: Recurrent Neural Networks.

II. RECURRENT NEURAL NETWORKS

The conventional Feed Forward Neural Network is trainedon labeled data until it minimizes the prediction error. Whereasthe decision an RNN reached at time step t − 1 affects thedecision it will reach at time step t. Therefore, RNNs havetwo input sources: the present, and the recent past, which ituses in combination in determining a response to a new input.

An example Recurrent Neural Networks (RNNs) is shownin Figure 1 and the related symbols are defined in Table I.A hidden state (st) is a non-linear transformation of a linearcombination of input (xt) and previous hidden state (st−1).Output (ot) at time t is only dependent on the current hiddenstate st. The output is associated with a probability pt de-termined through a “softmax” function. In Neural Networks asoftmax function is implemented in the final layer of a networkused for classification. For a classification with K classes, thesoftmax function determines the probability of a probe beingclassified as each of the K classes.

st = tanh(Wst−1 + Uxt)

ot = V st

TABLE I: Symbol Definitions

Symbol Definitionxt ∈ Rd d dimensional input vectorsSt ∈ Rp p dimensional hidden unitOt ∈ Rd′ d′ dimensional outputsU ∈ Rd×p maps d-dimensional input to p dimensional hidden unitV ∈ Rp×d′ maps p dimensional hidden unit to d′ dimensional outputW ∈ Rp×p maps p dimensional hidden unit to another hidden unit.

Vanishing and Exploding Gradients: Both of these terms wereintroduced by Bengio et al. [22] in 1994. The explodinggradient problem refers to the large increase in the norm ofthe gradient during training. The vanishing gradient problemrefers to the opposite behavior when long-term componentsgo exponentially fast to norm 0, making it impossible forthe model to learn the correlation between temporally distantevents.

Any quantity multiplied frequently by an amount slightlygreater than one can become immeasurably large (exploding).Multiplying by a quantity less than one is also true (vanishing).Because the layers and time steps of deep neural networksrelate to each other through multiplication, derivatives aresusceptible to vanishing or exploding.

Encouragingly, there are a few ways to address thesegradient problems. Proper initialization of the weight (W)matrix and regularization can reduce the impact of vanishinggradients. Another possible solution is to use the RectifiedLinear Unit (ReLU) [23] instead of the conventional tanh orsigmoid activation functions. The ReLU derivative is either 0or 1, so it is not as likely to cause a vanishing or explodinggradient. However, ReLU units can be fragile during trainingand can “die”. For example, a large gradient flowing througha ReLU neuron could cause the weights to update in sucha way that the neuron will never activate on any data pointagain. As a result as much as 40% of the neurons will neverbe activated across the entire training dataset if the learningrate is set too high. With a rigorously chosen learning rate,this can be avoided.

An even more effective solution is to use the gated Recur-rent Neural Network architectures: Long Short-Term Mem-ory (LSTM) or Gated Recurrent Unit (GRU). LSTMs firstproposed by Hochreiter et al. in 1997 are one of the mostwidely used Recurrent Neural Networks today. GRUs, firstproposed in 2014, are simplified versions of LSTMs. Both ofthese RNN architectures were purposely built to address thevanishing and the exploding gradient problem and efficientlylearn long-range dependencies. To assist the understanding ofGRU in the next section we first describe LSTM.

A. Long Short-Term Memory Units (LSTMs)

The Long Short Term memory is a special kind of RecurrentNeural Networks, that eliminates the shortcoming of vanish-ing or exploding gradient problem of the Recurrent NeuralNetworks. This makes LSTM suitable to learn from history toclassify, process and predict time series when there are verylong and unknown time lags between important events. AnLSTM network is made up of LSTM blocks that have three

Page 3: Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana , Julien Eppsy, Raja Jurdakz, Xue Lix,

3

!"#

$%&#

'#

(#

)#

C

C

(a) Long Short Term Memory Unit.i: Input Gate, f : Forget Gate, o:Output Gate.

!"#

$%&#

'#hh

(#

(b) Gated Recurrent Unit.z: update Gate, r: Reset Gate.

Fig. 2: Gated Recurrent Neural Networks.

gates: input, output and forget gate, which help it to remembera value for an arbitrary long time; forget the value when itis not important to remember anymore or output the value.LSTMs help preserve the error that can be backpropagatedthrough time and layers. By maintaining a more constant error,they allow recurrent nets to continue to learn over many timesteps, for example over 1000.

LSTMs contain information outside the normal flow ofthe recurrent network in a gated cell. Information can bestored in, written to, or read from a cell, much like data in acomputer’s memory. The cell makes decisions about what tostore, and when to allow reads, writes, and erasures, via gatesthat open and close. Unlike the digital storage on computers,however, these gates are analog, implemented with element-wise multiplication by sigmoids, which are all in the rangeof [0 − 1]. Analog has the advantage over digital of beingdifferentiable, and therefore suitable for backpropagation.

Those gates act on the signals they receive, and similar tothe neural networks nodes, they block or pass on informationbased on its strength and import, which they filter with theirown sets of weights. Those weights, like the weights thatmodulate input and hidden states, are adjusted via the recurrentnetworks learning process. That is, the cells learn when toallow data to enter, leave or be deleted through the iterativeprocess of making predictions, backpropagating error, andadjusting weights via gradient descent.

LSTMs memory cells give different roles to addition andmultiplication in the transformation of input. Instead of deter-mining the subsequent cell state by multiplying its current statewith new input, they add the two, which helps them preservea constant error when it must be backpropagated at depth.

Different sets of weights filter the input for input, output,and forgetting. The forget gate is represented as a linearidentity function, because if the gate is open, the current stateof the memory cell is simply multiplied by one, to propagateforward one more time step.

The LSTMs need a forget gate although their purpose is tolink distant occurrences to a final output. This can be justified

with an example. While analyzing a text corpus when thepointer comes to the end of a document, it could be the case thenext document does not have a correlation with the previousone. Therefore, the memory cell should be set to zero beforethe net ingests the first element of the next document.

An LSTM cell has been shown in Fig 2a. It is not limited tocomputing the weighted sum of the inputs and then appliyinga nonlinear function; rather each j-th LSTM unit maintains amemory cjt at time t. The activation function of the LSTM is

hjt = ojt tanh cjt .

The output gate ojt modulates the amount of memory contentexposure. With Vo as a diagonal matrix, It is calculated by

ojt = σ(Woxt + Ufht−1 + Voct)j .

The memory cjt is updated by partially forgetting the existingmemory and adding a new memory cjt . The extent to whichthe existing memory is forgotten is controlled by a forget gatef jt .

f jt = σ(Wfxt + Uoht−1 + Vict−1)j .

And the extent to which new memory is added is controlledby an input gate ijt

ijt = σ(Wixt + Uiht−1 + Vfct−1)j .

Controlled by these gates the existing memory is updated usingthe following equations.

cjt = tanh (Wcxt + Ucht−1)j ,

cjt = f jt cjt−1 + ijt c

jt .

B. Gated Recurrent Units (GRUs)

The Gated Recurrent Unit is a slightly more simplifiedvariation of the LSTM. It combines the forget and input gatesinto a single “update gate” and has an additional “reset gate”.The end model is simpler than standard LSTM models and isbecoming increasingly popular.

Page 4: Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana , Julien Eppsy, Raja Jurdakz, Xue Lix,

4

!"#$%&'()*+,-.)#/,0-.1,)

2345,)

!)

23456)#/,0-.1,)

7,-890,):;80-1<3.)

:=3<3.)*+-554>1-<3.)

Fig. 3: Experimental setting for emotion classification from noisy speech.

TABLE II: LSTM vs GRU

LSTM GRUControlled Memory Exposure The amount of memory seen by the other The whole memory is exposed to the network

units of the network is controlled by the Output GateNew Memory Computation No separate control for amount of Controls the information flow from the previous activation

information flow from the previous time stepComplexity vs Performance With an additional gate Has fewer parameters and

is likely to have higher complexity thus may train comparatively faster or need less data to generalize

A Gated Recurrent Unit like LSTM modulates informationinside the unit, however, without having a separate memorycell (see Fig 2b). The activation hjt of the GRU at time t is alinear interpolation between the previous activation hjt−1 andthe candidate activation hjt :

hjt = (1− zjt )hjt−1 + zjt h

jt

The update gate zjt decides how much the unit updates itsactivation.

zjt = σ(Wzxt + Uzht−1)j .

The candidate activation hjt is computed similarly to the updategate:

hjt = tanh(Wxt + U(rt. ∗ ht−1))j ,

[15] where rjt is a set of reset gates and .∗ denotes an element-wise multiplication. When the reset gate is off (rjt == 0), itallows the unit to forget the past. This is analogous to lettingthe unit reading the first symbol of an input sequence. Thereset gate is computed using the following formula

rjt = σ(Wrxt +r ht−1)j

Update gate z controls how much the past state shouldmatter now. Units with short-term dependencies will haveactive reset gates r. Units with long-term dependencies haveactive update gates z.

GRU and LSTM have many commonalities, yet there aresome differences between these two, which we summarize inTable II.

III. GATED RECURRENT UNIT FOR EMOTIONCLASSIFICATION FROM NOISY SPEECH

An emotion classification framework embodying a GatedRecurrent Unit is shown in Fig. 3. Our focus is on emotionclassification from noisy speech, so we simulate noisy speechupon superimposing various environmental noises on clean

speech. Features are extracted from the noisy speech and feedto the GRU for emotion classification. We have used the sameframework for LSTM to contrast its classification performancewith that of GRU.

A. Description of Datasets

The Berlin emotional speech database [?] is used in ex-periments for classifying discrete emotions. In this database,ten actors, five males and five females each uttered tensentences (5 short and 5 longer, typically between 1.5 and4 seconds) in German to simulate seven different emotions:anger, boredom, disgust, anxiety/fear, happiness, and sadness.Utterances scoring higher than 80% emotion recognition ratein a subjective listening test are included in the database. Weclassify all the seven emotions in this work. The numbersof speech files for these emotion categories in the presentedBerlin database are: anger (127), boredom (81), disgust (46),fear (69), joy (71), neutral (79) and sadness (62). We usethis prototypical database for the preliminary investigation ofGRU’s performance for emotion recognition.

In order to simulate noise-corrupted speech signals, the DE-MAND (Diverse Environments Multi-channel Acoustic NoiseDatabase) noise database [?] has been used in this paper. Thisdatabase involves 18 types of noises wherein we have usednoise from traffic, cafe, living room, park, washing, car, officeand river. The recordings were captured at a sampling rate of48 kHz and with a target length of 5 minutes (300 s). Actualaudio capture time was slightly longer thereby allowing theremoval of set-up noises and other artifacts by trimming. Therecorded signals were not subject to any gain normalization.Therefore, the original noise power in each environment waspreserved.

B. GRU Implementation

We use the PyBrain Toolbox [24] to implement the GRU.We evaluate the classification performance across three param-

Page 5: Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana , Julien Eppsy, Raja Jurdakz, Xue Lix,

5

1 10^-1 10^-3 10^-5 10^-7 10^-9Learning Rate

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Valid

ation E

rror

GRUGRU-No Bias

(a) GRU - No Noise.

1 10^-1 10^-3 10^-5 10^-7 10^-9Learning Rate

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Valid

ation E

rror

GRUGRU-No Bias

(b) GRU - Traffic Noise.

1 10^-1 10^-3 10^-5 10^-7 10^-9Learning Rate

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Valid

ation Error

LSTMLSTM-No Bias

(c) LSTM - No Noise.

1 10^-1 10^-3 10^-5 10^-7 10^-9Learning Rate

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Valid

ation Error

LSTMLSTM-No Bias

(d) LSTM - Traffic Noise.

Fig. 4: Classification Performance - Impact of Noise, Bias and Learning Rate.

eters including learning rate, number of cells and bias. To useGRU and LSTM for classification we use softmax functionin the output layer. The Pybrain package uses 75% of datafor training and 25% for validation. For accuracy we use thevalidation error in the plots.

C. Feature Selection

For each speech segment, we choose a 13 coefficient MelFrequency Cepstral Coefficients (MFCC) as a feature vector,which have been used by many researchers (e.g., [25],[25]and many more.) for emotion classification from speech. Wechoose the small number of coefficients for dimensionalityreduction of feature space. Large feature sets consume a sig-nificant amount of memory, jointly with computing and powerresources and they do not always contribute to improving therecognition rate.

IV. RESULTS AND DISCUSSIONS

A. Bias and Learning Rate

We compare the classification performance of GRU withrespect to Bias and Learning Rate in Fig. 4. We reduce thelearning rate gradually from 1 to 10−9 in the presence and

absence of the bias. We find that learning rate has a verysmall impact and GRU performs better without the bias terms.Similar performances have been observed for LSTM. Thesebehaviors can be observed in the noisy conditions as well. Wehave shown results for arbitrarily chosen Traffic noise.

Setting learning rates for plain Stochastic Gradient Descentin a neural network is usually a process of starting with aninitial arbitrary value such as 0.01 and then doing a cross-validation to find an optimal value. Common values range overa few orders of magnitude from 0.0001 up to 1. A small valueis desired to avoid overshooting, but a very small value cannotbe chosen to avoid getting stuck in local minima or taking longto descend. For our experiments, a learning rate of 1 yieldsthe best performance.

Again a Bias value usually allows to shift the activationfunction to the left or right, which might be necessary forsuccessful learning. However, in our case adding a bias has anegative impact, it lowers the accuracy.

B. Bias and Number of Cells

We also verify the impact of Number of Cells in presenceand absence of Bias in Fig. 5. For both GRU and LSTM, theerror is minimum when the number of cells is one and there is

Page 6: Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana , Julien Eppsy, Raja Jurdakz, Xue Lix,

6

1 5 10 20 30Number of Cells

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Validation Error

GRUGRU-No Bias

(a) GRU - No Noise.

1 5 10 20 30Number of Cells

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Validation Error

GRUGRU-No Bias

(b) GRU - Washing Noise.

1 5 10 20 30Number of Cells

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Valid

ation E

rror

LSTMLSTM-No Bias

(c) LSTM - No Noise.

1 5 10 20 30Number of Cells

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Valid

ation E

rror

LSTMLSTM-No Bias

(d) LSTM - Washing Noise.

Fig. 5: Classification Performance - Impact of Noise, Bias and Number of Cells.

no bias term. Similar behavior is observed in presence of noise.We have shown the results for an arbitrary chosen Washingnoise.

Both GRU and LSTM layers consist of blocks which in turnconsist of cells. We anticipate that due to the smaller size ofour dataset the increase in the number of cells does not havea positive impact

C. Accuracy: GRU versus LSTM

After analyzing the values of Number of Cells and LearningRate and Bias, we now compare the performance of GRU andLSTM in various noisy conditions in Fig. 6. In the following,we use the following values for the three parameters: Numberof cell = 1, Bias = False, Learning Rate = 1.

Out of eight different noise cases, in three cases LSTMand GRU performs the same (see Fig. 6a). While imputedwith the Washing noise, GRU performs better than LSTM by1.75%. For four remaining noise imputations LSTM performsbetter than GRU. Amongst these, for Cafe and River noiseLSTM performs noticeably (4.6% and 6.4%, respectively)better than GRU, but for Traffic and Park noise, it onlyperforms marginally better than GRU.

The comparison results between LSTM and GRU providesome intuitive insights. For example, GRU performs better forthe Washing noise which can be very periodic and not usuallycontinuous. Whereas, LSTM performs noticeably better thanGRU in the case of River and Cafe, which are usually sourcesof continuous noise. We will conduct further studies in futureto explain the differences between GRU and LSTM.

Using Fig. 6b we can understand the impact of noise (as awhole) on GRU’s performance where we compare the error atno-noise with the error combined at all other noisy conditions.We notice that GRU is quite robust to noise. In fact, the errorat noisy condition is smaller than the error at the no-noisycondition. It is not unnatural for Deep Learning models toperform better in the presence of noise as this helps avoidoverfitting, when the noise is not dominant. We observe similarrobustness for LSTM.

D. Run-Time: GRU versus LSTM

We compare the run-time performance of GRU and LSTMon a 2 GHz Intel Core i7 Macbook professional with a 8GB 1600 MHz DDR3 memory. For each noise type and forthe no-noisy condition we determine the run-time five times.To aggregate over the five times, we choose the median value

Page 7: Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana , Julien Eppsy, Raja Jurdakz, Xue Lix,

7

Traffic Cafe Living Park Washing Car Office River0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Valid

ation Error

GRULSTM

(a)

No-Noise All Noises Combined0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Validation Error

GRULSTM

(b)

Fig. 6: GRU versus LSTM

over mean, as the variances were quite high. We then combinethe median values and comparing the averaged median valuesfound that GRU incurs 18.16% smaller run-time compared toLSTM.

This comparison although were made on desktop Computer,it provides important insights about the time complexity dif-ferences between GRU and LSTM. In our future study, weaim to deploy the GRU module on smartphone and determinethe run-time complexity.

V. EXISTING WORK

The closest match of our work is the work done by Tanget al. [26], [27], where authors use Gated Recurrent Unitfor question detection from speech. Our focus is instead onemotion detection from speech.

In this paper, we have used LSTM as the benchmark toevaluate the performance of GRU, as LSTM is the mostpopular Recurrent Neural Networks implementing the gatingmechanism. In particular, we justify the exploration of GRUfor emotion classification from speech due to the success ofLSTM in emotion classification. For example, LSTM as agated recurrent neural network has been shown to success-fully exploit the long-range dependencies between successiveobservations and offer good classification accuracy in [19],[20]. It has also been shown in the literature that LSTM can becombined with other methods like Multiple Kernel Learning(MKL) [19] and Convolutional Neural Networks (CNNs) [8]to achieve greater accuracy. It has also been shown thatLSTM [21] can outperform the widely used (e.g., [28], [29],[30]) Support Vector Machine (SVM) when there is enoughtraining data.

Most of the studies described above consider emotion recog-nition from clean speech, but we are interested in emotionrecognition from noisy speech. One paper by Zhang et al. [31]performs extensive evaluations of LSTM for speech emotionrecognition in presence of non-stationary additive noise andconvolutional noise. Whereas these are synthetic noises (addi-tive Gaussian noise), we are more interested in speech emotion

recognition in presence of real-life background noise. Also,our focus is mainly on GRU.

VI. CONCLUSION

This paper investigates the feasibility of Gated RecurrentUnit (GRU), a gated Recurrent Neural Network, for emotionclassification from noisy speech. We create noisy speech uponsuperimposing noises from the cafe, washing, river etc. The re-sults show that GRU offers a very comparable accuracy to themost widely used Long Short-Term Memory (LSTM), whileincurring a shorter run-time. For example, LSTM performsbetter than GRU by 6.4% in the best case, but GRU incurs18.6% smaller run-time compared to LSTM. Interestingly, forwashing noise GRU incurs 1.75% smaller error comparedto LSTM. This accuracy versus time-complexity trade-off ofGRU is highly advantageous for any embedded platform ingeneral and in our future studies, we aim to investigate theperformance of GRU for real-time emotion recognition onsmartphones.

REFERENCES

[1] R. A. Calix, L. Javadpour, and G. M. Knapp, “Detection of affectivestates from text and speech for real-time human–computer interaction,”Human Factors: The Journal of the Human Factors and ErgonomicsSociety, vol. 54, no. 4, pp. 530–545, 2012.

[2] Y. Shi, M. Larson, and A. Hanjalic, “Mining mood-specific movie sim-ilarity with matrix factorization for context-aware recommendation,” inProceedings of the workshop on context-aware movie recommendation.ACM, 2010, pp. 34–40.

[3] G. E. Simon, “Social and economic burden of mood disorders,” Biolog-ical psychiatry, vol. 54, no. 3, pp. 208–215, 2003.

[4] J. Schmidhuber, “Deep learning in neural networks: An overview,”Neural Networks, vol. 61, pp. 85–117, 2015.

[5] A. Y. Hannun, C. Case, J. Casper, B. C. Catanzaro, G. Diamos,E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng,“Deep speech: Scaling up end-to-end speech recognition,” CoRR, vol.abs/1412.5567, 2014. [Online]. Available: http://arxiv.org/abs/1412.5567

[6] R. Brueckner and B. Schuller, “Be at odds? deep and hierarchical neuralnetworks for classification and regression of conflict in speech,” inConflict and Multimodal Communication. Springer, 2015, pp. 403–429.

[7] M. Wollmer, Y. Sun, F. Eyben, B. Schuller et al., “Long short-term mem-ory networks for noise robust speech recognition.” in INTERSPEECH,2010, pp. 2966–2969.

Page 8: Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana , Julien Eppsy, Raja Jurdakz, Xue Lix,

8

[8] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou,S. Zafeiriou et al., “Adieu features? end-to-end speech emotion recog-nition using a deep convolutional recurrent network,” in 2016 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2016, pp. 5200–5204.

[9] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient featuresfor speech emotion recognition using convolutional neural networks,”IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.

[10] L. Li, Y. Zhao, D. Jiang, Y. Zhang, F. Wang, I. Gonzalez, E. Valentin,and H. Sahli, “Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emotion recognition,” in Affective Computing andIntelligent Interaction (ACII), 2013 Humaine Association Conference on.IEEE, 2013, pp. 312–317.

[11] M. Wollmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, R. Cowie et al., “Abandoning emotion classes-towards contin-uous emotion recognition with modelling of long-range dependencies.”in INTERSPEECH, vol. 2008, 2008, pp. 597–600.

[12] F. Weninger, J. Bergmann, and B. Schuller, “Introducing currennt–themunich open-source cuda recurrent neural network toolkit,” Journal ofMachine Learning Research, vol. 16, no. 3, pp. 547–551, 2015.

[13] J. Deng, Z. Zhang, F. Eyben, and B. Schuller, “Autoencoder-basedunsupervised domain adaptation for speech emotion recognition,” IEEESignal Processing Letters, vol. 21, no. 9, pp. 1068–1072, 2014.

[14] P. Georgiev, N. D. Lane, K. K. Rachuri, and C. Mascolo, “Dsp.ear: Leveraging co-processor support for continuous audio sensing onsmartphones,” in Proceedings of the 12th ACM Conference on EmbeddedNetwork Sensor Systems. ACM, 2014, pp. 295–309.

[15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[16] K. Cho, B. Van Merrienboer, D. Bahdanau, and Y. Bengio, “On theproperties of neural machine translation: Encoder-decoder approaches,”arXiv preprint arXiv:1409.1259, 2014.

[17] R. Rana, “Poster: Context-driven mood mining,” in Proceedings of the14th Annual International Conference on Mobile Systems, Applications,and Services Companion. ACM, 2016, pp. 143–143.

[18] N. Yang, I. Demirkol, J. Yuan, W. Heinzelman, Y. Zhou, and M. Sturge-Apple, “How does noise impact speech-based emotion classification?”in Designing Speech and Language Interactions Workshop, the ACMCHI Conference on Human Factors in Computing Systems, 2014.

[19] J. Wei, E. Pei, D. Jiang, H. Sahli, L. Xie, and Z. Fu, “Multimodal con-tinuous affect recognition based on lstm and multiple kernel learning,”in Signal and Information Processing Association Annual Summit andConference (APSIPA), 2014 Asia-Pacific. IEEE, 2014, pp. 1–4.

[20] M. Wollmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll, “Lstm-modeling of continuous emotions in an audiovisual affect recognitionframework,” Image and Vision Computing, vol. 31, no. 2, pp. 153–163,2013.

[21] L. Tian, J. D. Moore, and C. Lai, “Emotion recognition in spontaneousand acted dialogues,” in Affective Computing and Intelligent Interaction(ACII), 2015 International Conference on. IEEE, 2015, pp. 698–704.

[22] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencieswith gradient descent is difficult,” IEEE transactions on neural networks,vol. 5, no. 2, pp. 157–166, 1994.

[23] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-mann machines,” in Proceedings of the 27th International Conferenceon Machine Learning (ICML-10), 2010, pp. 807–814.

[24] T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke,T. RC1/4ckstieC, and J. Schmidhuber, “Pybrain,” Journal of MachineLearning Research, vol. 11, no. Feb, pp. 743–746, 2010.

[25] D. Neiberg, K. Elenius, and K. Laskowski, “Emotion recognition inspontaneous speech using gmms.” in Interspeech, 2006.

[26] Y. Tang, Y. Huang, Z. Wu, H. Meng, M. Xu, and L. Cai, “Question de-tection from acoustic features using recurrent neural network with gatedrecurrent unit,” in 2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 6125–6129.

[27] Y. Tang, Z. Wu, H. Meng, M. Xu, and L. Cai, “Analysis on gatedrecurrent unit based question detection approach,” Interspeech 2016, pp.735–739, 2016.

[28] R. Rana, D. Austin, P. Jacobs, M. Karunanithi, and J. Kaye, “Gaitvelocity estimation using time-interleaved between consecutive passiveir sensor activation,” 2013.

[29] R. Rana, M. Yang, T. Wark, C. T. Chou, and W. Hu, “Simpletrack:adaptive trajectory compression with deterministic projection matrix formobile sensor networks,” IEEE Sensors Journal, vol. 15, no. 1, pp. 365–373, 2015.

[30] R. Rana, B. Kusy, J. Wall, and W. Hu, “Novel activity classification andoccupancy estimation methods for intelligent hvac (heating, ventilationand air conditioning) systems,” Energy, vol. 93, pp. 245–255, 2015.

[31] Z. Zhang, F. Ringeval, J. Han, J. Deng, E. Marchi, and B. Schuller,“Facing realism in spontaneous emotion recognition from speech: featureenhancement by autoencoder with lstm neural networks,” Proceedingsof INTERSPEECH, San Francsico, CA, 2016.

Dr Rajib Rana is a Vice Chancellors researchfellow in the Institute of Resilient Regions, Uni-versity of Southern Queensland (USQ) and an hon-orary fellow at the University of Queensland. Hereceived his PhD (2011) in Computer Science andEngineering from University of New South Wales(UNSW), Sydney, Australia. He was the recipientof the Presidents and Prime Ministers Gold Medalsfor his extraordinary achievement in his Bachelordegree (2004). His current research interests are inDeep Learning Neural Networks, Quantified Self,

Personal Health Informatics and the Internet of Things (IoT). He is the founderof the IoT-Health Lab at USQ, where he leads the research on Early Detectionof Mental Illness and Autism Spectrum Disorder. Since 2011 Dr Rana hasreceived a number of research grants including, Queensland Health InnovationGrant (twice: 2014, 2016), USQ major infrastructure grant, and UNSW Post-doctoral writing fellowship. He is the lead guest editor of the Special Issueon ”Sensors: Deep Learning and Wearable Sensing” (DLWS).

Dr Julien Epps is an Associate Professor in DigitalSignal Processing with the School of Electrical En-gineering and Telecommunications at The Universityof New South Wales, Sydney, Australia. He also hasan appointment as a Scientific Advisor for Boston-based startup Sonde Health, where I work on speech-based assessment of mental health, and have anappointment as a Contributed Principal Researcherwith CSIRO, in the ATP Laboratory, where heworks on methods for automatic task analysis usingbehavioural and physiological signals. His research

interests also include applications of speech modelling and processing, inparticular to emotion and mental state recognition from speech, and genomicsequence processing. He has also worked on aspects of human computer inter-action, including multimodal interfaces and computer-supported cooperativework.

Dr Raja Jurdak is a Senior Principal ResearchScientist at CSIRO, where he leads the DistributedSensing Systems Group. He has a PhD in In-formation and Computer Science at University ofCalifornia, Irvine in 2005, an MS in ComputerNetworks and Distributed Computing from the Elec-trical and Computer Engineering Department at UCI(2001), and a BE in Computer and Communica-tions Engineering from the American University ofBeirut (2000). His current research interests focus onenergy-efficiency and mobility in networks. He has

over 100 peer-reviewed journal and conference publications, as well as a bookpublished by Springer in 2007 titled Wireless Ad Hoc and Sensor Networks:A Cross-Layer Design Perspective. He regularly serves on the organizing andtechnical program committees of international conferences (DCOSS, RTSS,Sensapp, Percomm, EWSN, ICDCS). Dr. Jurdak is an Adjunct Professor atMacquarie University and James Cook University, and Adjunct AssociateProfessor at the University of Queensland and the University of New SouthWales. He is a Senior Member of the IEEE.

Page 9: Gated Recurrent Unit (GRU) for Emotion Classification … Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana , Julien Eppsy, Raja Jurdakz, Xue Lix,

9

Professor Xue Li is a Professor in InformationTechnology and Electrical Engineering at the Uni-versity of Queensland (UQ) in Brisbane Australia.He has more than 160 research articles published inACM, IEEE journals and International conferences,books, book chapters, since 1993. Dr Xue Li wasan Editor on Board, (2004-2007) of InternationalJournal of Information Systems and Management& Technology, Elsevier, A Guest Editor for threeissues of the journals: Journal of Global InformationManagement (JGIM), International Journal of Data

Warehouse and Mining (IJDWM), and International Journal of SystemsScience (IJSS). Xue has had more than 29 years experience in InformationTechnology. He is currently a guest professor in three Chinese 985 universities:Central South University of China, Chongqing University, and Xian JiaotongUniversity. Xue is the founder and Steering Committee Chair of a SoutheastAsian International Conference on Advanced Data Mining and Applications(ADMA) 2004 2015. He has been invited as a keynote speaker on BigData Analytics, Data Mining, and Web Information Systems in numerousinternational conferences.

Professor Roland Goecke is a Professor of Affec-tive Computing at the newly created (merged) Fac-ulty of Education, Science, Technology and Mathe-matics at the University of Canberra. Before that, hewas an Associate Professor in Software Engineeringat the same Faculty and prior to that an AssistantProfessor (Senior Lecturer) at the Faculty of In-formation Sciences and Engineering, University ofCanberra from Janaury 2010 until December 2012.He leads the Vision and Sensing Group and is theDeputy Director of the Human-Centred Computing

Laboratory. His research focus continues to be in the areas of face and facialfeature tracking and its applications, and more generally in Computer Vision,Affective Computing and Multimodal Human-Computer Interaction.

Professor Margot Brereton researches the partic-ipatory interaction design of ubiquitous computingtechnologies and their interfaces. She develops inno-vative designs, methods, and theoretical understand-ings by designing to support real user communitiesin selected challenging contexts. Her approach ishighly iterative and often involves growing usercommunities as the design evolves, by understandingand responding to socio-cultural factors. Her broadarea of research include Human-Computer Interac-tion, Participatory Design, Interaction Design, Com-

puter Supported Cooperative Work, Design Methods, Ubiquitous Computing.

Professor Jeffrey Soar holds the Chair in Human-Centred Technology at the University of SouthernQueensland where he researches technology innova-tion for human benefit. He came into research froma career at the highest levels of ICT managementwithin large public and private organisations. He wasCIO in several government departments in Australiaand New Zealand including Director of Informationand Technology for NZ Police where he managedthe largest-ever ICT project impacting across emer-gency services. In academia he established two

demonstration smart homes with the latest in innovations for independentliving, entertainment, security and energy. His research has been supportedby 7 Australian Research Council grants as well as over 30 grants fromnational and international technology and service organisations. His currentresearch projects are in Technology for Economic Development, E-Learningand M-Learning, E-Government, E-Health, Decision Support, Mobile, Cloud,Algorithms, Adoption and Benefits Realisation, Human Computer Interactionand User Experience.