Robust Speech Based Emotion Analysis for Driver Assistance...

Robust Speech Based Emotion Analysis for Driver Assistance System

Ashish Tawari and Mohan TrivediLISA: Laboratory for Intelligent and Safe Automobiles

University of California San Diego, Dept. of [email protected], [email protected]

Abstract— Automated analysis of human affective behaviorhas attracted increasing attention in recent years. In contextof driving task, particularly, driver’s emotion often influencesdriving performance which can be improved if the car activelyresponds to the emotional state of the driver. It is importantfor an intelligent driver support system to accurately monitorthe driver’s state in an unobtrusive and robust manner. Everchanging environment while driving poses a serious challengeto existing techniques for speech emotion recognition. In ourresearch, we propose to utilize contextual information of theoutside environment as well as inside car user to improve theemotion recognition accuracy. In particular, a noise cancellationtechnique is proposed to suppress the noise adaptively based onthe driving context and use gender based context informationfor developing the classifier.

I. RESEARCH OBJECTIVE AND MOTIVATION

The automotive industry is integrating more and moretechnologies into modern cars. With potentially more com-plex devices for the driver to control, risk of distractingdriver’s attention increases. Current research and attentiontheory have suggested that speech-based interactions are lessdistracting than interaction with visual display [1]. Therefore,to improve both comfort and safety in the car, the driver as-sistance technologies need effective speech interface betweenthe driver and the infotainment system. The introductionof speech-based interactions and conversation into the carbrings out potential importance of linguistic cues (like wordchoice and sentence structure) and paralinguistic cues (likepitch, intensity, and speech rate etc.). Such cues play afundamental role in enriching human-human interaction andincorporate among other things, personality and emotion [2].Driving in particular presents a context in which a user’semotional state plays a significant role. Emotions have beenfound to affect cognitive style and performance. The road-rage phenomenon [3] is an undeniable fact of the impactthat emotion can have on the safety of the roadways. Suchphenomenon can be mitigated and driving performance canbe improved if the car actively responds to the emotionalstate of the driver. Research studies show that matchingin-car voice and driver’s emotional state has great impacton driving performance [4]. While number of studies haveshown the potential of emotion monitoring, automatic recog-nition of emotion is still a very challenging task, speciallywhen it comes to real world driving scenario. The car settingis far from the ideal environment for speech acquisitionand processing need to deal with reverberation and noisyconditions inside car cockpit.

Fig. 1. Emotion classification system

In our research, we focus to develop a system which canadapt to the continually changing outside environment aswell as the inside driver of the car. While it is true thatit is matter of time when the cars would be fitted withcamera technology, we have adopted a speech based emotionrecognition system since speech interaction system is com-monplace in today’s cars. The major challenge in speechprocessing technology is presence of non-stationary noise,specifically in driving task, due to ever changing outside carenvironment. Adaptive noise cancellation technique seems anobvious choice to enhance the speech quality based on thecontext of the outside environment [5]. On the other hand,use of the user context information (e.g. gender) can improvethe emotion recognition [6]. We explore usefulness of suchcontext information in driving task. Another important aspectfrom machine learning paradigm suggests that performanceof classification system is highly dependent on the qualityof data. Towards this end, we are actively developing anaudiovisual emotion database.

II. SPEECH TO EMOTION: COMPUTATIONALFRAMEWORK

Fig. 1 shows the block diagram representing the three fun-damental steps of information acquisition and preprocessing,extraction and processing of parameters and classification ofsemantic units. The pre-processing stage consists of speechenhancement module which adopts to the context of theoutside environment. The classification system has threedifferent phases: feature extraction and selection (phase 1), toidentify features to be used during classification; the modeltraining phase (phase 2) and finally, testing phase (phase3) to evaluate the performance of the system in terms ofclassification accuracy.

III. COMPUTATIONAL MODULES

A. Adaptive speech enhancement

During last decades, several speech enhancement tech-niques have been proposed ranging from beamformingthrough microphone arrays to adaptive noise filtering ap-proaches. In this work, we utilize a speech enhancement tech-nique based on the adaptive thresholding in wavelet domain[7]. Objective criterion for parameter selection, however,used in our experiments was the classification performanceas opposed to SNR.

B. Feature extraction and selection

A variety of features have been proposed to recognizeemotional states from speech signal. These features can becategorized as acoustic features and linguistic features. Weavoid using the latter since these demand for robust recog-nition of speech in first place and also are a drawback formulti-language emotion recognition. Hence we only considerthe acoustic features. Within the acoustic category, we focuson prosodic like speech intensity, pitch and speaking rate,and spectral features like mel frequency cepstral coefficients(MFCC) to model emotional states.

In order to capture the characteristics of the pitch andenergy contours, we perform cepstrum analysis over the con-tour [6]. For all these sequences following statistical infor-mation is calculated: mean, standard deviation, relative maxi-mum/minimum, position of relative maximum/minimum, 1stquartile, 2nd quartile (median) and 3rd quartile. Speakingrate is modeled as fraction of the voiced segments. Thus,the total feature vector per segment contains 3 · (13 + 13 +13) · 9 + 1 = 1054 attributes.

Intuitively, a large number of features would improvethe classification performance, however, in practice a largefeature space suffers from the phenomenon of ‘curse ofdimensionality’. Therefore in order to improve the classifi-cation performance, a feature selection technique is utilized.One such method to eliminate redundant and insignificantfeatures is to identify features with high correlation withthe class but low correlation among themselves [8]. Wealso show that including gender information during featureselection significanlty improves the performance.

IV. CLASSIFIER DESIGN

Support Vector Machines (SVMs) have shown great per-formance in practice due to their various attractive proper-ties including generalization performance. For the task atour hand, we use SVM with linear kernel and one-vs-onemulticlass discrimination scheme. Classification is done bya max-wins voting strategy.

V. IN-VEHICLE DATA COLLECTION

Collecting real world driving data is certainly useful for re-searchers interested in design and development of intelligentvehicular system capable of dealing with different situationsin traffic. Real world data, as opposed to driving simulatorrecordings, are much more closely related to the day to day

Fig. 2. LISA testbed for data acquisition.

situations and can provide valuable information on drivers’behavior.

A. LISA audio-visual affect database

The user at the driver’s seat was prompted by a computerprogram specifying the emotion to be expressed. It alsoprovides example of an utterance that can be used by thedriver. We also recorded free conversion between passengerand driver. The database is collected in stationary and movingcar environment. The cockpit of automobile does not providethe comfort of noiseless anechoic environment. In fact,moving automobile with a lot of road noise has a drasticeffect on signal to noise ratio (SNR) for audio channel as wellas challenging illumination condition for video channel. Inthis study, we analyze emotional speech data in the stationarycar setting which gives the effect of the cockpit of the carwith relatively high SNR value.

The database is collected with the use of an analog videocamera facing the driver and an directional microphone be-neath steering wheel. Fig 2 shows the settings of the cameraand microphone. Video frames were acquired approximately30 frames per second and the audio signal, captured, isresampled to 16 kHz sampling rate. The emotional speechhas been labeled into 3 groups ‘pos’, ‘neg’ and ‘neu’ forpositive, negative and neutral expressions. The data has beenacquired with 4 different subjects: 2 male and 2 female.Distribution of data for different categories is: 82 pos, 82neg, and 60 neu.

REFERENCES

[1] H. Lunenfeld, “Human factor considerations of motorist navigation andinformation systems,” in Vehicle Navigation and Information SystemsConference, 1989.

[2] D. L. Strayer and W. A. Johnston, “Driven to distraction: Dual-taskstudies of simulated driving and conversing on a cellular telephone,”Psychological Science, vol. 12, no. 6, pp. 462–466, 2001.

[3] T. E. Galovski and E. B. Blanchard, “Road rage: a domain forpsychological intervention?” Aggression and Violent Behavior, vol. 9,no. 2, pp. 105 – 127, 2004.

[4] I. marie Jonsson, C. Nass, H. Harris, and L. Takayama, “Matching in-car voice with driver state: Impact on attitude and driving performance,”in Proc of the Third International Driving Symposium on Human Factorin Driver Assessment, Training and Vehicle Design, 2005, pp. 173–181.

[5] A. Tawari and M. Trivedi, “Speech emotion analysis in noisy real worldenvironment,” in Proc of Int. Conf on Pattern Recognition (ICPR), 2010.

[6] ——, “Context analysis in speech emotion recognition,” IEEE Trans-action on Multimedia, 2010.

[7] Y. Ghanbari and M. R. Karami-Mollaei, “A new approach for speechenhancement based on the adaptive thresholding of the wavelet pack-ets,” Speech Communication, vol. 48, no. 8, pp. 927 – 940, 2006.

[8] M. A. Hall, “Correlation-based feature subset selection for machinelearning,” 1998.

Robust Speech Based Emotion Analysis for Driver Assistance...

Documents

Transcript of Robust Speech Based Emotion Analysis for Driver Assistance...