AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE...

10
1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian Martyn Jones and Ing-Marie Jonsson †School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh EH14 4AS, UK [email protected] ‡Department of Communication Stanford University, California 94305, USA [email protected] ABSTRACT Speech interaction with in-car systems is becoming more commonplace as systems improve. New cars are often equipped with speech recognition systems to dial phone numbers and or control the in-car environment, and with speech output to provide verbal directions from navigation systems. The paper explores the possibilities of richer speech interaction between driver and car with automatic recognition of the emotional state of the driver with appropriate responses from the car. Driver’s emotions often influence driving performance that could be improved if the car actively responds to the emotional state of the driver. This paper focuses on an in-car emotion recognition system to recognise driver emotional state. Keywords: Emotional interface, emotion detection, emotional responses, driving simulator, in-car systems, affective computing, speech recognition 1. INTRODUCTION Todays cars are fitted with interactive information systems including high quality audio/video systems, pin-point satellite navigation systems, hands free telephony and control over climate and car behaviour. Current research and attention theory both suggest that speech-based interactions are less distracting to the driver than interactions with a visual display [Lunenfeld, 1989]. With potentially more complex devices for the driver to control using speech they no longer become a gimick but more a necessity. The introduction of speech-based interaction and conversation into the car highlights the potential influence of linguistic cues (such as word choice and sentence structure) and paralinguistic cues (such as pitch, frequency, accent, and speech rate). These cues play a critical role in human—human interactions, and incorporate among other things, personality and emotion [Strayer & Johnston, 2001]. It is interesting to note that communication and interaction with information systems has been viewed as an exception to that, where it is felt that people must discard their emotions in order to interact efficiently and rationally with information systems. Recently, however, there has been an explosion of research on the psychology of emotion [Gross, 1999]. Emotion is no longer limited to for example, excitement when a hard task is resolved or frustration when reading an incomprehensible error message. Literature on emotion has grown and new results show that emotions play a critical role in all goal-directed activities. There are a number of definitions of “Emotion”, however, two generally agreed-upon aspects of emotion seems to stand out [Kleinginna & Kleinginna,

Transcript of AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE...

Page 1: AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian †Martyn Jones and

1

AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR

DRIVERS TO ALLOW APPROPRIATE RESPONSES

Christian Martyn Jones† and Ing-Marie Jonsson‡ †School of Mathematical and Computer Sciences

Heriot-Watt University, Edinburgh EH14 4AS, UK [email protected]

‡Department of Communication Stanford University, California 94305, USA

[email protected]

ABSTRACT

Speech interaction with in-car systems is becoming more commonplace as systems improve. New cars are often equipped with speech recognition systems to dial phone numbers and or control the in-car environment, and with speech output to provide verbal directions from navigation systems. The paper explores the possibilities of richer speech interaction between driver and car with automatic recognition of the emotional state of the driver with appropriate responses from the car. Driver’s emotions often influence driving performance that could be improved if the car actively responds to the emotional state of the driver. This paper focuses on an in-car emotion recognition system to recognise driver emotional state.

Keywords: Emotional interface, emotion detection, emotional responses, driving simulator, in-car systems, affective computing, speech recognition

1. INTRODUCTION

Todays cars are fitted with interactive information systems including high quality audio/video systems, pin-point satellite navigation systems, hands free telephony and control over climate and car behaviour. Current research and attention theory both suggest that speech-based interactions are less distracting to the driver than interactions with a visual display [Lunenfeld, 1989]. With potentially more complex devices for the driver to control using speech they no longer become a gimick but more a necessity. The introduction of speech-based interaction and conversation into the car highlights the potential influence of linguistic cues (such as word choice and sentence structure) and paralinguistic cues (such as pitch, frequency, accent, and speech rate). These cues play a critical role in human—human interactions, and incorporate among other things, personality and emotion [Strayer & Johnston, 2001]. It is interesting to note that communication and interaction with information systems has been viewed as an exception to that, where it is felt that people must discard their emotions in order to interact efficiently and rationally with information systems. Recently, however, there has been an explosion of research on the psychology of emotion [Gross, 1999].

Emotion is no longer limited to for example, excitement when a hard task is resolved or frustration when reading an incomprehensible error message. Literature on emotion has grown and new results show that emotions play a critical role in all goal-directed activities. There are a number of definitions of “Emotion”, however, two generally agreed-upon aspects of emotion seems to stand out [Kleinginna & Kleinginna,

Page 2: AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian †Martyn Jones and

2

1981]: 1) Emotion is a reaction to events deemed relevant to the needs, goals, or concerns of an individual; and, 2) Emotion encompasses physiological, affective, behavioural, and cognitive components.

To add more complexity, emotion and mood are often used interchangeably even though emotion can be distinguished from mood. Emotions are related to a particular object, so that a) we become scared of something, or b) angry at someone. Moods, on the other hand, are more general - a person is generally depressed instead of sad (emotional state). Emotions also tend to be relatively short-lived; they are reactions to particular situations, whereas moods affect behaviour over a longer period of time [Davidson, 1994]. A person in a bad mood tends to view everything in the worst possible way, while a person in a good mood views everything in a positive way [Niedenthal, Setterlund, & Jones, 1994]. Moods can, in this manner, bias the emotions that are experienced, lowering the thresholds for emotions that are consistent with the current mood. It is therefore important to assess, and control for, the user’s mood when studying the effect of emotion and emotional cues in interactions. People that are in a good mood are more likely to be positive during an interaction than people that are a bad mood.

Emotions direct and focus people’s attention on objects and situations that have been appraised as important to current needs and goals. In a voice interface, this attention-function can be used to alert the user, as in a navigation system’s “ turn left right now” , or it can be distracting, as when users are frustrated by poor voice recognition. Just as emotions can direct users to a feature in an interface, emotions can also drive attention away from the stimulus eliciting the emotion [Gross, 1998]. For example, if a person becomes angry with a voice recognition system, the user may turn off or actively avoid parts of an interface that rely on voice input. Emotions have been found to affect cognitive style and performance, where even mildly positive feelings can have a profound effect on the flexibility and efficiency of thinking and problem solving [Murray, Sujan, Hirt, & Sujan, 1990]. People in a good mood are significantly more successful at solving problems [Isen, Daubman, & Nowicki, 1987]. Emotion also influences judgment and decision making. This suggests, for example, that users in a good mood would be more likely to judge both a voice interface itself, as well as what the interface says, more positively than if they were in a negative or neutral mood. It has also been shown that people in a positive emotional state also accept recommendations and take fewer risks than people in a negative emotional state [Isen, 2000].

Driving in particular presents a context in which a user’s emotional state plays a significant role. Attention, performance, and judgment are of paramount importance in automobile operation, with even the smallest disturbance potentially having grave repercussions. Studies with in-car information systems also show that alerting drivers to hazards in the road results in a more cautious and safer driving [Jonsson, Nass, Harris, & Takayama, 2005]. The road-rage phenomenon [Galovski & Blanchard, 2004] provides one undeniable example of the impact that emotion can have on the safety of the roadways. Considering the effects of emotion, and in particular that positive affect leads to better performance and less risk-taking, it is not surprising that research and experience demonstrate that happy drivers are better drivers [Groeger, 2000). The emotion of the car-voice has also been found to impact driving performance. Results from a study pairing the emotion of the car-voice with the emotion of the driver showed that matched emotions positively impacted driving performance [Nass, Jonsson, Harris, Reaves, Endo, Brave & Takayama, 2005]. With a focus on driving safety and driving performance, these results motivate the research to investigate the design of an emotionally responsive car.

2. AN EMOTIONALLY RESPONSIVE CAR

The development of an emotionally responsive car involves a number of technically demanding stages. In practice, the driver and car will converse two-way, where each will listen and respond to the others request for information and their emotional wellbeing. Systems which understand the driver’s requests and retrieve and respond with appropriate information are not considered in this paper. Instead we concentrate on the problem of recognising the emotional state of the driver. This could be used by the car to modify its response both in the words it uses but also the presentation of the message by stressing particular words in the message and speaking in an appropriate emotional state. As the car alters its ‘standard’ voice response it will be able to empathise with the driver and ultimately improve the wellbeing and driving performance. A previous study on pairing the emotional state of the driver with the emotional colouring of the car voice

Page 3: AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian †Martyn Jones and

3

shows that pairing the emotions has an enormous positive influence on driving performance [Nass, Jonsson, Harris, Reaves, Endo, Brave & Takayama, 2005]. The same study reported that the engagement and amount of conversations was significantly higher when the emotions were paired. For this paper we have not yet analysed factors of engagement, and reported here is only work to develop the acoustic emotion recognition part of the emotionally responsive car.

3. DEVELOPMENT OF ACOUSTIC EMOTION RECOGNITION

There is considerable research interest in automatically detecting and recognising human emotions both academically [Cowie, Douglas-Cowie, Tsapatsoulis, Votsis, Kollias, Fellenz & Taylor, 2001] [Humaine, 2004] and commercially [Jones, 2004]. Emotional information can be obtained by tracking facial motion, gestures and body language using image capture and processing [Kapoor, Qi & Picard, 2003]; tracking facial expressions using thermal imaging [Khan, Ward, & Ingleby, 2005]; monitoring physiological changes using biometric measurements taken from the steering wheel and seat/seat-belt [Healey & Picard, 2000]; and also analysing the acoustic cues contained in speech [Jones, 2004]. Currently, video cameras and biometric sensors are not fitted as standard in cars, however speech controlled systems are already commonplace. Voice-controlled satellite navigation, voice-dial mobile phones and voice-controlled multimedia systems exist and drivers are comfortable with their use. To keep the solutions non-intrusive, our approach was voice based since it is possible to incorporate voice-based emotion recognition without adding any additional hardware or changes to the driver’s environment.

Speech is a powerful carrier of emotional information and it has been shown that most emotions are associated with acoustic properties in a voice such as fundamental frequency, pitch, loudness, speech-rate and frequency range [Nass & Brave, 2005]. The emotion recognition system presented in this paper uses 10 acoustic features including pitch, volume, rate of speech and other spectral coefficients. The system then maps these features to emotions such as boredom, sadness, grief, frustration, extreme anger, happiness and surprise, using statistical and neutral network classifiers. The emotion recognition system uses changes in acoustic features representative of emotional state whilst suppressing what is said and by whom. It is therefore speaker independent and utterance independent and can be readily adapted to other languages. The system is trained on emotional speech obtained from United Kingdom and North American English speaking drama students at the Royal Scottish Academy of Music and Drama (RSAMD) using personalised and strongly emotive scenarios and constrained and free speech. All examples from RSAMD were validated in a blind listening study using human listeners before inclusion in our emotive speech corpus.

Using a test set of previously unseen emotive speech, the overall performance of the emotion recognition system is greater than 70% for five emotional groups of boredom, sadness/grief, frustration/extreme anger, happiness and surprise. The emotion recognition can track changes in emotional state over time and present its emotion decisions as a numerical indicator of the degree of emotional cues present in the speech. To aid visualisation, the emotional state of the speaker can be displayed as an emotional face image, see Table 1.

Table 1: Tracking changes in emotional state over time using an acoustic emotion recognition system.

Speech waveform

Boredom 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 Happiness 0.00 0.00 0.90 1.00 0.00 0.00 0.00 0.00 0.00 Surprise 0.00 0.00 0.15 1.00 1.00 0.00 0.00 0.00 0.00 Anger 0.06 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 Sadness 0.31 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.99 Decision Bored Bored Happy Not

sure Surprise Anger Anger None Sad

Emotional face images

Page 4: AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian †Martyn Jones and

4

The table above show the output of the emotion recognition system and is divided into three parts; the top part shows waveform of the speech, the second part shows the numerical classification of the emotions: boredom, happiness, surprise, anger/frustration, and sadness/grief followed by the decision of the emotion recognition system. Finally, the bottom part shows a graphical representation of the emotion.

4. THE EMOTIVE DRIVER PROJECT

The emotive driver project builds on previous research to assess the feasibility of automatically detecting driver emotions using speech that focused on five groups of basic emotions [Jones & Jonsson, 2005]. The strategy presented in this paper considers detecting emotions more along two dimensions, valence (positive and negative) and arousal (low energy and aroused) instead of named basic emotions. Assessing emotions along the two dimensions instead of sorting them into basic emotions reduces the problem of cultural differences and labelling. Instead of having the output be a predefined emotion, the output becomes a value along the valence and arousal dimensions. This value can later be mapped to a basic emotion or, in case of ambiguity, to a set of basic emotions.

We set up an experimental study where we recorded conversations between the car and the driver which we analysed to test the accuracy and validity of the automatic acoustic emotion recognition. The experiment consisted of an 8 day study at Oxford Brookes University, UK using 41 participants, 20 male and 21 female. Participants were all in the age group 18 – 25, all driving with a mixed informational and conversational in-car system.

The experimental study used the STISIM driving simulator [Stisim, 2005] for the driving session with the in-car information system. The driving simulator displays the road and controls such as speedometer and rev counter, and provides a rear-view mirror and control buttons for side-views left and right. Participants are seated in a real car seat and control the driving simulator using an accelerator pedal, a brake pedal, and a force-feedback steering wheel, see Figure 1. All participants experienced the same pre-defined route and properties for both driving conditions and the car. The drive lasted approximately 20 minutes for each participant.

Figure 1: STISIM simulator and controls.

Engine noise, brake screech, indicators, sirens etc together with the output from the in-car information system was played through stereo speakers. The in-car information system was described as a system that would present two types of information to the drivers, informational and conversational. The informational part of the system related to road conditions, traffic and driving conditions and was purely informational:

• This road is very busy during rush hour and traffic is slow • The police often use radar here so make sure to keep to the speed limit • Pedestrians cross the road without looking in this school zone • There is a traffic jam ahead if you turn left you might avoid it • This road has many construction zones • This stretch of road often has a problem with the fog

Page 5: AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian †Martyn Jones and

5

The part of the information system that focused on engaging the driver into conversation was based on self-disclosures. Self-disclosure is elicited by reciprocity; the system will disclose something about itself and then ask the driver a question about the same (or similar) situation. Engaging drivers in conversation can be useful for reasons such as, detecting the driver’s emotional state, gathering information on driver preferences for a personalised in-car information system, and as a potential aid for drowsy drivers. Examples of sentences that were selected to engage the driver into conversation:

• I get stressed in traffic almost every day, how often do you get stressed with traffic problems? • What do you think about the driving conditions? • Do you generally like to drive at, above or below the speech limit? • I like to drive with people who talk to me, what is your favorite person to driver with? • I like driving on mountain roads, what's your favorite road to drive on? • This is miserable, what's your strategy for coping with rain and fog whilst driving?

Speech from the participants was recorded using an Andrea directional bean with 4 microphones placed in front and about 1.5 meters away from the driver. This microphone is typical of those used in the cars of today and provided a clean acoustic recording without overly sampling the car noise. The driving sessions were also videotaped from the front left of the driver to show driver hands, arms, upper body, head and eye motion, and facial expressions. Further work and more analysis will be done to correlate the results from the acoustic emotion recognition with emotions displayed in the faces of the drivers and with self-reported emotions via questionnaires that was also collected in the experiment. The questionnaires were based on DES (Differential Emotional Scale) [Izard, 1977], a scale based on the theory and existence of ten basic emotions.

5. RESULTS FROM LISTENER AND AUTOMATIC EMOTION RECOGNITION SYSTEMS

The participants exhibit a range of emotions including boredom, sadness, anger, happiness and surprise; however for most of the drive the participants have a neutral/natural or an emotional state somewhere in between bored and sad. When challenged in the drive by obstacles in the road, other drivers, difficult road conditions and pedestrians, we observe strong emotions (aroused emotional states) from the drivers. For select parts of the conversations, a transcript of the drive has been created which includes not only the words of the conversation but also the emotional state of the driver. The transcribed speech recording was also processed by the acoustic emotion recognition system and its output classification is represented as emotive faces for each second of the drive.

The performance of the automatic emotion recognition system was assessed by comparing the human emotion transcript against the output from the recognition system. The human transcripts were created by trained experts in affective cues in speech. These experts did not take part in the driver study and are familiar with the emotion classifications of boredom, sadness/grief, frustration/extreme anger, happiness and surprise used by the automatic emotion recognition system. The experts were asked to listen to the speech soundtrack for each drive and report on the perceived emotional state of the driver. It is difficult for the human listener to provide a second by second description of the emotion present in the speech of the driver. Instead the human listeners were requested to report any changes in the perceived driver emotion but were required to provide at least one classification of emotion per sentence. The granularity of emotion tracking is therefore different between the automatic emotion recognition system and the human listener. However by visually comparing the automatic emotion recognition and human listener tracks for correlations and disparities between classifications of driver emotion we provide sufficient qualitative evidence of the performance of the system. A more quantitative measure of the performance is difficult due to the limitations in discrimination of the driver emotions by the human listeners. There are occasions where there is disparity between the emotion classifications from the human experts. Sadness and boredom can be confused, as can happiness and surprise. On these events the automated emotion recognition track is examined to see if both sadness and boredom (or happiness and surprise) acoustic cues are present. An example speech and emotion transcript created by the human listener is shown in Table 2.

Page 6: AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian †Martyn Jones and

6

Table 2: Transcr ipt of the conversation between the car and dr iver with classification of the dr iver ’s emotional state by the human listener .

4.40 [car] What do you think about the driving conditions?

4.50 [dr iver ] They are very good , dry, dry road no rain, generally good I think

5.20 [dr iver ] Nice day 5.30 [car] The police often use radar here so make sure to keep to the speed limit

5.35 [dr iver ] Thank you 6.40 [car] Do you generally like to drive at, above or below the speech limit?

6.47 [dr iver ] Erm I generally drive just a little bit above the speed limit, which probably explains

why I'm speeding a little bit now, but I generally also try and keep to it especially in 30 zones , erm and where the road is you know dangerous

7.15 [dr iver ] Not like that at all 7.30 [car] How do you like the cars performance?

7.35 [dr iver ] Erm not exactly a Ferrari but it's okay, it's got a bit of a kick on it , handlings not

great bit of under steer , can't really see myself as the next Jeremy Clarkson can you 8.00 [car] This is a really windy stretch

8.05 [dr iver ] Bloody ‘ell yeh 8.20 [car] What types of situations makes you feel stressed whilst driving?

8.25 [dr iver ] Erm generally er I think traffic certainly has an affect, erm I think people blocking

the road like this guy is now

An example output from the acoustic recognition system is divided into four parts, the top part is the waveform of the speech, the second part is the pitch characteristics of the speech, the third part is the volume characteristics of the speech, and at the bottom is the automatic classification of emotion of the speech. An example of the output from the acoustic recognition system can be seen in Figure 2.

Figure 2: Automatically generated transcr ipt of the dr iver ’s speech using the emotion recognition system.

Page 7: AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian †Martyn Jones and

7

An example output from the acoustic recognition system for the same driver is shown in Figure 2. As in Table 2, the driver is been asked (at time 7.30) “How do you like the cars performance?” and responds by saying “Erm not exactly a Ferrari but it's okay, it's got a bit of a kick on it, handlings not great bit of under steer, can't really see myself as the next Jeremy Clarkson can you” . The emotion track (bottom) shows that the driver is talking in an upbeat, joking fashion at the start and is enthusiastic but becomes more matter-of-fact and downbeat towards the end of the segment.

There is a correlation between the emotional transcript created by the human listener and the emotion output returned automatically by the acoustic emotion recognition system. However there are occasions where the speech is masked by car noise (such as engine noise, sirens and brakes). Other times, the automatic system could not disambiguate between emotional states so that the driver was assessed to be in one of two emotional states - bored or sad (negative emotions with low arousal), or - happy or surprised (positive emotions with moderate arousal).

6. DISCUSSION ON GROUPING AND LABELLING EMOTIONS

Comparing the emotional transcript of the human listener and the automated emotion recognition system on a sentence by sentence basis (and noting and comparing any changes in driver emotion) there is on average a 60-70% correlation for the five emotional groups. The human listener, however, has similar difficulties when classifying speech into emotions as experienced by the automatic system, and is particularly confused when differentiating between sadness and boredom and between happiness and surprise. This trend gives rise to the suggestion that we should define emotions based on values along the valence and arousal dimension, see Figure 3.

By doing this it is then possible to later label a recognized valence and arousal value be a particular emotion or in a set of emotions covered by the valence and arousal values, see Figure 3 for approximations based on labelling by UK participants. It would allow us to group emotions, such that boredom and sadness together could be called ‘downbeat’ . It would also overcome much of the confusion and ambiguity associated with labelling emotions, and especially with cross-cultural labelling of emotions.

Figure 3: Emotions assessed along dimensions of Valence and Arousal

Similarly grouping happiness and surprise together as ‘upbeat’ could also then have the potential to reduce confusions. Grouping the emotions into three groups of boredom/sadness/grief { downbeat} , happiness/surprise { upbeat} , and anger/frustration { anger} is projected to produce a correlation of 70% to 80% between the emotional transcript of the human listener and the automatic system.

The five emotions that are currently detected by the automatic recognition system may not be the optimal range of emotions required for the emotionally responsive car. Hence the performance of the emotion recognition can be improved by using the valence and arousal dimension and then grouping emotions into sets such as ‘downbeat’ , ‘upbeat’ and ‘angry’ . Further mappings of emotional groupings can be achieved by clasping the valence/arousal space such that negative arousal is called ‘non aroused’ and includes

Frustration

Arousal - Low

Arousal - High

Valence – Positive

Valence - Negative

Sad

Anger Rage

Grief

Calm Content

Surprise

Happy

Bored

Page 8: AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian †Martyn Jones and

8

boredom/sadness/grief, whilst positive arousal is called ‘aroused’ and includes happiness/surprise/anger/frustrations. Using two emotional groups of ‘non aroused’ and ‘aroused’ , rather than five, is projected to produce a correlation of greater than 80% between the emotional transcript of the human listener and the automatic systems. These figures are based on similar experiments using emotional conversations over mobile phone networks [Jones, 2004].

7. EXTENSIONS TO THE CURRENT RESEARCH

The research is ongoing and we continue to analyse speech from the drivers and consider improvements for future work [Jones & Jonsson, 2005]. Although the in-car system asks questions of the driver some of the participants did not respond and engage in conversation. Of the 41 participants, 5 did not converse with the car (3 female and 2 male) and thus we were unable to ascertain their emotional state acoustically. Future work will consider why the drivers did not talk with the car. Are they too focused on the task of driving? Did they not like the particular voice of the car? Did they not like the questions that the car was asking? Do they not feel comfortable talking with the car? Do they not feel the car is listening to them and responding appropriately? These and many more questions need to answered so that we can adapt the conversational interface with engaging conversations and information that is trusted without negative impact on driving performance. To do this we also need to develop and tune the acoustic emotion recognition to gain insight into the mood and emotional state of the driver.

There are of course more questions to be answered with an adaptive system. How fast should the system change? What should trigger the change? How would the change be implemented? Previous studies have considered varying the paralinguistic cues only [Isen, 2000], however should the content of the response also change, and how? Should the car become less or more talkative depending on the mood of the driver? Should the car alter the telematics, climate, music in the car in response the mood of the driver?

Further research should consider the affect of altering the car response and car environment to driver emotion. One strategy is to exhibit empathy by changing the emotion of the car-voice to match the user. Empathy fosters relationship development, as it communicates support, caring, and concern for the welfare of another [Brave, 2003]. A voice which expresses happiness in situations where the user is happy and sounds subdued or sad in situations where the user is upset would strongly increase the connection between the user and the voice [Nass, Jonsson, Harris, Reaves, Endo, Brave, & Takayama, 2005].

Looking at the rate of change, we see that although rapid response to predicted emotion of the user can be effective, there are a number of dangers in this approach. Emotions can change in seconds in the human brain and body [Picard, 1997]. A sad person may momentarily be happy if someone tells a joke, but will fall back into their sad state relatively quickly. Conversely, happy drivers may become frustrated as they must slam on the brakes for a yellow light, but their emotion may quickly switch back to feeling positively. If the voice in the car immediately adapted to the user’s emotions, drivers would experience occurrences such as the car-voice changing its emotion in mid-sentence. This would dramatically increase cognitive load constantly activate new emotions in the driver and be perceived as psychotic.

Mood must be taken into account to make the car-voice an effective interaction partner. Moods tend to bias feelings and cognition over longer terms, and while moods can be influenced by emotions, they are more stable and effectively filter events. A person in a good mood tends to view everything in a positive light, while a person in a bad mood does the opposite. Drivers that are in a good mood when entering a car are more likely to experience positive emotion during an interaction with a car-voice than drivers in a bad mood. Therefore it seems that emotion in technology-based voices must balance responsiveness and inertia by orienting to both emotion and mood.

Performance, attention, knowledge, beliefs, and feelings are to a large extent determined by emotions. People are influenced by (voice) interactions with people and interfaces and this makes it important for designers of speech based systems to work with linguistic and para-linguistic cues (including emotional cues) to create the desired effect when people interact with the system.

Page 9: AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian †Martyn Jones and

9

8. REFERENCES

Brave, S. (2003). Agents that care: Investigating the effects of orientation of emotion exhibited by an embodied computer agent, Doctoral dissertation. Stanford University, CA

Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W. & Taylor, J.G. (2001). Emotion recognition in human-computer interaction. IEEE Signal processing magazine, 32-80.

Davidson, R. J. (1994). On emotion, mood, and related affective constructs. In P. Ekman & R. J. Davidson (Eds.), The nature of emotion (pp. 51-55). New York: Oxford University Press.

Galovski, T. & Blanchard, E. (2004) Road rage: a domain for psychological intervention? Aggressive Violent Behavior 9(2), pp. 105-127.

Groeger, J.A. (2000). Understanding driving: Applying cognitive psychology to a complex everyday task. Hove, U.K.: Psychology Press.

Gross, J. J. (1998). Antecedent- and response-focused emotion regulation: Divergent consequences for experience, expression, and physiology. Journal of Personality and Social Psychology, 74, 224-237

Gross, J. J. (1999). Emotion and emotion regulation. In L. A. Pervin & O. P. John (Eds.), Handbook of personality: Theory and research (2nd ed., pp. 525-552). New York: Guildford.

Healey, J. & Picard, R. (2000). SmartCar: Detecting driver stress, In Proceedings of ICPR 2000, Barcelona, Spain, 2000

Humaine Portal. (2004). Research on emotion and human-machine interaction, http://www.emotion-research.net/

Isen, A.M. (2000). Positive affect and decision making, in Lewis, M. and Haviland-Jones, J.M. eds. Handbook of emotions, The Guilford Press, 417-435

Izard, C., Human Emotions, (1977), New York, Plenum Press.

Jones, C. (2004). Project to develop voice-driven emotive technologies, Scottish Executive, Enterprise transport and lifelong learning department, UK

Jones, C. & Jonsson, I-M. (2005). Speech patterns for older adults while driving, In Proceedings of HCI International 2005, Las Vegas, Nevada, USA, 22-27 July 2005

Jonsson, I-M., Nass, C., Harris, H., & Takayama, L., (2005). Got info? Examining the consequence of inaccurate information systems, In Proceedings of 3rd International Symposium on Human Factors in Driving Assessment, Training and Vehicle Design.

Kapoor, A., Qi, Y., & Picard, R. (2003). Fully automatic upper facial action recognition, IEEE International Workshop on Analysis and Modeling of Faces and Gestures (AMFG 2003) held in conjunction with ICCV 2003, Nice, France, October 2003

Khan, M., Ward, R, & Ingleby, M., (2005). Distinguishing facial expressions by thermal imaging using facial thermal feature points, In Proceedings of HCI 2005, Edinburgh, UK, 5-9 September 2005

Nass, C., & Brave, S., (2005). Wired for speech: How voice activates and advances the human-computer relationship. MIT Press, Cambridge, MA, 2005.

Page 10: AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE …1 AUTOMATIC RECOGNITION OF AFFECTIVE CUES IN THE SPEECH OF CAR DRIVERS TO ALLOW APPROPRIATE RESPONSES Christian †Martyn Jones and

10

Nass, C., Jonsson, I-M., Harris, H., Reaves, B., Endo, J., Brave, S., Takayama, L. (2005). Increasing safety in cars by matching driver emotion and car voice emotion, In Proceedings of CHI 2005, Portland, Oregon, USA, 2-7 April 2005

Kleinginna, P. R., Jr., & Kleinginna, A. M. (1981). A categorized list of emotion definitions, with suggestions for a consensual definition. Motivation and Emotion, 5(4), 345-379.

Lunenfeld, H. (1989). Human factor considerations of motorist navigation and information systems, In Proceedings of vehicle navigation and information systems, 35–42

Murray, N., Sujan, H., Hirt, E. R., & Sujan, M. (1990). The influence of mood on categorization: A cognitive flexibility interpretation. Journal of Personality and Social Psychology, 59, 411-425

Niedenthal, P. M., Setterlund, M. B., & Jones, D. E. (1994). Emotional organization of perceptual memory. In P. M. Niedenthal & S. Kitayama (Eds.), The heart's eye (pp. 87-113). San Diego: Academic Press, Inc.

Picard, R.W.: Affective computing. MIT Press, Cambridge, MA (1997)

STISIM drive system (2005), Systems technology, Inc. California, http://www.systemstech.com/

Strayer, D., and Johnston, W. (2001). Driven to distraction: Dual-task studies of simulated driving and conversing on a cellular telephone” , Psychological science, 12, 462–466

9. ACKNOWLEDGEMENTS

The research is supported by Affective Media Limited, Edinburgh, UK. We thank them for their assistance in developing the acoustic emotion recognition system and the analysis of the driver speech data. We also thank Mary Zajick at Oxford Brookes University for assistance in running driving simulator studies where we can collect speech data.