Labelling Emotional User States in Speech: Where's the Problems, where's the Solutions?

Labelling Emotional User States in Speech:Where's the Problems, where's the Solutions?

Anton Batliner, Stefan SteidlUniversity of Erlangen

HUMAINE WP5-WS, Belfast, December 2004

2

Overview

• decisions to be made– mapping data onto labels– the human factor– and later on: again some mapping

• illustrations– the sparse data problem– new dimensions– new measure for labeller agreement

• and afterwards: what to do with the data?– handling of databases– we proudly present: CEICES

• some statements

3

Mapping data onto labels I

• catalogue of labels– data-driven (selection from "HUMAINE"-catalogue?)– should be a semi-open class

• unit of annotation– word (phrase or turn) in speech– in video?

• alignment of video time stamps with speech data necessary

4

Mapping data onto labels II

• categorical (hard/soft) labelling vs. dimensions• formal vs. functional labelling *

– functional: holistic user states– formal: prosody, voice quality, syntax, lexicon, FAPs, ...

• reference baseline– speaker-/user-specific– neutral phase at beginning of interaction– sliding window

* emotion content vs. signs of emotion?

5

The human factor I

• expert labellers vs. naïve labellers– experts:

• experienced, i.e. consistent• with (theoretical) bias• expensive• few

– "naïve" labellers:• maybe less consistent• no bias, i.e., ground truth?• less expensive• more

• representative data = many data = high effort• are there "bad" labellers?• does high interlabeller agreement really mean

„good“ labellers?

6

The human factor II:evaluation of annotations

WP3

past WP9

kappa etc.

engineering

7

and later on

• mapping of labels onto cover classes– sparse data– classification performance

• embedding into the application task– small number of alternatives– criteria?– dimensional labels adequate? *

• human processing• system restrictions

* cf. the story of 33 vs. 2 levels of accentuation

8

The Sparse Data Problem

• un-balanced distribution (Pareto?)– (too) few for robust training– down- or up-sampling necessary for testing

• looking for "interesting" ("provocative"?) data: does this mean to beg the question?

9

The Sparse Data Problem: Some Frequencies, word-based

SmartKom # $ SympaFly # $ AIBO min max mean maj.vot. mean maj.vot.

joyful-strong joyful-weak

93 580

Joyful 58 Joyful 89 679 298 101 20 11

neutral 7827 Neutral 15390 Neutral 17180 44234 34526 39182 6790 7172

- - - - Motherese 768 3562 2084 1261 96 55

? ? Emphatic 3708 Emphatic 361 22921 7989 2528 982 631

surprised 62 Surprised 31 Surprised 1 30 7 0 0 0

Ironic 395 - - - - -

helpless 1065 Helpless 654 Helpless 4 480 176 3 52 20

- - - - Bored 92 593 231 11 46 0

Panic 43

- - - - Reprimanding 152 2611 1185 310 369 127

Touchy 806 Touchy 244 3130 1483 225 41 7

angry-strong angry-weak

418 138

Angry 40 Angry 42 740 319 84 50 23

human -WOZ multi-modal dialogue system

human - autom. call center human (child) - pet robot

German, 5 labellers English,

3 labellers

$ consensus labelling scenario-specific: ironic vs. motherese/reprimandingemphatic: in-betweenrare birds in AIBO: surprised, helpless, bored

not neutral in %: - /23 (27/9.6) 10.3/4.6 15.4/8 with/without emph.

10

Towards New Dimensions

• from categories to dimensions• confusion matrices = similarities

Non-Metrical Multi-Dimensional Solution (NMDS)

11

11 emotional user state labels, data-driven,word-based *

• joyful• surprised• motherese• neutral (default)• rest (wast-paper-basket, non-neutral)• bored• helpless, hesitant• emphatic (possibly indicating problems)• touchy (=irritated)• angry• reprimanding

* effort: 10-15 times real-time

12

confusion matrix: majority voting 3/5 vs. rest; if 2/2/1, both 2/2 as maj. vot. ("pre-emphasis")

A T R J M E N W S B H Angry 43 13 12 00 00 12 18 00 00 00 00 Touchy 04 42 11 00 00 13 23 00 00 02 00Reprim. 03 15 45 00 01 14 18 00 00 00 00Joyful 00 00 01 54 02 07 32 00 00 00 00Mother. 00 00 01 00 61 04 30 00 00 00 00Emph. 01 05 06 00 01 53 29 00 00 00 00Neutral 00 02 01 00 02 13 77 00 00 00 00Wast-p. 00 07 06 00 08 19 21 32 00 01 01Surpr. 00 00 00 00 00 20 40 00 40 00 00Bored 00 14 01 00 01 12 28 01 00 39 00Helpl. 00 01 00 02 00 12 37 03 00 00 41

R: reprimanding, W: waste paper basket category

13

"traditional" emotional dimensions in feeltrace:VALENCE and AROUSAL

14

NMDS: 2-dimensional solution with 7 labels,„relative“ majority with „pre-empasis“

-2 -1 0 1 2

Dimension 1

-1,5

-1,0

-0,5

0,0

0,5

1,0

Dim

ensi

on

2

ANGRY

TOUCHY

REPRIMANDING

JOYFUL

MOTHERESE

EMPHATIC

NEUTRAL

Euclidean distance model

Derived Stimulus Configuration

= o

rient

atio

n ?

= valence ?

15

and back

• from categories to dimensions• what about the way back?

– automatic clustering?– thresholds– ....

16

Towards New Quality Measures

Stefan Steidl

Entropy-Based Evaluation of Decoders

17

Handling of Databases

• http://www.phonetik.uni-muenchen.de/Forschung/BITS/index.html

• Publications• The Production of Speech Corpora (ISBN: 3-8330-0700-1)• The Validation of Speech Corpora (ISBN: 3-8330-0700-1)

The Production of Speech Corpora

Florian Schiel, Christoph Draxler

Angela Baumann, Tania Ellbogen, Alexander Steffen

Version 2.5 : June 1, 2004

http://www.phonetik.uni-muenchen.de/Forschung/BITS/index.html



http://www.phonetik.uni-muenchen.de/Forschung/BITS/TP1/Cookbook/

http://www.phonetik.uni-muenchen.de/Forschung/BITS/TP2/Cookbook/

18

CEICES

• Combining Efforts for Improving automatic Classification of Emotional user States, a "forced co-operation" initiative under the guidance of HUMAINE – evaluation of annotations– assessment of F0 extraction algorithms– assessment of impact of single feature (classes)– improvement of classification performance via

sharing of features

19

Ingredients of CEICES

• speech data: German AIBO database• annotations:

– functional, emotional user states, word-based– (prosodic peculiarities, word-based)

• manually corrected– segment boundaries for words– F0

• specifications of Train/Vali/Test, etc.• reduction of effort: ASCII file sharing via portal• forced co-operation via agreement

20

corr.F0

aut.F0

wordbound.

pitch-lab.

21

Agreement

• open for non-HUMAINE partners• nominal fee for distribution and handling• commitments

– to share labels and extracted feature values– to use specified sub-samples

• expected outcome– assessment of F0 extraction, impact of features, ...– set of feature classes/vectors with evaluation– common publication(s)

22

some statements

• annotation has to be data-driven• there is no bad labellers• classification results have to be used for

labelling assessment• automatic labelling is not good enough - or,

maybe you should call it „extraction“• each label type has to be mapped onto very

few categorical classes at the end of the day

23

Thank you for your attention

Labelling Emotional User States in Speech: Where's the Problems, where's the Solutions?

Documents

Transcript of Labelling Emotional User States in Speech: Where's the Problems, where's the Solutions?