Labelling Emotional User States in Speech: Where's the Problems, where's the Solutions?
description
Transcript of Labelling Emotional User States in Speech: Where's the Problems, where's the Solutions?
Labelling Emotional User States in Speech:Where's the Problems, where's the Solutions?
Anton Batliner, Stefan SteidlUniversity of Erlangen
HUMAINE WP5-WS, Belfast, December 2004
2
Overview
• decisions to be made– mapping data onto labels– the human factor– and later on: again some mapping
• illustrations– the sparse data problem– new dimensions– new measure for labeller agreement
• and afterwards: what to do with the data?– handling of databases– we proudly present: CEICES
• some statements
3
Mapping data onto labels I
• catalogue of labels– data-driven (selection from "HUMAINE"-catalogue?)– should be a semi-open class
• unit of annotation– word (phrase or turn) in speech– in video?
• alignment of video time stamps with speech data necessary
4
Mapping data onto labels II
• categorical (hard/soft) labelling vs. dimensions• formal vs. functional labelling *
– functional: holistic user states– formal: prosody, voice quality, syntax, lexicon, FAPs, ...
• reference baseline– speaker-/user-specific– neutral phase at beginning of interaction– sliding window
* emotion content vs. signs of emotion?
5
The human factor I
• expert labellers vs. naïve labellers– experts:
• experienced, i.e. consistent• with (theoretical) bias• expensive• few
– "naïve" labellers:• maybe less consistent• no bias, i.e., ground truth?• less expensive• more
• representative data = many data = high effort• are there "bad" labellers?• does high interlabeller agreement really mean
„good“ labellers?
6
The human factor II:evaluation of annotations
WP3
past WP9
kappa etc.
engineering
7
and later on
• mapping of labels onto cover classes– sparse data– classification performance
• embedding into the application task– small number of alternatives– criteria?– dimensional labels adequate? *
• human processing• system restrictions
* cf. the story of 33 vs. 2 levels of accentuation
8
The Sparse Data Problem
• un-balanced distribution (Pareto?)– (too) few for robust training– down- or up-sampling necessary for testing
• looking for "interesting" ("provocative"?) data: does this mean to beg the question?
9
The Sparse Data Problem: Some Frequencies, word-based
SmartKom # $ SympaFly # $ AIBO min max mean maj.vot. mean maj.vot.
joyful-strong joyful-weak
93 580
Joyful 58 Joyful 89 679 298 101 20 11
neutral 7827 Neutral 15390 Neutral 17180 44234 34526 39182 6790 7172
- - - - Motherese 768 3562 2084 1261 96 55
? ? Emphatic 3708 Emphatic 361 22921 7989 2528 982 631
surprised 62 Surprised 31 Surprised 1 30 7 0 0 0
Ironic 395 - - - - -
helpless 1065 Helpless 654 Helpless 4 480 176 3 52 20
- - - - Bored 92 593 231 11 46 0
Panic 43
- - - - Reprimanding 152 2611 1185 310 369 127
Touchy 806 Touchy 244 3130 1483 225 41 7
angry-strong angry-weak
418 138
Angry 40 Angry 42 740 319 84 50 23
human -WOZ multi-modal dialogue system
human - autom. call center human (child) - pet robot
German, 5 labellers English,
3 labellers
$ consensus labelling scenario-specific: ironic vs. motherese/reprimandingemphatic: in-betweenrare birds in AIBO: surprised, helpless, bored
not neutral in %: - /23 (27/9.6) 10.3/4.6 15.4/8 with/without emph.
10
Towards New Dimensions
• from categories to dimensions• confusion matrices = similarities
Non-Metrical Multi-Dimensional Solution (NMDS)
11
11 emotional user state labels, data-driven,word-based *
• joyful• surprised• motherese• neutral (default)• rest (wast-paper-basket, non-neutral)• bored• helpless, hesitant• emphatic (possibly indicating problems)• touchy (=irritated)• angry• reprimanding
* effort: 10-15 times real-time
12
confusion matrix: majority voting 3/5 vs. rest; if 2/2/1, both 2/2 as maj. vot. ("pre-emphasis")
A T R J M E N W S B H Angry 43 13 12 00 00 12 18 00 00 00 00 Touchy 04 42 11 00 00 13 23 00 00 02 00Reprim. 03 15 45 00 01 14 18 00 00 00 00Joyful 00 00 01 54 02 07 32 00 00 00 00Mother. 00 00 01 00 61 04 30 00 00 00 00Emph. 01 05 06 00 01 53 29 00 00 00 00Neutral 00 02 01 00 02 13 77 00 00 00 00Wast-p. 00 07 06 00 08 19 21 32 00 01 01Surpr. 00 00 00 00 00 20 40 00 40 00 00Bored 00 14 01 00 01 12 28 01 00 39 00Helpl. 00 01 00 02 00 12 37 03 00 00 41
R: reprimanding, W: waste paper basket category
13
"traditional" emotional dimensions in feeltrace:VALENCE and AROUSAL
14
NMDS: 2-dimensional solution with 7 labels,„relative“ majority with „pre-empasis“
-2 -1 0 1 2
Dimension 1
-1,5
-1,0
-0,5
0,0
0,5
1,0
Dim
ensi
on
2
ANGRY
TOUCHY
REPRIMANDING
JOYFUL
MOTHERESE
EMPHATIC
NEUTRAL
Euclidean distance model
Derived Stimulus Configuration
= o
rient
atio
n ?
= valence ?
15
and back
• from categories to dimensions• what about the way back?
– automatic clustering?– thresholds– ....
16
Towards New Quality Measures
Stefan Steidl
Entropy-Based Evaluation of Decoders
17
Handling of Databases
• http://www.phonetik.uni-muenchen.de/Forschung/BITS/index.html
• Publications• The Production of Speech Corpora (ISBN: 3-8330-0700-1)• The Validation of Speech Corpora (ISBN: 3-8330-0700-1)
The Production of Speech Corpora
Florian Schiel, Christoph Draxler
Angela Baumann, Tania Ellbogen, Alexander Steffen
Version 2.5 : June 1, 2004
18
CEICES
• Combining Efforts for Improving automatic Classification of Emotional user States, a "forced co-operation" initiative under the guidance of HUMAINE – evaluation of annotations– assessment of F0 extraction algorithms– assessment of impact of single feature (classes)– improvement of classification performance via
sharing of features
19
Ingredients of CEICES
• speech data: German AIBO database• annotations:
– functional, emotional user states, word-based– (prosodic peculiarities, word-based)
• manually corrected– segment boundaries for words– F0
• specifications of Train/Vali/Test, etc.• reduction of effort: ASCII file sharing via portal• forced co-operation via agreement
20
corr.F0
aut.F0
wordbound.
pitch-lab.
21
Agreement
• open for non-HUMAINE partners• nominal fee for distribution and handling• commitments
– to share labels and extracted feature values– to use specified sub-samples
• expected outcome– assessment of F0 extraction, impact of features, ...– set of feature classes/vectors with evaluation– common publication(s)
22
some statements
• annotation has to be data-driven• there is no bad labellers• classification results have to be used for
labelling assessment• automatic labelling is not good enough - or,
maybe you should call it „extraction“• each label type has to be mapped onto very
few categorical classes at the end of the day
23
Thank you for your attention