Post on 27-Dec-2015
Occasion: HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004
Talk: Ronald Müller
Speech Emotion Recognition Speech Emotion Recognition Combining Acoustic and Semantic AnalysesCombining Acoustic and Semantic Analyses
Institute for Human-Machine Communication
Technische Universität München
Slide -2-
System Overview
Emotional Speech Corpus
Acoustic Analysis
Semantic Analysis
Stream Fusion
Results
Outline
OutlineOutline
Slide -3-
System Overview
System OverviewSystem Overview
Speech signalSpeech signal
Prosodic featuresProsodic features ASR-unitASR-unit
Semantic interpretationSemantic interpretation(Bayesian Networks)(Bayesian Networks)
ClassifierClassifier(SVM)(SVM)
Stream fusionStream fusion(MLP)(MLP)
EmotionEmotion
Slide -4-
Emotion set:
Anger, disgust, fear, joy, neutrality, sadness, surprise
Corpus 1: Practical course
404 acted samples per emotion
13 speakers (1 female)
Recorded within one year
Corpus 2: Driving simulator
500 spontaneous emotion samples
200 acted samples (disgust, sadness)
Emotional Speech Corpus
Emotional Speech CorpusEmotional Speech Corpus
2828iE
700iE
Slide -5-
System Overview
System OverviewSystem Overview
Speech signalSpeech signal
Prosodic featuresProsodic features ASR-unitASR-unit
Semantic interpretationSemantic interpretation(Bayesian Networks)(Bayesian Networks)
ClassifierClassifier(SVM)(SVM)
Stream fusionStream fusion(MLP)(MLP)
EmotionEmotion
Slide -6-
Acoustic Analysis
Acoustic AnalysisAcoustic Analysis
Low-level features
Pitch contour (AMDF, low-pass filtering)
Energy contour
Spectrum
Signal
High-level features
Statistic analysis of contours
Elimination of mean, normalization to standard dev.
Duration of one utterance (1-5 seconds)
Slide -7-
Acoustic Analysis
Feature selection (1/2)
Initial set of 200 statistical features
Ranking 1: Single performance of each feature
(nearest-mean classifier)
Ranking 2: Sequential Forward Floating Search
wrapping by nearest-mean classifier
Slide -8-
Acoustic Analysis
Feature selection (2/2)
Top 10 features
Acoustic Feature SFFS-Rank Single Perf.
Pitch, maximum gradient 1 31.5
Pitch, standard deviation of distance between reversal points
2 23.0
Pitch, mean value 3 25.6
Signal, number of zero-crossings 4 16.9
Pitch, standard deviation 5 27.6
Duration of silences, mean value 6 17.5
Duration of voiced sounds, mean value 7 18.5
Energy, median of fall-time 8 17.8
Energy, mean distance between reversal points
9 19.0
Energy, mean of rise-time 10 17.6
Slide -9-
Acoustic Analysis
Classification
Evaluation of various classification methods
33 features
ClassifierError, %
Speaker indep. Speaker dep.
kMeans 57.05 27.38
kNN 30.41 17.39
GMM 25.17 10.88
MLP 26.86 9.36
SVM 23.88 7.05
ML-SVM 18.71 9.05
Output: Vector of (pseudo-) recognition confidencesOutput: Vector of (pseudo-) recognition confidences
Slide -10-
Acoustic Analysis
Classification
Multi-Layer Support Vector Machines
acoustic feature vectoracoustic feature vector
ang, ntl, fea, joy / dis, sur, sadang, ntl, fea, joy / dis, sur, sad
ang, ntl / fea, joyang, ntl / fea, joy dis, sur / saddis, sur / sad
ang / ntlang / ntl fea / joyfea / joy dis / surdis / sur
angang ntlntl feafea joyjoy sadsaddisdis sursur
No confidence vector to forward to fusionNo confidence vector to forward to fusion
Slide -11-
System Overview
System OverviewSystem Overview
Speech signalSpeech signal
Prosodic featuresProsodic features ASR-unitASR-unit
Semantic interpretationSemantic interpretation(Bayesian Networks)(Bayesian Networks)
ClassifierClassifier(SVM)(SVM)
Stream fusionStream fusion(MLP)(MLP)
EmotionEmotion
Slide -12-
Semantic Analysis
Semantic AnalysisSemantic Analysis
ASR-Unit
HMM-based
1300 words german vocabulary
No language model
5-best phrase hypotheses
Recognition confidences per word
Example output (first hypothesis):
I can‘t stand this every tray traffic-jam
69.3 34.6 72.1 20.0 36.1 15.9 55.8
Slide -13-
Semantic Analysis
Semantic AnalysisSemantic Analysis
Conditions
Natural language
Erroneous speech recognition
Uncertain knowledge
Incomplete knowledge
Superfluous knowledge
Probabilistic spotting approachProbabilistic spotting approach
Bayesian Belief NetworksBayesian Belief Networks
Slide -14-
Semantic Analysis
Bayesian Belief NetworksBayesian Belief Networks
Acyclic graph of nodes and directed edges One state variable per node (here states , ) Setting node-dependencies via cond. probability matrices
Setting initial probabilities in root nodes
Observation A causes evidence in a child node(i.e. is known)
Inference to direct parent nodes and finally to root nodes
Bayes‘ rule :
iX ix ix
)|()|(
)|()|(|
~)()(
PCPC
PCPCPParentCChild xxPxxP
xxPxxPXXP
CxP
TRRR xPxPXP )()(
)(
)()|(|
C
PPCCP XP
XPXXPXXP
Slide -15-
Semantic Analysis
Emotion modelling
...II
...
I_hateI_hate BadBad AdhorrenceAdhorrence
first_personfirst_person
JoyJoy
NegativeNegativePositivePositive DisgustDisgust
InputlevelInputlevel
WordsWords
SuperwordsSuperwords
PhrasesPhrases
Super-Super-phrasesphrases
DisgustDisgust
I can‘t stand this nasty every tray traffic-jam
can‘tcan‘t standstand nastynasty
cannotcannot standstand badbad disgustingdisgusting
InterpretationInterpretation
GoodGood
AngerAnger
ClusteringClustering
SequenceSequenceHandlingHandling
ClusteringClustering
ClusteringClustering
SpottingSpotting
I_likeI_like ... ...
... ...
...
... ...
... ...
... ...
Output: Vector of “real“ recognition confidencesOutput: Vector of “real“ recognition confidences
Slide -16-
System Overview
System OverviewSystem OverviewF&F of HMC
Overview Speech signalSpeech signal
Prosodic featuresProsodic features ASR-unitASR-unit
Semantic interpretationSemantic interpretation(Bayesian Networks)(Bayesian Networks)
ClassifierClassifier(SVM)(SVM)
Stream fusionStream fusion(MLP)(MLP)
EmotionEmotion
Slide -17-
Stream Fusion
Stream FusionStream Fusion
Pairwise mean
Discriminative fusion applying MLP
Input layer: 2 x 7 confidences
Hidden layer: 100 nodes
Output layer: 7 recognition confidences
nfusionn
EPmaxarg
nsemanticnacousticnfusion EPEPEP
Slide -18-
Results
ResultsResults
Emotion ang dis fea joy ntl sad sur Mean
% 95.5 61.3 78.7 75.1 78.5 62.1 68.3 74.2
Acoustic recognition rates (SVM): Acoustic recognition rates (SVM):
Semantic recognition rates: Semantic recognition rates:
Emotion ang dis fea joy ntl sad sur Mean
% 78.4 71.2 53.4 57.7 56.0 35.0 65.5 59.6
Slide -19-
Results
ResultsResults
Emotion ang dis fea joy ntl sad sur Mean
% 98.0 78.7 88.3 95.9 98.2 91.7 95.8 92.0
Recognition rates after discriminative fusion: Recognition rates after discriminative fusion:
Acoustic Information
Language Information
Fusionby means
Fusionby MLP
% 74.2 59.6 83.1 92.0
Overview: Overview:
Slide -20-
Summary
SummarySummary
Acted Emotions
7 discrete emotion categories
Prosodic feature selection via
Singe feature performance
Sequential forward floating search
Evaluative comparision of different classifiers
Outperforming SVMs
Semantic analysis applying Bayesian Networks
Significant gain by discriminative stream fusion
Slide -21-