Speech Emotion Recognition and Perception of...

1/27

IntroductionBasic Features

Musical FeaturesSimulations and Results

Speech Emotion Recognition

and Perception of Music

Mélanie Fernández Pradier

Prof. Dr.-Ing. Bin Yang

Supervisors: Prof. Dr.-Ing. Bin Yang

Dipl.-Ing. Fabian Schmieder

January 27, 2011

Mélanie Fernández Pradier Speech Emotion Recognition and Perception of Music

2/27



MotivationAim of the thesis

MotivationSpeech Emotion Recognition and Perception of Music

Emotion Recognition from Speech

Speech ∼ two-channel

linguistic

paralinguistic

Several Applications

support ASR

diagnoses

speech synthesis

entertainment

Music Perception

�language of emotion�

treatment of a�ective

disorders

treatment of speech disorders

same origin of music and

speech


3/27



MotivationAim of the thesis

Aim of the thesisApply Music Theory to Speech Emotion Recognition

Investigate Speech and Music similarities to derive universal features for Emotions

1 What is the link between music and speech?

2 How are emotions transmitted through music?

3 Can we apply musical knowledge to speech processing?


4/27



General ConceptsDescription

1 Introduction

Motivation

Aim of the thesis

2 Basic Features

General Concepts

Description

3 Musical Features

Interval and Triad Features

Based on Music Emotion Recognition

Perceptual Model of Intonation

4 Simulations and ResultsMélanie Fernández Pradier Speech Emotion Recognition and Perception of Music

4/27




Pattern Recognition


5/27




Feature Generation


6/27




Basic Features Description

Local Features

ZCR

MFCC

Energytotal + bands

Pitch

Voiced-unvoiced

VAD

ZCR = 12 ·∑

N

n=1 |sgn (xn)− sgn (xn+1)|

Cepstrum =∣∣∣FFT {log (|FFT {x}|2)}∣∣∣2

Energy =∑

N

n=1 xn · x?n

Global Features

Global statistics: min, mean, max,

median, std, iqr...

directly, 1st or 2nd derivative

Energy and pitch plateaux

Combination with logical features


7/27



Interval and Triad FeaturesBased on Music Emotion RecognitionPerceptual Model of Intonation

Interval and Triad Features


8/27




Interval Features

Autocorrelation of the circular

pitch density function

∫ L

0

po (modL (s + λ)) po (λ) dλ

Intervalic dissonance

DIS =

∫ L

0

d (s) ro (s) ds

where d (s) '√N (s)D (s)

0 2 4 6 8 10 12

2

4

6

8

10

12

14

Pitch Histogram

Num

ber

of P

itch S

am

ple

s

Circular frequency in ST scale

0 2 4 6 8 10 120

0.01

0.02

0.03

0.04

0.05

0.062nd order Autocorrelation

Circular frequency in ST scale


9/27




Interval Dissonance

0f- f

Frequency of

Tone Sensation

Beats

Area

Roughness

AreaSmoothness

Area

Smoothness

Area

10Hz

Frequency difference

f = f2 - f1

f1

One-Tone

Sensation

Critical Bandwidth

f2

f1

Limits of

Discrimination

Two-Tone

Sensation

Two-Tone

Sensation

m2 M2 m3 M3 P4 4+/5° P5 m6 M6 m7 M70

10

20

30

40

50

60

70

Dis

so

na

nce

Intervals


10/27




Triad Features

1 Direct computation

2 Extraction of �dominant pitches�

Autocorrelation Triad Features

0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

0.25

Gaussian Mixture Model

Semi−Tone Scale

Gaussian Triad Features


11/27




Tension and Modality


12/27




Loudness, Timbre and Rhythm

Intensity Features

I (k) =

N/2∑n=0

|FFTk (n)|

Di (k) =1

I (k)

Hi∑n=Li

|FFTk (n)|

where k refers to the frame

Timbre Features

FFTk ≡ {xk1 . . . xkN}

sorted ≡{x′

k1 . . . x′

kN

}

Peak (k) = log

{1

αN

αN∑i=1

x′

ki

}

Valley (k) = log

{1

αN

αN∑i=1

x′

k(N−i+1)

}


13/27





Rhythm Features

1 Compute FFT

2 Extract amplitude envelope

Ai (n) = FFTi (n)⊗ hw (n)

3 Apply Canny operator

Oi (n) = Ai (n)⊗ C (n)

C (n) = n

σ2e− n

2

2σ2

We obtain the onset sequence

Oi (n)50 100 150 200 250

2

4

6

8

10

12

14

16

18

Number of samples

Am

plit

ude

Onset Sequence


14/27





Rhythm Features

Strength Average value of the

peaks

Regularity Average value of peaks

in the autocorrelation

Speed Ratio of number of

peaks and time

duration

50 100 150 200 250

2

4

6

8

10

12

14

16

18

Number of samples

Am

plit

ude

Onset Sequence


15/27




Perceptual Model of Intonation

Perceptual principles

1 Segmentation E�ect

2 Glissando Threshold: minimum

amount of frequency change

gth = 0.16/T 2 [ST/s²]

3 Di�erential Glissando Threshold:

minimum di�erence in slope

dgth = a2 − a1 = 20 [ST/s]

4 Short-term integration in time

0 0.5 1 1.5 20

50

100

150

200

250

300

Time (s)

Fre

quency (

Hz)

F0 estimation

stylization 1

stylization 2


16/27



Database - Labels - Features

Database: emoDB (TUB)

10 speakers

708 �les

6 emotions

BASIC SET

duration 16

MFCC 91

ZCR 13

harmony 3

energy 58

pitch 33

Total 214

MUSICAL SET

interval 31

autocorr.

triad

4

gaussian

triad

10

intensity 63

rhythm 15

Total 123


17/27



Strategies for evaluation 9-1 Vs 8-1-1


18/27



Musical Universals

0.8 1 1.2 1.33 1.5 1.6 1.75 2 2.2−18

−16

−14

−12

−10

−8

−6

−4

−2

0

2

Frequency Ratio

Me

an

no

rma

lize

d a

mp

litu

de

(d

B)

1.4

m3

unison

octave

M3

P4P5

4+or5°

m6

M6

m7

1.25 1.67


19/27



Plain bayes classi�er - Evaluation 8-1-1

0 10 20 30 40 50

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of features

Tra

inin

g h

it r

ate

(%

)

Basic set

Full set

0 10 20 30 40 50

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Number of features

Ge

ne

raliz

atio

n h

it r

ate

(%

)

Basic set

Full set


20/27



Nature of selected features

time

MFCC

ZCR

harmony

energy

pitch

interval

auto−correlation

triad

gaussiantriad

intensity

rhythm


21/27



Comparison plain Vs hierarchical bayes classi�er

Activation

Valence

Potency

Valence

Potency

high low

highhigh

high highlow

lowlow

low

happy angry afraid neutralboredsad

plain

Bayes

hierarchical

Bayes

Basic 76.12 84.22

Basic+

Interval+Triad80.61 85.04


22/27



Multi-dimensional Scaling

−15 −10 −5 0 5 10 15 20−15

−10

−5

0

5

10

15

1st Principal Component

BASIC

2n

d P

rin

cip

al C

om

po

ne

nt

Happy

Sad

Bored

Angry

−20 −10 0 10 20 30−15

−10

−5

0

5

10

15

20

25

1st Principal Component

2n

d P

rin

cip

al C

om

po

ne

nt

FULL

Happy

Sad

Bored

Angry


23/27



Happy Vs Angry - Evaluation 8-1-1

0 10 20 30 40 500.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

Number of features

Tra

inin

g h

it r

ate

(%

)

Happy Vs Angry

Basic set

Full set

0 10 20 30 40 500.62

0.64

0.66

0.68

0.7

0.72

0.74

Number of features

Ge

ne

raliz

atio

n h

it r

ate

(%

)

Angry Versus Happy

Basic set

Full set


24/27



Final Comparison of Musical Features

0 5 10 15 20 25 30 35 40 45 500.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8Comparison between different musical feature sets

Number of features

Accura

cy R

ate

(%

)

Musical Set

Basic Set

B+Stylization Set

B+Interval+Triad

B+Intensity

B+Rhythm

Full Set


25/27



Conclusion

Summary

1 Literature review about speech, music and emotions

2 Theoretical background on psychoacoustics

3 Re-implementation of the basic features

4 Implementation of speech processing algorithms

5 Implementation of musical features

(music perception, MER and linguistics)

6 Simulations ⇒ Musical features can help to improve emotion

recognition in speech


26/27



Conclusions

Further research

Environment: natural emotional speech, other languages

Pattern Recognition steps: feature transformation, pitch

extraction, classi�cation...

Improvement of musical features

Dissonance model, Perceptual model of intonation,

Emotionally meaningful moments

Systematization of feature extraction step

"Even monkeys express strong feelings in di�erent tones � anger

and impatience by low, � fear and pain by high notes."

Charles Darwin, Naturalist


27/27



Thank you!

Looking forward to your questions. . .


Speech Emotion Recognition and Perception of...

Documents

Transcript of Speech Emotion Recognition and Perception of...