Timbre and Modulation Features for Music Genre/Mood Classification

Timbre and Modulation Features forTimbre and Modulation Features forMusic Genre/Mood ClassificationMusic Genre/Mood Classification

J.-S. Roger Jang & Jia-Min RenJ.-S. Roger Jang & Jia-Min RenMultimedia Information Retrieval LabMultimedia Information Retrieval LabDept. of CSIE, National Taiwan UniversityDept. of CSIE, National Taiwan University

2/40

Outline Audio features and modulation spectral analysis MIREX 2011 method and its improvement Experimental setup and results Conclusions and future work

3/40

Introduction – music genres/moods

*pictures from www.playonradio.com, brainpickings.org & mpac.ee.ntu.edu.tw

Descriptions of music contents

4/40

Motivation Rapid growth of digital music

Apple iTunes: 28 million songs; 7digital: 20 million tracks Organization of large collections of audio music

Important but challenging Manual labeling by tags: labor intensive/time consuming

Thus, machine learning for classification is called for!

Feature Extraction

Music clipsfor training

Classifier Training

Feature Extraction

Evaluation

Short-term: MFCC, OSCLong-term: beat, tempo, pitch

KNN, GMM, SVM

Classifiers

ResultMusic clipfor test

Feature Extraction

Evaluation Result

5/40

System overview

Frame-based timbre feature extraction and

summarization

Long-term modulation-based feature extraction

Music clips for training

...

SVMs training

SVMs

Concatenation

Feature extraction

Classification

Result

Feature extraction

Training stage

Test stageMusic clip for testing

6/40

Performance evaluation Dataset-dependent criteria for evaluation

GTZAN 10-fold cross-validation

ISMIR2004Genre Holdout test, same as the one used in ISMIR 2004 Genre

Classification Contest, with 729 clips for training and 729 clips for test

7/40

Audio features – short-term timbre features Statistical spectrum descriptors (SSD)

Spectral centroid (SC) Spectral flux (SF) Spectral rolloff (SR), Spectral skewness (SS) Spectral kurtosis (SK).

MFCC To model the subjective frequency contents of audio signals 21-dim (including energy)

8/40

Audio features – short-term timbre features Spectral contrast & valley (SCV)

Measure spectral contrast/valley in octave-based subbands

Valley: non-harmonic/noise

audio frame

FFT

For each subband, compute peak/valleyby averaging values in the larger/smaller percentage of spectra ( )

contrast=peak-valley:relative distribution20%

Peak: harmonic 8 frequencysubbands:1: [0,100)2: [100,200)3: [200,400)4: [400,800)5: [800,1600)6: [1600,3200)7: [3200,6400)8: [6400,11025]

10/40

Audio features – short-term timbre features Spectral flatness measure (SFM)

Measures the noisiness of spectra within a subband

≈1: similar amount of power is distributed in all spectral bands ≈0: spectral power is concatenated in a relative small # bands

Spectral crest measure (SCM)

,1

,1

( )1

aa

a

NNa ii

Na ii

a

BSFM a

BN

,1,...,

,1

max( )

1a

a

a ii N

Na ii

a

BSCM a

BN

, :a iB the i-th magnitude spectrum in the a-th subband

:aN # of spectra in the a-th subband

11/40

Audio features – short-term timbre features For each feature dimension, we compute its mean and

standard deviation. Total dimensions for short-term timbre features

2*(5+21+16+16)=116

SSD MFCC SCV SFM/SCM

Frame-based features

Mean & std

Octave-based subbands

12/40

Modulation spectral analysis MFCC, SC, SFM/SCM

Capture only short-time spectral properties of audio signals Modulation spectral analysis

Captures long-term spectral dynamics within audio signals Computes spectrogram, then creates modulation spectrogram

(by applying FFT again along time axis of spectrogram) Low/high modulation frequency slow/fast spectral change

FFT

13/40

Modulation spectral analysis of timbre features Flowchart

MSP/MSV: the strength of rhythm in music

7 modulation freq. subbands:[0,0.33), [0.33,0.66),[0.66,1.32),[1.32,2.64),[2.64,5.28),[5.28,10.56),[10.56, 21.03)

The same process is applied to MFCC, SFM/SCM.

(MSC: modulationSpectralcontrast)

1. OSC extraction (hop size 23 ms)

music clip

2. Segmentation(256 frames ≈ 6 sec)

3. FFT (along feature dim)

4. Average

5. Modulation frequency

decomposition

129 bins

16 dim

modulation frequency(129 bins)

windows

...

7. Mean/std computation(along feature & subband dims)

92-dim feature vector(=16*2+7*2+16*2+7*2)

feature

dim (16)

256 frames

...

texture windows window

- =

16 dim

...

...

...

16 dim

7 subbands

6. Modulation spectral peak/valley (MSP/MSV)

computation

...

MSP

16 dim

7 dim

MSV

16 dim

7 dim

MSC

16 dim

7 dim

MSV MSC

14/40

Modulation spectral analysis of timbre features Reference

C.-H. Lee, J.-L. Shih, K.-M. Yu, and H.-S. Lin, “Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features,” IEEE Trans. Multimedia, vol. 11, no. 4, pp.670-682, June 2009.

15/40

Proposed joint acoustic frequency and modulation frequency features Motivation

Averaging and mean/std computation smooth out MD info. Computation of joint frequency features (proposed)

Compute modulation spectrogram from an entire music clip Compute SCV (spectral contrast/valley), SFM/SCM (spectral

flatness/crest measure) within each joint acoustic-modulation (AM) frequency subband AMSCV, AMSFM/AMSCM

FFT

ComputeAMSCVAMSFMAMSCM

16/40

Audio features used in our study All possible audio features

Extract SSD, MFCC, SCV, and SFM/SCM from audio frames mean/std computation MuStd MuStd dim=2*(5+21+16+16)=116

Perform modulation spectral analysis on MFCC, OSC, SFM/SCM MMFCC dim=2*(21*2+7*2)=112 MSCV dim=2*(16*2+7*2)=92 MSFM/MSCM dim=2*(16*2+7*2)=92

Compute SCV, SFM/SCM within acoustic-modulation (AM) frequency subbands AMSCV, AMSFM/AMSCM AMSCV 8*7*2=112 AMSFM/AMSCM dim = 8*7*2=112

17/40

Audio feature sets and classifier Audio feature sets

MIREX 2011 method MuStd+MMFCC+MSCV+MSFM/MSCM

dim=116+112+92+92=412

Improved method MuStd+MMFCC+AMSCV+AMSFM/AMSCM

dim=116+112+112+112=452

Classifier construction with RBF kernel SVMs Three-fold inside cross-validation to tune hyper-parameters

18/40

30

40

50

60

70

Acc

ura

cy (

%)

Genre classification

WR1

TCCP4SSKS2 JR1

SSPK1WR2

TCCP3JR2

ES2ES1

DM1GDC2

EP2GKC4

30

40

50

60

70

Acc

ura

cy (

%)

Mood classification

JR1

TCCP4WR1

TCCP3SSKS2

ES2SSPK1

WR2ES1 JR2

DM4DM1

GDC1EP2

GDC2GKC4

Experimental setup and results of MIREX 2011 genre/mood classification tasks Datasets

Genre classification: 10 genres, 700 30-sec clips in each one Mood classification: 5 categories, 120 30-sec clips in each one

Evaluation metric Three-fold cross-validation; classification accuracy

Results (JR1 is ours)

19/40

Experimental results of MIREX 2008-2012 genre/mood classification tasks

ParticipationsClassification Task(Year)

Accuracy(%)

Rank(# of Submissions)

Wu and Jang Genre (2013) 76.23 1 (13)

Wu and Jang Genre (2012) 76.13 1 (16)

Wu and Ren Genre (2011) 75.57 1 (15)

Our submission Genre (2011) 74.23 4 (15)

Seyerlehner et al. Genre (2010) 73.64 1 (24)

Cao and Li Genre (2009) 73.33 1 (31)

Tzametalis Genre (2008) 67.83 1 (13)

Wu and Jang Mood (2013) 68.33 1 (23)

Panda and Paiva Mood (2012) 67.83 1 (20)

Our submission Mood (2011) 69.50 1 (17)

Wang et al. Mood (2010) 64.17 1 (36)

Cao and Li Mood (2009) 65.67 1 (33)

Peeters Mood (2008) 63.67 1 (13)

20/40

Extended experiments Four datasets

Performance evaluation Randomly stratified 10-fold cross-validation (repeating 10

times) Repeat the above process 10 times to obtain the average

result

Dataset Category Class # Min/Max # of clips in classes

Total # of clips

Duration of each clip

GTZAN Genre 10 100/100 1,000 30sUnique Genre 14 26/766 3,115 ~30sSoundtracks Mood 6 30/30 180 18s to 30sMIR-Mood Mood 4 464/619 2,223 ~30s or ~60s

21/40

Extended experiments Averaged classification accuracy (%) of combining

different feature sets on four datasets

22/40

Extended experiments Comparison of our methods with other recent work

23/40

Conclusions Timbre & modulation features

Won 1st place (MIREX 2011 mood classification) Timbre & improved modulation

Improves 2.47%/2.08% on GTZAN/Unique Achieves 2.50%/0.14% higher than MIREX 2011 method on

Soundtracks/MIR-Mood

24/40

Thank you for listening.Questions & comment welcome!

Timbre and Modulation Features for Music Genre/Mood Classification

Documents

Transcript of Timbre and Modulation Features for Music Genre/Mood Classification