1 Musical Genre Classification Prepared by Elliot Sinyor for MUMT 611 March 3, 2005.
-
Upload
osborn-cross -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Musical Genre Classification Prepared by Elliot Sinyor for MUMT 611 March 3, 2005.
1
Musical Genre Classification
Prepared by Elliot Sinyor for MUMT 611March 3, 2005
2/36
Table of Contents
What is Genre? Approaches to Genre Classification
Manual Automatic
Related Work Soltau 1998 Tzanetakis & Cook
prescriptive approach Pachet et al. 2001
emergent approach Conclustion
3/36
What is Genre?
A way of describing what an item shares with other items as well as what differentiates it from other items
From Aucouturier and Pachet “The genesis of genre is therefore to be
found in our natural and irrepressible tendency to classify”
4/36
What is Genre?
A&P separate into two broad categories Intentional vs. Extensional
5/36
What is Genre? - Intentional
More subjective Relies on collective cultural knowledge Social/Historical context Eg 60s, hippies, brit-pop
6/36
Problems with “Genre”
What do the names mean? Rock? Pop?
No fixed semantics Amazon.com Genres by:
Period (“60s pop”) Topic (“love song”) Country of Origin (“Japanese music”)
Genre is based on extrinsic habits rather than intrinsic properties
To a French person – C. Aznavour – Variety To an English person – C. Aznavour – French
7/36
What is Genre? - Extensional
Analysis-based Describes the music itself Tempo, timbre, pitch, language, etc. (sometimes) easier for automatic genre
classification systems Eg fast rock, mellow classical.
8/36
Problems with “Genre”
What granularity to use? By Artist?
Please Please Me vs. Sgt. Pepper
By Album? Revolution 9 vs. Helter Skelter vs. Mother
Nature’s Son
Does work for broad categories Rock vs. Classical
9/36
Problems with “Genre”
Does anyone agree? Allmusic.com – 531 genres Amazon.com – 719 genres Mp3.com – 430 genres
Only 70 words common to the three taxonomies (Pachet and Cazaly 2000)
10/36
Approaches to Genre Classification
Manual Musicologists and Elbow Grease
Automatic Prescriptive
Signal Analysis based Emergent
Uses existing human-entered meta-data to group things together
11/36
Manual Classification
Dannenberg et al. 2001: To build a taxonomy for MSN Music
Search Engine “Few hundred thousand songs” Hired full-time musicologists Took 30 human years “The details of the taxonomy and the
design methodology are, however, not available”
12/36
Manual Classification
Pachet and Cazaly 2001 (CUIDADO) Separated descriptors – country,
instrumentation, artist type, etc _____ Rock
Too sensitive to musical evolution, difficult to build, difficult to maintain
Changed focus to artists instead of titles.
In any case, insufficient for millions of titles
13/36
Prescriptive – History
Originated from Speech Recognition work
Most Classified audio from TV into music/speech/environmental
14/36
Prescriptive – Various Approches
Saunders 1996 Thresholding/ZCR techniques
Scheirer and Slaney 1997 Multiple features and statistical pattern recognition
Kimber and Wilcox 1996 MFCCs and HMM to classify into music, speech,
laughter and nonspeech Zhang and Kuo 2001
Rule-based system for classifying audio from movies and TV into:
Non-music Pure speech, non harmonic environmental sound
Music Harmonic environmental sound, pure music, song,
speech with music, environmental sound with music
15/36
Prescriptive
Soltau et al 1998 – “Recognition of Music Types”
New approach – Explicit Time Modelling with Neural Network (ETM-NN)
16/36
Prescriptive – Soltau et al. 1998
In a nutshell: Transform acoustic signal into
sequence of abstract sonic events Look at statistical patterns derived from
sequences combine into vectors that represent temporal structure
3-layer feed-forward network
17/36
Prescriptive – Soltau et al. 1998
Experimental Results: 3 hours of data (360 samples, 30 sec each) Rock, Pop, Techno, Classical 67% training, 13% cross-validation, 20%
evaluation
Compare ETM-NN vs. HMM, using cepstral coefficients ETM-NN: 86.1% HMM: 79.2%
18/36
“Musical Genre Classification of Audio Signals” – Tzanetakis and Cook, 2002
Timbral Texture Features Spectral {Centroid, Rolloff, Flux}, ZCR, MFCC
(5 coefficients) Analysis Window – features should be
stable – 23 ms Texture Window – “minimum amount of
time to identify a 'texture’” 43 analysis windows, 1 sec.
“Memory of the past”
Statistics (means, variances) of features over the texture window
19/36
“Musical Genre Classification of Audio Signals” – Tzanetakis and Cook, 2002
Timbral Texture Features Spectral {Centroid, Rolloff, Flux}, ZCR, MFCC
(5 coefficients) Analysis Window – features should be
stable – 23 ms Texture Window – “minimum amount of
time to identify a 'texture’” 43 analysis windows, 1 sec.
“Memory of the past”
20/36
Timbral Texture Feature Vector
Statistics (means, variances) of features over the texture window 19 dimensions
(m, v) of SC, SF, SR, ZCR, 5 MFCC “low energy feature” fraction of analysis
windows over texture window that have less than average RMS energy
Eg vocal music will have more silences
21/36
Rhythmic Content – “Beat Histogram” “Pitch detection with larger
periods” Use DWT to divide signal into
frequency bands
22/36
Rhythmic Content – “Beat Histogram”
23/36
Features taken from BH
A0, A1: relative amplitude (divided by the sum of amplitudes) of the first, and second histogram peak;
RA: ratio of the amplitude of the second peak divided by the amplitude of the first peak;
P1, P2: period of the first, second peak in bpm;
SUM: overall sum of the histogram (indication of beat strength).
24/36
Pitch Content Features
Used enhanced Autocorrelation function to create folded (1 octave) and unfolded (all notes) pitch histograms
Mapped to MIDI note numbers Folded- common pitch classes Unfolded – pitch range
Higher for jazz, classical FA0, UP0, UP1, IPO1 (interval between
2 highest peaks), SUM
25/36
Experimental Results
Used GMM classifiers with diagonal covariance matrices
26/36
Experimental Results
27/36
Prescriptive – Some Results: (from A&P)
Gaussian and Gaussian Mixture Models, used in 48% of successful classification in Ermolinskiy et al.(2001) using 100 songs for each class in the training phase. This result has to be taken with care since the system uses only pitch information.
Tzanetakis et al. (2001) achieves a rather disappointing 57%, but also reports 75% in Tzanetakis and Cook (2000a) using 50 songs per class.
90% in Lambrou and Sandler (1998) and 75% in Deshpande et al. (2001) on a very small training and test set, which may not be representative.
Pye (2000) reports 90% on a total set of 175 songs. Soltau (1998) reports 80% with HMM, 86% with NN,
with a database of 360 songs.
28/36
Emergent
Unlike Prescriptive, it is unsupervised
Based on “cultural similarity from text documents”
Possible to extract similarities that are not possible to extract from the audio signal
29/36
Emergent – Collaborative Filtering
Shardanand & Maes 1995, Pestoni et al. 2001 There are patterns in tastes Have users rate their music, match like-tasted
users, recommend unknown items to users Problems
Good for naïve profiles, bad for broad, eclectic tastes
Favors “middle of the road” – liked by large proportion
Only works some time after release of new music
30/36
Emergent – co-concurrent analysis
Pachet et al. 2001 Looks at online text sources for co-
occurrences of songs (aka data mining)
If 2 items appear in the same context (or share a common neighbour), this is evidence of some sort of similarity
31/36
Co-occurrence
Pachet et al. 2001 “Musical Data Mining for Electronic Music Distribution”
Sources used Track listing databases (CDDB)
Mostly look at compilations of similar artists Radio Show playlists
Specialty programs better than daily commercial radio
Lists made by experts
32/36
Co-occurrence
Build a matrix where: Value of entry (i, j) corresponds to
number of times title i co-occurs with title j
What about indirect co-occurrence? Eg Eleanor Rigby/Good Vibrations, Good
Vibrations/God Only Knows Eleanor Rigby God Only Knows
Correlation measure, using co-variance matrices of each title
33/36
Experimental Results
Using distance functions, use Ascendant Hierarchical Clustering
Used CDDB database, compared co-occurrence vs correlation
Manually examined results “70% of clusters had interesting
similarities”
34/36
Experimental Results
35/36
Challenges
Name format is not strictly enforced The Beatles; Beatles, The; Beatles
Difficult to characterize the nature of the similarities
Cover songs can sound nothing alike
36/36
Conclusions and Future directions
“It seems that samples of Techno and Classical are easy to discriminate … Rock and Pop seems to be more difficult” – Soltau et al 1998
Manual classification not feasible Why not combine
prescriptive/emergent techniques?