MIREX 2010.pdf

4
MIREX 2010: AUDIO CLASSIFICATION USING SEMANTIC TRANSFORMATION AND CLASSIFIER ENSEMBLE Ju-Chiang Wang *† , Hung-Yi Lo * , Shyh-Kang Jeng , Hsin-Min Wang * *Institute of Information Science, Academia Sinica, Taipei, Taiwan †Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan [email protected], [email protected], [email protected], [email protected] ABSTRACT Our audio classification system is implemented as fol- lows. First, in the training phase, the frame-based 70 di- mensional feature vectors are extracted from a training song by MIRToolbox. Next we apply the Posterior Weighted Bernoulli Mixture Model (PWBMM) to trans- form the frame-decomposed feature vectors of the train- ing song into a semantic representation based on the mu- sic tags; this procedure is called Semantic Transformation. In other words, a song is represented in terms of a multi- nomial distribution over relevant music tags. The PWBMM is fitted with 70 dimensional features from the MajorMiner music tagging dataset, which contains 2,472 ten-second audio clips and 45 pre-defined tags. Finally, for each class, a set of semantic representations of songs is used to train an ensemble classifier consisting of SVM and AdaBoost classifiers. In the classification phase, un- seen audio clips are submitted to the system. The result- ing outputs of individual classifiers for each class are transformed into calibrated probability scores such that they can be easily combined by the associated ensemble classifier. The class with the highest score is selected as the final output. 1. INTRODUCTION In the music classification field, fixed numbers of catego- ries or classes are usually pre-defined by experts for dif- ferent classification tasks. For example, there are five tasks concerning different subjects in MIREX 2010, such as artist identification, US pop genre classification, Latin genre classification, music mood classification, and clas- sical composer identification. In general, these categories or classes should be definite and as mutually exclusive as possible. For example, there are 10 genres in the US pop genre classification task: blues, jazz, country/western, baroque, classical, romantic, electronica, hip-hop, rock, and hard-rock/metal. However, when most people listen to a song they have never heard before, certain musical impressions are created in their minds, although they may not be able to name the exact musical category of the song. These musical impressions can be described by some general words, such as exciting, noisy, fast, male vocal, drums, and guitars. Indeed, these musical impres- sions or concepts are inspired by direct auditory cues. In fact, we believe that a set of occurrences of these musical concepts may indicate membership in a specific audio classification class. Therefore, we are interested in ex- ploring the relationships between the general words and the specific categories. Since people listening to music tend to mentally tag a piece of music with specific words, music tags are a natu- ral way to describe these general musical concepts. These tags can include different types of musical information such as genre, mood, and instruments. As a result, we be- lieve that knowledge of pre-generated music tags in an audio dataset can help in the classification of another au- dio dataset. Our goal is to train a system to recognize mu- sical concepts in terms of semantic tags first, and then to have the machine classify the music into specific classes based on the musical concepts. metal instrumental horns piano guitar ambient saxophone house loud bass fast keyboard electronic noise british solo electronica beat 80s dance jazz drum machine strings pop r&b female rock voice rap male slow vocal quiet techno drum funk acoustic distortion organ soft country hip hop synth trumpet punk Table 1. The 45 Tags in the MajorMiner Dataset The system is overviewed as follows. In the training phase, we first extract audio features with respect to vari- ous musical characteristics, including dynamics, spectral, timbre and tonal features, from the training songs. Next we apply the Posterior Weighted Bernoulli Mixture Model (PWBMM) [1] to automatically tag the songs. PWBMM is trained on the MajorMiner dataset clawed from the website of the MajorMiner 1 music tagging game. The dataset contains 2,472 ten-second audio clips and their associated tags. As shown in Table 1, we select 45 tags to define the semantic space. In other words, a song is transformed into a 45-dimensional vector over the pre- defined tags using a tagging procedure we call Semantic Transformation (ST). In the MajorMiner dataset, the counts of a tag given to a music clip ranges from 2 to 12. 1 http://majorminer.org/

Transcript of MIREX 2010.pdf

Page 1: MIREX 2010.pdf

MIREX 2010: AUDIO CLASSIFICATION USING SEMANTIC TRANSFORMATION AND CLASSIFIER ENSEMBLE

Ju-Chiang Wang*†, Hung-Yi Lo*, Shyh-Kang Jeng†, Hsin-Min Wang* *Institute of Information Science, Academia Sinica, Taipei, Taiwan

†Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan

[email protected], [email protected], [email protected], [email protected]

ABSTRACT

Our audio classification system is implemented as fol-lows. First, in the training phase, the frame-based 70 di-mensional feature vectors are extracted from a training song by MIRToolbox. Next we apply the Posterior Weighted Bernoulli Mixture Model (PWBMM) to trans-form the frame-decomposed feature vectors of the train-ing song into a semantic representation based on the mu-sic tags; this procedure is called Semantic Transformation. In other words, a song is represented in terms of a multi-nomial distribution over relevant music tags. The PWBMM is fitted with 70 dimensional features from the MajorMiner music tagging dataset, which contains 2,472 ten-second audio clips and 45 pre-defined tags. Finally, for each class, a set of semantic representations of songs is used to train an ensemble classifier consisting of SVM and AdaBoost classifiers. In the classification phase, un-seen audio clips are submitted to the system. The result-ing outputs of individual classifiers for each class are transformed into calibrated probability scores such that they can be easily combined by the associated ensemble classifier. The class with the highest score is selected as the final output.

1. INTRODUCTION

In the music classification field, fixed numbers of catego-ries or classes are usually pre-defined by experts for dif-ferent classification tasks. For example, there are five tasks concerning different subjects in MIREX 2010, such as artist identification, US pop genre classification, Latin genre classification, music mood classification, and clas-sical composer identification. In general, these categories or classes should be definite and as mutually exclusive as possible. For example, there are 10 genres in the US pop genre classification task: blues, jazz, country/western, baroque, classical, romantic, electronica, hip-hop, rock, and hard-rock/metal. However, when most people listen to a song they have never heard before, certain musical impressions are created in their minds, although they may not be able to name the exact musical category of the song. These musical impressions can be described by some general words, such as exciting, noisy, fast, male vocal, drums, and guitars. Indeed, these musical impres-sions or concepts are inspired by direct auditory cues. In

fact, we believe that a set of occurrences of these musical concepts may indicate membership in a specific audio classification class. Therefore, we are interested in ex-ploring the relationships between the general words and the specific categories.

Since people listening to music tend to mentally tag a piece of music with specific words, music tags are a natu-ral way to describe these general musical concepts. These tags can include different types of musical information such as genre, mood, and instruments. As a result, we be-lieve that knowledge of pre-generated music tags in an audio dataset can help in the classification of another au-dio dataset. Our goal is to train a system to recognize mu-sical concepts in terms of semantic tags first, and then to have the machine classify the music into specific classes based on the musical concepts.

metal instrumental horns piano guitar ambient saxophone house loud bass fast keyboard electronic noise british solo electronica beat 80s dance jazz drum machine strings pop r&b female rock voice rap male slow vocal quiet techno drum funk acoustic distortion organ soft country hip hop synth trumpet punk

Table 1. The 45 Tags in the MajorMiner Dataset

The system is overviewed as follows. In the training phase, we first extract audio features with respect to vari-ous musical characteristics, including dynamics, spectral, timbre and tonal features, from the training songs. Next we apply the Posterior Weighted Bernoulli Mixture Model (PWBMM) [1] to automatically tag the songs. PWBMM is trained on the MajorMiner dataset clawed from the website of the MajorMiner1 music tagging game. The dataset contains 2,472 ten-second audio clips and their associated tags. As shown in Table 1, we select 45 tags to define the semantic space. In other words, a song is transformed into a 45-dimensional vector over the pre-defined tags using a tagging procedure we call Semantic Transformation (ST). In the MajorMiner dataset, the counts of a tag given to a music clip ranges from 2 to 12.

1 http://majorminer.org/

Page 2: MIREX 2010.pdf

These counts are also modeled by PWBMM and have been known to facilitate the performance of music tag annotation [1]. Finally, for each class, the associated training songs, each with a 45-dimensional semantic rep-resentation, are used to train an ensemble classifier, consisting of SVM and AdaBoost classifiers. In the clas-sification phase, the class with the highest output score is assigned to the unseen audio clip in question.

The remainder of this paper is organized as follows. In Section 2, we describe the music features used in this work. In Section 3, we present how to apply PWBMM for music semantic representation, and in Section 4 we present our ensemble classification method.

2. AUDIO FEATURE EXTRACTION

We use the MIRToolbox 1.31 for music feature extraction [2]. As shown in Table 2, four types of features are used in this system, including dynamics, spectral features, tim-bre, and tonal features. To ensure the alignment and pre-vent the mismatch of different features in a vector, all the features are extracted in the same fixed-size short-time frame. A sequence of 70-dimensional feature vectors is extracted with 50ms frame size and 0.5 hop shift.

Types Feature Description Dimdynamics rms 1

centroid: mean 1 spread: standard deviation 1 skewness 1 kurtosis 1 entropy 1 flatness 1 rolloff at 85% 1 rolloff at 95% 1 brightness: high frequency energy 1 roughness: sensory dissonance 1

spectral

irregularity: the degree of varia-tion of the successive peaks

1

zero crossing rate 1 spectral flux 1 MFCC 13delta MFCC 13

timbre

delta-delta MFCC 13key clarity 1 key mode possibility 1 HCDF 1 chroma peak 1 chroma centroid 1

tonal

chroma 12

Table 2. The music features used in the 70-dimensional frame-based vector.

1 http://www.jyu.fi/music/coe/materials/mirtoolbox

3. POSTERIOR WEIGHTED BERNOULLI MIXTURE MODEL

The PWBMM-based system contains two steps. First, it converts the frame-based feature vectors of a song into a fixed-dimensional vector (in a GMM posterior represen-tation). Then, the Bernoulli Mixture Model (BMM) [3] predicts the scores over the 45 music tags for the song.

3.1 GMM Posterior Representation

The generation of the GMM posterior representation can be viewed as a process of soft tokenization from a music background model. First, the feature vectors from all au-dio clips are normalized to have a mean of 0 and standard deviation of 1 in each dimension. Then the GMM is fitted using the expectation and maximization (EM) algorithm. With the GMM we can describe how likely it is that a given feature vector belongs to a “latent music class” zk, k=1,…,K, by the posterior probability of the latent music class:

.),|(

)()1(

1∑ =

==K

j jj

kkk

zpzp

Σx

xx

μNjπ

π (1)

Given a song sj, by assuming that each frame contributes equally to the song, the posterior probability of a certain latent music class can be computed by

,)(1)1(1∑=

==jN

ijik

jjk zp

Nszp x (2)

where xji is the feature vector of the i-th frame of song sj and Nj is the number of frames in song sj.

3.2 Bernoulli Mixture Model

Assume a music corpus with J songs, each denoted sj, j=1,…,J. and with associated tagging counts cjw, w=1,…,W. The tagging counts are positive integers indicating the number of times that tag tw has been assigned to song sj. The binary random variable y, with

}1 ,0{∈jwy , represents the event of tag tw applying to song sj.

3.2.1 Generative Process

The generative process of BMM has two steps. First, the probability of the latent class zkw, k=1,…,K, is chosen from song sj’s class weight vector θj:

,)1( jkjkwzp θ== θ (3)

where θjk is the weight of the k-th latent class. Next a case of the discrete variable yjw is selected based on the following conditional probabilities:

kwkwjw

kwkwjw

zyp

zyp

β

β

−===

===

1 ) ,1 0(

) ,1 1(

β

β

. (4)

The conditional probability which models the probability that song sj has tag tw is a Bernoulli distribution with in-

Page 3: MIREX 2010.pdf

put discrete variable yjw and parameter β for the k-th class zkw.

The complete joint distribution over y and z is de-scribed with model parameter β and weight matrix Θ, where its row vector is θj of song sj:

.) , | ,() , |,(1 11∏∏∏= ==

==J

j

W

wkw

zjk

J

jjkw

kwzpp βθθΘ ββ yz y (5)

3.2.2 Model Inference by the EM Algorithm

The Bernoulli Mixture Model can be fitted with respect to parameter β and weight matrix Θ by maximum-likelihood (ML) estimation. We can link the latent class of BMM together with the “latent music class” of GMM described in Section 3.1. Each song, which has been transformed into the posterior weights θjk = p(zk=1|sj) in Eq. (2), provides the observation information to the BMM; therefore, we only need to estimate β, which cor-responds to the probability that a latent musical class oc-curs. We apply the EM algorithm to maximize the log-likelihood of Eq. (5) in the presence of latent variable z.

In the E-step, given the song-level weight matrix Θ and the model parameter β, the posterior probability of each latent variable zkw can be computed by

( )( )⎪

⎪⎩

⎪⎪⎨

=∑

=∑

=

===

==

=

=

− .0for

1for

) ,|()|1() ,1|(

) , |1()(

1

1

1

1jw

jw

jjw

jkwkwjw

kwkwj

y

y

yp

zpzyp

zpz

K

i iwji

kwjk

K

i iwji

kwjk

βθβθ

βθβθ

γ

θθ

Θ,

ββ

(6)

In the M-step, the update rule for βkw is as follows,

.)(

)(

∑∑

←j kwj

j jwkwj

kwz

yz

γ

γβ (7)

From the tag counts of the music corpus, we know that there exist different levels of relationships between a song and a tag. If song sj has a more-than-one tag count cjw for tag tw, we can make song sj contribute to βkw cjw times rather than only once for each iteration of EM. This leads to a new update rule for βkw:

.)()(

)(

0,1,∑∑

==+

jwjw yjkwj

yjkwjjw

j jwkwjjw

kwzzc

yzc

γγγ

β (8)

3.2.3 Semantic Representation

A song s to be classified is first transformed into a se-quence of music feature vectors, and the posterior weight vector θ is computed from the pre-trained GMM. Finally, the transformed value of song s on tag tw is computed as

the conditional probability of yw = 1 given the BMM pa-rameters β:

.)1() ,1(11

w

K

kkwk

K

kkwwkw vypyp ===== ∑∑

==βθβθθβ (9)

The transformed vector associated with the pre-defined tags provides the semantic representation of the song and its music tags.

4. THE ENSEMBLE CLASSIFICATION METHOD

Assume that we have N classes for the audio classifica-tion task, and all the classes are independent. We can train N binary classifiers, denoted as Cg, ,2,..., 1, Ng = for each class. Each binary classifier Cg calculates a final score combining the outputs of two sub-classifiers: Sup-port Vector Machine (SVM) and AdaBoost.

4.1 Support Vector Machine

SVM finds a separating surface with a large margin be-tween training samples of two classes in a high-dimensional feature space implicitly introduced by a computationally efficient kernel mapping. The large margin implies a good generalization ability according to statistical learning theory. In this work, we exploited a linear SVM classifier f(v) of the following form:

∑=

+=W

www bvf

1)( λv , (10)

where vw is the w-th dimension of the semantic vector v of a testing song; λw and b are parameters to be trained from (vj, lmg), m=1,…,M, where vj is the semantic vec-tor of the m-th training song and 0} {1, ∈mgl is the g-th class label of the m-th training song. The advantage of using the linear SVM is its training efficiency; certain recent literature has shown that it has comparable predic-tion performance to the non-linear SVM. A single cost parameter is determined by using cross-validation.

4.2 AdaBoost

Boosting is a method of finding a highly accurate classi-fier by combining many base classifiers, each of which is only moderately accurate. We use decision stumps as the base learner. The decision function of the boosting classi-fier takes the following form:

∑=

=T

ttthg

1)()( vv α , (11)

where αt is set as suggested in [4]. The model selection procedure can be done efficiently as we can iteratively increase the number of base learners and stop when the generalization ability with respect to the validation set does not improve.

Page 4: MIREX 2010.pdf

4.3 Calibrated Probability Scores and Probability Ensemble

As each classifier Cg is trained independently for the class g, the raw scores of different classifiers are not di-rectly comparable. In the audio classification task, we need to compare the scores of all N classifiers and deter-mine which class should be assigned to the given test au-dio clip. Therefore, we transform the output scores of the two sub-classifiers, i.e., SVM and AdaBoost, into prob-ability scores with a sigmoid function [5]:

)exp(11)|1Pr(

BAfl

++≈= v (12)

where f is the output score of a sub-classifier, SVM or AdaBoost, and A and B are learned by solving a regular-ized maximum likelihood problem [6]. After the sub-classifier output has been calibrated into a probability score, a classifier ensemble is formed by averaging the probability scores of SVM and AdaBoost, and the prob-ability scores among N classifiers become comparable.

4.4 Cross-Validation

We first perform inner cross-validation on the training set to determine the cost parameter C in the linear SVM and the number of base learners in AdaBoost. Then, we re-train the classifiers with the complete training set and the selected parameters. Since the class distributions for some classes might be imbalanced (we do not have the classification dataset from MIREX), we use the AUC-ROC as the model selection criterion.

5. REFERENCES

[1] J.-C. Wang, H.-S. Lee, S.-K. Jeng and H.-M. Wang: “Posterior Weighted Bernoulli Mixture Model for Music Tag Annotation and Retrieval,” Submitted to APSIPA ASC, 2010.

[2] O. Lartillot and P. Toiviainen: “A Matlab Toolbox for Musical Feature Extraction from Audio,” International Conference on Digital Audio Effects, Bordeaux, 2007.

[3] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

[4] Y. Freund and R. E. Schapire: “A Decision-theoretic Generalization of On-line Learning and An Application to Boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp.119–139, 1997.

[5] J. Platt: “Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods,” Advances in Large Margin Classifiers, Cambridge, MA.

[6] H.-T. Lin, C.-J. Lin, and R.-C. Weng, “A Note on Platt's Probabilistic Outputs for Support Vector Machines,” Machine Learning, vol. 68, no.3, pp. 267-276, 2007.