Report Project 3 ASP

7/24/2019 Report Project 3 ASP

1/13

Audio Signal Processing

Project-3

Raghavendra Reddy Pappagari

Johns Hopkins University

[email protected]

1. Song Fingerprinting and Searching

Definition: Fingerprinting refers to efficient summarization of whole song based on its contents

Few applications of fingerprinting are

audio monitoring with out having to use metadata

Instant broadcasting on user demand

Used for getting the statistics of how many times a song has been aired in radio channels

Useful in efficient management of ever increasing music content on the Internet

Very useful to get the meta data of the song from just small clip of that song

A good fingerprint is small in memory and quickly searchable for small song clip

Fingerprint is robust if it can be used for searching of noisy clips successfully.

Challenges and trade-offs

In order for the fingerprint to be robust to noise, it should retain maximum of acoustic, perceptual information

of song but complexity increases so the searching time

If the size of the fingerprint is small, discrimination may become difficult between songs i.e., false alarms

increase

Granularity of the system should be good i.e., very few seconds of audio clip is required to obtain original

song

1.1. Generation of fingerprint

First input music signal is windowed and obtained spectrogram

Length of frame and frame shift in time domain are experimented as it is observed that it has lot of impact on

complexity of the system like memory of fingerprint and searching time and also robustness to noise

For each frame, from its magnitude response, few notes (spectral peaks) are chosen to represent the frame. As

shown in Figure. 1(a) and (b) only prominent frequencies in spectrogram are chosen for representing music signal

Criteria for choosing the number of peaks is as follows: frequency axis are divided into octave bands and maximum

frequency components are chosen from each band whose magnitude is more than a threshold

Threshold is set as mean of peaks of each band times a constant factorC

This constant factor controls the number of notes to represent the frame

In this manner the whole song is represented as a table of frequencies and and their time indices

Reason for choosing only few prominent notes is because of their high surviving probability under noisy conditions


2/13

To increase the robustness of the algorithm each frequency is paired with few other points obtained in the previous

step

As there is good probability of matching same note in other clip this algorithm results in high umber of false alarms

To overcome this problem, each note is paired with another note in the same song so the probability of mismatch

decreases as there is less chance of same notes at the same time difference

More formally, each note (anchor point) is paired with 5 other notes (target zone) and their time differences and

absolute time of anchor point is stored

Example of anchor point and corresponding target zone is shown in Figure. 1(c).

So each point is stored as an array [ frequency of anchor, frequency of point in target zone, time difference, absolute

time of anchor point]. This way each point is represented as 5 arrays as 5 notes are chosen as target zone.

In literature this representation is called hashing and the array [ frequency of anchor, frequency of point in target

zone, time difference, absolute time of anchor point, song index] is called the hash

Assuming song index isS, from Figure. 1(c), example hash can be written as [f1, f2, t2t1, t1, S]

In this work, target zone for each point is chosen as 5 points after 2 points with respect to anchor point as shown in

Figure. 1(c)

Here for the increased robustness, trade-off is increased storage space which is approximately 5 times the original

representation of the song with out any target zone

For test music clip, similar list of hashes, hash table, is computed

1.2. Searching

After obtaining the hash tables for audio database and for the test music clip, efficient searching algorithm is

required to obtain original song

A good searching algorithm should be able to quickly return correct song

I followed Shazam paper [4] for implementing searching algorithm

In the first step, given test hash (array) I searched for that hash in one song and if match found then I noted down

the difference of time indices. This is shown in Figure. 2(a) and 3(a) as smallbluecircles

Previous step is iterated over all hashes of test music clip and noted down matches

Then histogram of noted time differences is computed. If the clip is from that song then the histogram will have

peak at a point which denotes the location of the clip in that song

Same procedure is repeated for all songs in database and the song which has large peak among all histograms ischosen as system decision

Peak in histogram corresponds to more matchings of hashes so the above method is intuitively makes sense

It can be seen in Figure. 2a that a diagonal present at marked area which denotes that the region is potential candidate

for test music clip

This visible diagonal region can be picked by plotting histogram of time differences which is shown in Figure. 2b

Figure. 3 shows plots for a song which has no match.

It can be observed from Figure. 2(b) and 3(b) that a clear peak stands out in histogram if the song is a match


3/13

Spectrogram

Frame Index

Frequencybin

Index

5 10 15 20 25 30 35 40 45

50

100

150

200

250

300

350

400

450

500

550

(a)

0 5 10 15 20 25 30 35 40 45

50

100

150

200

250

300

350

400

450

500

Constellation Map

Frame Index

Frequency

bin

Index

(b)

32 33 34 35 36 37 38

0

50

100

150

X: 32

Y: 35

Frame Index

Frequencybin

Index

Anchor Point

Target Zone

(t2,f2)

(t1,f1)

(c)

Figure 1: (a) Spectrogram of a query music clip, (b) Constellation Map generated from the spectrogram, (c) Details of

hash generation


4/13

0 20 40 60 80 100 120 140 1600

5

10

15

20

25

30

35

40

45Scatterplot of matching hash locations

Reference Frame Index

QueryFrameIndex

(a)

40 20 0 20 40 60 80 100 120 140 1600

50

100

150

200

250

300

350

400

450Histogram of matching time differences

Time differences bins

(b)

Figure 2: Illustration of accuracy of searching algorithm (a) scatterplot of matching hash locations (b) Histogram of

matching hash locations


5/13

0 20 40 60 80 100 120 140 1600

5

10

15

20

25

30

35

40

45Scatterplot of matching hash locations

Reference Frame Index

QueryFrameIndex

(a)

20 0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3

3.5

4Histogram of matching time differences

Time differences bins

(b)

Figure 3: Illustration of accuracy of searching algorithm (a) scatterplot of matching hash locations (b) Histogram of

matching hash locations


6/13

1.3. Quick Overview of the method

Songs database is represented as a quick look up table where each entry corresponds to a pair of frequencies in the

corresponding song as shown in Figure. 1

Each entry is called a hash and its format is

[ frequency of anchor, frequency of point in target zone, time difference of two frequencies, absolute time of anchor

point, song index]

In this work, each anchor point has 5 points in target zone which is 3 points away from anchor point as shown in

Figure. 1(c)

Each hash entry of test music clip is searched in music database and matching time differences are noted as shown

Figure. 2(a) and 3(a)

Then their histogram is computed and the song which has highest number of matches is chosen as identified song

1.4. Experiments and Discussion

For all the experiments, I have randomly cut 3.12 seconds of music and results are reported by considering 300 such

music clips

Critical parameters in this algorithm are frame size, frame shift and the coefficient,C, to choose selective frequen-cies. I have experimented with these parameters extensively

I have converged to optimal parameter set by experimenting with each parameter keeping other parameters constant

In this work, Frame shift and Frame size are noted in terms of samples

I set FrameSize=400, FrameShift=160 and C=2 and experimented on clean signals and the performance is 100%

but it is time consuming which makes it not usable in real time and also not memory efficient

To reduce fingerprint memory size, I set FrameShift=400 and the performance is 93.2% which is considerable lossin performance and but the algorithm is speeded up 2.5 times

As we know that human ears respond well to changes in frequencies, good frequency resolution will be a good

representation

Based on this observation, I set FrameSize=1024, FrameShift=1000 samples andC= 1.2and evaluated the system.Results of this experiment is shown in Table. 1 for different types of time domain windows

It can be observed that performance is better than the case where FrameSize=400, FrameShift=160 samples and

C= 2and one more advantage is that number of frames per second are reduced by almost 6 times so less memoryand good speed

Small values ofCresults in high number of hashes

As the frequency scale is divided into octave bands while choosing peak frequencies, introduction of higher fre-

quencies do not affect system performance and it is proved as shown in Table. 1

In all the above cases where frame overlap is small, the performance is not 100% from which we can conclude that

frame overlap has considerable effect on the system

As most of the information is in low frequency bands and due to the sensitivity of human ears to changes in

low frequencies, it is good idea to have good frequency resolution and to consider only low frequency bands for

representing music signal

So now with the support of above idea, I set FrameSize=5000, FrameShift=1000 andC= 1.2. Now the system isincredibly robust to noise and very fast. It is also memory efficient. Results are shown in Figure. 4


7/13

Table 1: Results of song identification system setting FrameSize=1024, FrameShift=1000 samples and C=1.2

SNR (dB) -10 -5 0 5 10 15

Hamming Window 2% 13% 42.5% 73.5% 96.5% 95.5%

Rectangular Window 1% 15% 49.5% 84.5% 94.5% 98%

15 10 5 0 5 10 150

10

20

30

40

50

60

70

80

90

100

Signal/Noise Ratio (dB)

Recognition rate noisy conditions

Considering only low freq

Considering all freq

Figure 4: Illustration of robustness of song identification system and effect of considering only few frequency bands of

magnitude response: Parameter values used are FrameSize=5000, FrameShift=1000 samples and C= 1.2

It can be observed that considering high frequencies has detrimental effect on system in noisy conditions which can

be because of two reasons: (1) domination of noise in high frequency bands (2) human ability to perceive changes

in low frequency bands

The above reason is supported by the fact that performance of the system performs marginally better if all frequen-

cies are considered which can be seen at SNR=10dB

So the best parameter set found is FrameSize=5000, FrameShift=1000 andC= 1.2

To further speed up the system FrameShift is set as 2000 but the performance went down significantly for lower

SNRs

1.5. Quick overview of observations in the experiments

Frame size and frame shift has huge effect on the performance of the final system

Discarding high frequency bands helps in noisy conditions without any effect on system performance in clean

conditions

Frame overlap is required for good performance in low SNRs

Reported results are better than Shazam paper [4] at any SNR in any aspect memory requirement, speed and

accuracy

As they have reported 25ms frame length and 10ms frame shift as parameters, the complexity is considerably high

compared to this system where 312.5ms (5000 samples) frame length and 62.5ms(1000 samples) frame shift are

used


8/13

Noise robustness of this system can attributed to hash table generation where each anchor point is paired with 5

other peak frequencies (notes)

1.6. Future Work

Dimensionality reduction techniques like PCA can be applied to obtain compact and good representation

Modelling techniques can be employed for generating fingerprints


9/13

2. Genre classification

Genre classification is an important application mainly for managing large amounts music data

It can be used as front end for song identification systems: First classifying test music clip into one of genres will

save lot of searching time

Each genre has its own special characteristics, different from other genres, which are useful for classification

MFCC features are widely explored and used in speech community and the question is whether these features can

be applied for music signals

This question is addressed in [2] and which confirms that use of mel scale for frequency warping is not harmful for

representing music signals

Also, in [2], validity of DCT to decorrelate filter bank energies confirmed

So, in this work, I decided to use 39-dimensional MFCC features for representing songs as we know that spectral

components contain lot of information.

Other music specific features such as Timbre, spectral roll-off etc., can be appended to MFCC for experiments

It is also shown that not much improvement, in fact worse than MFCC features, can be seen by appending music

specific features to MFCC [1]

In this work, I employed Gaussian Mixture Model (GMM) to obtain statistics of training data

2.1. Gaussian Mixture Models

Mixture models capture the underlying statistical properties of data

In particular, GMM models the probability distribution of the data as a linear weighted combination of Gaussian

densities. That is, given a data set X={x1, x2, ...,xn}, the probability of dataXdrawn from GMM is

p(X)=

M

i=1

wiN(X/i,i) (1)

whereN(.)is Gaussian distribution,Mis number of mixtures,wi is the weight of the ith Gaussian component,iis its mean vector and i is its covariance matrix.

The parameters of the GMMi = {wi, i,i}fori= 1, 2..,M, can be estimated using Expectation Maximization(EM) algorithm [3].

Fig. 5 illustrates the joint density capturing capabilities of GMM, using 2-dimensional data uniformly distributed

along a circular ring.

Thered ellipses, superimposed on the data (blue) points, correspond to the locations and shapes of the estimated

Gaussian mixtures. Second column shows the 3-dimensional plot of captured density with 3rd dimension as prob-

ability of corresponding data point

In the case of 4-mixture GMM, with diagonal covariance matrices, the density was poorly estimated at odd multiples

of45o, as shown in Fig. 5(a).

As the number of mixtures increases, the density is better captured as shown in Fig. 5(b).

Since the diagonal matrices cannot capture correlations between dimensions, the curvature of the circular ring is

not captured well.

In the case of diagonal covariance matrices, the ellipses are aligned with the xy-axes as shown in Fig. 5(a) and

Fig. 5(b).

The density estimation can be improved using full covariance matrices, as shown in Fig. 5(c).


10/13

(a)

(b)

(c)

Figure 5: Illustration of distribution capturing capability of GMM. GMM trained with diagonal covariance matrices (a)

4-mixtures (b) 10-mixtures and (c) 10-mixture GMM trained with full covariance matrices

However, this improvement comes at the expense of increased number of parameters and computation. We need to

estimateM(2D+ 1) parameters for an M-mixture GMM, with diagonal covariances, whereD is the dimension ofthe data.

For a GMM with full covariance matrices, we need to estimateM(0.5D2 + 1.5D+ 1)parameters, which in turnrequires large amount of data.

It is also shown that for sufficiently high number of mixtures, diagonal covariance matrices can capture correlations

in data so in practice usage of GMMs with full covariance matrices is not usual

2.2. System Development

I have employed Gaussian Mixture model for obtaining statistics of song

I have followed [1] (IEEE Signal Processing Letter) for implementing genre classification. Inter-genre similarities

are explored in this paper so the method name : Inter-Genre Similarity Modelling (IGS)


11/13

From each genre first 15 songs are chosen for training and next 15 songs are used for testing

I have trained one GMM for each genre. Number of mixtures (assume M) used is experimented

So now for 5 classes in song database 5 GMMs (M-mixture) are trained

For all the frames in training data, negative log-likelihood is calculated with respect to 5 GMMs

Misclassified frames are separated from training data and by using correctly classified frames 5 GMMs are updated

Now the updated means and variances of GMMs are more stable and represent well the corresponding genre as

we have not considered the frames which are more confusable. More confusable frames have common spectral

characteristics across genres

Build a GMM for all the misclassified frames, so in total we have 6 GMMs from the training data which will be

used for testing

The GMM built upon misclassified frames is termed as Inter-Genre Similarity (IGS) GMM because it captures

similarities of spectral components across genres (IGS-GMM)

2.3. Classification of Genres

For the test music clip, extract 39-dimensional MFCC and find the negative log-likelihood for each frame with

respect to 6 GMMs

Discard the frames which belongs to IGS-GMM, consider only the frames which belongs to GMMs corresponding

to genres and form a new set of frames

Belongingness of a frame to a particular GMM can be said if likelihood of the frame is more for that GMM

compared to other GMMs

Now compute the average likelihood of new set of frames with respect to GMMs corresponding to genres

The genre corresponding to GMM which has average maximum likelihood is chosen as system decision denoted

by parameters

Average likelihood can be computed as follows:

= argmaxn

1kwkn

K

k=1

wknlogP(fk|n) (2)

wherewkn is 0 iffk belongs to IGS-GMM otherwise 1 and it is assumed that input music clip consists ofKframesnamelyf1, f2,...,fK

wkn lets us to choose only the so called new set of frames

This procedure can be repeated several times by filtering misclassified frames in each iteration and updating genre

GMMs

I did only one iteration because of small data set

2.4. Experiments and Discussion

Music signal frames are obtained with 25ms window and 10ms shift

I have extracted 39-dimensional MFCCs (13-MFCCs + 13- + 13-) with 26 triangular filters linearly spaced

along mel-frequency axis

I have experimented with different number of mixtures 4, 8 and 16, and corresponding results are shown in Table. 2.

These results can not be compared with the results published in [1] as they are obtained with different database but

the accuracies are very close which confirms that implemented algorithm is correct


12/13

Table 2: Results of Genre classification system using only GMMs and IGS-GMMs

4-mix 8-mix 16-mix

Only GMM modelling 41.4% 46.6% 45.8%

IGS modelling 45.2% 49.6% 46.6%

I have observed that 8-mixture GMMs are performing best as shown in Table. 2

I believe the reason why 16-mixture GMMs are not performing best is data insufficiency

1-mixture GMM estimates 78 parameters (39 means and 39 variances for 39-dim MFCC)

16-mixture GMM estimates7816 + 16(weights) = 1264parameters and training data available for each genreis 150 seconds and out of which many frames are filtered as misclassified which can lead to over-fitting i.e., mem-

orization of data points

4-mixture GMM is not performing well, could be because number of mixtures are too low which lead to each

mixture representing spread of characteristics of the genre

It can be observed that IGS modelling has improved performance by 3% in 8-mixtures case

In any case, IGS modelling performs better than only GMM modelling

It is noted that more than half of the training data frames are misclassified with only GMM modelling

So IGS-GMM is trained on half of the training data taken across genres

Since only 150-seconds (15000 frames) of training data is available for each class, after GMM modelling only 7000

frames (approximately) are classified correctly and each GMM is updated with these 7000 frames

It can be seen that it is inappropriate to use more mixtures for training GMMs as only 7000 frames are available

and also more number of iterations of IGS modelling can not be employed

For these experiments, I have randomly cut 500 clips from testing database. Each clip is of 50000 samples i.e.,

3.125 seconds

Table. 3 shows the confusion matrix of genres and their accuracies.

Left most row shows original genre and top row shows the genre with which original genre got confused

As it can be seen, all diagonal entries (bold faced) are high i.e., no genre got confused with any other more than

itself

Also, blues got misclassified most with jazz, rock confused most with blues

Electronic and jazz got misclassified with one another

Electronic genre has least accuracy

Rock and hip-hop seems to have no common characteristics as they are not confused with each other

It can be observed that hip-hop has highest accuracy

2.5. Future Work

This method can be improved by iterative filtering of misclassified frames if enough data is provided and which

also enables more experiments with number of mixtures

More experiments need to be done to find the effect of frame size and frame shift as it is observed that in the case

of song identification they have huge effect


13/13

Table 3: Confusion matrix of genres computed with IGS-modelling and 8-mixture GMMs

Blues Electronic Jazz Rock Hip-hop Accuracy

Blues 44/104 1 0/104 36/104 3/104 11/104 42.30%

Electronic 13/89 31/89 21/89 8/89 16/89 34.83%

Jazz 3/90 34/90 47/90 3/90 0/90 52.22%

Rock 32/115 11/115 25/115 47/115 0/115 40.87%

Hip-hop 9/102 3/102 16/102 0/102 74/102 72.54%

3. References[1] Baci, Ula, and Engin Erzin. Automatic classification of musical genres using inter-genre similarity. Signal Processing Letters, IEEE 14.8 (2007):

521-524.

[2] Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling. ISMIR. 2000.

[3] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. John Wiley & Sons, 2012.

[4] Wang A. An Industrial Strength Audio Search Algorithm. InISMIR 2003 Oct 26 (pp. 7-13).

Report Project 3 ASP

Documents

Transcript of Report Project 3 ASP