Report Project 3 ASP
-
Upload
anonymous-mvdogo -
Category
Documents
-
view
219 -
download
0
Transcript of Report Project 3 ASP
-
7/24/2019 Report Project 3 ASP
1/13
Audio Signal Processing
Project-3
Raghavendra Reddy Pappagari
Johns Hopkins University
1. Song Fingerprinting and Searching
Definition: Fingerprinting refers to efficient summarization of whole song based on its contents
Few applications of fingerprinting are
audio monitoring with out having to use metadata
Instant broadcasting on user demand
Used for getting the statistics of how many times a song has been aired in radio channels
Useful in efficient management of ever increasing music content on the Internet
Very useful to get the meta data of the song from just small clip of that song
A good fingerprint is small in memory and quickly searchable for small song clip
Fingerprint is robust if it can be used for searching of noisy clips successfully.
Challenges and trade-offs
In order for the fingerprint to be robust to noise, it should retain maximum of acoustic, perceptual information
of song but complexity increases so the searching time
If the size of the fingerprint is small, discrimination may become difficult between songs i.e., false alarms
increase
Granularity of the system should be good i.e., very few seconds of audio clip is required to obtain original
song
1.1. Generation of fingerprint
First input music signal is windowed and obtained spectrogram
Length of frame and frame shift in time domain are experimented as it is observed that it has lot of impact on
complexity of the system like memory of fingerprint and searching time and also robustness to noise
For each frame, from its magnitude response, few notes (spectral peaks) are chosen to represent the frame. As
shown in Figure. 1(a) and (b) only prominent frequencies in spectrogram are chosen for representing music signal
Criteria for choosing the number of peaks is as follows: frequency axis are divided into octave bands and maximum
frequency components are chosen from each band whose magnitude is more than a threshold
Threshold is set as mean of peaks of each band times a constant factorC
This constant factor controls the number of notes to represent the frame
In this manner the whole song is represented as a table of frequencies and and their time indices
Reason for choosing only few prominent notes is because of their high surviving probability under noisy conditions
-
7/24/2019 Report Project 3 ASP
2/13
To increase the robustness of the algorithm each frequency is paired with few other points obtained in the previous
step
As there is good probability of matching same note in other clip this algorithm results in high umber of false alarms
To overcome this problem, each note is paired with another note in the same song so the probability of mismatch
decreases as there is less chance of same notes at the same time difference
More formally, each note (anchor point) is paired with 5 other notes (target zone) and their time differences and
absolute time of anchor point is stored
Example of anchor point and corresponding target zone is shown in Figure. 1(c).
So each point is stored as an array [ frequency of anchor, frequency of point in target zone, time difference, absolute
time of anchor point]. This way each point is represented as 5 arrays as 5 notes are chosen as target zone.
In literature this representation is called hashing and the array [ frequency of anchor, frequency of point in target
zone, time difference, absolute time of anchor point, song index] is called the hash
Assuming song index isS, from Figure. 1(c), example hash can be written as [f1, f2, t2t1, t1, S]
In this work, target zone for each point is chosen as 5 points after 2 points with respect to anchor point as shown in
Figure. 1(c)
Here for the increased robustness, trade-off is increased storage space which is approximately 5 times the original
representation of the song with out any target zone
For test music clip, similar list of hashes, hash table, is computed
1.2. Searching
After obtaining the hash tables for audio database and for the test music clip, efficient searching algorithm is
required to obtain original song
A good searching algorithm should be able to quickly return correct song
I followed Shazam paper [4] for implementing searching algorithm
In the first step, given test hash (array) I searched for that hash in one song and if match found then I noted down
the difference of time indices. This is shown in Figure. 2(a) and 3(a) as smallbluecircles
Previous step is iterated over all hashes of test music clip and noted down matches
Then histogram of noted time differences is computed. If the clip is from that song then the histogram will have
peak at a point which denotes the location of the clip in that song
Same procedure is repeated for all songs in database and the song which has large peak among all histograms ischosen as system decision
Peak in histogram corresponds to more matchings of hashes so the above method is intuitively makes sense
It can be seen in Figure. 2a that a diagonal present at marked area which denotes that the region is potential candidate
for test music clip
This visible diagonal region can be picked by plotting histogram of time differences which is shown in Figure. 2b
Figure. 3 shows plots for a song which has no match.
It can be observed from Figure. 2(b) and 3(b) that a clear peak stands out in histogram if the song is a match
-
7/24/2019 Report Project 3 ASP
3/13
Spectrogram
Frame Index
Frequencybin
Index
5 10 15 20 25 30 35 40 45
50
100
150
200
250
300
350
400
450
500
550
(a)
0 5 10 15 20 25 30 35 40 45
50
100
150
200
250
300
350
400
450
500
Constellation Map
Frame Index
Frequency
bin
Index
(b)
32 33 34 35 36 37 38
0
50
100
150
X: 32
Y: 35
Frame Index
Frequencybin
Index
Anchor Point
Target Zone
(t2,f2)
(t1,f1)
(c)
Figure 1: (a) Spectrogram of a query music clip, (b) Constellation Map generated from the spectrogram, (c) Details of
hash generation
-
7/24/2019 Report Project 3 ASP
4/13
0 20 40 60 80 100 120 140 1600
5
10
15
20
25
30
35
40
45Scatterplot of matching hash locations
Reference Frame Index
QueryFrameIndex
(a)
40 20 0 20 40 60 80 100 120 140 1600
50
100
150
200
250
300
350
400
450Histogram of matching time differences
Time differences bins
(b)
Figure 2: Illustration of accuracy of searching algorithm (a) scatterplot of matching hash locations (b) Histogram of
matching hash locations
-
7/24/2019 Report Project 3 ASP
5/13
0 20 40 60 80 100 120 140 1600
5
10
15
20
25
30
35
40
45Scatterplot of matching hash locations
Reference Frame Index
QueryFrameIndex
(a)
20 0 20 40 60 80 100 1200
0.5
1
1.5
2
2.5
3
3.5
4Histogram of matching time differences
Time differences bins
(b)
Figure 3: Illustration of accuracy of searching algorithm (a) scatterplot of matching hash locations (b) Histogram of
matching hash locations
-
7/24/2019 Report Project 3 ASP
6/13
1.3. Quick Overview of the method
Songs database is represented as a quick look up table where each entry corresponds to a pair of frequencies in the
corresponding song as shown in Figure. 1
Each entry is called a hash and its format is
[ frequency of anchor, frequency of point in target zone, time difference of two frequencies, absolute time of anchor
point, song index]
In this work, each anchor point has 5 points in target zone which is 3 points away from anchor point as shown in
Figure. 1(c)
Each hash entry of test music clip is searched in music database and matching time differences are noted as shown
Figure. 2(a) and 3(a)
Then their histogram is computed and the song which has highest number of matches is chosen as identified song
1.4. Experiments and Discussion
For all the experiments, I have randomly cut 3.12 seconds of music and results are reported by considering 300 such
music clips
Critical parameters in this algorithm are frame size, frame shift and the coefficient,C, to choose selective frequen-cies. I have experimented with these parameters extensively
I have converged to optimal parameter set by experimenting with each parameter keeping other parameters constant
In this work, Frame shift and Frame size are noted in terms of samples
I set FrameSize=400, FrameShift=160 and C=2 and experimented on clean signals and the performance is 100%
but it is time consuming which makes it not usable in real time and also not memory efficient
To reduce fingerprint memory size, I set FrameShift=400 and the performance is 93.2% which is considerable lossin performance and but the algorithm is speeded up 2.5 times
As we know that human ears respond well to changes in frequencies, good frequency resolution will be a good
representation
Based on this observation, I set FrameSize=1024, FrameShift=1000 samples andC= 1.2and evaluated the system.Results of this experiment is shown in Table. 1 for different types of time domain windows
It can be observed that performance is better than the case where FrameSize=400, FrameShift=160 samples and
C= 2and one more advantage is that number of frames per second are reduced by almost 6 times so less memoryand good speed
Small values ofCresults in high number of hashes
As the frequency scale is divided into octave bands while choosing peak frequencies, introduction of higher fre-
quencies do not affect system performance and it is proved as shown in Table. 1
In all the above cases where frame overlap is small, the performance is not 100% from which we can conclude that
frame overlap has considerable effect on the system
As most of the information is in low frequency bands and due to the sensitivity of human ears to changes in
low frequencies, it is good idea to have good frequency resolution and to consider only low frequency bands for
representing music signal
So now with the support of above idea, I set FrameSize=5000, FrameShift=1000 andC= 1.2. Now the system isincredibly robust to noise and very fast. It is also memory efficient. Results are shown in Figure. 4
-
7/24/2019 Report Project 3 ASP
7/13
Table 1: Results of song identification system setting FrameSize=1024, FrameShift=1000 samples and C=1.2
SNR (dB) -10 -5 0 5 10 15
Hamming Window 2% 13% 42.5% 73.5% 96.5% 95.5%
Rectangular Window 1% 15% 49.5% 84.5% 94.5% 98%
15 10 5 0 5 10 150
10
20
30
40
50
60
70
80
90
100
Signal/Noise Ratio (dB)
Recognition rate noisy conditions
Considering only low freq
Considering all freq
Figure 4: Illustration of robustness of song identification system and effect of considering only few frequency bands of
magnitude response: Parameter values used are FrameSize=5000, FrameShift=1000 samples and C= 1.2
It can be observed that considering high frequencies has detrimental effect on system in noisy conditions which can
be because of two reasons: (1) domination of noise in high frequency bands (2) human ability to perceive changes
in low frequency bands
The above reason is supported by the fact that performance of the system performs marginally better if all frequen-
cies are considered which can be seen at SNR=10dB
So the best parameter set found is FrameSize=5000, FrameShift=1000 andC= 1.2
To further speed up the system FrameShift is set as 2000 but the performance went down significantly for lower
SNRs
1.5. Quick overview of observations in the experiments
Frame size and frame shift has huge effect on the performance of the final system
Discarding high frequency bands helps in noisy conditions without any effect on system performance in clean
conditions
Frame overlap is required for good performance in low SNRs
Reported results are better than Shazam paper [4] at any SNR in any aspect memory requirement, speed and
accuracy
As they have reported 25ms frame length and 10ms frame shift as parameters, the complexity is considerably high
compared to this system where 312.5ms (5000 samples) frame length and 62.5ms(1000 samples) frame shift are
used
-
7/24/2019 Report Project 3 ASP
8/13
Noise robustness of this system can attributed to hash table generation where each anchor point is paired with 5
other peak frequencies (notes)
1.6. Future Work
Dimensionality reduction techniques like PCA can be applied to obtain compact and good representation
Modelling techniques can be employed for generating fingerprints
-
7/24/2019 Report Project 3 ASP
9/13
2. Genre classification
Genre classification is an important application mainly for managing large amounts music data
It can be used as front end for song identification systems: First classifying test music clip into one of genres will
save lot of searching time
Each genre has its own special characteristics, different from other genres, which are useful for classification
MFCC features are widely explored and used in speech community and the question is whether these features can
be applied for music signals
This question is addressed in [2] and which confirms that use of mel scale for frequency warping is not harmful for
representing music signals
Also, in [2], validity of DCT to decorrelate filter bank energies confirmed
So, in this work, I decided to use 39-dimensional MFCC features for representing songs as we know that spectral
components contain lot of information.
Other music specific features such as Timbre, spectral roll-off etc., can be appended to MFCC for experiments
It is also shown that not much improvement, in fact worse than MFCC features, can be seen by appending music
specific features to MFCC [1]
In this work, I employed Gaussian Mixture Model (GMM) to obtain statistics of training data
2.1. Gaussian Mixture Models
Mixture models capture the underlying statistical properties of data
In particular, GMM models the probability distribution of the data as a linear weighted combination of Gaussian
densities. That is, given a data set X={x1, x2, ...,xn}, the probability of dataXdrawn from GMM is
p(X)=
M
i=1
wiN(X/i,i) (1)
whereN(.)is Gaussian distribution,Mis number of mixtures,wi is the weight of the ith Gaussian component,iis its mean vector and i is its covariance matrix.
The parameters of the GMMi = {wi, i,i}fori= 1, 2..,M, can be estimated using Expectation Maximization(EM) algorithm [3].
Fig. 5 illustrates the joint density capturing capabilities of GMM, using 2-dimensional data uniformly distributed
along a circular ring.
Thered ellipses, superimposed on the data (blue) points, correspond to the locations and shapes of the estimated
Gaussian mixtures. Second column shows the 3-dimensional plot of captured density with 3rd dimension as prob-
ability of corresponding data point
In the case of 4-mixture GMM, with diagonal covariance matrices, the density was poorly estimated at odd multiples
of45o, as shown in Fig. 5(a).
As the number of mixtures increases, the density is better captured as shown in Fig. 5(b).
Since the diagonal matrices cannot capture correlations between dimensions, the curvature of the circular ring is
not captured well.
In the case of diagonal covariance matrices, the ellipses are aligned with the xy-axes as shown in Fig. 5(a) and
Fig. 5(b).
The density estimation can be improved using full covariance matrices, as shown in Fig. 5(c).
-
7/24/2019 Report Project 3 ASP
10/13
(a)
(b)
(c)
Figure 5: Illustration of distribution capturing capability of GMM. GMM trained with diagonal covariance matrices (a)
4-mixtures (b) 10-mixtures and (c) 10-mixture GMM trained with full covariance matrices
However, this improvement comes at the expense of increased number of parameters and computation. We need to
estimateM(2D+ 1) parameters for an M-mixture GMM, with diagonal covariances, whereD is the dimension ofthe data.
For a GMM with full covariance matrices, we need to estimateM(0.5D2 + 1.5D+ 1)parameters, which in turnrequires large amount of data.
It is also shown that for sufficiently high number of mixtures, diagonal covariance matrices can capture correlations
in data so in practice usage of GMMs with full covariance matrices is not usual
2.2. System Development
I have employed Gaussian Mixture model for obtaining statistics of song
I have followed [1] (IEEE Signal Processing Letter) for implementing genre classification. Inter-genre similarities
are explored in this paper so the method name : Inter-Genre Similarity Modelling (IGS)
-
7/24/2019 Report Project 3 ASP
11/13
From each genre first 15 songs are chosen for training and next 15 songs are used for testing
I have trained one GMM for each genre. Number of mixtures (assume M) used is experimented
So now for 5 classes in song database 5 GMMs (M-mixture) are trained
For all the frames in training data, negative log-likelihood is calculated with respect to 5 GMMs
Misclassified frames are separated from training data and by using correctly classified frames 5 GMMs are updated
Now the updated means and variances of GMMs are more stable and represent well the corresponding genre as
we have not considered the frames which are more confusable. More confusable frames have common spectral
characteristics across genres
Build a GMM for all the misclassified frames, so in total we have 6 GMMs from the training data which will be
used for testing
The GMM built upon misclassified frames is termed as Inter-Genre Similarity (IGS) GMM because it captures
similarities of spectral components across genres (IGS-GMM)
2.3. Classification of Genres
For the test music clip, extract 39-dimensional MFCC and find the negative log-likelihood for each frame with
respect to 6 GMMs
Discard the frames which belongs to IGS-GMM, consider only the frames which belongs to GMMs corresponding
to genres and form a new set of frames
Belongingness of a frame to a particular GMM can be said if likelihood of the frame is more for that GMM
compared to other GMMs
Now compute the average likelihood of new set of frames with respect to GMMs corresponding to genres
The genre corresponding to GMM which has average maximum likelihood is chosen as system decision denoted
by parameters
Average likelihood can be computed as follows:
= argmaxn
1kwkn
K
k=1
wknlogP(fk|n) (2)
wherewkn is 0 iffk belongs to IGS-GMM otherwise 1 and it is assumed that input music clip consists ofKframesnamelyf1, f2,...,fK
wkn lets us to choose only the so called new set of frames
This procedure can be repeated several times by filtering misclassified frames in each iteration and updating genre
GMMs
I did only one iteration because of small data set
2.4. Experiments and Discussion
Music signal frames are obtained with 25ms window and 10ms shift
I have extracted 39-dimensional MFCCs (13-MFCCs + 13- + 13-) with 26 triangular filters linearly spaced
along mel-frequency axis
I have experimented with different number of mixtures 4, 8 and 16, and corresponding results are shown in Table. 2.
These results can not be compared with the results published in [1] as they are obtained with different database but
the accuracies are very close which confirms that implemented algorithm is correct
-
7/24/2019 Report Project 3 ASP
12/13
Table 2: Results of Genre classification system using only GMMs and IGS-GMMs
4-mix 8-mix 16-mix
Only GMM modelling 41.4% 46.6% 45.8%
IGS modelling 45.2% 49.6% 46.6%
I have observed that 8-mixture GMMs are performing best as shown in Table. 2
I believe the reason why 16-mixture GMMs are not performing best is data insufficiency
1-mixture GMM estimates 78 parameters (39 means and 39 variances for 39-dim MFCC)
16-mixture GMM estimates7816 + 16(weights) = 1264parameters and training data available for each genreis 150 seconds and out of which many frames are filtered as misclassified which can lead to over-fitting i.e., mem-
orization of data points
4-mixture GMM is not performing well, could be because number of mixtures are too low which lead to each
mixture representing spread of characteristics of the genre
It can be observed that IGS modelling has improved performance by 3% in 8-mixtures case
In any case, IGS modelling performs better than only GMM modelling
It is noted that more than half of the training data frames are misclassified with only GMM modelling
So IGS-GMM is trained on half of the training data taken across genres
Since only 150-seconds (15000 frames) of training data is available for each class, after GMM modelling only 7000
frames (approximately) are classified correctly and each GMM is updated with these 7000 frames
It can be seen that it is inappropriate to use more mixtures for training GMMs as only 7000 frames are available
and also more number of iterations of IGS modelling can not be employed
For these experiments, I have randomly cut 500 clips from testing database. Each clip is of 50000 samples i.e.,
3.125 seconds
Table. 3 shows the confusion matrix of genres and their accuracies.
Left most row shows original genre and top row shows the genre with which original genre got confused
As it can be seen, all diagonal entries (bold faced) are high i.e., no genre got confused with any other more than
itself
Also, blues got misclassified most with jazz, rock confused most with blues
Electronic and jazz got misclassified with one another
Electronic genre has least accuracy
Rock and hip-hop seems to have no common characteristics as they are not confused with each other
It can be observed that hip-hop has highest accuracy
2.5. Future Work
This method can be improved by iterative filtering of misclassified frames if enough data is provided and which
also enables more experiments with number of mixtures
More experiments need to be done to find the effect of frame size and frame shift as it is observed that in the case
of song identification they have huge effect
-
7/24/2019 Report Project 3 ASP
13/13
Table 3: Confusion matrix of genres computed with IGS-modelling and 8-mixture GMMs
Blues Electronic Jazz Rock Hip-hop Accuracy
Blues 44/104 1 0/104 36/104 3/104 11/104 42.30%
Electronic 13/89 31/89 21/89 8/89 16/89 34.83%
Jazz 3/90 34/90 47/90 3/90 0/90 52.22%
Rock 32/115 11/115 25/115 47/115 0/115 40.87%
Hip-hop 9/102 3/102 16/102 0/102 74/102 72.54%
3. References[1] Baci, Ula, and Engin Erzin. Automatic classification of musical genres using inter-genre similarity. Signal Processing Letters, IEEE 14.8 (2007):
521-524.
[2] Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling. ISMIR. 2000.
[3] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. John Wiley & Sons, 2012.
[4] Wang A. An Industrial Strength Audio Search Algorithm. InISMIR 2003 Oct 26 (pp. 7-13).