Report Project 3 ASP

download Report Project 3 ASP

of 13

Transcript of Report Project 3 ASP

  • 7/24/2019 Report Project 3 ASP

    1/13

    Audio Signal Processing

    Project-3

    Raghavendra Reddy Pappagari

    Johns Hopkins University

    [email protected]

    1. Song Fingerprinting and Searching

    Definition: Fingerprinting refers to efficient summarization of whole song based on its contents

    Few applications of fingerprinting are

    audio monitoring with out having to use metadata

    Instant broadcasting on user demand

    Used for getting the statistics of how many times a song has been aired in radio channels

    Useful in efficient management of ever increasing music content on the Internet

    Very useful to get the meta data of the song from just small clip of that song

    A good fingerprint is small in memory and quickly searchable for small song clip

    Fingerprint is robust if it can be used for searching of noisy clips successfully.

    Challenges and trade-offs

    In order for the fingerprint to be robust to noise, it should retain maximum of acoustic, perceptual information

    of song but complexity increases so the searching time

    If the size of the fingerprint is small, discrimination may become difficult between songs i.e., false alarms

    increase

    Granularity of the system should be good i.e., very few seconds of audio clip is required to obtain original

    song

    1.1. Generation of fingerprint

    First input music signal is windowed and obtained spectrogram

    Length of frame and frame shift in time domain are experimented as it is observed that it has lot of impact on

    complexity of the system like memory of fingerprint and searching time and also robustness to noise

    For each frame, from its magnitude response, few notes (spectral peaks) are chosen to represent the frame. As

    shown in Figure. 1(a) and (b) only prominent frequencies in spectrogram are chosen for representing music signal

    Criteria for choosing the number of peaks is as follows: frequency axis are divided into octave bands and maximum

    frequency components are chosen from each band whose magnitude is more than a threshold

    Threshold is set as mean of peaks of each band times a constant factorC

    This constant factor controls the number of notes to represent the frame

    In this manner the whole song is represented as a table of frequencies and and their time indices

    Reason for choosing only few prominent notes is because of their high surviving probability under noisy conditions

  • 7/24/2019 Report Project 3 ASP

    2/13

    To increase the robustness of the algorithm each frequency is paired with few other points obtained in the previous

    step

    As there is good probability of matching same note in other clip this algorithm results in high umber of false alarms

    To overcome this problem, each note is paired with another note in the same song so the probability of mismatch

    decreases as there is less chance of same notes at the same time difference

    More formally, each note (anchor point) is paired with 5 other notes (target zone) and their time differences and

    absolute time of anchor point is stored

    Example of anchor point and corresponding target zone is shown in Figure. 1(c).

    So each point is stored as an array [ frequency of anchor, frequency of point in target zone, time difference, absolute

    time of anchor point]. This way each point is represented as 5 arrays as 5 notes are chosen as target zone.

    In literature this representation is called hashing and the array [ frequency of anchor, frequency of point in target

    zone, time difference, absolute time of anchor point, song index] is called the hash

    Assuming song index isS, from Figure. 1(c), example hash can be written as [f1, f2, t2t1, t1, S]

    In this work, target zone for each point is chosen as 5 points after 2 points with respect to anchor point as shown in

    Figure. 1(c)

    Here for the increased robustness, trade-off is increased storage space which is approximately 5 times the original

    representation of the song with out any target zone

    For test music clip, similar list of hashes, hash table, is computed

    1.2. Searching

    After obtaining the hash tables for audio database and for the test music clip, efficient searching algorithm is

    required to obtain original song

    A good searching algorithm should be able to quickly return correct song

    I followed Shazam paper [4] for implementing searching algorithm

    In the first step, given test hash (array) I searched for that hash in one song and if match found then I noted down

    the difference of time indices. This is shown in Figure. 2(a) and 3(a) as smallbluecircles

    Previous step is iterated over all hashes of test music clip and noted down matches

    Then histogram of noted time differences is computed. If the clip is from that song then the histogram will have

    peak at a point which denotes the location of the clip in that song

    Same procedure is repeated for all songs in database and the song which has large peak among all histograms ischosen as system decision

    Peak in histogram corresponds to more matchings of hashes so the above method is intuitively makes sense

    It can be seen in Figure. 2a that a diagonal present at marked area which denotes that the region is potential candidate

    for test music clip

    This visible diagonal region can be picked by plotting histogram of time differences which is shown in Figure. 2b

    Figure. 3 shows plots for a song which has no match.

    It can be observed from Figure. 2(b) and 3(b) that a clear peak stands out in histogram if the song is a match

  • 7/24/2019 Report Project 3 ASP

    3/13

    Spectrogram

    Frame Index

    Frequencybin

    Index

    5 10 15 20 25 30 35 40 45

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    550

    (a)

    0 5 10 15 20 25 30 35 40 45

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    Constellation Map

    Frame Index

    Frequency

    bin

    Index

    (b)

    32 33 34 35 36 37 38

    0

    50

    100

    150

    X: 32

    Y: 35

    Frame Index

    Frequencybin

    Index

    Anchor Point

    Target Zone

    (t2,f2)

    (t1,f1)

    (c)

    Figure 1: (a) Spectrogram of a query music clip, (b) Constellation Map generated from the spectrogram, (c) Details of

    hash generation

  • 7/24/2019 Report Project 3 ASP

    4/13

    0 20 40 60 80 100 120 140 1600

    5

    10

    15

    20

    25

    30

    35

    40

    45Scatterplot of matching hash locations

    Reference Frame Index

    QueryFrameIndex

    (a)

    40 20 0 20 40 60 80 100 120 140 1600

    50

    100

    150

    200

    250

    300

    350

    400

    450Histogram of matching time differences

    Time differences bins

    (b)

    Figure 2: Illustration of accuracy of searching algorithm (a) scatterplot of matching hash locations (b) Histogram of

    matching hash locations

  • 7/24/2019 Report Project 3 ASP

    5/13

    0 20 40 60 80 100 120 140 1600

    5

    10

    15

    20

    25

    30

    35

    40

    45Scatterplot of matching hash locations

    Reference Frame Index

    QueryFrameIndex

    (a)

    20 0 20 40 60 80 100 1200

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4Histogram of matching time differences

    Time differences bins

    (b)

    Figure 3: Illustration of accuracy of searching algorithm (a) scatterplot of matching hash locations (b) Histogram of

    matching hash locations

  • 7/24/2019 Report Project 3 ASP

    6/13

    1.3. Quick Overview of the method

    Songs database is represented as a quick look up table where each entry corresponds to a pair of frequencies in the

    corresponding song as shown in Figure. 1

    Each entry is called a hash and its format is

    [ frequency of anchor, frequency of point in target zone, time difference of two frequencies, absolute time of anchor

    point, song index]

    In this work, each anchor point has 5 points in target zone which is 3 points away from anchor point as shown in

    Figure. 1(c)

    Each hash entry of test music clip is searched in music database and matching time differences are noted as shown

    Figure. 2(a) and 3(a)

    Then their histogram is computed and the song which has highest number of matches is chosen as identified song

    1.4. Experiments and Discussion

    For all the experiments, I have randomly cut 3.12 seconds of music and results are reported by considering 300 such

    music clips

    Critical parameters in this algorithm are frame size, frame shift and the coefficient,C, to choose selective frequen-cies. I have experimented with these parameters extensively

    I have converged to optimal parameter set by experimenting with each parameter keeping other parameters constant

    In this work, Frame shift and Frame size are noted in terms of samples

    I set FrameSize=400, FrameShift=160 and C=2 and experimented on clean signals and the performance is 100%

    but it is time consuming which makes it not usable in real time and also not memory efficient

    To reduce fingerprint memory size, I set FrameShift=400 and the performance is 93.2% which is considerable lossin performance and but the algorithm is speeded up 2.5 times

    As we know that human ears respond well to changes in frequencies, good frequency resolution will be a good

    representation

    Based on this observation, I set FrameSize=1024, FrameShift=1000 samples andC= 1.2and evaluated the system.Results of this experiment is shown in Table. 1 for different types of time domain windows

    It can be observed that performance is better than the case where FrameSize=400, FrameShift=160 samples and

    C= 2and one more advantage is that number of frames per second are reduced by almost 6 times so less memoryand good speed

    Small values ofCresults in high number of hashes

    As the frequency scale is divided into octave bands while choosing peak frequencies, introduction of higher fre-

    quencies do not affect system performance and it is proved as shown in Table. 1

    In all the above cases where frame overlap is small, the performance is not 100% from which we can conclude that

    frame overlap has considerable effect on the system

    As most of the information is in low frequency bands and due to the sensitivity of human ears to changes in

    low frequencies, it is good idea to have good frequency resolution and to consider only low frequency bands for

    representing music signal

    So now with the support of above idea, I set FrameSize=5000, FrameShift=1000 andC= 1.2. Now the system isincredibly robust to noise and very fast. It is also memory efficient. Results are shown in Figure. 4

  • 7/24/2019 Report Project 3 ASP

    7/13

    Table 1: Results of song identification system setting FrameSize=1024, FrameShift=1000 samples and C=1.2

    SNR (dB) -10 -5 0 5 10 15

    Hamming Window 2% 13% 42.5% 73.5% 96.5% 95.5%

    Rectangular Window 1% 15% 49.5% 84.5% 94.5% 98%

    15 10 5 0 5 10 150

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Signal/Noise Ratio (dB)

    Recognition rate noisy conditions

    Considering only low freq

    Considering all freq

    Figure 4: Illustration of robustness of song identification system and effect of considering only few frequency bands of

    magnitude response: Parameter values used are FrameSize=5000, FrameShift=1000 samples and C= 1.2

    It can be observed that considering high frequencies has detrimental effect on system in noisy conditions which can

    be because of two reasons: (1) domination of noise in high frequency bands (2) human ability to perceive changes

    in low frequency bands

    The above reason is supported by the fact that performance of the system performs marginally better if all frequen-

    cies are considered which can be seen at SNR=10dB

    So the best parameter set found is FrameSize=5000, FrameShift=1000 andC= 1.2

    To further speed up the system FrameShift is set as 2000 but the performance went down significantly for lower

    SNRs

    1.5. Quick overview of observations in the experiments

    Frame size and frame shift has huge effect on the performance of the final system

    Discarding high frequency bands helps in noisy conditions without any effect on system performance in clean

    conditions

    Frame overlap is required for good performance in low SNRs

    Reported results are better than Shazam paper [4] at any SNR in any aspect memory requirement, speed and

    accuracy

    As they have reported 25ms frame length and 10ms frame shift as parameters, the complexity is considerably high

    compared to this system where 312.5ms (5000 samples) frame length and 62.5ms(1000 samples) frame shift are

    used

  • 7/24/2019 Report Project 3 ASP

    8/13

    Noise robustness of this system can attributed to hash table generation where each anchor point is paired with 5

    other peak frequencies (notes)

    1.6. Future Work

    Dimensionality reduction techniques like PCA can be applied to obtain compact and good representation

    Modelling techniques can be employed for generating fingerprints

  • 7/24/2019 Report Project 3 ASP

    9/13

    2. Genre classification

    Genre classification is an important application mainly for managing large amounts music data

    It can be used as front end for song identification systems: First classifying test music clip into one of genres will

    save lot of searching time

    Each genre has its own special characteristics, different from other genres, which are useful for classification

    MFCC features are widely explored and used in speech community and the question is whether these features can

    be applied for music signals

    This question is addressed in [2] and which confirms that use of mel scale for frequency warping is not harmful for

    representing music signals

    Also, in [2], validity of DCT to decorrelate filter bank energies confirmed

    So, in this work, I decided to use 39-dimensional MFCC features for representing songs as we know that spectral

    components contain lot of information.

    Other music specific features such as Timbre, spectral roll-off etc., can be appended to MFCC for experiments

    It is also shown that not much improvement, in fact worse than MFCC features, can be seen by appending music

    specific features to MFCC [1]

    In this work, I employed Gaussian Mixture Model (GMM) to obtain statistics of training data

    2.1. Gaussian Mixture Models

    Mixture models capture the underlying statistical properties of data

    In particular, GMM models the probability distribution of the data as a linear weighted combination of Gaussian

    densities. That is, given a data set X={x1, x2, ...,xn}, the probability of dataXdrawn from GMM is

    p(X)=

    M

    i=1

    wiN(X/i,i) (1)

    whereN(.)is Gaussian distribution,Mis number of mixtures,wi is the weight of the ith Gaussian component,iis its mean vector and i is its covariance matrix.

    The parameters of the GMMi = {wi, i,i}fori= 1, 2..,M, can be estimated using Expectation Maximization(EM) algorithm [3].

    Fig. 5 illustrates the joint density capturing capabilities of GMM, using 2-dimensional data uniformly distributed

    along a circular ring.

    Thered ellipses, superimposed on the data (blue) points, correspond to the locations and shapes of the estimated

    Gaussian mixtures. Second column shows the 3-dimensional plot of captured density with 3rd dimension as prob-

    ability of corresponding data point

    In the case of 4-mixture GMM, with diagonal covariance matrices, the density was poorly estimated at odd multiples

    of45o, as shown in Fig. 5(a).

    As the number of mixtures increases, the density is better captured as shown in Fig. 5(b).

    Since the diagonal matrices cannot capture correlations between dimensions, the curvature of the circular ring is

    not captured well.

    In the case of diagonal covariance matrices, the ellipses are aligned with the xy-axes as shown in Fig. 5(a) and

    Fig. 5(b).

    The density estimation can be improved using full covariance matrices, as shown in Fig. 5(c).

  • 7/24/2019 Report Project 3 ASP

    10/13

    (a)

    (b)

    (c)

    Figure 5: Illustration of distribution capturing capability of GMM. GMM trained with diagonal covariance matrices (a)

    4-mixtures (b) 10-mixtures and (c) 10-mixture GMM trained with full covariance matrices

    However, this improvement comes at the expense of increased number of parameters and computation. We need to

    estimateM(2D+ 1) parameters for an M-mixture GMM, with diagonal covariances, whereD is the dimension ofthe data.

    For a GMM with full covariance matrices, we need to estimateM(0.5D2 + 1.5D+ 1)parameters, which in turnrequires large amount of data.

    It is also shown that for sufficiently high number of mixtures, diagonal covariance matrices can capture correlations

    in data so in practice usage of GMMs with full covariance matrices is not usual

    2.2. System Development

    I have employed Gaussian Mixture model for obtaining statistics of song

    I have followed [1] (IEEE Signal Processing Letter) for implementing genre classification. Inter-genre similarities

    are explored in this paper so the method name : Inter-Genre Similarity Modelling (IGS)

  • 7/24/2019 Report Project 3 ASP

    11/13

    From each genre first 15 songs are chosen for training and next 15 songs are used for testing

    I have trained one GMM for each genre. Number of mixtures (assume M) used is experimented

    So now for 5 classes in song database 5 GMMs (M-mixture) are trained

    For all the frames in training data, negative log-likelihood is calculated with respect to 5 GMMs

    Misclassified frames are separated from training data and by using correctly classified frames 5 GMMs are updated

    Now the updated means and variances of GMMs are more stable and represent well the corresponding genre as

    we have not considered the frames which are more confusable. More confusable frames have common spectral

    characteristics across genres

    Build a GMM for all the misclassified frames, so in total we have 6 GMMs from the training data which will be

    used for testing

    The GMM built upon misclassified frames is termed as Inter-Genre Similarity (IGS) GMM because it captures

    similarities of spectral components across genres (IGS-GMM)

    2.3. Classification of Genres

    For the test music clip, extract 39-dimensional MFCC and find the negative log-likelihood for each frame with

    respect to 6 GMMs

    Discard the frames which belongs to IGS-GMM, consider only the frames which belongs to GMMs corresponding

    to genres and form a new set of frames

    Belongingness of a frame to a particular GMM can be said if likelihood of the frame is more for that GMM

    compared to other GMMs

    Now compute the average likelihood of new set of frames with respect to GMMs corresponding to genres

    The genre corresponding to GMM which has average maximum likelihood is chosen as system decision denoted

    by parameters

    Average likelihood can be computed as follows:

    = argmaxn

    1kwkn

    K

    k=1

    wknlogP(fk|n) (2)

    wherewkn is 0 iffk belongs to IGS-GMM otherwise 1 and it is assumed that input music clip consists ofKframesnamelyf1, f2,...,fK

    wkn lets us to choose only the so called new set of frames

    This procedure can be repeated several times by filtering misclassified frames in each iteration and updating genre

    GMMs

    I did only one iteration because of small data set

    2.4. Experiments and Discussion

    Music signal frames are obtained with 25ms window and 10ms shift

    I have extracted 39-dimensional MFCCs (13-MFCCs + 13- + 13-) with 26 triangular filters linearly spaced

    along mel-frequency axis

    I have experimented with different number of mixtures 4, 8 and 16, and corresponding results are shown in Table. 2.

    These results can not be compared with the results published in [1] as they are obtained with different database but

    the accuracies are very close which confirms that implemented algorithm is correct

  • 7/24/2019 Report Project 3 ASP

    12/13

    Table 2: Results of Genre classification system using only GMMs and IGS-GMMs

    4-mix 8-mix 16-mix

    Only GMM modelling 41.4% 46.6% 45.8%

    IGS modelling 45.2% 49.6% 46.6%

    I have observed that 8-mixture GMMs are performing best as shown in Table. 2

    I believe the reason why 16-mixture GMMs are not performing best is data insufficiency

    1-mixture GMM estimates 78 parameters (39 means and 39 variances for 39-dim MFCC)

    16-mixture GMM estimates7816 + 16(weights) = 1264parameters and training data available for each genreis 150 seconds and out of which many frames are filtered as misclassified which can lead to over-fitting i.e., mem-

    orization of data points

    4-mixture GMM is not performing well, could be because number of mixtures are too low which lead to each

    mixture representing spread of characteristics of the genre

    It can be observed that IGS modelling has improved performance by 3% in 8-mixtures case

    In any case, IGS modelling performs better than only GMM modelling

    It is noted that more than half of the training data frames are misclassified with only GMM modelling

    So IGS-GMM is trained on half of the training data taken across genres

    Since only 150-seconds (15000 frames) of training data is available for each class, after GMM modelling only 7000

    frames (approximately) are classified correctly and each GMM is updated with these 7000 frames

    It can be seen that it is inappropriate to use more mixtures for training GMMs as only 7000 frames are available

    and also more number of iterations of IGS modelling can not be employed

    For these experiments, I have randomly cut 500 clips from testing database. Each clip is of 50000 samples i.e.,

    3.125 seconds

    Table. 3 shows the confusion matrix of genres and their accuracies.

    Left most row shows original genre and top row shows the genre with which original genre got confused

    As it can be seen, all diagonal entries (bold faced) are high i.e., no genre got confused with any other more than

    itself

    Also, blues got misclassified most with jazz, rock confused most with blues

    Electronic and jazz got misclassified with one another

    Electronic genre has least accuracy

    Rock and hip-hop seems to have no common characteristics as they are not confused with each other

    It can be observed that hip-hop has highest accuracy

    2.5. Future Work

    This method can be improved by iterative filtering of misclassified frames if enough data is provided and which

    also enables more experiments with number of mixtures

    More experiments need to be done to find the effect of frame size and frame shift as it is observed that in the case

    of song identification they have huge effect

  • 7/24/2019 Report Project 3 ASP

    13/13

    Table 3: Confusion matrix of genres computed with IGS-modelling and 8-mixture GMMs

    Blues Electronic Jazz Rock Hip-hop Accuracy

    Blues 44/104 1 0/104 36/104 3/104 11/104 42.30%

    Electronic 13/89 31/89 21/89 8/89 16/89 34.83%

    Jazz 3/90 34/90 47/90 3/90 0/90 52.22%

    Rock 32/115 11/115 25/115 47/115 0/115 40.87%

    Hip-hop 9/102 3/102 16/102 0/102 74/102 72.54%

    3. References[1] Baci, Ula, and Engin Erzin. Automatic classification of musical genres using inter-genre similarity. Signal Processing Letters, IEEE 14.8 (2007):

    521-524.

    [2] Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling. ISMIR. 2000.

    [3] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. John Wiley & Sons, 2012.

    [4] Wang A. An Industrial Strength Audio Search Algorithm. InISMIR 2003 Oct 26 (pp. 7-13).