VLAD TABUS IDENTIFICATION OF SIMILAR MUSIC CLIPS WITHIN … Tabus.pdf · audio signal, followed by...

VLAD TABUS IDENTIFICATION OF SIMILAR MUSIC CLIPS WITHIN A SONG Diploma Thesis Work

Examiner: Dr. Heikki Huttunen Examiner and topic accepted by the meeting of the Faculty council of Electrical and Computing engineer-ing faculty on 18. December 2011

2 2

ABSTRACT TAMPERE UNIVERSITY OF TECHNOLOGY Degree Programme in Information Technology TABUS, VLAD: Identification of similar music clips within a song Bachelor of Science Thesis, 28 pages, 2 appendix pages December 2011 Major: Signal Processing and Multimedia Examiner: Dr. Heikki Huttunen Keywords: Music Information Retrieval, MIR, Music Similarity Analysis, HPCP, Song Structure

This thesis proposes and describes a method that uses low level tonal descriptors, namely harmonic pitch class profile vectors (HPCP) to perform a chroma analysis of the audio signal, followed by the computation of a similarity matrix to identify starting and ending points of similar audio clips within a song. The song is first processed in order to extract melodic information as a sequence of about tens of thousands of twelve-dimensional HPCP vectors, each vector corresponding to a 11,6 ms time frame.

The assumption made in the calculation of chroma by HPCP is that only the local maxima in the estimated spectrum are important for distributing the power spectrum over each semitone. The similarity matrix is computed with a resolution of 1s by aver-aging scalar products between the two HPCP vector sequences.

Similar audio segments appear as long diagonal lines. The lines having duration longer than 7 s are defined as similar audio clips. This method is applied to 20 songs, each belonging to a different genre and the similarity of the extracted clips is rated by listening to the aligned audio clips rendered concurrently on the left and right channels of a stereo headset. The method was found to perform reasonably well for most of the music genres tested with few exceptions due to various genre specific factors.

3

CONTENTS Abstract ........................................................................................................................ 2 Terms and their definitions ........................................................................................... 4 1 Introduction........................................................................................................... 5 2 Background ........................................................................................................... 7

2.1 Sound perception ........................................................................................... 7 2.2 Music information retrieval ........................................................................... 8 2.3 State of the art ............................................................................................... 9

3 Harmonic pitch class profile extraction................................................................ 11 3.1 Description of Pitch Class Profile ................................................................ 11 3.2 From power spectrum to chroma ................................................................. 12 3.3 Steps to extracting HPCPs from a song ........................................................ 15

4 Constructing a similarity matrix from calculated HPCP vectors ........................... 20 5 Results ................................................................................................................ 22

5.1 Experiments with note patterns and HPCP vectors ....................................... 22 5.2 Experiments with different genres ............................................................... 24

6 Conclusions......................................................................................................... 26 References .................................................................................................................. 28 Appendix .................................................................................................................... 29

4

TERMS AND THEIR DEFINITIONS bpm – beats per minute is a measure of the tempo in which music is played. B-H window – Blackman-Harris window Chroma – a representation of spectral properties of the sound, where all frequencies are collapsed in 12 containers, each denoting a semitone. FFT – Fast Fourier Transformation Fundamental frequency - the main frequency contained in the sound generated by an instrument or human voice, over which additional harmonics are superposed to create the timbre of the sound. Grayscale – images are represented in grayscale format when their pixels have only the luminance defined, varying between a minimum value (usually 0) and a maximum val-ue. GUI – graphical user interface HPCP – Harmonic Pitch Class Profiling is the method used to extract the HPCP vec-tors. Points of local maxima are identified in the estimated spectrum, corresponding to harmonic components. Power spectrum values over each semitone are defined for these points of local maxima. Interpolation – the process of estimating the value of a signal at an arbitrary time lo-cated within the range of defined discrete time points. Main lobe and side lobe - the graphical representation of the gain over frequencies for a time window has a central part, usually centered on frequency 0, which constitutes a main lobe. Additional lobes appear at higher frequencies. Matlab – a matrix based fourth generation programming language. MIDI – Musical Instrument Digital Interface is a standard for enabling interconnectivi-ty between computers and other electronic musical instruments specifying parameters such as notes, velocities and timing. MIR – Music Information Retrieval MIREX – Music Information Retrieval Evaluation eXchange is an annual evaluation campaign for MIR algorithms. MPEG-7 – Multimedia Content Description Interface mp3 – a de-facto audio storage standard Normalization – the process of scaling the values in a vector by multiplying them with the same value so that some property is obtained (e.g., for HPCP normalization to max-imum value 1). Octave – a musical interval having eight notes or twelve semitones. Power Spectrum – describes how the energy of the signal is distributed over frequen-cies. Thresholding – An operation where the values of the elements in a vector are compared to a threshold value and those larger or equal to the threshold are set to 1, those smaller are set to 0. Sampling rate – the rate, or frequency, at which a discrete signal is sampled from a continuous waveform. Standard audio sampling rate is 44100 Hz.

Introduction 5 5

1 INTRODUCTION

Personal music collections are quickly growing as music production became a

highly accessible hobby to people passionate about music all over the world. In times of such content abundance new methods of music sorting, surfing and correlation are continuously under development.

The classical method of surfing music collections is by making use of textual information associated with each track or sample. Using manual annotations to group and correlate audio thus improving browsing suggestions has been the traditional beginning of music [1]. Textual tagging is comprised of three facets: editorial, textual and bibliographic [6]. While it is still the cornerstone of orientation within the universe of audio data there are many more ways of classification and navigation of large audio databases that are worth exploring.

Considering a private music collection is comprised of tens of thousands of songs it is likely to assume the average end user will not be able to navigate solely via memorized textual tags. This challenge, commonly referred to as the underused music repository problem [1] is currently being dealt with via creative new ways of computerized classification of audio data such as tree like arrangements based on artist, genre, type, tempo (bpm) and mood induced to the listener.

One key component of audio surfing is developing recommender systems [13] that will automatically identify related content based on current content or recent choices. This may be achieved among other methods by measuring pattern similarities. The measurement of similarity becomes an especially interesting but also challenging problem when dealing with polyphonic audio as there is no clear benchmark in comparing complex audio signals.

To outline any similarity related problem it is important to clarify what type of audio is being examined, at what scale shall the comparison be attempted, what features of the audio information are to be examined, how similar should the target features be in order to identify the audio as being similar and how do we use the similarity information obtained to improve the labeling and browsing of music collections.

Accordingly, this thesis considers the problem of similarity within polyphonic music at an intra-song level (i.e. audio clips belonging to the same song), making use of harmonic pitch class profile (HPCP) vectors as feature vectors, according to [2]. The similarity of the HPCP vectors is to be compared in groups making use of a similarity matrix to store comparison grading results. From this matrix, information defining start

Introduction 6

and end points of similar audio clips in the analyzed song is extracted and through simultaneous listening a user may decide how similar these clips really are.

The reason for measuring similarity at an intra-song level is because it provides important information about the structure of a song and how repetitive its content is, also providing a good testing environment for comparison of results. The audio pattern does not vary much inside the same song therefore a user may easily compare the audio clips identified by the program as being similar. Furthermore, due to computational complexity, the running time when extracting feature vectors is very long; therefore applying such an algorithm to a large database in order to identify similar songs would be very time consuming.

An additional step when comparing HPCP vectors of different songs is the estimation of the reference frequency for each song because either one of the songs may be tuned to a different frequency than the standard 440 Hz. This estimation process, also referred to as reference tuning [2] is complex and is not going to be part of the algorithm described in this thesis.

A successful similarity analysis may be used to solve different problems when applied to different length audio patterns. One of the most common and reoccurring research problems related to similarity analysis at an inter-song scale is cover song identification, a topic extensively covered and benchmarked in the annual music information retrieval evaluation exchange (MIREX) competition [1,2,3].

When applied on an intra-song scale a similarity analysis of sound clips within the song may lead to identification of the structure of that particular composition [2].

The most similar part within a song is the chorus or hook, being repeated multiple times within a composition due to its strong energy and melodic content. Identifying the hook of a song would give multiple clues with respect to the structure or sequence of building blocks a song is composed of. Implicitly one may assume that before the first hook there is an intro and a verse and in between hooks there are other verses.

In a nutshell, the aim of this thesis is to solve the common music information retrieval problem of similar audio clip identification at an intra-song level by using an algorithm implemented in Matlab. The algorithm combines methods from two research papers [1,3] and a Ph. D. thesis [2], which had a slightly different goal, namely to identify certain melodic sequences occurring at an inter-song level in multiple cover songs or remixes.

Chapter 2 provides the reader with the necessary background discussing perceptual characteristics of sound in humans, the field of music information retrieval and the reasons for choosing the particular solution discussed in this thesis. Chapter 3 connects music theory to the implemented solution and explains in detail the process of HPCP extraction. Chapter 4 presents the construction of a similarity matrix and its usage in finding similar clips. Chapter 5 presents experiments for validating the performance of the method in different genres. Finally the conclusions of these experiments are presented in Chapter 6.

Background 7

2 BACKGROUND

2.1 Sound perception

According to Webster dictionary, sound is defined as “a mechanical radiant energy that is transmitted by longitudinal pressure waves in a material medium”1. Another, more comprehensive definition is “an alteration in pressure, particle displacement, or particle velocity which is propagated in an elastic medium, or the superposition of such propagated alterations” [7, p. 3]. Additionally it must be noted that for any sound to be audible to humans its frequency must be between 20Hz and 20kHz and its loudness or sound-pressure level between and 0dB and 100dB [7, p. 399]. Above the sound-pressure level of 100dB the perception of sound becomes painful and the risk of becoming deaf increases exponentially. Though it is useful to define sound it is important to understand that sound is conceptually different from music since sound may exist even without its psychoacoustic implication while music cannot. Music is the ordered arrangement of sounds in order to express an idea, emotion or create a mood. In order for a sequence of sounds to be considered music it must be consistent throughout its entire length possessing to a certain extent some of the following qualities: rhythm, harmony and melody.

In order to analyze a song, before proceeding to the definition and computation of similarity metrics, its key physical properties must be defined, measured and linked to each other, to a progression in time or to a new feature based on a pattern inferred from them. The most commonly used properties are also called facets of music information and they are classified as: editorial, textual, bibliographic, pitch, temporal, harmonic and timbral [6]. The editorial facet specifies details about the execution and performance of a musical composition such as dynamics indicating to play a certain part of the melody in one of the following manners: soft, strong, sudden change or gradual change. Textual and bibliographic facets refer to the lyrical information and the makers and publication information of the song. These three facets will not be used in any way for any similarity analysis as sorting based on these tags is already being widely used everywhere today.

Pitch is the human perception and interpretation or psychological experience of a sound wave frequency. The frequency defines the number of physical vibrations per second and is measured in Hertz. Most sound waves are complex ones, composed of superposed sinusoidal waves, each with a different frequency. The overall pitch or fundamental frequency of complex or aperiodic sound waves can be determined by

1 http://www.merriam-webster.com/dictionary/sound (retrieved 9.11.2011)

Background 8

comparison with a pure tone but in fact it has been found that listeners are not able to identify precisely the fundamental frequency of a complex sound. It was observed that the pitch of a complex tone is slightly lower than that of a pure tone with the same frequency [8, p. 141] therefore complex waves may be attributed a fundamental frequency without actually containing a pure tone of that frequency. This is the reason for the differentiation between the listeners’ psychological experience of pitch and the physical feature of sound, which is frequency. Good pitch discrimination is inherited and has been found not to vary significantly with training or intelligence [9, p. 59]. Temporal information refers to the duration and accentuation of beats, notes, sounds and the relation amongst them that generates the overall feeling of rhythm and melody. Both pitch and temporal facets are very useful in similarity analysis, defining melody and timing, and therefore will be used in detection and alignment of similar audio clips.

Harmony is the richness of a polyphonic sound at a certain moment in time. At an instrumental level it is referring to the multiple overtones heard concurrently as the fundamental frequency and at a score level it is referring to the multiple pitches occurring concurrently. Chords, being composed of multiple notes sounded simultaneously are the simplest example of creating harmony.

The timbre or “tone color or quality” is the attribute of sound dealing with differentiation according to means of production and mediums of propagation. It covers all other aspects of a sound except for pitch and loudness [8, p. 299]. Timbre is determined primarily by “the number, the order, and the relative intensity of the fundamental and its overtones as expressed in the wave form” [9, p. 20].

It is important to understand the aforementioned concepts as the human perception

of sound is not that of a sound wave, but of a tone having pitch, duration, harmony, loudness and timbre [9, p. 15].

The psychological interpretation of tonal properties may deviate from the corresponding parameters of the physical wave thus making it difficult to prove one approach as perfect for sound clip similarity comparison. Furthermore, as the same song may have multiple interpreters and may incorporate multiple genres as different structural modules, a good algorithm for similar sound clip identification should be flexible with interpretation style, voice print and musical genre while still successfully analyzing the pitch change sequence and melodic progression.

2.2 Music information retrieval

Many commercial applications aim to retrieve songs or music clips from large databases, leading to the appearance of the field of music information retrieval (MIR) that aims to extract, quantify and collect meaningful features related to various compositions providing methods for comparison and classification of music. Some of the related fields of research are signal processing, information retrieval, the study of

Background 9

music, statistics, database management, human-computer interaction interfaces and even machine learning.

The approach used in this thesis with respect to music processing and feature extraction will be a signal processing one. To be able to identify certain features within a song making use of a computer and signal processing it is necessary to consider the area of music cognition and perception in humans to implement a solution to a given problem of MIR nature. The solution must consider music cognition and perception in humans and must be congruent with the listener’s overall perception of the song.

As the importance of MIR is becoming significantly more recognized, perhaps due to the world’s music capital recent exponential growth, multiple conferences and events are being organized on a yearly basis, amongst which are: Audio Engineering Society Conventions and Conferences2, European Conference on Information Retrieval3, Digital Audio Effects Conferences4 and Music Information Retrieval Evaluation eXchange (MIREX)5.

Sources of information for MIR are the audio signal itself, symbolic representations of the signal such as scores of notes, metadata and social music networks such as Spotify, iTunes Ping, Microsoft Zune-The Social and Jango.

2.3 State of the art

The decision regarding which method should be studied and implemented in order to find similar audio clips within a song should be based choosing from the best solutions dealing with a similar topic submitted to a well known yearly recurring competition, in this case MIREX4, one of the biggest international conventions of its type in the world. In this annual convention some of the most important topics to the field of MIR and audio processing are scrutinized and, when necessary, practically tested on large databases to identify the most viable solution submitted up to date. Among the different topics, cover song recognition is a very popular one recurring in consecutive years. A few more examples from the same competition of up to date (2011) topics linked to similarity are audio music similarity and retrieval, structural segmentation, audio cover song identification and audio melody extraction6.

Since cover song identification deals with similarity of audio in large files it provides a good start to adapting a solution for finding similar sound clips throughout a song as well. By comparing the results obtained for cover song identification in the years 2007-2010 and comparing the success rate scores obtained on tests in large databases in MIREX it may be concluded that the solutions to score best results are those of Serra, Gomez, and Herrera in 2008 and Serra, Zanin, and Andrzejak7 in 2009,

2 http://www.aes.org/ (retrieved 9.11.2011) 3 http://www.ecir2011.dcu.ie/ (retrieved 9.11.2011) 4 http://www.dafx.de/ (retrieved 9.11.2011) 5 http://www.music-ir.org/mirex/wiki/MIREX_HOME (retrieved 9.11.2011) 6 http://www.music-ir.org/mirex/wiki/2011:Main_Page (retrieved 9.11.2011) 7 Note: The solutions to score the best results are found by examining the following sources: http://www.music-ir.org/mirex/abstracts/2007/MIREX2007_overall_results.pdf (retrieved 9.11.2011) http://www.music-ir.org/mirex/results/2008/MIREX2008_overview_A0.pdf (retrieved 9.11.2011) http://www.music-ir.org/mirex/results/2009/MIREX2009ResultsPoster1.pdf (retrieved 9.11.2011) http://www.music-ir.org/mirex/results/2010/mirex_2010_poster.pdf(retrieved 9.11.2011)

Background 10

boasting a precision of 75%. These solutions make use of low-level chroma or HPCP tonal descriptors to achieve a similarity comparison between songs and also between audio clips.

Low-level descriptors are useful tools for audio pattern recognition in a song

because they are related to the most relevant facet of music information, the pitch or rather the sequence of change in tonality. This sequence is important because it provides the foundation of what people perceive as melody. The melody of a song in general continues uninterrupted and its progression and change over time must follow a somewhat deterministic tonal pattern, otherwise the overall harmony would be destroyed and in turn music would become unpleasant noise.

All modern music typically has many variations in vocals (multiple singers and multiple styles per each singer), added coloring effects and only locally occurring harmonics. All these high level perceptions can be interpreted by a pattern analysis algorithm as timbre, loudness and tuning variations followed by added random noise. The low-level descriptors must be robust to all variations between two occurrences of the same audio clip in order to be fit for music analysis.

There are many low level descriptors, with as many as 17 temporal and spectral parameters being described in the MPEG-7 Audio Low Level Descriptors International organization for standardization ISO documentation8. The best solutions submitted to MIREX use only spectral parameters since temporal parameters are varying between songs and even between different sections of the same song. The idea behind both chroma features and HPCP features is to group the entire spectra on each frame into 12 bins, each bin representing a different semitone therefore providing a detailed representation of the standard octave that is used in composing all western modern music.

Another method commonly used for detecting similarity in patterns having different

temporal features but otherwise being similar is dynamic time warping (DTW). This method has also been applied to audio patterns with a high success rate, one of the latest papers regarding this method, “Partial sequence matching using an unbounded dynamic time warping algorithm” [11]. However this algorithm is more computationally demanding making analysis of long complex songs time consuming and though it performs well with sound alignment and classification in MIDI files and monophonic samples its performance may significantly decrease in modern music therefore it was not considered a good choice.

8 http://mpeg.chiariglione.org/technologies/mpeg-7/mp07-aud%28ll%29/index.htm (retrieved 9.11.2011)

HPCP Extraction 11

3 HARMONIC PITCH CLASS PROFILE EX-TRACTION

3.1 Description of Pitch Class Profile

One approach for extracting the melodic line uses a feature vector called pitch class

profile (PCP). The initial signal is split into frames whose discrete Fourier transform is computed first and then transformed into pitch class profile. The PCP is a vector composed of 12 elements that represent the intensities of the twelve semitone pitch classes [10]. An accurate transcription of each note such as an automatic scoring approach is not required as the PCPs are grosser measures grouping all notes of the same pitch class together. Noise and simultaneously occurring harmonics affect the accuracy of PCP detection. The methods of smoothing the notes and chords smoothing and additionally of change sensing may be employed to more accurately determine if there was a pitch change between analyzed frames [10].

Generally both notes and chords tend to last multiple frames as the length typically chosen for one frame is in the order of few thousands samples and the sampling rate for modern music is 44.1 kHz. For example a frame of length 2048 samples at this sampling rate would only last 46,5 ms. According to Guinness World Records the fastest piano player in the world managed 498 key hits per minute9 translating to 0,386 key hits per a frame of 46,5ms.

Harmonic pitch class profiling is an extension of PCP also computed on a frame

wise basis. It is based on detecting local maximum values of the spectrum computed over the fast Fourier transform (FFT) frequency range, allocating each FFT frequency to the corresponding semitone, and normalizing. The advantage of using HPCP representation is that the resulting vector is not majorly affected by the source wave tuning, loudness, dynamics or even noise.

HPCP vectors may take into account the possibility that the tuning frequency of a song is not the standard reference frequency A 440 hertz (Hz), but a different nearby frequency. This is an important calibration of HPCP features when dealing with inter-song similarity, however this part of the implementation is not important and may be left out since throughout the same song the reference frequency only suffers minor changes. 9 http://www.guinnessworldrecords.com/records-7000/most-piano-key-hits-in-one-minute/ (retrieved 9.11.2011)

HPCP Extraction 12

3.2 From power spectrum to chroma

Chroma is the ordering of the frequency components of the entire frequency spec-trum into 12 bins, each corresponding to one of the 12 semitones commonly used when composing modern music. All octaves are superposed into only one 12 unit division. The power spectrum is the power of the signal in each frequency and is computed using the FFT transform of a frame consisting of N samples y=[y(0), y(1), …, y(N-1)]. This is done for the frequencies: 휔 , 2휔 , 3휔 , 4휔 , . . . , (푁 − 1)휔 , where 휔 = is the

fundamental frequency. The fast Fourier transform (FFT) is 푌 = 퐹퐹푇(푦), where y is a vector composed of N

elements and the 푘 element of Y is:

푌(푘휔 ) = 푦(푖) 푒

The power spectrum for the normalized angular frequency 푘휔 is computed

as푃(푘) = |푌(푘휔 )| . The normalized angular frequency 푘휔 (measured in radians/s) corresponds to the

frequency 푓 = ∗ ∗ = ∗ 퐻푧, therefore the fundamental frequency is calcu-

lated as 푓 = = = 10,76퐻푧. It is worth mentioning that the term of funda-

mental frequency should not to be confused with the reference frequency of 440 Hz that western music is generally tuned to.

The melodic line of a song may vary at very high and low frequency ranges,

therefore though human aural perception range is between 20 Hz and 20 kHz a further limitation must be imposed on this range bringing the frequency sensitivity of the algorithm down to the range between 100 Hz and 5 kHz. This choice aids in finding consistent patterns throughout the song by eliminating unimportant areas the signal from consideration. The corresponding FFT frequency values are from 9*푓 to 465*푓 and may be seen plotted in Figure 2 and Figure 3.

In western music the properties that need to be defined to specify a note are the

octave number, letter name and sharp or flat symbol in case it represents an accidental. In an octave there are 8 notes and 4 accidentals altogether defining 12 semitones. The increase in frequency measured in hertz varies from one note to another; hence the notes are not equidistant on the frequency scale. However, considering logarithmic frequency scale, the notes are found to be equidistant. Assuming tuning is performed with respect to the reference frequency A4=440Hz (log2 (440) = 8,7814), all 9 lower semitones in the 4th octave are computed by iteratively and sequentially subtracting 0,0833 from it and symmetrically, the upper two semitones by iteratively adding the step of 0,0833. In

HPCP Extraction 13

general the formula for calculating the fundamental frequency of a note n is 440퐻푧 ∗2 / , where 440 is commonly agreed upon as reference and n varies between -48 to 39 on a piano keyboard [4]. Accordingly the note A4, having n=0 will have the frequency of 440퐻푧 ∗ 2 / = 440퐻푧, half that of 퐴5 = 440퐻푧 ∗ 2 / = 880퐻푧 whose n =12.

The points in which the FFT transform is calculated, similar to musical notes start with a reference frequency of 0 and are computed by iteratively adding the step of 10,76. Table 1 presents the fourth octave semitones, the frequencies corresponding to them and the FFT frequencies in their proximity. The four accidentals are marked as the nearest

note followed by “#” and “♭”to indicate a sharp or a flat note respectively. In this table it

may be noticed that the progression of frequencies is according to a 0,0833 incremental step in log scale and a 10,76 incremental step in FFT frequencies. The note that would follow in the progression would be C5 and its semi-tonal frequency value would follow both of the following two rules: f = 2 , , = 2 , = 523,23Hz and f = 2 ∗ f = 2 ∗ 261,6 = 523,2Hz.

FFT frequencies are computed according to f = 10,76 ∗ k, where for the 4th oc-tave k is between 24 and 47. Thus, in the last row of Table 1 due to the logarithmic na-ture of semi-tonal frequency progression and the incremental progression of the FFT values, a steady increase of neighboring FFT frequencies may be noticed for every note. This increase will continue throughout all of the next octaves as well resulting in a much better resolution at higher frequency ranges. The rule according to which associa-tion of FFT frequencies is done with Semi-tonal frequencies is simply by selecting the closest corresponding values thus, when taking the FFT frequency of 473,7 for example 466,1+7,1=473,7=493,8-20,1. In this case the FFT frequency is closer to the frequen-cy 466,1 corresponding to A#.

Table 1: Frequencies for the 4th octave measured in Hertz, in logarithmic scale and the FFT frequencies in their proximity, specifying the points where Power Spectrum values are calculated

Index 1 2 3 4 5 6 7 8 9 10 11 12 Note C C# D D#

E♭ E F F# G G# A A#

B♭ B

Syllable Do - Re - Mi Fa - Sol - La - Ti/Si

Semi-tonal frequencies

(Hz)

261,6

277,1

293,6

311,1

329,6

349,2

369,9

392,0

415,3

440

466,1

493,8

Frequency (풍풐품ퟐ)

8,031 8,114 8,198 8,281 8,364 8,448 8,531 8,614 8,698 8,781 8,864 8,948

FFT freq. 퐟퐅퐅퐓 =

ퟏퟎ,ퟕퟔ ∗ 퐤 Hz

258,4 269,2 279,9

290,7 301,5

312,2 323,0 333,8

344,5 355,3

366,1 376,8

387,6 398,4

409,1 419,9

430,7 441,4

452,2 463,0 473,7

484,5 495,3 506,0

HPCP Extraction 14

In Figures 1,2 and 3 the first row of crosses and filled diamonds indicates the location of FFT frequencies, 푓 = 10,76 ∙ 푘, with 푓 = 10,76퐻푧 and k = 3...460 and the second row of hollow circles and diamonds indicates the location of the semitones found in all octaves from one to eight; notes from 1C to 8D# , all together 12semitones ∙ 8 octaves = 96 notes. Since neither 12 different symbols nor 12 different colors are easily available for use in Matlab, two symbols and the six predefined colors are combined to generate a different colored symbol for each different semitone. The symbol and color placed at the location of each frequency of interest is not important, however it would be confusing to have the exact same symbol of the same color for more than one semitone. Figure 1 emphasizes the reason why the frequency sensitivity range of the algorithm is set to begin at 100 Hz by showing how widely spaced is the FFT frequency spectrum in this lower frequency range.

Figure 1: The first and part of second octave, notes in this frequency range are not

significant to the main melodic content therefore very few FFT frequencies are computed in this range.

Figure 2: The second, third and fourth octaves. A better FFT resolution per note may be

noticed. The frequency range explained in Table 1 is displayed inside the red window. Regular human voice frequency is within this range, with singers being able to reach

the fifth octave as well with practice [12].

Figure 3: The fifth, sixth and part of seventh octaves having the best FFT resolution.

HPCP Extraction 15

3.3 Steps to extracting HPCPs from a song

Spectrum analysis is composed of segmentation, windowing, zero padding and FFT computation. These steps are shown in the correct sequence in Figure 4 and further discussed in this subchapter.

Figure 4: Sequence of steps to be performed in order to compute HPCP vector from the

initial waveform audio signal Segmentation and windowing The signal must be segmented into overlapping frames to begin analysis and build a

descriptor. Values chosen for these frames in literature vary in size ranging from 4096 samples in [2] to 17617 samples in [10], in the case of a standard sampling rate of 44100 Hertz this resulting in a temporal resolution between 93ms and 400ms. There are no set rules to guide in selecting frame size but the tradeoff is between frequency resolution and temporal resolution. As the frame size gets larger a lower temporal resolution will be obtained because there will be less frames available for analysis in the overall signal, however more frequencies will also be available within the frame for grouping into the 12 bins of the HPCP vector.

In this thesis the frame size was chosen to be 푁 = 2048 samples (46,44ms) and the hop size is 푁 = 512 samples (11,6ms), hence 1536 samples are identical between consecutive frames. The hop size simply means that if the current frame starts at sample s, the next frame starts at s+512, one hop further. The variable 푞 is the generic index of

HPCP Extraction 16

a frame,0 ≤ q < ( ) , used to index frames throughout the entire document.

Once the frame is extracted a Blackman-Harris window (B-H window) is applied to it in the time domain [5]. The magnitude response of the B-H window is set to have an attenuation between the peak of the main lobe and peak of the side-lobe equal to -62 dB when the following coefficients 훼 = 0.44859, 훼 = 0.49364 and 훼 = 0.05677 are used to generally define a window of L terms in the time domain as:

푤 (푛) =1푁 훼 푐표푠

2푛푙휋푁 ,

wheren = 0,1, … N − 1[2]. When substituting the parameters specific to this particular implementation into the formula, the window becomes:

푤 (푛) =1

2048 훼 푐표푠2푛푙휋2048 ,

with n between zero and 2047, 푁 = 2048, and L = 3. The windowing is applied in the time domain by element wise multiplication between the 푞th frame and window resulting in a windowed signal:

푥 (푛) = 푥(푛 + 푞 ∙ 푁 ) ∙ 푤(푛), wheren = 0,1, … N − 1[2].

Zero Padding The windowed signal has only 2048 samples and to increase accuracy in frequency

zero padding is used. In a frame of length 푁 =2048 the resolution in frequency will be

= 0.0031 which is not sufficient. In order to increase the frequency resolution the

frame needs to be extended to 푁 = 4096 (93ms) by zero-padding. Therefore 2048 discrete values y(1024) to y(3072) are taken equal to the windowed signal 푥 (0) to 푥 (2047) and 2048 zero values are inserted from y(0) to y(1023) and from y(3073) to y(4095). To get a better understanding of the extended frame 푁 progression, a visual representation is provided in Figure 12 Appendix.

In Figure 5 the red rectangle defines the length of the entire extended frame, 푁 and the green rectangle defines the length of the initial frame푁 . The B-H window generated envelope is superposed to the initial waveform, both of length푁 , and the rest of the extended frame, 푁 is padded with zeros.

HPCP Extraction 17

Figure 5: From top to bottom: A random initial waveform where n is the discrete time, followed by the B-H window and the resulting signal after element wise multiplication

between the initial waveform and the B-H window

To complete spectral analysis the FFT transform of the qth extended frame is then

computed. To obtain the power spectrum values 푃(푘) for these FFT frequencies the square of the absolute values of each element in the FFT vector must be computed.

Though there are 4096 power values for each frame not all of these are of significance for comparing audio similarity. The points at which the sinusoidal components of the signal are strong are at local maximum peaks in the power spectrum. These peaks are not necessarily located at a multiple of 푓 therefore once a peak is found its position is calibrated by interpolation using parabola fitting. The process according to which local maximums are detected in the power spectrum is called Peak detection.

Peak Detection The power spectrum was computed at the resolution of 푘 ∗ 푓 with 9 < 푘 < 464. A

value 푃(푘) is a peak value if it respects the following condition 푃(푘 − 1) < 푃(푘) >푃(푘 + 1). Once a peak value is found a parabola is drawn through the three pairs of points ((푘 − 1),푃(푘 − 1)), (푘,푃(푘)) and ((푘 + 1),푃(푘 + 1)). The equation that defines a parabola is푦(푘) = 푎(푘 − 푝) + 푏, where a is the concavity, b the offset and

HPCP Extraction 18

p the center of the parabola. The maximum value location is found on the frequency axis to be at the center 푥 of the parabola defined by the equation:

푥 =푃(푘 − 1) − 푃(푘 + 1)

2(푃(푘 − 1)− 2푃(푘) + 푃(푘 + 1)) + 푘

In this equation 푃(푘 − 1), 푃(푘), 푃(푘 + 1) are three consecutive values of the power spectrum located at indexes 푘 − 1, 푘, 푘 + 1. The magnitude of the parabola peak found at this location is:

푦(푥) = 푃(푘) −푃(푘 − 1)− 푃(푘 + 1) (푥 − 푘)

4

Figure 6: Left: Power Spectrum from 0 to 22kHz (step f0=10.76Hz),

the area of interest being highlighted in red (100Hz to 5kHz) Right: Local maxima detected in the frequency range 100Hz to 5kHz are plotted as red

stars

HPCP computation The purpose of the Harmonic Pitch Class Profile low level descriptor is to measure

the overall intensity of each of the 12 semitones at each point in time, throughout the input signal. Since octaves two through seven are included in the frequency range of interest they are folded one on top of the other and all notes contained in this range are allocated to only 12 bins. The HPCP vector for the 푞th frame is defined according to:

퐻푃퐶푃 (푛) = 푤(푛, 푥 푓 ) ∙ 푦(푥 ),

where n = 1… 12 is the semitone index, 푥 푓 is the frequency of peak i, 푦(푥 ) is the magnitude of peak i, and nPeaks is the number of peaks found by the peak detection procedure for the 푞th frame. The weighing function 푤(푛, 푓) has the role of allocating the power 푦(푥 ) to the frequency of the closest semitone and to left and right neighbors as

HPCP Extraction 19

well through a complex process described in [2]. The reason for choosing a maximum value of 12 for n is to get a 1:1 mapping of peaks to semitones but there is no set limitation for the upper value thus it may be chosen to be any multiple of 12 as well: 24, 36, 48 or more if useful.

As a post processing step normalization is performed by taking the highest value of the HPCP vector and dividing all other elements by it to result in a HPCP value range between 0 and 1.

Similarity matrix 20

4 CONSTRUCTING A SIMILARITY MATRIX FROM CALCULATED HPCP VECTORS

To find similarity within an audio signal once its HPCP feature vectors are computed, a pair wise comparison method may be employed where each vector is compared to each of the others extracted from the signal resulting in values signifying similarity or dissi-milarity, one value for each pair. The similarity matrix is a squared two dimensional matrix containing values of a similarity function, each computed between the two HPCP vectors being compared and representing data similarity between them.

Graphically, the high values can be represented in white and when these values unite to form a continuous diagonal line it is likely that a long continuous similar pattern has been discovered thus indicating that a segment of the song is similar to another segment. In the case a similar pattern exists, a continuous white line will be formed.

To achieve a meaningful temporal resolution HPCP vectors must be grouped with an appropriate group size. One HPCP vector is calculated for every hop so every 11,6 ms therefore to achieve a temporal resolution of approximately one second the group size 퐺 is defined here as being equal to 86. The total hops throughout the song are 푇 =

( ) indicating that the number of groups found in the song is 푁 = .

This is the size of the similarity matrix index or its borders. The similarity between the groups k and l is defined as:

푆푖푚(푘, 푙) = 퐻푃퐶푃( ) (푗) ∙ 퐻푃퐶푃( ) (푗)

In order to emphasize the diagonal lines we also take an average of the previous and next five elements in the diagonal of the matrixes elements obtaining the new similarity matrix Sim2 of the same size with elements:

푆푖푚2(푘, 푙) = 푆푖푚(푘 + 푖, 푙 + 푖)

Thresholding of the matrix Sim2 is done by comparing all elements with the threshold of max , (푆푖푚2(푘, 푙)) and the results obtained are shown in Figure 7 and 8. The

choice regarding the threshold is specific to this thesis and has been proven to be suita-ble experimentally. The beginning and end of the diagonal lines indicating similar clips may be slightly imprecise due to the averaging operation executed when computing Sim2. Only the lower part of the similarity matrix has been plotted since it contains all the necessary information and the upper part, being a reflection of the lower part along

Similarity matrix 21 the main diagonal would just distract the viewer without presenting additional useful information.

Figure 7: Similarity matrix Sim2 where Sim2(k,l)represents the similarity between the

HPCP group k and group l (Elton John - Candle in the wind). The song is comprised of 230 groups that are marked on the axis and correspond to 229.645 seconds.

In the similarity matrix the ones situated in a diagonal pattern are represented visually as lines. The diagonal lines are noisy and their width is not one pixel but by applying a morphological operator for thinning to the black and white image (Matlab function: bwmorph) the resulting lines in it will have the proper thickness. Any such diagonal pattern of ones lasting longer than 7 seconds will be considered long enough to be a valid similar sound clip match and as a result the number of the clip and two coordinates at the beginning (xa, ya) and end (xb, yb) will be stored. These coordinates denote the beginning (xa, xb) and ending times (ya, yb) of the similar audio segments. In Figure 8 a similarity matrix and the resulting similar audio segments are plotted next to each other.

Figure 8: Blues - Muddy Waters Similarity matrix and similar segment map

Results 22

5 RESULTS

5.1 Experiments with note patterns and HPCP vectors

For an audio signal, by using the method described in Chapter 3 and Chapter 4, numerous HPCP vectors will be generated and a similarity matrix will be computed. By analyzing the similarity matrix, the beginning and ending times of similar audio clips within the same song are obtained. The HPCP vector progression can be plotted in grayscale to visually evaluate the accuracy of pitch detection applied to simple note patterns. Additionally other dominant song components such as drums may be overlapped to these simple note patterns to observe how melody detection (progression of notes) is affected by non-melodic components that almost always exist in modern music. In the lower part of Figure 10 this phenomenon may be observed starting from the middle of the figure where a loud drum pattern is superposed to the notes.

Figure 9 and Figure 10 are composed of two parts. The upper part of the figure is a screenshot, showing the graphical user interface (GUI) of the home studio music composition program named Fruity loops (FL) Studio 8 where note patterns are defined by the author. The lower part is a Matlab plot showing the HPCP vector progression computed based on an mp3 audio file exported from FL Studio 8 and the algorithm defined in chapter three. The connection between the upper and lower parts of the figures is that the audio file (.mp3) being first composed in FL Studio 8 and then analyzed in Matlab is exactly the same. Usually it is easier to present one image per figure but in this case due to the goal of demonstrating a visually consistent alignment between the two note sequences two images must be included in the same figure.

The upper part of Figure 9, composed in FL Studio 8, presents a chromatic scale octave, including all accidentals therefore having 12 semitones, followed by a standard octave having 8 notes. A note lasts four squares on the scale, corresponding to 0,485seconds thus the total pattern length is of 0,485*20=9,7seconds. This may be observed with slight approximation in the lower half of the figure as well.

The upper part of Figure 10 presents an arbitrary melodious note pattern composed by the author where each two squares correspond to 0,485 seconds, a notes length being half that in Figure 9. This note pattern also contains pauses between notes, and exactly in the middle a strong drum loop starts thus enhancing the rhythmic sensation but also distorting the core melody as previously mentioned. The drum loop is not visually represented in the upper part of Figure 10 but its effect on the HPCP vectors values can easily be noticed in the lower part of the figure where multiple short length components mostly evenly distributed through the 12 bins make their appearance. The total length of this pattern is 32*0,485=15,5 seconds.

Results 23

Figure 9: (Up) Fruity Loops Studio 8 is used to compose a chromatic scale octave followed by a standard octave and generate an mp3 file. The image is cropped from the

programs GUI.

(Down) The algorithm described in Chapter 3 is used to compute the HPCP vectors from the mp3 file and all vectors are plotted in Matlab in grayscale (ranging from zero

for black to one for white). A slight time offset is noticed due to imperfect image alignment.

Figure 10: (Up) Fruity Loops Studio 8 is used to compose a melodic note pattern and a drum beat (not shown here). The beat is superposed to the melody starting from the middle of the pattern, and then an mp3 file is generated from their combination. The image is cropped from the programs GUI and only shows the melodic note pattern.

(Down) The algorithm described in Chapter 3 is used to compute the HPCP vectors from the mp3 file and all vectors are plotted in Matlab in grayscale.

The figures above demonstrate a good achieved accuracy in note recognition when ana-lyzing a melodic note progression individually and even when mixed with a beat. Since

Results 24 the aim of this thesis is to identify the similar music clips in a very long note progres-sion ongoing throughout an entire song it is important to also plot and examine at least one instance of these results. The starting and ending times of two music clips detected as similar are observed in the similarity matrix as the beginning and ending coordinates of the continuous diagonal lines as explained in Chapter 4. As an example of similar music clips identified within a song the longest pair of similar audio clips found in the Blues song “I'm Your Hoochie Coochie Man” by Muddy Waters is plotted in Matlab as a sequence of HPCP vectors in an aligned manner in Figure 11.

Figure 11: A Matlab plot of HPCP vectors found in a pair of similar audio clips in the Blues song"I'm Your Hoochie Coochie Man". The two vector sequences are plotted in

parallel to emphasize the similarity of the patterns.

5.2 Experiments with different genres

A song is a very general term, songs usually being very different from each other in

structure, rhythm and tonal progression length. The well known classification of songs according to musical genres is used in sorting large collections and also in describing the type of music a song belongs to. The algorithm described in this thesis is likely to perform differently across different genres, therefore to get a good idea of what is its overall performance on a random popular song each particular genre should be considered separately.

Results 25

By listening to the automatic similar clip selections made by the algorithm, performance may be subjectively judged by the listener. The more clips found as similar within a song, the better the overall coverage is.

To rate the similarity of each audio clip, numbers from zero to four are used, where zero signifies complete randomness and difference, two signifies indecision whether selection is feasible and four means the two sequences match to a very high degree. A good coverage, meaning multiple similar clips are found, ensures that the similarity rating is accurate.

The melody overall similarity rating is the average of the similarity ratings of each audio clip and the melody overall coverage rating (Mcr) is simply the number of similar clips found. It is assumed is that if Mcr ≥5, the coverage is sufficient for the similarity rating to be meaningful; otherwise no conclusions may be drawn regarding the genre. A Mcr≥10 signifies excellent coverage. Table 2 presents the essential results (based on Table 3 Appendix) obtained when running the Matlab implementation of the algorithm described in Chapters 3 and 4 in genre comparison using the two described ratings.

Table 2: The summary of the results obtained in genre based music similarity analysis, having only one reference song per genre and two ratings: coverage and similarity

ratings

Number

Genre

Corresponding Song

Melody Overall Similarity Rating

Melody Overall Coverage

Rating

1 Blues Muddy Waters-I'm Your Hoo-chie Coochie Man 3,5 10

2 Country Johnny Cash-Folsom Prison Blues2 3,27 11

3 Dance David Guetta-One Love 3 4 4 Electronic/Techno Prodigy-Diesel Power 3,58 ≥20

5 Rock Guns N Roses-Knocking on heavens door 2,53 ≥20

6 Pop Madonna- Hollywood (great) 3,57 7 7 Hip-hop/Rap B.o.B-I'll be in the sky 3,67 3 8 Jazz Miles Davis-Summertime 0 0

9 Latin Daddy Yankee-Like you (Choice = Reggaeton) 3,17 6

10 Metal Marilyn Manson – Rock is dead 4 1

11 Reggae Bob Marley-Buffalo Soldier 3,5 2 12 R&B/Soul/Funk Craig David-7 Days 2,45 ≥20

13

Minimal / Contemporary Classical Philip Glass-Negro River 3,85 ≥20

14 Alternative Nirvana – Smells like teen spirit 3,05 ≥20

15 House Guru Josh Project-Infinity 2,9 ≥20

Audio Clip Similarity Rating

4. Very Similar

3. Quite Similar

2. Not Clear

1. Quite Different

0. Totally Different

Conclusions 26

6 CONCLUSIONS

Based on the figures presented in Chapter 5 it may be concluded that HPCP vectors are suitable for note mapping into a chromatic scale when the signal is composed of clear modulated samples without additional percussion, vocals or noise. In the case of music, where all these occur simultaneously, HPCP vectors also perform gracefully as low level signal descriptors, being robust to noise even in the difficult case when the noise is not evenly spread throughout the entire spectrum. Figure 11 is a good example of the performance of the method used in this thesis for similar music clip identification, combining the HPCP vector progression with the information extracted from the simi-larity matrix to identify and display two similar audio clips parallel to each other. The two HPCP vector progressions contain very high values representing the main melody and occurring at the same points in time. The lower and medium values in the vector progressions represent overlapping percussions, vocals differing from the main melody, effects and noise and tend to vary greatly from one plot to another.

In the genre based similarity analysis seen in Table 2 the best song coverage was achieved for the genres: house, alternative, minimal, R&B, techno and rock music (dark green cells) while the worst coverage was achieved in: metal, reggae, jazz, dance and hip-hop (dark red cells). Due to bad coverage, the melody overall similarity rating is not considered meaningful for these genres. If this study would take into account multiple songs for each different genre perhaps the coverage would become sufficient in at least one of the songs, however the specific features of these genres also provide some expla-nations regarding why good coverage was not possible. Metal usually has a lot of noisy components overlapping the melody, originating from purposeful distortion of vocals, highly reverberated drums and multiple overdriven guitar strings played simultaneously. In rap and reggae vocals are in the foreground, overshadowing the melody, therefore a possible explanation regarding bad coverage could be that vocal content tends to vary more often than every seven seconds (minimum length for an audio clip in order for it to be taken into account) and emphasizes rhythm instead of melody, therefore very little information regarding melody is left in the pattern foreground. In the case of jazz no similar clips were found suggesting that a chorus is missing or is shorter than seven seconds and also that the gene contains very few longer repetitive parts. This conclusion seems to match a general consensus that jazz is difficult to define or separate from simi-lar genres and also that one of the key elements of jazz is improvisation. Any genre or music style that is based upon improvisation, thus not looped in the studio and usually played live, is likely to have less high level structural parts since improvisation is the opposite of a predefined, orderly structure.

Conclusions 27

Out of genres with a good coverage rating, presented in dark green in Table 2, the ones with best similarity rating (also dark green) are minimal or contemporary classic with the maximum similarity rating of 3,85 closely followed by techno (3,58), pop (3,57) and blues (3,50). It is no surprise that minimalistic music scored the best similari-ty rating since the specifics of the genre are the usage of few instruments, long melodic patterns repeated often, and a lack of vocals. Techno and pop were also expected to have a high similarity rating since they are by definition built on a central melody made to be catchy, thus repetitive.

Overall, visually and aurally the music clip pairs identified in the similarity matrix are very similar, most genres had enough coverage for a conclusion to be drawn regard-ing how similar their internal structure is and the genres that had very little coverage or no similar audio clips identified were according to their properties of such non-repetitive nature. The only surprising result was the coverage in the genre of hip-hop/rap that is created in a beat looped manner usually following a simple repeated melody. It may be due to the particular song used that only three similar audio clips were found and also due to loud vocals overshadowing the melody as previously mentioned.

28

REFERENCES [1] J. S. Juliá, “Music similarity based on sequences of descriptors: tonal features applied to audio cover song identification”, M.Sc. Thesis, Pompeu Fabra University, 2007. [2] E. Gómez, “Tonal description of music audio signals”, 2006. [3] J. Serrá, E. Gómez, and P. Herrera. “Audio cover song identification and similarity: background, approaches, evaluation, and beyond”, Studies in Computational Intelli-gence, Volume 274/2010, 307-332, 2010. [4] A. Klapuri, M. Davy. “Signal processing methods for music transcription”, Springer Science, New York, 2006. [5] Harris, Fredric. "On the use of Windows for Harmonic Analysis with the Discrete Fourier Transform", Proceedings of the IEEE 66 (1): 51–83, 1978. [6] J. S. Downie. “Music Information retrieval”, 2005. [7] H.F. Olson. “Music, physics and engineering”, 1967. [8] W.M. Hartmann, “Signals, sound and sensation”, Springer Science, New York, 1997. [9] C. E. Seashore. “Psychology of Music”, Dover Publications, New York, 1967. [10] T. Fujishima. “Realtime chord recognition of musical sound: a system using common Lisp music”, Proceedings of the 1999 International Computer Music Conference, Beijing, China, pp. 464-467, 1999. [11] X. Anguera, R. Macrae and N. Oliver “Partial sequence matching using an un-bounded dynamic timewarping algorithm”, ICASSP, 2010. [12] I.R. Titze. "Principles of voice production", National Center for Voice and Speech, 2000. [13] P. Melville and V. Sindhwani. Encyclopedia of Machine Learning, 2010.

29

APPENDIX

Figure 12 Appendix: The generation of FFT windows based on N_frame extracted from

the audio signal, zero padding and a successive hop progression.

30

Table 3 Appendix: The original genre based similarity and coverage analysis results table

VLAD TABUS IDENTIFICATION OF SIMILAR MUSIC CLIPS WITHIN … Tabus.pdf · audio signal, followed by...

Documents

Transcript of VLAD TABUS IDENTIFICATION OF SIMILAR MUSIC CLIPS WITHIN … Tabus.pdf · audio signal, followed by...