IEEE TRANSACTIONS ON AUDIO, SPEECH, AND...

13
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 Automatic Transcription of Flamenco Singing from Polyphonic Music Recordings Nadine Kroher and Emilia G´ omez Abstract—Automatic note-level transcription is considered one of the most challenging tasks in music information retrieval. The specific case of flamenco singing transcription poses a particular challenge due to its complex melodic progressions, intonation inaccuracies, the use of a high degree of ornamentation and the presence of guitar accompaniment. In this study, we explore the limitations of existing state of the art transcription systems for the case of flamenco singing and propose a specific solution for this genre: We first extract the predominant melody and apply a novel contour filtering process to eliminate segments of the pitch contour which originate from the guitar accompaniment. We formulate a set of onset detection functions based on volume and pitch characteristics to segment the resulting vocal pitch contour into discrete note events. A quantised pitch label is assigned to each note event by combining global pitch class probabilities with local pitch contour statistics. The proposed system outperforms state of the art singing transcription systems with respect to voicing accuracy, onset detection and overall performance when evaluated on flamenco singing datasets. Index Terms—Automatic music transcription, Music infor- mation retrieval, Singing voice, Pitch contour, Audio Content Description. I. I NTRODUCTION A. Definition and Motivation F LAMENCO music is a rich improvisational art form with roots in Andalusia, a province in southern Spain. Due to its particular characteristics and importance for the cultural identity of its area of origin, flamenco as an art form was inscribed in the UNESCO List of Intangible Cultural Heritage of Humanity 1 in 2010. Given the growing community of fla- menco enthusiasts around the world, a need for computational methods to aid study and diffusion of the genre has been identified. First efforts have been made to adapt existing music information retrieval (MIR) techniques to the specific nature of flamenco [1]. Having evolved from an a cappella singing tradition [11], the vocals represent the central element of flamenco music, accompanied by the guitar, percussive hand- clapping and dance. Consequently, the main focus is set on developing algorithms which target the analysis of the singing voice. Flamenco is a strongly improvisational and sparsely docu- mented oral tradition, where songs and techniques have been passed from generation to generation. Given the resulting N. Kroher was with the Music Technology Group, Department of Infor- mation and Communication Technologies, Universitat Pompeu Fabra during the time of submission and is currently with the Department of Applied Mathematics II, University of Seville, e-mail: [email protected] E. G´ omez is with the Music Technology Group, Department of Information and Commu- nication Technologies, Universitat Pompeu Fabra, Barcelona, Spain, e-mail: [email protected] 1 http://www.unesco.org/culture/ich/en/RL/flamenco-00363 lack of scores, studies often rely on scant, labour-intensive manual transcriptions. In this study, we present a novel system for automatic singing transcription from polyphonic music recordings targeting the particular case of flamenco. The resulting automatic transcriptions are essential to a number of related MIR tasks, such as melodic similarity characterisation [3], similarity-based style recognition [2], singer identification [4] or melodic pattern retrieval [40] and can furthermore aid a broad variety of musicological studies [5]. Automatic singing transcription (AST) refers to the extrac- tion of a note-level representation corresponding to the singing melody directly from audio recordings. This task comprises two main challenges: First, the estimation of the fundamental frequency corresponding to the sung melody (vocal pitch extraction) and second, its conversion into discrete note events (note transcription). In the resulting symbolic representation, each note is described by its onset time, duration and a pitch value, usually quantised to the equal tempered scale. While the generalised task of automatic music transcription (AMT) [6], [7] is considered a major challenge in MIR, the genre under study poses a number of additional difficulties for both, the vocal pitch extraction and the note transcription stage. Pitch estimation is a well-studied problem in MIR with a variety of sub-tasks such as multi-pitch and predominant melody extraction [8]. The difficulty in the context of flamenco singing is to extract the pitch contour corresponding to the vocal melody while omitting contour segments which originate from the guitar accompaniment. The voice is usually not present throughout the entire song but alternates with interludes in which the guitar takes over the main melodic line. Note transcription can be a trivial task when, depending on the instrumentation, note onsets coincide with significant pitch and volume discontinuities. The singing voice on the other hand, given its non-percussive and pitch-continuous nature, poses a particular challenge when segmenting a pitch contour into discrete note events. Flamenco singing is characterised by a large amount of melodic, partly micro-tonal, ornamen- tations, extensive use of vibrato and instability of timbre and dynamics. Furthermore, vocal melodies are often composed of a succession of conjunct degrees and singers tend to intonate significantly above or below target notes [10]. Consequently, the complexity of onset detection and pitch labelling signif- icantly increases and requires an algorithmic design which considers the aforementioned characteristics. B. Related work Approaches to automatic note-level transcription from audio recordings date back as far as 1977 [12] and are extensively

Transcript of IEEE TRANSACTIONS ON AUDIO, SPEECH, AND...

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1

Automatic Transcription of Flamenco Singing fromPolyphonic Music Recordings

Nadine Kroher and Emilia Gomez

Abstract—Automatic note-level transcription is considered oneof the most challenging tasks in music information retrieval. Thespecific case of flamenco singing transcription poses a particularchallenge due to its complex melodic progressions, intonationinaccuracies, the use of a high degree of ornamentation and thepresence of guitar accompaniment. In this study, we explore thelimitations of existing state of the art transcription systems forthe case of flamenco singing and propose a specific solution forthis genre: We first extract the predominant melody and apply anovel contour filtering process to eliminate segments of the pitchcontour which originate from the guitar accompaniment. Weformulate a set of onset detection functions based on volume andpitch characteristics to segment the resulting vocal pitch contourinto discrete note events. A quantised pitch label is assigned toeach note event by combining global pitch class probabilities withlocal pitch contour statistics. The proposed system outperformsstate of the art singing transcription systems with respect tovoicing accuracy, onset detection and overall performance whenevaluated on flamenco singing datasets.

Index Terms—Automatic music transcription, Music infor-mation retrieval, Singing voice, Pitch contour, Audio ContentDescription.

I. INTRODUCTION

A. Definition and Motivation

FLAMENCO music is a rich improvisational art form withroots in Andalusia, a province in southern Spain. Due to

its particular characteristics and importance for the culturalidentity of its area of origin, flamenco as an art form wasinscribed in the UNESCO List of Intangible Cultural Heritageof Humanity1 in 2010. Given the growing community of fla-menco enthusiasts around the world, a need for computationalmethods to aid study and diffusion of the genre has beenidentified. First efforts have been made to adapt existing musicinformation retrieval (MIR) techniques to the specific natureof flamenco [1]. Having evolved from an a cappella singingtradition [11], the vocals represent the central element offlamenco music, accompanied by the guitar, percussive hand-clapping and dance. Consequently, the main focus is set ondeveloping algorithms which target the analysis of the singingvoice.

Flamenco is a strongly improvisational and sparsely docu-mented oral tradition, where songs and techniques have beenpassed from generation to generation. Given the resulting

N. Kroher was with the Music Technology Group, Department of Infor-mation and Communication Technologies, Universitat Pompeu Fabra duringthe time of submission and is currently with the Department of AppliedMathematics II, University of Seville, e-mail: [email protected] E. Gomez iswith the Music Technology Group, Department of Information and Commu-nication Technologies, Universitat Pompeu Fabra, Barcelona, Spain, e-mail:[email protected]

1http://www.unesco.org/culture/ich/en/RL/flamenco-00363

lack of scores, studies often rely on scant, labour-intensivemanual transcriptions. In this study, we present a novel systemfor automatic singing transcription from polyphonic musicrecordings targeting the particular case of flamenco. Theresulting automatic transcriptions are essential to a number ofrelated MIR tasks, such as melodic similarity characterisation[3], similarity-based style recognition [2], singer identification[4] or melodic pattern retrieval [40] and can furthermore aida broad variety of musicological studies [5].

Automatic singing transcription (AST) refers to the extrac-tion of a note-level representation corresponding to the singingmelody directly from audio recordings. This task comprisestwo main challenges: First, the estimation of the fundamentalfrequency corresponding to the sung melody (vocal pitchextraction) and second, its conversion into discrete note events(note transcription). In the resulting symbolic representation,each note is described by its onset time, duration and a pitchvalue, usually quantised to the equal tempered scale. While thegeneralised task of automatic music transcription (AMT) [6],[7] is considered a major challenge in MIR, the genre understudy poses a number of additional difficulties for both, thevocal pitch extraction and the note transcription stage. Pitchestimation is a well-studied problem in MIR with a varietyof sub-tasks such as multi-pitch and predominant melodyextraction [8]. The difficulty in the context of flamenco singingis to extract the pitch contour corresponding to the vocalmelody while omitting contour segments which originate fromthe guitar accompaniment. The voice is usually not presentthroughout the entire song but alternates with interludes inwhich the guitar takes over the main melodic line. Notetranscription can be a trivial task when, depending on theinstrumentation, note onsets coincide with significant pitchand volume discontinuities. The singing voice on the otherhand, given its non-percussive and pitch-continuous nature,poses a particular challenge when segmenting a pitch contourinto discrete note events. Flamenco singing is characterisedby a large amount of melodic, partly micro-tonal, ornamen-tations, extensive use of vibrato and instability of timbre anddynamics. Furthermore, vocal melodies are often composed ofa succession of conjunct degrees and singers tend to intonatesignificantly above or below target notes [10]. Consequently,the complexity of onset detection and pitch labelling signif-icantly increases and requires an algorithmic design whichconsiders the aforementioned characteristics.

B. Related work

Approaches to automatic note-level transcription from audiorecordings date back as far as 1977 [12] and are extensively

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2

reviewed by [6] and [7]. Most systems described in thesereviews provide a generic transcription framework covering awide range of instruments and musical genres. As mentionedin [7], specific instruments exhibit particular characteristicswhich might not be captured by a generic transcription system.

Approaches dealing with the specific task of singing tran-scription have usually been developed in user-oriented MIRtasks such as query-by-humming (QBH), query-by-singing(QBS), sight-singing tutors or computer-aided compositiontools. Such systems mainly assume user-input, i.e. the usersinging a query, and are consequently designed to transcribeunaccompanied recordings sung by amateur singers. In thiscase, the vocal pitch extraction stage of the transcription al-gorithm is reduced to a monophonic pitch estimation problemand systems mainly differ in the note segmentation stage: [13]and [14] obtain an initial detection of note onsets directly fromdiscontinuities in the volume envelope and [15] use a detectionfunction based on changes in spectral band energies. Thereare obvious limitations to these rather simple segmentationapproaches: Consecutive notes sung in legato may have astable volume envelope and accompanying instruments maycause sudden spectral variation without a singing voice onsetbeing present. A first approach entitled island building basedon pitch contour characteristics was proposed in [16] andstill finds application in more recent systems [13], [14]:Consecutive frames with a pitch estimate within a limitedrange are grouped into notes events. In a comparative study[17] a variety of pitch contour based segmentation approachesare compared, including adaptive filters, maximum likelihoodestimation, probabilistic modelling and local quantisation. Arecent approach to note segmentation [21], formulates intervaltransition as a hysteresis curve and locates note onsets basedon the local cumulative pitch deviation.

Addressing the more complex task of singing transcriptionfrom polyphonic recordings [18], a multiple fundamentalfrequency estimator with an accent signal is proposed ex-tract the dominant pitch trajectory. The segmentation stagerelies on a probabilistic note event model, using a HiddenMarkov Model (HMM) trained on manual transcriptions. Inan extension to this approach [19], note transition probabilitieswere incorporated in the computational model. Recently, thisnote segmentation method was implemented in the computer-aided note transcription tool tony [20]. It should be mentionedthat such probabilistic methods require a large amount ofground truth data during the training stage. This involves time-consuming manual annotations in particular for improvisa-tional and strongly ornamented singing performances, as itis the case for flamenco music.

A recent trend in music information retrieval has focusedon culture-specific non-Western music traditions [22] and hasled to the adaptation of existing MIR approaches as well asthe development of novel genre-specific techniques. It hasbeen shown that underlying musicological assumptions, i.e.regarding rhythm or tonality, do not necessarily hold for non-Western music styles. In the context of flamenco singing, a firstapproach for monophonic transcription was proposed in [2]:Based on a contour simplification algorithm [27], an estimatedmonophonic pitch track is converted into a set of constant

segments within which the absolute error between the pitchtrack and the fitted constant does not exceed a pre-determinedthreshold. A system for computer-assisted transcription ofsingle-voiced a cappella singing recordings has been proposedin [23]: Given the absence of accompaniment, a monophonicpitch estimator based on spectral auto-correlation (SAC) repre-sents the front-end of the system. The note segmentation stageis based on a likelihood maximisation method [24]: A dynamicprogramming (DP) algorithm is used to find the best amongall possible notes segmentations along the entire track. Theresulting short note transcription is refined in a post-processingstage, containing an iterative tuning estimation and short noteconsolidation process. The system was extended [25] [41] tothe transcription of sung melodies from polyphonic flamencorecordings by replacing the monophonic pitch estimator witha predominant pitch extraction algorithm [26].

Nevertheless, the authors report mistakenly transcribed gui-tar notes as a main source of error and the inaccurate vocal de-tection as the major limitation of the system performance. Thereported note transcription accuracies reported in [23] witha note f-measure of slightly below 0.4 furthermore indicatethe difficulty of transcribing flamenco singing. Significantlyhigher performance is achieved when evaluating the sametranscription algorithm on a dataset containing pop- and jazzsinging excerpts. We therefore identify a need for improvingthe state of the art on flamenco singing transcription and designan algorithm robust towards the particular characteristics ofthe genre which is furthermore suitable for the analysis ofaccompanied flamenco singing recordings.

C. Contributions and paper outline

We propose a novel AST system capable of transcribingstrongly ornamented improvisational flamenco performancesand achieving higher accuracies than state of the art methods.Similar to Gomez et al. [25], we use a predominant melodyextraction algorithm to extract the vocal pitch contour. Weextend this method in two ways: In a pre-processing stage,we use spectral characteristics to select the stereo channel inwhich the vocals are more dominant for further processing.Subsequent to the predominant melody extraction, we applya novel contour filtering method, in which pitch contoursegments corresponding to guitar melodies are eliminated.In the note transcription stage, we formulate a set of novelonset detection functions based on pitch contour and volumecharacteristics, which prove robust towards the influence ofthe accompaniment as well as a high degree of melodicornamentations. A pitch label is assigned to each note eventby combining overall chroma probabilities with local pitchstatistics. We refine the resulting symbolic transcription byapplying musicological constraints regarding note duration andpitch range.

The remainder of the paper is organised as follows: Weprovide a detailed description of the proposed system inSection II and specify the evaluation methodology in SectionIII. We give experimental results in Section IV and finallyconclude the study in Section V.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 3

Fig. 1. Block diagram of the proposed system.

II. PROPOSED METHOD

The processing blocks of the proposed system are depictedin Figure 1. As mentioned above, automatic singing transcrip-tion systems consist of two main processing blocks: A vocalpitch extraction algorithm and a note transcription stage. Bothstages of the proposed algorithm are described in detail below.

A. Vocal Pitch Extraction

In the context of singing voice transcription, we use the termvoicing to describe whether the vocals are present in a frameor not. Consequently, for both monophonic and polyphonictranscription systems, the pitch values of unvoiced frames needto be eliminated in order to obtain an accurate time-frequencyrepresentation of the vocal pitch. For monophonic singing tran-scription systems, the task of vocal pitch extraction is reducedto a monophonic pitch estimation problem, since the singingvoice represents the only instrument present. Nevertheless,silences and background sounds can cause noisy sections inthe pitch track during unvoiced frames. While [13] and [14]locate unvoiced frames solely based on the volume envelope,in [21] a tree classifier based on a set of low-level descriptorsin used. For polyphonic recordings, where the singing voiceis accompanied by one or more instruments, the task of vocalpitch extraction gains in complexity due to the presence ofvarious harmonic sources. In flamenco music, the singingvoice is the dominant element accompanied by the guitar,which takes over the main melodic line during interludes. Anobvious approach to extracting the vocal pitch would be toisolate the voice signal by applying source separation methodsand using a monophonic pitch estimator on the resulting voicesignal. A different strategy was applied in [25], where apredominant melody extraction algorithm is used to estimatethe vocal melody. This method gives convincing results forsections of a track where the vocals represent the dominant

music element. The same study has furthermore shown thatthe obtained overall performance regarding voicing and rawpitch accuracy is superior to a source separation approach.Nevertheless, mistakenly transcribed guitar contour segmentsduring interludes, where the voice is absent and the guitar takesover the main melodic line, cause a relatively high number ofvoicing false positives.

Based on these prior findings, we adopt the predominantmelody approach and extend it in order to reduce the numberof mistakenly transcribed guitar pitch contour segments.

We first apply a channel selection in order to exploit thefact that in flamenco stereo recordings the vocals are usuallymore dominant in one of the two channels. We use only thischannel for further processing. In pop music productions, itis common practice to place the vocals in the centre of thepanorama during the mixing process. In flamenco recordingson the other hand, the vocals often appear stronger in onestereo channel. One reason for this phenomenon is that manysuch recordings are live stereo recordings, where the resultingpanorama distribution corresponds to the physical location ofsinger and guitarist on stage. Even in multi-track flamencoproductions it is a tradition to separate the voice and guitar inthe artificially created panorama. An exception are productionswith an extended instrumentation, i.e. two guitars or additionalinstruments, which is rather uncommon, or the obvious case ofmono recordings. We want to exploit this fact by automaticallyselecting the channel with the stronger presence of the vocalsfor further processing in order to reduce the influence of theguitar during the vocal pitch extraction stage. We would liketo point out, that while this channel selection improves systemperformance as shown in Section IV, our system does not relyon having a panorama separation of the sources. We simplyexploit this fact whenever it is the case for a given track.

We then proceed to the predominant melody extractionfrom the selected channel. This process refers to the task ofestimating the pitch track corresponding to the perceptuallydominant melody in a given polyphonic music recording.It furthermore includes the task of determining if the mainmelody is present in a given time frame or not. In the scopeof this paper we will refer to frames in which the main melodyis estimated to be present as melody frames and all remain-ing frames as non-melody frames. Furthermore, we definea contour as a sequence of consecutive melody frames. Weextract the predominant melody from the previously selectedchannel using the algorithm proposed in [26]. We selectedthis particular method due its available implementation in theessentia library [31] and in order to obtain a direct comparisonto the study presented in [25], where it was also used as thefront end of a flamenco singing transcription. We would liketo mention that this method can be replaced by any otherpredominant melody extraction algorithm. While in the generaltask of pitch extraction, the performance is mainly limited bypitch estimation errors, i.e. octave errors, the limitation in thescope of vocal pitch estimation is to a large extend conceptual:Even a perfect predominant melody extraction algorithm willestimate contours originating from the guitar, since the guitaraccompaniment represents the perceptually dominant elementduring certain sections of a flamenco song.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 4

After extracting the predominant melody, we apply a novelcontour filtering process and eliminate parts of the pitchcontour which are located outside the vocal sections. Vocaldetection has been explored mainly as a machine learning tasks[28]–[30]. While good results have been obtained on a frame-level, such algorithms require a large amount of manuallyannotated training data. Here, we propose a novel vocaldetection algorithms which works on a track level without anyprevious training stage and eliminates guitar contour segmentsbased on spectral characteristics. The proposed scheme isbased on two assumptions: First, the majority of all melodyframes are voiced, or, in other words, the main melody largelycorresponds to the vocal melody. This assumption is supportedby the experiments carried out by Gomez et al. [25], whereonly around 15% of all estimated pitch values originate fromthe guitar. Second, we assume that guitar and vocal sectionscan be discriminated based on spectral characteristics. Inpreliminary experiments, we investigated a variety of spectraldescriptors and found that the energy of the first 12 bark bands[32] are best-suited for this task (Figure 3 (a) and (b)). Basedon these assumptions, we model melody frames originatingfrom the guitar as outliers in a Gaussian distribution of barkband energies extracted for all melody frames.

Below, we provide a detailed description of the threestages involved in the vocal pitch extraction process and theirimplementation.

1) Channel selection: The automatic channel selection isbased on the fact that guitar and voice differ in their distribu-tion of spectral energy (Figure 2): When the vocals are present,we observe an increase in the frequency range of 500 Hz to6 kHz. We therefore select the stereo channel with a higheraverage presence in this range. For both audio channels witha sampling rate of fs = 44.1 kHz, we compute the ShortTime Fourier Transform (STFT) for a window size N = 4096samples multiplied with a hanning window function, a zeropadding factor of m = 2 and a hop size of hs = 1024 samples.Accordingly, the bin k corresponding to a centre frequency fis given as

k(f) = round

(f ·m ·N

fs

)(1)

To characterise the presence of the voice, we compute thespectral band ratio S[n] from the ratio of summed magnitudesin the upper band f21 = 500 Hz < f < f22 = 6000 Hz to thelower band f11 = 80 Hz < f < f12 = 400 Hz:

S[n] = 20 · log 10

k(f21)<k<k(f22)

|X[k, n]|∑k(f11)<k<k(f12)

|X[k, n]|

(2)

In order to eliminate the influence of the overall volume, wedivide the magnitude spectrum |X[k, n]| by its maximum valuein the given frame Xmax[n] = max

k(X[k, n]), resulting in

the normalised magnitude spectrum |X[k, n]| at time framen. We compute the spectral band ratio S[n] for each framen and average for each channel separately over the entirelength of the song. Based on the resulting two average spectralband ratios, Sleft and Sright, we select the channel with the

Fig. 2. Spectrogram for voiced and unvoiced sections: ”GIT” denotes thepresence of the guitar; ”VOC” denotes the presence of the singing voice.

Fig. 3. Contour filtering: (a) predominant melody, (b) lower bark bands, (c)vocal / non-vocal classifier and (d) estimated vocal melody.

higher average value. Preliminary experiments have shown thatcomparing the summed bin magnitude values |X[k, n]| yieldsto a better discrimination than comparing the summed signalenergy |X[k, n]|2 among the two bands.

2) Predominant melody extraction: The predominantmelody extraction algorithm [26] estimates pitch candidateson a frame level based on harmonic summation and groupsthem into pitch and time continuous contours using auditorystreaming principles. It furthermore filters out contours basedon their average pitch salience which do not form part of thedominant melodic line. As a result, it outputs a single pitchvalue f0[n] in Hz for all melody frames. As this algorithm haspreviously been used in the context of flamenco singing, weadopt the parameters suggested in [25] as follows: The analysiswindow corresponds to N = 4096 samples with a hop sizeof hs = 128 samples at a sample rate of fs = 44.1 kHz. Thelower and upper limits for considered fundamental frequencycandidates are set to 120 Hz and 720 Hz, respectively. Thesevalues correspond to the expected pitch range in flamencosinging. The voicing threshold τv[−2, 3] which is related tothe relative salience threshold for determining if a contourbelongs to the main melody, is adjusted to τv = 0.2. It hasbeen shown [25] that with this value, around 90% of all vocalframes are retained during the elimination process.

3) Contour filtering: First, we extract the energy in thelower twelve bark bands with frequency limits specified as0 Hz, 50 Hz, 100 Hz, 150 Hz, 200 Hz, 300 Hz, 400 Hz, 510 Hz,630 Hz, 770 Hz, 920 Hz, 1080 Hz and 1270 Hz in windows

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 5

of length N = 1024 samples with a hop size of hs = 128samples at a sampling rate of fs = 44.1 kHz. The energy inthe mth bark band with lower frequency limit f1,m and upperfrequency limit f2,m at time frame n is given as

B[n,m] =∑

k(f1,m)<k<k(f2,m)

|X[k, n]|2 (3)

where k(f) is the frequency bin corresponding to the fre-quency f (Eq. 1) and |X[k, n]| is the magnitude spectrumat time frame n. As a result, we obtain a feature vectorx[n] = 〈B[n, 1], ..., B[n, 12]〉 holding the energies of the lowertwelve bark bands for each analysed frame n.

We then assign an initial label to the feature vector ineach frame based on the output of the predominant melodyalgorithm: Melody frames are labelled as voiced x+ andnon-melody frames as unvoiced x−. This initial labellingcorresponds to the case where the main melody coincides withthe vocal melody. Subsequently, we fit a single multivariateGaussian distribution to both feature sets separately. Applyingmaximum likelihood estimation, we obtain estimates for mean(µ+ and µ−) and covariance (Σ+ and Σ−) for both classes.The resulting likelihood p for an arbitrary feature vector x isgiven as:

p(x|µ,Σ) =1√

(2π)12|Σ|· e(− 1

2 (x−µ)TΣ−1(x−µ)) (4)

Evaluating this equation for a given feature vector xfor both distributions, voiced p+(x|µ+,Σ+) and unvoicedp−(x|µ−,Σ−), we now expect a higher value for p+ if theframe contains vocals. In this manner, we evaluate every frameand assign a binary vocal prediction v[n]:

ν[n] =

{1 if p+(x[n]|µ+,Σ+) ≥ p−(x[n]|µ−,Σ−)0 else

(5)Since we assume voiced section to be continuous in time, wesubsequently apply a binary moving average filter of 1 secondlength to the vocal prediction sequence v[n] to eliminate fastfluctuations. The resulting sequence is shown in Figure 3 (c).

We now use this frame-wise classification to filter contoursegments. A given contour ranging from frame d1 to framed2 is eliminated, if it is entirely located outside the estimatedvocal regions:

d2∑d1

v[n]

{= 0 eliminate> 0 retain

(6)

An example of the resulting vocal pitch is depicted in Figure3 (d).

B. Note transcription

After eliminating unwanted guitar contours, we are now leftwith a set of contours corresponding to the vocal melody.Instead of a continuous time-frequency representation, we nowaim to segment the remaining contours into discrete noteevents, characterised by onset time, duration and a quantisedsemi-tone pitch value. In other words, the task is to split thecontours at vocal onsets and assign a pitch label to each noteevent.

Fig. 4. Note segmentation based on local maxima: Ground truth transcription(grey rectangles), pitch contour, local maxima (red circles) and upper envelope(red dashed line).

1) Segmentation: Conceptually, we can distinguish twotypes of notes onsets: Those which coincide with a change inpitch (interval onsets) and onsets where the pitch of the currentnote is the same as the previous note (steady pitch onsets). Inwhat follows, we will define four detection schemes which intheir combination capture the majority of both types of onsetsat a low false positive rate.

For each voiced segment, we first map the frequency contourf0[n] to a cent scale relative to a reference frequency of fRef =440 Hz:

c0[n] = 1200 ∗ log2

(f0[n]

fRef

)(7)

First, we aim to discover interval onsets: When dealingwith strongly ornamented pitch contours, the detection ofsignificant pitch changes which indicate a note onset turnsinto a complex task. For the particular case of flamencosinging, fast fluctuations of the pitch track caused by micro-tonal ornamentations can exceed a semitone without a noteonset being present. Low-pass filtering of the pitch contourreduces these fluctuations but also affects the steep slopeswhich indicate a note change. A close inspection of suchpitch contours has shown that the relative change of the upperenvelope gives a good indication of the underlying slowlychanging perceived pitch (Figure 4): Vocal vibrato causes aseries of local maxima and in case of a steady note theirpitch values remain in a strongly limited range in contrast tothe actual pitch contour which might show fluctuations withan extend up to a semitone. In preliminary experiments theupper envelope has proven to be slightly more reliable forthis task than the lower envelope. We furthermore observethat due to the periodicity of the vibrato fluctuations, thesteep slope characterising a note onset is centred between twoadjacent local maxima. We therefore extract the local maximapm and assume a pitch change centred between two adjacentmaxima if their relative distance ∆pmi,i+1 exceeds a threshold∆pmin. For a pitch change of an exact semitone, we expect∆pmi,i+1 = 100 cents. In order to leave room for intonationinaccuracies, we adjust the threshold to ∆pmin = 80 cents.Since not all local maxima are caused by vocal vibrato, wefurthermore exclude cases where the time distance betweentwo local maxima exceeds T = 0.25 s. We chose this valuesin order to cover modulation rates as low as 4 Hz, which

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 6

Fig. 5. Gaussian derivative filtering; top: cent-scaled pitch contour; bottom:output of the Gaussian derivative filter. The green dashed line marks thedetection threshold for local maxima.

corresponds to the lower bound of expected vocal vibrato rates[35]. With this procedure we detect a large number of pitchinterval onsets, in particular in sections where vocal vibrato ispresent. Nevertheless, a number of onsets are missed, i.e. whenthe contour does not show a regular vibrato and consequentlyno relevant local maxima in the area of a pitch change.

In order to detect further undiscovered onsets, we refine thesegmentation by applying a first derivative Gaussian filter tothe contour. Such filters have been used successfully in therelated task of edge detection in image processing [33], [34].The one-dimensional Gaussian density function g[n, σ] withzero mean and standard deviation σ is given as:

g[n, σ] = e−n2

2σ2 (8)

and its first derivative h[n] results to:

h[n] =δg[n, σ]

δn= − n

σ2· e−

n2

2σ2 (9)

In the context of onset detection, the purpose of applyingGaussian derivative filtering is to detect long-term changesin the signal while omitting fast fluctuations. The parameter σdetermines the effective length of the filter and consequentlythe period of averaging: If σ is too large, the filter mightaverage entire short notes. Choosing a very small value for σwill lead to a large filter output at fast pitch fluctuations causedby vibrato. We choose σ = 43.5 ms, so that the effective filterlength of 300 ms covers a full period of a slow vibrato witha rate of 4 Hz. Applying the filter h[n] of length Nh to thecent-scaled contour c0[n] of length Nc leads to the filteredoutput cF [n]. Analysing its absolute value (Figure 5) showsstrong peaks in the area where a change of the slowly varyingunderlying pitch takes place. We consequently detect onsetsat local maxima of |cF [n]|. Since minor variations in thepitch contour may cause noise in the filter output, we discardlocal maxima below an empirically determined threshold of|cF [n]|min = 4.0.

We now proceed to the detection of steady pitch onsets:We define two characteristics which indicate this type ofonset: A sudden local decay in volume (Figure 6) and suddendecrease in the pitch contour (Figure 7). It is important tostate at this point, that at a given onset either one or both ofthese characteristics can be present. We therefore define twoseparate detection schemes instead of combining both featuresin a single detection function. Detecting volume decays whenthe overall dynamics of a track vary, i.e. certain sections are

Fig. 6. Pitch contour dips at note onsets: pitch contour; note onsetscharacterised by a dip in the pitch contour. Green circle mark detected onsets,grey rectangles correspond to the ground truth transcription.

generally of lower volume than others, requires an analysis ofthe short-term dynamics. We first extract the root mean square(RMS) of the signal rms[n] in windows of length N = 4096samples with a hop size of hS = 128 samples. In orderto detect local decays, we define the local RMS fluctuationrLOC[n] by comparing each sample to the mean value of itssurrounding 100 samples, corresponding to 150 ms and mapto a decibel scale:

rLOC[n] = 20 · log 10

rms[n]∑−50≤n≤50∧n 6=0

rms[n] · 0.01

(10)

We segment the contour at frames n where the local minimawith rLOC[n] < −10 dB are found. We set this threshold inorder to ignore local volume variations which often accompanypitch vibrato or are caused by dynamic fluctuations of theaccompaniment.

The difficulty in detecting the previously described de-creases in the pitch envelope is to distinguish them from localminima occurring during vocal vibrato. We therefore modellocal minima which are related to note onsets as outliersin the local distribution of pitch values over all frames inthe considered contour. A common method to detect outliervalues of a distribution is to analyse its z-score z[n], whichdescribes the relation between a given datapoint and thestandard deviation of all considered data points,

z[n] =c[n]− µ

σ(11)

where µ denotes the mean and σ the standard deviation ofthe distribution of pitch values in the considered segment.Consequently, the described decreases in a given contour willproduce large negative values for z[n]. In order to avoiddetecting local minima caused by vocal vibrato, we onlyconsider negative peaks of z[n] below a threshold zmax. Thestandard deviation of a zero-centred sinusoid σsin with amagnitude A corresponds to its RMS value and results to:

σsin =A√2

(12)

Local minima of the periodic oscillation will consequentlycause a z-score zmin,sin of

zmin,sin =−Aσsin

= −√

2 (13)

Since we are looking for local pitch decreases which aressignificantly larger than the deviations caused by vocal vibrato

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 7

Fig. 7. Note onsets characterised by a local decrease in volume; top: pitchcontour; bottom: local RMS; The green circle marks a detected onset, greyrectangles correspond to the ground truth transcription.

Fig. 8. Pitch labelling: (a) contour segment; (b) local pitch histogram; (c) localpitch probability; (d) global pitch probability; (e) combined pitch probability.

and furthermore the vibrato extend might vary during longnotes, we adjust the threshold for detecting local minima belowthe theoretical boundary of −

√2 to zmax = −2.

All previously described threshold values involved in theonset detection process (|cF [n]|min = 4.0, rLOC[n] < −10 dBand zmax = −2) were determined empirically based on obser-vations of several onsets showing the respective characteristicswhich were singled out from the cante2midi dataset (SectionIII-A).

2) Pitch labelling: After the note segmentation stage, theremaining task is to assign a pitch label to each resultingnote event. There are various challenges in this stage: First,the tuning of the track might deviate from the reference,causing even constant segments to be located between twosemitone bins. While the mean pitch value across the contouris often a sufficient approximation for stable contour segmentsor even symmetric ornamentations (i.e. vibrato), singing voicecontours may contain portamentos at note beginnings andendings or over-swing before the target pitch is reached. Suchnon-symmetric ornamentations can cause an offset in the meanpitch value. Finally, local intonation might not always beaccurate and the estimated pitch can be located slightly belowor above the target value. McNab [16] estimates the resultingpitch as the peak bin of the local pitch histogram. Molina etal. determine the pitch label of a given segment as the energy-weighted alpha-trimmed mean in order to exclude local pitchoutliers and give assign higher weights to high-energy frames.In the model-based transcription system [18], the pitch label ofeach found note results directly from the output of the HMM.In a similar way, Gomez & Bonada [23] deduce quantisedpitch labels from the maximum likelihood estimation.

Here, we first estimate a global tuning reference for theentire song and then combine pitch statistics from the noteevent with global pitch class probabilities to finally determinethe pitch value. The entire process is summarised in Figure 8.

We estimate a global tuning deviation ∆t from the referencevalue of A4,St = 440 Hz using an established method based oncircular statistics [36]. The new tuning reference A4,T resultsto

A4,T = 2∆t

1200 ·A4,St (14)

and we can re-map each contour f0[n] to a cent scale c0,T [n]relative to the estimated reference frequency fRef = A4,T

applying Eq. 7.Even though in several music traditions and genres, the

main melody does not strictly follow a defined scale, certainpitches tend to occur more frequent than others according tothe underlying harmonic context. For note events with unstablepitch or inaccurate intonation, such probabilities can assist indeciding on the pitch label. We estimate a so called pitchclass profile providing probability estimates for all twelvesemitones from the averaged chroma vector over all frames.Chroma features were first introduced by Wakefield [37] andare frequently used in the context of a variety of MIR tasks,such chord and tonality estimation or cover song identification[38]. The chroma vector for a given time frame is obtained byquantising the spectrum |X[n, k]| to semi-tone bins and thenmapping the entire analysed range into a single octave. In thismanner, we obtain an instantaneous chroma vector chr[kchr, n]for each frame n consisting of K = 12 semitone bins kchr.Subsequently, we can estimate a global pitch class probabilityLglobal for each semitone from the average chroma vectorchr[kchr] over all frames in a given track:

Lglobal[kchr] =chr[kchr]

12∑kchr=1

chr[kchr]

(15)

The use of chroma vectors give the advantage that it includespitch information of both, the singing voice and the guitar ac-companiment, and should therefore give a better representationof the underlying tonality.

We now proceed to the statistical analysis of the pitchcontent within a single note event. The centre values Cc[ks] incents of quantised cent bins corresponding to the kths semitoneabove A4,T are given as:

Cc[ks] = ks · 100 (16)

Now, for each frame within the note event, the pitch con-tour c0,T [n] is quantised to the closest semitone bin ksand accordingly a pitch histogram H[ks] is computed byaccumulating the occurrences of each bin over all frames ofthe analysed track and dividing by the total number of frames.As mentioned earlier, due to local intonation inaccuracies andnon-symmetric ornamentations, the ground truth pitch mightnot correspond to the peak bin but to one of the adjacentbins. We therefore replace each bin value H[ks] by a Gaussian

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 8

distribution Gk′s[ks] originating from the bin k′s and spanning

over various semitone bins ks:

Gk′s[ks] = H[k′s] ·

1

σ√

2πe−

(ks−k′s)2

2σ2 (17)

The mean of the distribution corresponds to the semitone bink′s from which the distribution originates. We furthermoreassume a standard deviation of σ = 0.5 corresponding to aquarter tone and weight with the occurrence of the respectivebin H[k′s]. By accumulating the contributions of all distri-butions for each bin, we obtain the local pitch probabilityfunction Llocal[ks] as a Gaussian mixture distribution:

Llocal[ks] =

∑k′s

Gk′s [ks]

K(18)

The pitch label can now be estimated by combining theglobal pitch class probability and the local pitch probabilityfunction. For a given semitone bin ks the correspondingchroma bin kchr is defined as:

kchr[ks] = ks mod 12 (19)

We calculate the product of global and local pitch classprobabilities Lpitch[ks] as:

Lpitch[ks] = Lglobal[kchr[ks]] · Llocal[ks] (20)

The MIDI pitch label Pmidi,i associated with the ith contoursegment is finally computed from the semitone bin with thehighest combined pitch probability:

Pmidi,i = 69 + argmaxks

(Lpitch,i[ks]) (21)

3) Note post-processing: By applying basic musicologi-cally motivated restrictions regarding pitch and duration, wecan further refine the obtained note transcription. First, wecan exploit the idea that flamenco singing is usually limitedto a pitch range of less than an octave. Figure 9 shows ananalysis of the cante2midi dataset described in Section III,where the pitch of a ground truth note is displayed in relationto the median pitch of the corresponding track. Based onthis observations, we subtract the median pitch of the entiretrack from each transcribed note and limit the range to ±8semitones. Transcribed notes below this range are likely tobelong to the accompaniment and are consequently eliminated.Notes above this range may be caused by octave errors inthe vocal melody extraction and are transposed down by oneoctave. We furthermore assume a minimum note duration of0.05 seconds and eliminate shorter transcribed notes. In theanalysed datasets (III-A), only 0.3% of all ground truth noteshave a shorter duration. In order to reduce computationalcomplexity and to avoid labelling short notes which areafterwards discarded, segments shorter than 0.05 seconds areeliminated before the note labelling stage.

III. EVALUATION METHODOLOGY

Below we provide a detailed description of the evaluationstrategies applied in the scope of this study. We give anoverview of the employed data collections and the correspond-ing ground truth annotation process and describe the evaluation

Fig. 9. Distribution of ground truth pitch values with respect to the trackmedian for the cante2midi dataset.

TABLE IDATABASE INFORMATION.

database cante2midi fandangos tonasno. tracks 20 39 72clip type full track excerpt excerptno. singers 15 21 44total duration 1h 6m 34m 20mno. ground truth notes 6025 3070 2983% voiced frames 42 50 82

metrics used. We furthermore provide a short description of thereference algorithms used during the comparative evaluation.

A. Test collections

We use three data collections to evaluate the performance ofthe proposed approach and compare to reference algorithms:The cante2midi (C2M) dataset was gathered in the scope ofthis study and contains a variety of flamenco styles and voicetimbres with varying degree and complexity of ornamentation.The instrumentation of all 20 tracks is limited to vocals andguitar. The fandango (FAN) dataset comprises 39 excerptsfrom recordings of the fandango style, containing vocals andguitar accompaniment. This dataset was used previously in thecontext of automatic transcription of flamenco singing [25].In order to compare to monophonic transcription systems,we furthermore evaluated the proposed system on the tonas(TON) dataset. This collection contains 72 clips of a cappellaflamenco recordings and is publicly available2. It should bementioned that a cappella singing represents only a small frac-tion of the flamenco genre and such performances are usuallycharacterised by a large amount of melismatic ornamentation.Furthermore, the absence of the guitar accompaniment oftencauses strong tuning fluctuations throughout the performance.Further information on all three data collections is providedin Table I.

B. Ground truth annotations

A general guideline for the ground truth annotation processof all three collections was to obtain a detailed transcriptionincluding all audible notes. Since flamenco music does notalways follow a strict rhythm and the proposed transcriptionsystem does not apply rhythmic quantisation, note onsets weretranscribed as absolute time instants. The ground truth annota-tions for the C2M collection were conducted by a person with

2mtg.upf.edu/download/datasets/tonas/

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 9

formal music education and training in melody transcription byear, but only basic knowledge of flamenco. All transcriptionswere verified and corrected by a flamenco expert. The outputof the system described by Gomez et al. [25] was takenas a starting point. After converting to MIDI, transcriptionswere edited in the digital audio workstation LogicPro. Theannotator listened to the original track and the transcriptionsynthesised by a piano simultaneously with the possibility ofmuting one of the tracks when required. The tuning of theMIDI synth was manually adjusted to match the tuning ofthe corresponding audio track. A visual representation of thepitch contour and the baseline transcription was provided asadditional aid. Ground truth annotations for both, the FANand the TON collection, were conducted by a musician withlimited knowledge of flamenco in order to avoid implicitknowledge of the style. Annotations were then verified andcorrected by a flamenco expert and occasionally discussedwith another flamenco expert. For more details on the annota-tion process of these two collections and general guidelines ofmanual flamenco singing transcription, we refer to [23] and[25]. In addition to the note transcriptions, we furthermoreprovide manually corrected pitch contours for both polyphonicdatabases, where guitar contours were eliminated.

C. Reference algorithms

In the course of this study we compared the proposed systemto a number of reference algorithms. Below we briefly describeeach of the methods we used in the comparative evaluation inSection IV. For a detailed description we refer the reader tothe references provided for each method.• Curve fitting (FIT): A contour simplification algorithm

[2] was applied to a given pitch contour. As suggestedby the authors, the pitch deviation tolerance was set toαe=1.

• Recursive least squares filtering (RLS): This approachperforms a segmentation of the pitch contour based onthe error function of an adaptive filter which tracks thesemitone pitch contour as described by Adams [17]. Weadopt all system parameters suggested by the authorsand segment at values of the squared error functiond[n] > 0.25.

• Dynamic programming (DP): We used an existing im-plementation of the flamenco singing transcription sys-tem described by Gomez & Bonada [23] (DP-Mono) totranscribe a cappella recordings and applied its extension[25] (DP-Poly) for polyphonic recordings. The imple-mentation furthermore allows to segment any given pitchsequence.

• Segmentation based on hysteresis (SiPTH): The mono-phonic singing transcription described by Molina et al.[21] was used to transcribe a cappella singing recordings.

• Segmentation based on probabilistic Hidden MarkovModelling (HMM): We compare to the implementationof the segmentation algorithm proposed in [18] as im-plemented in the tony [20] transcription framework formonophonic singing recordings. The system uses theprobabilistic YIN [39] algorithm as a front end to estimate

Fig. 10. Frame-wise accuracy of the GMM-based vocal detector [28]:cante2midi and fandango dataset.

the vocal pitch. We used the publicly available vampplugin implementation3.

The systems SiPTH and HMM represent monophonicsinging transcription frameworks and were consequently eval-uated in the scope of a cappella singing. FIT and RLS arecontour segmentation algorithms without a vocal pitch extrac-tion front end and were used in the scope of onset detectionevaluation. The DP algorithm is the only complete referencesystem for singing transcription from polyphonic recordingsand was therefore part of a comparative study evaluating theoverall system performance. Since it can furthermore operateon any given pitch contour input, it was also used in the onsetdetection evaluation.• Vocal detection using Gaussion Mixture Models (GMM):

In order to evaluate the proposed vocal detection stageto existing methods, we furthermore implemented a thealgorithm proposed by Song et al. [28] based on GaussianMixture Models. We adopted all parameters as suggestedby the authors and processed both polyphonic datasetsin a 10-fold cross-validation. The frame-wise accuracyby means of voicing precision, recall and f-measure isshown in 10 and is slightly above the values reported bythe authors

D. Evaluation metrics

We aim to evaluate three aspects of the proposed algorithm:Its capability of detecting the vocal sections, the performanceof the segmentation of vocal contours into discrete note eventsand its overall performance.

The vocal section retrieval is evaluated by means of voicingprecision (Pr-V), voicing recall (Rec-V) and voicing f-measure(FM-V). Pr-V is defined as the fraction of all frames estimatedas voiced, which are labelled as voiced in the ground truth.Rec-V corresponds to the fraction of all voiced ground truthframes, which are estimated as voiced. The resulting f-measureis calculated as the harmonic mean of Rec-V and Pr-V.

In a similar manner, we evaluate the onset detection stageby means of onset precision (Pr-On), onset recall (Rec-On)and onset f-measure (FM-On). In this case, Pr-On refers tothe proportion of all detected onsets, which correspond toground truth onsets. Rec-On is defined as the proportion ofall ground truth onsets, which are correctly detected. FM-Onagain corresponds to the harmonic mean of Rec-On and Pr-On.We adapt a previously suggested threshold ( [23]) and consideran onset as correctly detected if it is located within 0.15

3code.soundsoftware.ac.uk/projects/pyin

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 10

seconds of a ground truth onset. Furthermore, each groundtruth onset can only be associated to a single detected onsetand vice versa.

In order to evaluate note transcriptions, we first definethe frame-wise raw pitch accuracy (RPA) as the percentageof correctly transcribed frames. In order to incorporate thevoicing detection into this measure, we define an unvoicedframe as correctly transcribed if it was estimated as unvoiced.Voiced frames are correctly transcribed if the estimated MIDIpitch corresponds to the ground truth in this frame. Wefurthermore evaluate transcriptions based on note precision(Pr-N), note recall (Rec-N) and note f-measure (FM-N). Pr-N refers to the proportion of all detected notes, which arecorrectly transcribed ground truth notes. Rec-N is defined asthe proportion of all ground truth notes, which are correctlytranscribed and FM-N is calculated as the harmonic mean ofPr-N and Rec-N.

We adopt the evaluation thresholds suggested in [23]: Aground truth onset is correctly detected if an estimated onsetis within a range of 0.15 seconds. Furthermore, a note iscorrectly transcribed if the pitch label is correctly assigned andthe estimated duration is within a tolerance of 30%. Analysingthe test collection, we observe significant deviations from thestandard tuning of A4 = Hz ranging even above 40 cents. Forsuch large deviations, it is even in the manual process difficultto decide, if the track is tuned below or above the reference.Consequently, small errors in the tuning estimation of onlya few cents can cause the entire melody to be transcribed asemitone above or below the ground truth. We assume thatin cases of large tuning deviations such a transcription shouldstill be valid. Therefore, we perform a preliminary evaluationof the entire transcription and its transpositions one semitoneabove and below and correct towards the best match.

IV. EXPERIMENTAL RESULTS

In this section, we present the results of a number ofexperiments carried out in order to evaluate the performanceof the proposed system (P) and compare to other meth-ods. We first evaluate the overall performance for flamencotranscription from polyphonic and monophonic systems andcompare to a several existing systems. In order to deal withmonophonic recordings, we omit the channel selection andcontour filtering stage and increase the voicing tolerance ofthe predominant melody extraction algorithm (Section II-A2)to τv = 3.0. This setup is denoted as P-Mono. We then analysethe accuracy of the proposed contour filtering stage, compareto alternative system setups and investigate the influence on thenote transcription. Subsequently, we first isolate the contoursegmentation stage and then the entire note transcription blockand study the obtained accuracies with respect to alternativeapproaches. Finally, we conduct a component analysis ofthe proposed system and investigate the influence of certainprocessing blocks on the overall performance.

A. Overall systemFigure 11 shows the evaluation of the proposed system (P)

and the polyphonic implementation of the dynamic program-ming approach (DP-Poly) [25] for the two polyphonic datasets

Fig. 11. Overall system evaluation for polyphonic datasets: Proposed system(P) and [25] (DP-Poly).

Fig. 12. Overall system evaluation for the monophonic dataset (TON):Proposed system (P), [23] (DP-Mono), [18], [20] (HMM) and [21] (SiPTH).

described in Section III-C. Figure 12 shows the comparativeevaluation on the monophonic dataset. For both, monophonicand polyphonic recordings, the proposed system outperformsall reference systems. The note f-measure as well as theframe-wise raw pitch accuracy is significantly lower for themonophonic dataset, indicating the difficulty of transcribingthis particular sub-genre (Section III-A).

B. Voicing

We now investigate the effectiveness of the proposed vocalpitch extraction stage (P) and compare to two alternativesetups: P-RawPM refers to replacing the entire stage with theraw predominant melody without any further processing. Thissetup corresponds to the front-end of the algorithm proposedby Gomez & Bonada [25] and represents the baseline. Wefurthermore replace the proposed frame-wise voicing predic-tion v[n] (Section II-A3) with the output of the GMM-basedapproach described in [28] and eliminate contours accordingly.This setup is referred to as P-GMM. The experiments wereconducted for both polyphonic datasets, evaluating the frame-wise voicing accuracy (Figure 13) as well as the influence onthe resulting note transcription (Figure 14).

The results show that the raw predominant melody givessignificantly better results for the fandango (FAN) than for thecante2midi (CSM) dataset. An explanation for this behaviourcan be found when analysing the content of the excerpts com-

Fig. 13. Frame-wise voicing accuracy for the proposed system (P), theproposed system with [28] replacing the vocal detection function v[n] (P-GMM) and the raw predominant melody (P-RawPM).

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 11

Fig. 14. Note transcription evaluation for the proposed system (P), theproposed system with [28] replacing the vocal detection function v[n] (P-GMM) and the raw predominant melody (P-RawPM).

prising FAN: Despite containing only 50% of voiced frames(Table I), the excerpts do not include the guitar introductionor melodic interludes. In contrast, C2M contains full trackswhere these sections are included. These parts of the songrepresent the sections, where the guitar is most dominant andconsequently tends to produce melody contours. Nevertheless,applying channel selection and contour filtering leads to asignificant increase in the obtained frame-wise voicing accu-racy which propagates to the resulting transcription quality:The proposed approach yields a note f-measure of 0.63 onthe C2M dataset and 0.60 on the FAN dataset. The note f-measure obtained with the raw predominant decreases 0.51for C2M and 0.56 for FAN. We can furthermore observe,that replacing v[n] with the output of the GMM-based vocaldetection yields similar results when compared to the proposedsystem. The difference in voicing accuracy between the twosetups is marginal and does not show in the resulting nottranscription performance (both obtain an f-measure of 0.63on C2M and 0.60 on FAN). Nevertheless, an advantage of theproposed system is that it works on a track-level and doesnot require any training phase involving manual ground truthannotations.

C. Segmentation

We now isolate the note segmentation stage describedin Section II-B1 and compare to a number of alternativeapproaches. The evaluation is carried out by means of onsetprecision, recall and f-measure. For all three databases, thecorrected vocal pitch contour is provided as input to all meth-ods. The results shown in Figure 15 show that the proposedapproach (P-SEG) yields the best performance among allconsidered methods on all datasets, followed by the dynamicprogramming approach (DP-SEG) [23]. The curve fittingalgorithm (FIT-SEG) [2] obtains a low precision but a highrecall rate, indicating an over-segmentation of the contour. Thereversed behaviour is observed for the RLS segmented (RLS-SEG) [17]. We furthermore observe, that the performance ofthe proposed approach is consistent for the three datasets. Theonset f-measure of DP-SEG decreases for the monophonicdataset (0.78 for C2M, 0.76 for FAN and 0.68 for TON). Giventhe higher complexity of the segmentation task for the tonasdataset described in Section III-A, these results indicate thatthe proposed approach is robust towards tuning inaccuraciesand is capable of dealing with a large amount of melismaticornamentations of the melody.

Fig. 15. Note segmentation evaluation for the proposed system (P-SEG), thedynamic programming approach (DP-SEG) [23], the fitting algorithm (FIT-SEG) [2] and the RLS segmenter (RLS) [17].

Fig. 16. Note transcription evaluation for the proposed system (P), thedynamic programming approach (DP) [23] when applied to the manuallycorrected pitch contour (corr).

D. Note transcription

After isolating the note segmentation stage, we now proceedto the evaluation of the entire note transcription stage (SectionII-B), comprising contour segmentation, pitch labelling andnote post-processing. We compare to the DP algorithm [23]and provide both systems with the manually corrected vocalpitch contour as input. This experimental setup represents aglass ceiling evaluation in a sense that it corresponds to atranscription with a perfect vocal pitch extraction stage interms of voicing. Figure 16 provides the note-related evalu-ation for both polyphonic datasets. It can be observed thatthe decrease in performance from using the corrected pitchcontours in this experiment to the real-world scenario (Figure11) is significantly lower for the proposed system: For theC2M dataset, the note f-measure drops from 0.66 to 0.63for the proposed system and from 0.54 to 0.39 for the DPalgorithm. This indicates that the proposed system provides abetter approximation of the vocal pitch contour than the rawpredominant melody [23]. These results confirm the findingsin Section IV-B.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 12

Fig. 17. Note transcription evaluation for the proposed system (P) whenseveral components are removed: Contour filtering (P-CF), channel selection(P-CS) and global pitch probability estimation (P-PP).

E. Component analysis

In a last experiment, we conduct a component analysisof the proposed system in order to verify that each of thecore algorithm components contribute to the overall systemperformance. For this analysis, we chose the C2M datasetsince it contains recordings which are both polyphonic andstereo. We evaluate the proposed system (P) and compare toscenarios where single components are removed: The contourfiltering stage (P-CF, Section II-A3), the channel selection (P-CS, Section II-A1) and the global pitch probability (P-PP,Section II-B2).

The results displayed in Figure 17 confirm that each of theinvestigated algorithm components contributes to the overallsystem performance. Among the considered setups, the re-moval of the contour filtering stage caused the largest decreasein performance, reducing the note f-measure from 0.63 to 0.55.Nevertheless, this result is still superior to the performanceobserved for the system presented in [25] (f-measure 0.49).

F. Summary

The results of the previously described experiments showthat the proposed system gives superior results when comparedto a number of reference algorithms. We observe a higheroverall system performance when comparing to a state of theart flamenco singing transcription system (Figure 11). Thenote transcription f-measure for the proposed system rangesbetween 0.60 and 0.63 depending on the evaluation dataset.The results in Figure 13 show that the contour filtering processyields an improved voicing detection and gives similar resultsto a GMM-based vocal detection scheme which requires a pre-trained model (Figure 14). We isolated the note segmentationand pitch labelling stages and observed a superior performancecompared to a number of state of the art systems (Figure15 and 16). Finally, we conducted a component analysis anddemonstrated that each processing block contributes to theoverall system performance (Figure 17).

V. CONCLUSIONS

In this paper we present a novel approach for automatictranscription of flamenco singing directly from polyphonicaudio recordings. All involved signal processing blocks havebeen described in detail: For stereo recordings we first selectthe channel with stronger dominance of the vocals for furtherprocessing. We then extract the predominant melody and applya novel contour filtering process to discard guitar contours. It

has been shown that this process significantly improves thevoicing accuracy when compared to the raw predominant pitchcontour. The resulting estimated vocal pitch sequence is furtherprocessed in the note transcription stage, where contoursare converted to discrete note events. We propose a novelcontour segmentation procedure based on pitch and volumecharacteristics, which has proven to yield better results bymeans of onset detection accuracy when compared to a numberof alternative approaches. We assign a pitch label to eachresulting segment by combining local and global pitch infor-mation. A number of experiments have shown that the overallsystem as well as isolated stages outperform various refer-ence algorithms. Our approach has proven to give convincingresults given with the particular characteristics of flamencosinging, in particular strong ornamentations, extensive use ofvocal vibrato and local intonation inaccuracies. The resultingautomatic transcriptions therefor provide a suitable basis fora number of related MIR tasks, such as melodic similaritycharacterisation or automatic style identification. The systemcan furthermore aid in large-scale musicological studies byproviding first estimates for computer-assisted transcription.

There are several aspects in the context of flamenco singingtranscriptions which we aim to investigate in order to furtherimprove the quality of automatic transcription. A main topic ofinterest is to include perceptual aspects in the pitch labellingstage, by exploring the influence of fast pitch fluctuations andguitar accompaniment on the perceived pitch. Furthermore,it would be interesting to investigate how the quantitativemeasures presented in this study relate to perceptual tran-scription quality and how the transcription accuracy influencesthe performance of MIR systems which rely on automatictranscriptions. We are furthermore interested in how far moredetailed note representations, i.e. including micro-tonal pitchlabels compare the proposed standard MIDI representation fordifferent MIR tasks. Finally, we aim to annotate ground truthfor music traditions with similar characteristics to flamenco inorder to evaluate the suitability of our approach for a largervariety of genres.

ACKNOWLEDGMENT

This research was funded by the PhD fellowship of theDepartment of Information and Communication Technolo-gies, Universitat Pompeu Fabra and the projects SIGMUS(TIN2012-36650) and COFLA II (P12-TIC-1362).

REFERENCES

[1] F. Gomez-Martın, J.M. Dıaz-Banez, E. Gomez and J. Mora, “Flamencoand its computational study,” BRIDGES: Mathematical Connections inArt, Music and Science, Seoul, Korea, 2014.

[2] J. M. Dıaz-Banez and J. C. Rizo, “An efficient DTW-based approachfor melodic similarity in flamenco singing,” Lecture notes in ComputerScience, vol. 8821, Springer, 2014, pp. 289-300.

[3] N. Kroher and E. Gomez, “Computational models for perceived melodicsimilarity in a cappella flamenco cantes,” 15th International Society forMusic Information Retrieval Conference, Taipei, Taiwan, 2014.

[4] N. Kroher and E. Gomez, “Automatic Singer Identification for Impro-visational Styles Based on Vibrato, Timbre and Statistical PerformanceDescriptors” 11th Sound an Music Computing Conference, Athens,Greece, 2014.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 13

[5] N. Kroher, A. Chaachoo, J. M. Dıaz-Banez, E. Gomez, F. Gomez-Martın,J. Mora and M. Sordo, “Computational ethnomusicology: a study offlamenco and Arab-Andalusian vocal music,” Handbook for SystematicMusicology, Springer, 2015.

[6] A. P. Klapuri, “Automatic music transcription as we know it today,”Journal of New Music Research, vol. 33, no. 3, 2004, pp. 269-282.

[7] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff and A. Klapuri,“Automatic music transcription: challenges and future directions,” Journalof Intelligent Information Systems, vol. 41, no. 3, 2013, pp. 407-434.

[8] G. E. Poliner, D. P. W. Ellis, A. F. Ehrmann, E. Gomez, S. Streich andB. Ong, “Melody transcription from music audio: Melody transcriptionfrom music audio: Approaches and evaluation,” IEEE Transactions onAudio, Speech and Language Processing, vol. 15, no. 4, 2007, pp. 1247-1256.

[9] J. Salamon, E. Gomez, D. P. W. Ellis and G. Richard, “MelodyExtraction from Polyphonic Music Signals: Approaches, Applications andChallenges, IEEE Signal Processing Magazine, vol. 31(2), 2014, pp. 118-134.

[10] J. Mora, F. Gomez-Martın, E. Gomez, F. Escobar Borrego and J. M.Dıaz-Banez, “Characterization and melodic similarity of a cappella fla-menco cantes,” International Symposium on Music Information Retrieval,Utrecht, Netherlands, 2010.

[11] J. M. Gamboa, “Una historia del flamenco”, Espasa-Calpe, Madrid,2005.

[12] J. A. Moorer, “On the transcription of musical sound by computer,”Computer Music Journal, vol. 1(4), 1977.

[13] M. Antonelli and A. Rizzo, “A correntropy-based voice to miditranscription algorithm,” 10th IEEE Workshop on Multimedia SignalProcessing, 2008, pp. 978-983.

[14] E. Pollastri, “A pitch tracking system dedicated to process singing voicefor music retrieval,” IEEE International Conference on Multimedia andExpo (ICME), 2002 pp. 341-344.

[15] C.-K. Wang and R.-Y. Lyu, “A robust singing melody tracker usingadaptive round semitones (ars),” 3rd International Symposium on Imageand Signal Processing and Analysis, 2003, pp. 549-554.

[16] R. J. McNab, L. A. Smith and I. H. Witten, “Signal processing formelody transcription,” 19th Australasian Computer Science Conference,1997, pp. 301-307.

[17] R. H. Adams, M. A. Bartsch and G. H. Wakefield, “Note segmentationand quantisation for music information retrieval,” IEEE Transactions onAudio, Speech and Language Processing, vol. 14(1), 2006, pp. 131-141.

[18] M. Ryynanen and A. P. Klapuri, “Transcription of the singing melody inpolyphonic music,” 7th International Conference on Music InformationRetrieval, 2006.

[19] W. Krige, T. Herbst and T. Niesler, “Explicit transition modelling forautomatic singing transcription,” Journal of New Music Research, vol.34(4), 2008, pp. 311-324.

[20] M. Mauch, C. Cannam, R. Bittner, G. Fazekas, J. Salamon, J. Dai,J. Bello and S. Dixon, “Computer-aided Melody Note TranscriptionUsing the Tony Software: Accuracy and Efficiency,” First InternationalConference on Technologies for Music Notation and Representation,2015.

[21] E. Molina, L. J. Tardon, A. M. Barbancho and I. Barbancho, “SiPTH:Singing Transcription Based on Hysteresis Defined on the Pitch-TimeCurve,” IEEE Transactions on Audio, Speech and Language Processing,23(2), 2015, pp. 252 - 263.

[22] X. Serra, “A Multicultural Approach in Music Information Research”,International Society for Music Information Retrieval Conference (IS-MIR), 2011, pp. 151-156.

[23] E. Gomez and J. Bonada, “Towards computer-assisted flamencotranscription: An experimental comparison of automatic transcriptionalgorithms as applied to a cappella singing,”, Computer Music Journal,vol. 37(2), 2013, pp. 73-90.

[24] J. J. Mestres, J. B. Sanjaume, M. De Boer and A. L. Mira, “Audiorecording analysis and rating,” U.S. Patent 8,158,871, Apr. 17, 2012.

[25] E. Gomez, F. Canadas, J. Salamon, J. Bonada, P. Vera, and P. Cabanas,“Predominant fundamental frequency estimation vs singing voice sepa-ration for the automatic transcription of accompanied flamenco singing,”in 13th International Society for Music Information Retrieval Conference(ISMIR), 2012.

[26] J. Salamon and E. Gomez, “Melody extraction from polyphonic musicsignals using pitch contour characteristics,” IEEE Transactions on Audio,Speech and Language Processing, vol. 20, no. 6, 2011, pp. 1759-1770.

[27] J. M. Dıaz-Banez and J. A. Mesa, “Fitting rectilinear polygonal curves toa set of points in the plane,” European Journal of Operational Research,vol. 130, no. 1, 2001, pp. 214-222.

[28] L. Song, M. Li and Y. Yan, “Automatic Vocal Segments Detectionin Popular Music,” 9th International Conference on ComputationalIntelligence and Security, 2013.

[29] M. Rocamora and P. Herrera, “Comparing audio descriptors for singingvoice detection in music audio files,” 11th Brazilian Symposium onComputer Music, 2007.

[30] S. D. You, Y.-C. Wu, “Comparative Study of Singing Voice DetectionMethods”, Computer Science and its Applications, Lecture Notes inElectrical Engineering, vol. 330, 2015, pp. 1291-1298.

[31] D. Bogdanov, N. Wack, E. Gomez, S. Gulati, P. Herrera, O. Mayor, G.Roma, J. Salamon, J. Zapata and X. Serra, “ESSENTIA: an open sourcelibrary for audio analysis”, ACM SIGMM Records, vol. 6(1), 2014.

[32] E. Zwicker and E. Terhardt, “Analytical expressions for critical bandrate and critical bandwidth as a function of frequency,” Journal of theAcoustic Society of America, vol. 68, 1980, pp 1523-1525.

[33] M. Basu, “Gaussian-Based Edge-Detection Methods - A Survey,” IEEETransactions on Systems, Man, and Cybernetics, vol. 32(3), 2002, pp.252-260.

[34] J. Canny, “A computational approach to edge detection,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 8(6),1986, pp. 679-698.

[35] J. Sundberg, “Acoustic and psychoacoustic aspects of vocal vibrato,” InP. Dejonckere, M. Hirano and J. Sundberg: Vibrato, Singular PublishingCompany, San Diego, CA, 1995.

[36] K. Dressler and S. Strech, “Tuning frequency estimation using circularstatistics,” in 8th International Conference on Music Information Re-trieval, 2007, pp. 357-360.

[37] G. H. Wakefield, “Mathematical representation of joint time-chromadistributions.,” in Advanced Signal Processing Algorithms, Architecturesand Implementations (SPIE), 1999.

[38] D. P. W. Ellis, “Identifying ’cover songs’ with chroma features anddynamic programming beat tracking,” in IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), 2007, pp. 1429-1432.

[39] M. Mauch and S. Dixon, “pYIN: a fundamental frequency estimatorusing probabilistic threshold distributions,” IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp.659-663.

[40] N. Kroher, A. Pikrakis, J. Moreno and J. M. Dıaz-Banez, “ Discovery ofRepeated Vocal Patterns in Polyphonic Audio: a Case Study on FlamencoMusic”, European Signal Processing Conference (EUSIPCO), 2015.

[41] E. Gomez, J. Bonada and J. Salamon, “Automatic Transcription ofFlamenco Singing from Monophonic and Polyphonic Music Recordings”,3rd Interdisciplinary Conference on Flamenco Research (INFLA), 2012.

Nadine Kroher received a MSc in Audio and Elec-trical Engineering from Graz University of Tech-nology (Austria) in 2011 and a MSc in Soundand Music Computing from Universitat PompeuFabra (Spain). Currently, she is a PhD student atthe Department of Applied Mathematics, Universityof Seville (Spain). Her research focuses on MusicInformation Retrieval for computational ethnomusi-cology.

Emilia Gomez is an associate professor (Serra-Hunter fellow) at the Music Technology Group,Universitat Pompeu Fabra (UPF). She graduated asa Telecommunication Engineer specialized in SignalProcessing at Universidad de Sevilla. Then, shereceived a DEA in Acoustics, Signal Processing andComputer Science applied to Music (ATIAM) atIRCAM, Paris. In July 2006, she completed her PhDin Computer Science and Digital Communication atthe UPF. Her main research interests are related tomelodic and tonal description of music audio signals,

computer-assisted music analysis and computational ethnomusicology.